Using Excel to Easily Coalesce Data
Let’s imagine you’re working with a list of Contacts that lists the company that they work at, doing a little clean-up before it gets used. Only some of the Contacts have an address listed, but they all have a Company address listed. This needs to be updated so each Contact record has an address.
In this case, the business requirements are:
- We need to have one address on file for each contact
- If we have that specific contact’s address…great, use that
- If we don’t have a contact’s address, use their Company’s address instead
Doing a simple Copy/paste could be your friend if the list is only a handful of Contacts long…but this can be tedious if you’re looking at dozens (or more).
I was doing something very similar to this for a client recently. My go-to tool for this is Excel…it really makes this problem simple to solve. Come to think of it I use this same concept in a lot of different places when doing data prep–so I’ll run with this example, but recognize that the same method can be used elsewhere in Excel.
OK I’m interested, what do I do?
Fortunately, the requirements above give us exactly the direction we need. We only need one address for each contact, and each row has as either one or two. We know that one address wins out (the contact’s address) if it’s there. So from a list standpoint, we need to do the following:
- Devise a systematic way to determine the correct address
- Choose the correct address for each contact
- Eliminate the extraneous data
Accomplishing this isn’t too much work, but it does involve an Excel formula that’s going to look like Greek when you first see it. We’ll get there in a second, but indulge me for a slight diversion.
When working with a list of data, my best practice in Excel is to keep the source data as pristine as possible. This means I don’t like to add/delete columns in the original list. This ends up being important later on…when somebody comes with an updated list, or a slight variation of what rules to use when identifying the desired data. If you’ve made your changes directly in the original workbook it becomes much more likely that you’ll have to start from square one, whereas my method allows you to more easily sub in new data or change one of the underlying rules.
To accomplish this, I’m using two Worksheets (tabs) in one Excel Workbook (file). “Raw List” (the raw list of data) and “Final Product” (where our modified version of the data will go). When you see formulas such as =’Raw List’!C3 that really means this cell should equal the cell C3 in the “Raw List” worksheet.
Hopefully that little explanation will keep this next screenshot from looking like total hoodoo.
We’re going to use this formula for each address field (Street 1, Street 2, City, State, Zip). We’re also going to make the assumption that if the “Contact Street 1” field is blank, that it is not a valid address and the Company address should be used (make sure this is a valid assumption based on your situation :)). This translates to “If the Contact’s Street value is blank, use Company Street, otherwise use Contact Street“.
Adding the visual context in Excel:
Sweet, now we’re cooking. Next, we want to use this same formula for each of the rest of the Address fields.
Note that when we’re checking the subsequent fields like City, State, etc, we’re always checking the Contact Street value. This is important because we don’t want to use parts of both addresses…we want the whole contact address or the whole company address.
At this point, we’ve done the most complicated part. The other fields (like name, company name, etc) don’t need a funky formula, they just need to be identified and added as columns in the data set. You can take some time to review how this is done by downloading the Excel file I worked on, which has all of the formulas preserved on each of the two worksheets.
Great, so what’s next?
From here, there are a few common things that I’ll do as next steps:
- Save results to a separate file, then save it as a new file. This is a great way to force only the “final” version of the data is available moving forward. A basic copy/paste has some problems since the formulas reference the original data set (and would start showing errors). Below are two common approaches.
- One option for doing this is to do a copy/paste values into a new workbook. This is great when you want to ensure there are no references to the original set of data
- An alternative is to copy the entire worksheet to a new workbook (by right-clicking on the worksheet name). This will also update the formula references to look back to the original file
Note that either of these options is acceptable if the new file will be saved as a .csv file, since Excel will remove the incompatible features (formulas) when you click Save. A .csv file is great if you want to import the data into Dynamics CRM.
- Clean up the original workbook, making it less potentially-confusing to the next person to touch it. Below are a couple of clean up ideas:
- Hide the “Raw List” worksheet (tab) so it isn’t selected by accident
- Freeze the column headers in the Final Product worksheet by clicking Freeze Panes->Freeze Top Row on the View tab
- Protect the list from being accidentally modified by clicking Protect Sheet on the Review tab
Keep in mind, there are many other ways to accomplish this same task. If there’s another method you’d like to share, feel free to do so in the comments.