Your Data isn’t a “Dirty” little secret
“My data is filthy.” “My data is terrible.” These are often the first two things we hear when we begin chatting about a data analytics engagement with a potential customer. Everyone thinks their data is messier and more complicated than everyone else. While sometimes it turns out to be the case, we data scientists have developed a pretty good toolkit to clean things up.
We call this process “ETL” Extract, Transform and Load. Let’s take a look at a particular example and how we’re able to clean the data to get a clear picture of what’s happening. One of our customers is a large B2B hard goods retailer. They work through a network of dealers and generate roughly $1.5B a year in sales. That’s around a million transactions per year, a pretty significant chunk of data. Let’s say we made a sale to O’Connors (an Irish pub in Troy, NY). That’s great, then we made a sale to OConnors (the same Irish Pub) and then a few months later we made a sale to O-Connors Irish Pub (same place). So my simple analysis says we’ve got three differently named customers – but that’s not really the case. This kind of variation is incredibly common in sales transactions. So what do we do?
First off, we can do some textual analysis to match up the names. At the most basic level, we simply eliminate capitalization and punctuations. This usually gets us 30% of the way there. Next, we can use “fuzzy matching,” which matches names which are “close” to each other, and “phonetic matching” which gets us similar sounding names. Now we’re about 60% of the way there. Finally, we can use the address. Often, however, these will have the same issues – so do I need to do everything again? No, because it turns out that address standardization is a solved problem. Think of all the people who use GPS’s and type their addresses in multiple ways. Google (or the GPS provider) needs to convert them to a “standard” form to match against them. So, we use the same technique and identify that all the O’Connors are at the same address. At this point we can make a pretty safe assumption they are the same business. That gets us to 90% of the way there.
I wish I could tell you I had a magic wand for the last 10%, but I don’t. This comes down to an efficient process of identifying the potential problem accounts and some level of examination. Even when I sell to thousands of accounts, the number we’re left with at the end usually turns out to be manageable.
In summary, your data is probably not much worse than everyone else’s, and it’s often possible to do most of “record-linking” and cleaning using an automated process and then finishing up with some manual data inspection.