Introduction
LiveRamp receives thousands of large files each day from our customers and we need column type configuration to know how to interpret these files. For many files, we expect them to conform to an existing configuration. For others we need to auto-detect the type of data within the file. Here’s a contrived example of what a file could look like. Note that we don’t always have such helpful column labels to guide our auto-detection methods.
Previous methods use string matching and regular expressions, which work well for some column types such as email. However, other columns such as first name are harder to detect this way. Therefore in our recent Hackweek, Jade, Gordon, Louis, myself, and others explored machine learning (ML) approaches to classifying column types. These new methods achieve over 95% accuracy. This blog post explains how we developed this ML approach for column autodetection. Note that all code and analyses are developed in Python and we’re grateful for the OSS Python libraries numpy, pandas, scikit-learn, tensorflow, and many more.
Data Preparation
To train the model, we select 500 random imports from the last few days that have already been manually configured. For each, we sample up to 10,000 rows. In total, this gives us 2.3 million labeled strings from several thousand different columns.
Here’s a made up, illustrative example of the labeled strings:
Note that “field” refers to any data element about a person (e.g., “Yes” for the field has pets) and can be contrasted to the other column types which correspond to identifiers (or parts thereof) for a person.
Feature Engineering and Selection
Now that we have labeled strings, we need to figure out how to represent each string as a collection of numbers so that we can apply modeling techniques. This process is known as feature engineering or vectorization since it converts each string into a vector/array of numbers. We start with a conventional technique called count vectorization (or bag of characters) whereby we count the frequency of all characters in each string. Going beyond just individual characters, we also include bigrams and trigrams; i.e., sequential characters.
This method is best illustrated with an example. The string ‘[email protected]’ is converted into a vector of length 41 and the following table shows the values for twenty of these positions.
Applying this method to all strings in our sample result in a vector space of size 11,631. This includes all character and n-grams in all of the strings. Highly frequent features that show up in more than 50% of the records or highly infrequent ones (less than 0.01%) are ignored. Each individual strings only includes a very small fraction of features within this vector space, so almost all positions in each vector are zero.
We also design a few additional features that we believe may help the ML models distinguish different types of strings.
- length — The length of each string in characters
- num_length — The number of digits characters in a string
- space_length — The number of spaces within a string
- special_length — The number of special (non-alphanumeric) characters
- vowel_length — The number of vowels in a string
In total, this gives 11,636 features describing each string. We find this number of features to be too large for many modeling techniques and further imagine many features to be low value. Therefore we apply univariate feature selection to select the top 1000 features that are individually correlated with distinguishing different labels as quantified by chi-squared statistics. The top 20 features are as follows:
Note, that all five designed features are within in the top 20.
Every time a model is trained within this report, we withhold a random 20% of labeled strings for measuring the accuracy of the model after training.
Training and Evaluating a Neural Network
We’ve experimented with several different ML models and find a multilayer perceptron (a type of neural network) to give the highest accuracy. The model consists of just a single hidden layer of 512 rectified linear units and a softmax output layer. The model was constructed with Keras using the Tensorflow backend. Overall, the model can classify 93.7% of test strings correctly. We plot the confusion matrix to better understand what types of mistakes are made by the model.
Here we see three main types of errors:
- Misclassifying cities as last names.
- Confusing first and last names with each other.
- Confusing MD5 emails and online identifiers. This likely happens because some online identifiers take the same form as an MD5 hash; 32 hexadecimal characters.
Our final goal is to classify columns of strings, not just individual strings. Therefore we also experiment with an approach whereby we run the model on 100 example strings for a column and then choose the most frequent label assigned by the model. This approach is found to correctly classify 95.3% of columns and confusion matrix is shown in the following figure.
Overall, we’re excited by the high performance of this approach in classifying columns. Further, misclassifying online-ids as md5_emails is the main deficiency of this approach and it may not be possible for any method to distinguish these types of ids since many online-ids appear to have the same structure of MD5 hashes. Dropping online-id labels gives us a column level accuracy of 97.4%.
Conclusions and Productization
To use these methods in production, we’ve recently developed a simple RESTful web server to apply the feature engineering and the model on a sample of column strings. We’re hosting the web server on Kubernetes to provide a robust and fault-tolerant service. We found this Docker template very useful for hosting the Python web server using flask, uwsgi, and nginx. In the future, we may also experiment with adding file column labels as additional features to further improve the model.