November 14, 2018
Predictive analytics is an undeniably valuable technology, with research indicating its market size could top $12 billion USD by 2022. Across a range of industries, businesses, and applications, using historical data to predict future outcomes can lead to greater operational efficiency in a variety of ways. Predictive analytics can enable organizations to streamline their operational processes, optimize their demand forecasting, drastically reduce downtime, and better understand their customers’ propensity to buy.
That said, reaping the benefits of predictive analytics requires a fair amount of engineering legwork. For instance, before applying machine learning techniques to identify the likelihood of future outcomes based on historical data, the data in question must be prepared for training those machine learning algorithms in the first place. Looking at the historical data, organizations and their data scientists need to determine which data is viable and how trustworthy it is, then transform it from its raw initial state into clean datasets that a machine learning algorithm can use.
Data preparation is a highly iterative process, and typically requires up to 80 percent of the time and resources needed to build predictive applications. To make data preparation more efficient and productive — particularly for organizations with disparate data silos, or data lakes housing decades of data in dozens of different formats — consider following these five steps:
- Create a normalized data format.
Before building any predictive analytics application, the data at hand needs to be normalized into a consistent format. This is the standard job of extract, transform and load (ETL). A good ETL process should convert the data from disparate forms into a normalized format — preferably in a database. Good ETL jobs also address any cryptic or inconsistent naming conventions for the various data fields. For example, in a case study for the digital oilfield, there were small JSON files and maintenance logs that needed to be ETLed into a single time-series record that could be used downstream. No matter the type of formatting required, the goal of this first step is to transform the data records’ format to make the data more useful.
- Build trust in the data.
The next step is to explore the data to find any inconsistencies. Fix what you can, exclude what you can’t, and identify any supplemental data you might need. Going back to the oil field equipment example: After exploring and validating the equipment service records data, we found that not all the data was usable. Specifically, there was high variation in how each human service technician recorded data in the maintenance software. In addition, the root cause of the failure was often not recorded. For that reason, we decided not to use these records and instead rely solely on the machine generated data. In addition, since the same units were moved over the years, we found it necessary to pull in ERP (enterprise resource planning) data to determine when the oil field equipment changed locations.
- Define the response variable.
What are you trying to predict? The answer here is not always clear cut. For instance, in the oil field example, we originally thought we were going to predict certain types of failures. But exploration showed that the maintenance logs were not trustworthy labels. As such, we pivoted and moved to predicting long outages instead. Working with subject matter experts is very important as this stage, as they are often the best resource for understating the business problems and the data.
- Create features from the data.
Feature engineering, or the creation of features from the data, is a critical step to most machine learning models. Though some modern deep learning methods claim to not require feature engineering, in practice they fair better if features are engineered from the raw data. Effectively, feature engineering is how domain knowledge about the problem is encoded as inputs into the model. For example, certain moving averages or trendlines can be engineered from the raw time series data. In other cases, a spectrogram could be created if there is signal in the frequency domain. The exact features to be engineered is always problem-specific, and ideally should be undertaken with domain experts.
- Build training, validation and test sets.
In the end, what matters is how your model is going to perform on new data. This is called generalization. Unfortunately, when you are training a model you don’t have access to new data — you only have the historical data. In order to assess generalization performance effectively, it’s important to carefully construct training, validation and test sets from your data. The training set is what you train your algorithm on, the validation set is what you use for hyper parameter tuning, and the test set is what you finally assess performance on.
There are a number of important considerations to be aware of when constructing these datasets. Most importantly, you should prevent data leakage, which occurs when the training set knows more than it should about the test set. This will lead you to believe your generalization will be better than it will actually be on new data. Especially for time series data, it’s easy to accidentally have data leakage by using data from the future in the training set — or by overlapping time intervals. There may be additional considerations as well, such as rebalancing. This was important in our oil field case study where the failure rate was only 2%. In this case, the training set needed to be rebalanced using a bootstrap sampling technique to make the incidence of failure 20–25% — otherwise the model would simply learn to always predict that the machine would not fail (and achieve 98% accuracy).
When preparing data for predictive analytics, organizations and their data scientists will often have a mental model of what their data should look like, and what it should be able to predict—however, rarely is data in such a form at the onset. Especially if your data has not always been collected or labeled with predictive analytics in mind, it may need quite a bit of work to become useful. But that work can be tremendously worthwhile if you do it carefully and well. In fact, the end result can produce an invaluable competitive advantage for your organization.
About the Authors:
Kyle Seaman leads product for Sentenai, a Boston-based company building infrastructure to automatically manage sensor data for advanced analytic applications. Kyle works with companies in the manufacturing and agriculture space building predictive maintenance and IoT applications. Previously, Kyle was Director of Farm Technology at Freight Farms, where he built a network of IoT-enabled, smart, hydroponic farms and lead data science initiatives around yield optimization and farm performance.
Dr. Sourav Dey is CTO at Manifold, an artificial intelligence engineering services firm with offices in Boston and Silicon Valley. Prior to Manifold, Sourav led teams to build data products across the technology stack, from smart thermostats and security cams (Google/Nest) to power grid forecasting (AutoGrid) to wireless communication chips (Qualcomm). He holds multiple patents for his work, has been published in several IEEE journals, and has won numerous awards. Sourav holds PhD, MS, and BS degrees in Electrical Engineering and Computer Science from the Massachusetts Institute of Technology (MIT).