November 14, 2018
By Kyle Seaman, Head of Product at Sentenai, and Sourav Dey, Co-Founder and CTO at Manifold
Predictive analytics is an undeniably valuable technology, with research indicating its market size could top $12 billion USD by 2022. Across a range of industries, businesses, and applications, using historical data to predict future outcomes can lead to greater operational efficiency in a variety of ways. Predictive analytics can enable organizations to streamline their operational processes, optimize their demand forecasting, drastically reduce downtime, and better understand their customers’ propensity to buy.
That said, reaping the benefits of predictive analytics requires a fair amount of engineering legwork. For instance, before applying machine learning techniques to identify the likelihood of future outcomes based on historical data, the data in question must be prepared for training those machine learning algorithms in the first place. Looking at the historical data, organizations and their data scientists need to determine which data is viable and how trustworthy it is, then transform it from its raw initial state into clean datasets that a machine learning algorithm can use.
Data preparation is a highly iterative process, and typically requires up to 80 percent of the time and resources needed to build predictive applications. To make data preparation more efficient and productive — particularly for organizations with disparate data silos, or data lakes housing decades of data in dozens of different formats — consider following these five steps:
There are a number of important considerations to be aware of when constructing these datasets. Most importantly, you should prevent data leakage, which occurs when the training set knows more than it should about the test set. This will lead you to believe your generalization will be better than it will actually be on new data. Especially for time series data, it’s easy to accidentally have data leakage by using data from the future in the training set — or by overlapping time intervals. There may be additional considerations as well, such as rebalancing. This was important in our oil field case study where the failure rate was only 2%. In this case, the training set needed to be rebalanced using a bootstrap sampling technique to make the incidence of failure 20–25% — otherwise the model would simply learn to always predict that the machine would not fail (and achieve 98% accuracy).
When preparing data for predictive analytics, organizations and their data scientists will often have a mental model of what their data should look like, and what it should be able to predict—however, rarely is data in such a form at the onset. Especially if your data has not always been collected or labeled with predictive analytics in mind, it may need quite a bit of work to become useful. But that work can be tremendously worthwhile if you do it carefully and well. In fact, the end result can produce an invaluable competitive advantage for your organization.
About the Authors:
Kyle Seaman
Kyle Seaman leads product for Sentenai, a Boston-based company building infrastructure to automatically manage sensor data for advanced analytic applications. Kyle works with companies in the manufacturing and agriculture space building predictive maintenance and IoT applications. Previously, Kyle was Director of Farm Technology at Freight Farms, where he built a network of IoT-enabled, smart, hydroponic farms and lead data science initiatives around yield optimization and farm performance.
Sourav Dey
Dr. Sourav Dey is CTO at Manifold, an artificial intelligence engineering services firm with offices in Boston and Silicon Valley. Prior to Manifold, Sourav led teams to build data products across the technology stack, from smart thermostats and security cams (Google/Nest) to power grid forecasting (AutoGrid) to wireless communication chips (Qualcomm). He holds multiple patents for his work, has been published in several IEEE journals, and has won numerous awards. Sourav holds PhD, MS, and BS degrees in Electrical Engineering and Computer Science from the Massachusetts Institute of Technology (MIT).
Tune in to hear from Chris Brown, Vice President of Sales at CADDi, a leading manufacturing solutions provider. We delve into Chris’ role of expanding the reach of CADDi Drawer which uses advanced AI to centralize and analyze essential production data to help manufacturers improve efficiency and quality.