November 16, 2018 Preparing Your Data for Predictive Analytics

Predictive analytics can enable organizations to streamline their operational processes among many other benefits for your business.

November 14, 2018

By Kyle Seaman, Head of Product at Sentenai, and Sourav Dey, Co-Founder and CTO at Manifold

Predictive analytics is an undeniably valuable technology, with research indicating its market size could top $12 billion USD by 2022. Across a range of industries, businesses, and applications, using historical data to predict future outcomes can lead to greater operational efficiency in a variety of ways. Predictive analytics can enable organizations to streamline their operational processes, optimize their demand forecasting, drastically reduce downtime, and better understand their customers’ propensity to buy.

That said, reaping the benefits of predictive analytics requires a fair amount of engineering legwork. For instance, before applying machine learning techniques to identify the likelihood of future outcomes based on historical data, the data in question must be prepared for training those machine learning algorithms in the first place. Looking at the historical data, organizations and their data scientists need to determine which data is viable and how trustworthy it is, then transform it from its raw initial state into clean datasets that a machine learning algorithm can use.

Data preparation is a highly iterative process, and typically requires up to 80 percent of the time and resources needed to build predictive applications. To make data preparation more efficient and productive — particularly for organizations with disparate data silos, or data lakes housing decades of data in dozens of different formats — consider following these five steps:

Create a normalized data format.
Before building any predictive analytics application, the data at hand needs to be normalized into a consistent format. This is the standard job of extract, transform and load (ETL). A good ETL process should convert the data from disparate forms into a normalized format — preferably in a database. Good ETL jobs also address any cryptic or inconsistent naming conventions for the various data fields. For example, in a case study for the digital oilfield, there were small JSON files and maintenance logs that needed to be ETLed into a single time-series record that could be used downstream. No matter the type of formatting required, the goal of this first step is to transform the data records’ format to make the data more useful.

Build trust in the data.
The next step is to explore the data to find any inconsistencies. Fix what you can, exclude what you can’t, and identify any supplemental data you might need. Going back to the oil field equipment example: After exploring and validating the equipment service records data, we found that not all the data was usable. Specifically, there was high variation in how each human service technician recorded data in the maintenance software. In addition, the root cause of the failure was often not recorded. For that reason, we decided not to use these records and instead rely solely on the machine generated data. In addition, since the same units were moved over the years, we found it necessary to pull in ERP (enterprise resource planning) data to determine when the oil field equipment changed locations.

Define the response variable.
What are you trying to predict? The answer here is not always clear cut. For instance, in the oil field example, we originally thought we were going to predict certain types of failures. But exploration showed that the maintenance logs were not trustworthy labels. As such, we pivoted and moved to predicting long outages instead. Working with subject matter experts is very important as this stage, as they are often the best resource for understating the business problems and the data.

Create features from the data.
Feature engineering, or the creation of features from the data, is a critical step to most machine learning models. Though some modern deep learning methods claim to not require feature engineering, in practice they fair better if features are engineered from the raw data. Effectively, feature engineering is how domain knowledge about the problem is encoded as inputs into the model. For example, certain moving averages or trendlines can be engineered from the raw time series data. In other cases, a spectrogram could be created if there is signal in the frequency domain. The exact features to be engineered is always problem-specific, and ideally should be undertaken with domain experts.

Build training, validation and test sets.
In the end, what matters is how your model is going to perform on new data. This is called generalization. Unfortunately, when you are training a model you don’t have access to new data — you only have the historical data. In order to assess generalization performance effectively, it’s important to carefully construct training, validation and test sets from your data. The training set is what you train your algorithm on, the validation set is what you use for hyper parameter tuning, and the test set is what you finally assess performance on.

There are a number of important considerations to be aware of when constructing these datasets. Most importantly, you should prevent data leakage, which occurs when the training set knows more than it should about the test set. This will lead you to believe your generalization will be better than it will actually be on new data. Especially for time series data, it’s easy to accidentally have data leakage by using data from the future in the training set — or by overlapping time intervals. There may be additional considerations as well, such as rebalancing. This was important in our oil field case study where the failure rate was only 2%. In this case, the training set needed to be rebalanced using a bootstrap sampling technique to make the incidence of failure 20–25% — otherwise the model would simply learn to always predict that the machine would not fail (and achieve 98% accuracy).

When preparing data for predictive analytics, organizations and their data scientists will often have a mental model of what their data should look like, and what it should be able to predict—however, rarely is data in such a form at the onset. Especially if your data has not always been collected or labeled with predictive analytics in mind, it may need quite a bit of work to become useful. But that work can be tremendously worthwhile if you do it carefully and well. In fact, the end result can produce an invaluable competitive advantage for your organization.

About the Authors:

Kyle Seaman
Kyle Seaman leads product for Sentenai, a Boston-based company building infrastructure to automatically manage sensor data for advanced analytic applications. Kyle works with companies in the manufacturing and agriculture space building predictive maintenance and IoT applications. Previously, Kyle was Director of Farm Technology at Freight Farms, where he built a network of IoT-enabled, smart, hydroponic farms and lead data science initiatives around yield optimization and farm performance.

Sourav Dey
Dr. Sourav Dey is CTO at Manifold, an artificial intelligence engineering services firm with offices in Boston and Silicon Valley. Prior to Manifold, Sourav led teams to build data products across the technology stack, from smart thermostats and security cams (Google/Nest) to power grid forecasting (AutoGrid) to wireless communication chips (Qualcomm). He holds multiple patents for his work, has been published in several IEEE journals, and has won numerous awards. Sourav holds PhD, MS, and BS degrees in Electrical Engineering and Computer Science from the Massachusetts Institute of Technology (MIT).

Sentenai | Manifold

Subscribe to Industry Today

This field is hidden when viewing the form

Name

Name(Required)

First Last

Company Name(Required)

Email(Required)

Job Title(Required)

Other

Country(Required)

Business Type(Required)

Your Industry(Required)

CAPTCHA

Read Our Current Issue

Hire Heroes USA: Channeling Veteran Skills to Power U.S. Manufacturing

Most Recent EpisodeSmart Sites, Smart Growth: Inside Site Selection with Didi Caldwell

Listen Now

A warm welcome to our guest Didi Caldwell, CEO of Global Location Strategies (GLS) and one of the world’s top site selection experts. With over $44 billion in projects across 30+countries, Didi is reshaping how companies choose where to grow. Here she shares insights on reshoring, data-driven strategy, and navigating global industry shifts.

News ............. And More

August 20, 2025

Culture is Key to Overcoming Labor Shortages in AI Era

August 18, 2025

Rethinking Legionella Control in Manufacturing

August 18, 2025

Why Tech Alone Can’t Solve Revenue Growth Challenges

August 18, 2025

Why Corporate AI Adoption Comes from an Unlikely Source

August 14, 2025

Manufacturing News

August 13, 2025

Logistics Leaders Mustn’t Delay Digital Transformation

August 13, 2025

Why OT Management Is Manufacturing’s Next Frontier

August 11, 2025

Boost Back Office Profits In An Uncertain Economy

August 11, 2025

Safeguarding Workers’ Rights Amid Global Turmoil

August 11, 2025

Rethink the Line: Solving Manufacturing’s Talent Crisis

August 11, 2025

Overuse Hits Manufacturing Hard – Can Tech Save It?

August 7, 2025

Manufacturing News

See All

Get In Touch

Google news and SEO compliant, Industry Today’s state-of-the-art digital media platform offers bespoke media campaigns that target key decision makers and buyers to achieve your marketing and promotional goals.

Industry Today

472 Meeting Street
Ste C-156
Charleston, SC 29403
USA
Telephone

Voice: +001 973.218.0310
Email

For further information please contact the following:

Media Campaigns: Susan Poeton
spoeton@industrytoday.com

Press Releases:
editor@industrytoday.com or submit direct

Content Submissions/Interview Opportunities:
editorialdesk@industrytoday.com

Contribute

Showcase your brand and promote your business to our highly targeted audience. We offer detailed Google Analytics with measurable ROI to assure success. Submit your content for review by our Editorial team who will contact you to discuss the project further.

About Us

Reach Your Targeted Audience and Grow Your Business. Learn more About Industry Today.

Contact Us

This field is hidden when viewing the form

Name

Name(Required)

Email(Required)

Phone

Comments

CAPTCHA

August 20, 2025Unifying Manufacturing Operations Through MES

August 20, 2025Smart Manufacturing to Take Spotlight at CMTS 2025

August 20, 2025Factory Downtime: The Hidden Cost of Weak Networks

August 20, 2025Culture is Key to Overcoming Labor Shortages in AI Era

August 19, 2025The Power of Strategic Partnerships

August 19, 2025Bodor Laser Invests $20M to Boost U.S. Manufacturing

August 6, 2025Photonic AI Chip Platform

July 29, 20252025 Best Places for Food Manufacturing Insights Report

July 25, 2025Reduce Downtime with a Predictive Maintenance Strategy

July 14, 2025Turning the Tide Against a Tsunami of Cyber Threats

July 8, 2025AI and Data Drive the Aftermarket of the Future

June 24, 2025Industrial Manufacturing US Deals 2025 Midyear Outlook

August 20, 2025Unifying Manufacturing Operations Through MES

August 20, 2025Smart Manufacturing to Take Spotlight at CMTS 2025

August 20, 2025Factory Downtime: The Hidden Cost of Weak Networks

August 19, 2025The Power of Strategic Partnerships

August 19, 2025Bodor Laser Invests $20M to Boost U.S. Manufacturing

August 14, 2025Hyde Park Capital Advises CRA on its Sale

November 16, 2018 Preparing Your Data for Predictive Analytics

Predictive analytics can enable organizations to streamline their operational processes among many other benefits for your business.

Subscribe to Industry Today

Most Recent EpisodeSmart Sites, Smart Growth: Inside Site Selection with Didi Caldwell

News ............. And More

Subscribe to Industry Today’s regular e-newslettersindustrytoday.com

August 20, 2025Unifying Manufacturing Operations Through MES

August 20, 2025Smart Manufacturing to Take Spotlight at CMTS 2025

August 20, 2025Factory Downtime: The Hidden Cost of Weak Networks

August 20, 2025Culture is Key to Overcoming Labor Shortages in AI Era

August 19, 2025The Power of Strategic Partnerships

August 19, 2025Bodor Laser Invests $20M to Boost U.S. Manufacturing

August 6, 2025Photonic AI Chip Platform

July 29, 20252025 Best Places for Food Manufacturing Insights Report

July 25, 2025Reduce Downtime with a Predictive Maintenance Strategy

July 14, 2025Turning the Tide Against a Tsunami of Cyber Threats

July 8, 2025AI and Data Drive the Aftermarket of the Future

June 24, 2025Industrial Manufacturing US Deals 2025 Midyear Outlook

August 20, 2025Unifying Manufacturing Operations Through MES

August 20, 2025Smart Manufacturing to Take Spotlight at CMTS 2025

August 20, 2025Factory Downtime: The Hidden Cost of Weak Networks

August 19, 2025The Power of Strategic Partnerships

August 19, 2025Bodor Laser Invests $20M to Boost U.S. Manufacturing

August 14, 2025Hyde Park Capital Advises CRA on its Sale

November 16, 2018 Preparing Your Data for Predictive Analytics

Predictive analytics can enable organizations to streamline their operational processes among many other benefits for your business.

Subscribe to Industry Today

Subscribe to Industry Today’s regular e-newslettersindustrytoday.com

Most Recent EpisodeSmart Sites, Smart Growth: Inside Site Selection with Didi Caldwell

News ............. And More