November 16, 2018 Preparing Your Data for Predictive Analytics

Predictive analytics can enable organizations to streamline their operational processes among many other benefits for your business.

November 14, 2018

By Kyle Seaman, Head of Product at Sentenai, and Sourav Dey, Co-Founder and CTO at Manifold

Predictive analytics is an undeniably valuable technology, with research indicating its market size could top $12 billion USD by 2022. Across a range of industries, businesses, and applications, using historical data to predict future outcomes can lead to greater operational efficiency in a variety of ways. Predictive analytics can enable organizations to streamline their operational processes, optimize their demand forecasting, drastically reduce downtime, and better understand their customers’ propensity to buy.

That said, reaping the benefits of predictive analytics requires a fair amount of engineering legwork. For instance, before applying machine learning techniques to identify the likelihood of future outcomes based on historical data, the data in question must be prepared for training those machine learning algorithms in the first place. Looking at the historical data, organizations and their data scientists need to determine which data is viable and how trustworthy it is, then transform it from its raw initial state into clean datasets that a machine learning algorithm can use.

Data preparation is a highly iterative process, and typically requires up to 80 percent of the time and resources needed to build predictive applications. To make data preparation more efficient and productive — particularly for organizations with disparate data silos, or data lakes housing decades of data in dozens of different formats — consider following these five steps:

Create a normalized data format.
Before building any predictive analytics application, the data at hand needs to be normalized into a consistent format. This is the standard job of extract, transform and load (ETL). A good ETL process should convert the data from disparate forms into a normalized format — preferably in a database. Good ETL jobs also address any cryptic or inconsistent naming conventions for the various data fields. For example, in a case study for the digital oilfield, there were small JSON files and maintenance logs that needed to be ETLed into a single time-series record that could be used downstream. No matter the type of formatting required, the goal of this first step is to transform the data records’ format to make the data more useful.

Build trust in the data.
The next step is to explore the data to find any inconsistencies. Fix what you can, exclude what you can’t, and identify any supplemental data you might need. Going back to the oil field equipment example: After exploring and validating the equipment service records data, we found that not all the data was usable. Specifically, there was high variation in how each human service technician recorded data in the maintenance software. In addition, the root cause of the failure was often not recorded. For that reason, we decided not to use these records and instead rely solely on the machine generated data. In addition, since the same units were moved over the years, we found it necessary to pull in ERP (enterprise resource planning) data to determine when the oil field equipment changed locations.

Define the response variable.
What are you trying to predict? The answer here is not always clear cut. For instance, in the oil field example, we originally thought we were going to predict certain types of failures. But exploration showed that the maintenance logs were not trustworthy labels. As such, we pivoted and moved to predicting long outages instead. Working with subject matter experts is very important as this stage, as they are often the best resource for understating the business problems and the data.

Create features from the data.
Feature engineering, or the creation of features from the data, is a critical step to most machine learning models. Though some modern deep learning methods claim to not require feature engineering, in practice they fair better if features are engineered from the raw data. Effectively, feature engineering is how domain knowledge about the problem is encoded as inputs into the model. For example, certain moving averages or trendlines can be engineered from the raw time series data. In other cases, a spectrogram could be created if there is signal in the frequency domain. The exact features to be engineered is always problem-specific, and ideally should be undertaken with domain experts.

Build training, validation and test sets.
In the end, what matters is how your model is going to perform on new data. This is called generalization. Unfortunately, when you are training a model you don’t have access to new data — you only have the historical data. In order to assess generalization performance effectively, it’s important to carefully construct training, validation and test sets from your data. The training set is what you train your algorithm on, the validation set is what you use for hyper parameter tuning, and the test set is what you finally assess performance on.

There are a number of important considerations to be aware of when constructing these datasets. Most importantly, you should prevent data leakage, which occurs when the training set knows more than it should about the test set. This will lead you to believe your generalization will be better than it will actually be on new data. Especially for time series data, it’s easy to accidentally have data leakage by using data from the future in the training set — or by overlapping time intervals. There may be additional considerations as well, such as rebalancing. This was important in our oil field case study where the failure rate was only 2%. In this case, the training set needed to be rebalanced using a bootstrap sampling technique to make the incidence of failure 20–25% — otherwise the model would simply learn to always predict that the machine would not fail (and achieve 98% accuracy).

When preparing data for predictive analytics, organizations and their data scientists will often have a mental model of what their data should look like, and what it should be able to predict—however, rarely is data in such a form at the onset. Especially if your data has not always been collected or labeled with predictive analytics in mind, it may need quite a bit of work to become useful. But that work can be tremendously worthwhile if you do it carefully and well. In fact, the end result can produce an invaluable competitive advantage for your organization.

About the Authors:

Kyle Seaman
Kyle Seaman leads product for Sentenai, a Boston-based company building infrastructure to automatically manage sensor data for advanced analytic applications. Kyle works with companies in the manufacturing and agriculture space building predictive maintenance and IoT applications. Previously, Kyle was Director of Farm Technology at Freight Farms, where he built a network of IoT-enabled, smart, hydroponic farms and lead data science initiatives around yield optimization and farm performance.

Sourav Dey
Dr. Sourav Dey is CTO at Manifold, an artificial intelligence engineering services firm with offices in Boston and Silicon Valley. Prior to Manifold, Sourav led teams to build data products across the technology stack, from smart thermostats and security cams (Google/Nest) to power grid forecasting (AutoGrid) to wireless communication chips (Qualcomm). He holds multiple patents for his work, has been published in several IEEE journals, and has won numerous awards. Sourav holds PhD, MS, and BS degrees in Electrical Engineering and Computer Science from the Massachusetts Institute of Technology (MIT).

Sentenai | Manifold

Subscribe to Industry Today

This field is hidden when viewing the form

Name

Name(Required)

First Last

Company Name(Required)

Email(Required)

Job Title(Required)

Other

Country(Required)

Business Type(Required)

Your Industry(Required)

CAPTCHA

Read Our Current Issue

Forging the Next 250 Years: Powering the Next Era of American Manufacturing

Most Recent EpisodeManaging Complexity in the Age of Mass Customization

Listen Now

As manufacturers offer more customization than ever before, managing product complexity has become a critical challenge. Tune in with Dan Joe Barry, Vice President of Product Marketing at Configit, who explores how companies are tackling the growing number of product configurations across engineering, sales, manufacturing, and service. He explains how Configuration Lifecycle Management (CLM) helps organizations maintain a single source of truth for configuration data. The result: fewer errors, faster quoting, and the ability to deliver customized products at scale.

News ............. And More

July 2, 2026

Manufacturing News – Digital Magazine

July 1, 2026

Manufacturing News – Fluke Reliability

June 30, 2026

Technology & Communications News

June 30, 2026

Why Your Frontline Workers Reject the Tech You Deploy

June 30, 2026

Respirator Certification System Under Growing Strain

June 29, 2026

What Agile Can Teach the Shop Floor

June 29, 2026

Managing PFAS Risk: A Path to Permanent Exit

June 29, 2026

The Cost Not Shown On the Quote

June 25, 2026

Manufacturing News

June 25, 2026

HALT & HASS: How Automotive Electronics Survive the Road

June 25, 2026

AI Won’t Save Manufacturers If Their Product Data Lies

June 25, 2026

Restoring Utility IT with AI‑Driven Automation

See All

Get In Touch

Google news and SEO compliant, Industry Today’s state-of-the-art digital media platform offers bespoke media campaigns that target key decision makers and buyers to achieve your marketing and promotional goals.

Industry Today

472 Meeting Street
Ste C-156
Charleston, SC 29403
USA
Telephone

Voice: +001 973.218.0310
Email

For further information please contact the following:

Media Campaigns: Susan Poeton
spoeton@industrytoday.com

Press Releases:
editor@industrytoday.com or submit direct

Content Submissions/Interview Opportunities:
editorialdesk@industrytoday.com

Contribute

Showcase your brand and promote your business to our highly targeted audience. We offer detailed Google Analytics with measurable ROI to assure success. Submit your content for review by our Editorial team who will contact you to discuss the project further.

About Us

Reach Your Targeted Audience and Grow Your Business. Learn more About Industry Today.

Contact Us

This field is hidden when viewing the form

Name

Name(Required)

Email(Required)

Phone

Comments

CAPTCHA

July 6, 2026How To Wire A 12V Linear Actuator

July 6, 2026June 2026 ISM® Services PMI® Report

July 6, 2026How to Choose a Commercial Outdoor Furniture Manufacturer

July 6, 20267 Best AI-Powered Production Planning Software in 2026

July 2, 2026Manufacturing News – Digital Magazine

July 2, 2026Material Innovation’s Impact on Automotive Performance

June 3, 2026The Cost of Factory Closures — and the Case for Rebuilding

May 29, 2026Free Fluke eBook: Laser Shaft Alignment Guide

May 21, 2026The Manufacturing Limits of EV Battery Cooling Hardware

May 19, 20262026 State of Manufacturing & Supply Chain Report

May 6, 2026Inside Our Journey to an Injury Free Workplace

April 10, 2026What Does “Antistatic Material” Actually Mean?

July 6, 2026How To Wire A 12V Linear Actuator

July 6, 2026June 2026 ISM® Services PMI® Report

July 2, 2026KEITH Showcases RX Technology at Road Transport Expo

July 2, 2026APM Steam Offers 7,000+ Steam-System Parts

June 30, 2026Southwest Venture Plastics, LLC. Celebrates 20 Years

June 30, 2026UK Exports to the Netherlands – 2026 Snapshot

November 16, 2018 Preparing Your Data for Predictive Analytics

Predictive analytics can enable organizations to streamline their operational processes among many other benefits for your business.

Subscribe to Industry Today

Most Recent EpisodeManaging Complexity in the Age of Mass Customization

News ............. And More

Subscribe to Industry Today’s regular e-newslettersindustrytoday.com

July 6, 2026How To Wire A 12V Linear Actuator

July 6, 2026June 2026 ISM® Services PMI® Report

July 6, 2026How to Choose a Commercial Outdoor Furniture Manufacturer

July 6, 20267 Best AI-Powered Production Planning Software in 2026

July 2, 2026Manufacturing News – Digital Magazine

July 2, 2026Material Innovation’s Impact on Automotive Performance

June 3, 2026The Cost of Factory Closures — and the Case for Rebuilding

May 29, 2026Free Fluke eBook: Laser Shaft Alignment Guide

May 21, 2026The Manufacturing Limits of EV Battery Cooling Hardware

May 19, 20262026 State of Manufacturing & Supply Chain Report

May 6, 2026Inside Our Journey to an Injury Free Workplace

April 10, 2026What Does “Antistatic Material” Actually Mean?

July 6, 2026How To Wire A 12V Linear Actuator

July 6, 2026June 2026 ISM® Services PMI® Report

July 2, 2026KEITH Showcases RX Technology at Road Transport Expo

July 2, 2026APM Steam Offers 7,000+ Steam-System Parts

June 30, 2026Southwest Venture Plastics, LLC. Celebrates 20 Years

June 30, 2026UK Exports to the Netherlands – 2026 Snapshot

November 16, 2018 Preparing Your Data for Predictive Analytics

Predictive analytics can enable organizations to streamline their operational processes among many other benefits for your business.

Subscribe to Industry Today

Subscribe to Industry Today’s regular e-newslettersindustrytoday.com

Most Recent EpisodeManaging Complexity in the Age of Mass Customization

News ............. And More