November 16, 2018 Preparing Your Data for Predictive Analytics

Predictive analytics can enable organizations to streamline their operational processes among many other benefits for your business.

November 14, 2018

By Kyle Seaman, Head of Product at Sentenai, and Sourav Dey, Co-Founder and CTO at Manifold

Predictive analytics is an undeniably valuable technology, with research indicating its market size could top $12 billion USD by 2022. Across a range of industries, businesses, and applications, using historical data to predict future outcomes can lead to greater operational efficiency in a variety of ways. Predictive analytics can enable organizations to streamline their operational processes, optimize their demand forecasting, drastically reduce downtime, and better understand their customers’ propensity to buy.

That said, reaping the benefits of predictive analytics requires a fair amount of engineering legwork. For instance, before applying machine learning techniques to identify the likelihood of future outcomes based on historical data, the data in question must be prepared for training those machine learning algorithms in the first place. Looking at the historical data, organizations and their data scientists need to determine which data is viable and how trustworthy it is, then transform it from its raw initial state into clean datasets that a machine learning algorithm can use.

Data preparation is a highly iterative process, and typically requires up to 80 percent of the time and resources needed to build predictive applications. To make data preparation more efficient and productive — particularly for organizations with disparate data silos, or data lakes housing decades of data in dozens of different formats — consider following these five steps:

Create a normalized data format.
Before building any predictive analytics application, the data at hand needs to be normalized into a consistent format. This is the standard job of extract, transform and load (ETL). A good ETL process should convert the data from disparate forms into a normalized format — preferably in a database. Good ETL jobs also address any cryptic or inconsistent naming conventions for the various data fields. For example, in a case study for the digital oilfield, there were small JSON files and maintenance logs that needed to be ETLed into a single time-series record that could be used downstream. No matter the type of formatting required, the goal of this first step is to transform the data records’ format to make the data more useful.

Build trust in the data.
The next step is to explore the data to find any inconsistencies. Fix what you can, exclude what you can’t, and identify any supplemental data you might need. Going back to the oil field equipment example: After exploring and validating the equipment service records data, we found that not all the data was usable. Specifically, there was high variation in how each human service technician recorded data in the maintenance software. In addition, the root cause of the failure was often not recorded. For that reason, we decided not to use these records and instead rely solely on the machine generated data. In addition, since the same units were moved over the years, we found it necessary to pull in ERP (enterprise resource planning) data to determine when the oil field equipment changed locations.

Define the response variable.
What are you trying to predict? The answer here is not always clear cut. For instance, in the oil field example, we originally thought we were going to predict certain types of failures. But exploration showed that the maintenance logs were not trustworthy labels. As such, we pivoted and moved to predicting long outages instead. Working with subject matter experts is very important as this stage, as they are often the best resource for understating the business problems and the data.

Create features from the data.
Feature engineering, or the creation of features from the data, is a critical step to most machine learning models. Though some modern deep learning methods claim to not require feature engineering, in practice they fair better if features are engineered from the raw data. Effectively, feature engineering is how domain knowledge about the problem is encoded as inputs into the model. For example, certain moving averages or trendlines can be engineered from the raw time series data. In other cases, a spectrogram could be created if there is signal in the frequency domain. The exact features to be engineered is always problem-specific, and ideally should be undertaken with domain experts.

Build training, validation and test sets.
In the end, what matters is how your model is going to perform on new data. This is called generalization. Unfortunately, when you are training a model you don’t have access to new data — you only have the historical data. In order to assess generalization performance effectively, it’s important to carefully construct training, validation and test sets from your data. The training set is what you train your algorithm on, the validation set is what you use for hyper parameter tuning, and the test set is what you finally assess performance on.

There are a number of important considerations to be aware of when constructing these datasets. Most importantly, you should prevent data leakage, which occurs when the training set knows more than it should about the test set. This will lead you to believe your generalization will be better than it will actually be on new data. Especially for time series data, it’s easy to accidentally have data leakage by using data from the future in the training set — or by overlapping time intervals. There may be additional considerations as well, such as rebalancing. This was important in our oil field case study where the failure rate was only 2%. In this case, the training set needed to be rebalanced using a bootstrap sampling technique to make the incidence of failure 20–25% — otherwise the model would simply learn to always predict that the machine would not fail (and achieve 98% accuracy).

When preparing data for predictive analytics, organizations and their data scientists will often have a mental model of what their data should look like, and what it should be able to predict—however, rarely is data in such a form at the onset. Especially if your data has not always been collected or labeled with predictive analytics in mind, it may need quite a bit of work to become useful. But that work can be tremendously worthwhile if you do it carefully and well. In fact, the end result can produce an invaluable competitive advantage for your organization.

About the Authors:

Kyle Seaman
Kyle Seaman leads product for Sentenai, a Boston-based company building infrastructure to automatically manage sensor data for advanced analytic applications. Kyle works with companies in the manufacturing and agriculture space building predictive maintenance and IoT applications. Previously, Kyle was Director of Farm Technology at Freight Farms, where he built a network of IoT-enabled, smart, hydroponic farms and lead data science initiatives around yield optimization and farm performance.

Sourav Dey
Dr. Sourav Dey is CTO at Manifold, an artificial intelligence engineering services firm with offices in Boston and Silicon Valley. Prior to Manifold, Sourav led teams to build data products across the technology stack, from smart thermostats and security cams (Google/Nest) to power grid forecasting (AutoGrid) to wireless communication chips (Qualcomm). He holds multiple patents for his work, has been published in several IEEE journals, and has won numerous awards. Sourav holds PhD, MS, and BS degrees in Electrical Engineering and Computer Science from the Massachusetts Institute of Technology (MIT).

Sentenai | Manifold

Subscribe to Industry Today

This field is hidden when viewing the form

Name

Name(Required)

First Last

Company Name(Required)

Email(Required)

Job Title(Required)

Other

Country(Required)

Business Type(Required)

Your Industry(Required)

CAPTCHA

Read Our Current Issue

Hire Heroes USA: Channeling Veteran Skills to Power U.S. Manufacturing

Most Recent EpisodePMI Pulse: Navigating Contraction with ISM’s Susan Spence

Listen Now

Tune in for a timely conversation with Susan Spence, MBA, the new Chair of the ISM Manufacturing Business Survey Committee. With decades of global sourcing leadership—from United Technologies to managing $25B in procurement at FedEx—Susan shares insights on the key trends shaping global supply chains and what they mean for the manufacturing outlook.

News ............. And More

July 2, 2025

Resilient Breach Protection for Resource-Limited Teams

July 2, 2025

The Manufacturing Cost Estimation Gap

July 2, 2025

Why UK Manufacturers Need Partners, Not Just Investors

July 2, 2025

Master Your Manufacturing Data To Supercharge AI

June 30, 2025

Publisher’s Letter

June 30, 2025

Why Hiring Veterans is a Smart Strategic Move

June 30, 2025

From the Front Lines to the Factory Floor

June 30, 2025

Military Hiring: A Blueprint for a Skills-Based Workforce

June 30, 2025

Lifting Veterans Up

June 30, 2025

AI Age Delivers Semiconductor Surge in Sacramento

June 30, 2025

Transformation in Brooklyn’s Sunset Park District

June 30, 2025

Energy Manufacturing Powering Pennsylvania

See All

Get In Touch

Google news and SEO compliant, Industry Today’s state-of-the-art digital media platform offers bespoke media campaigns that target key decision makers and buyers to achieve your marketing and promotional goals.

Industry Today

472 Meeting Street
Ste C-156
Charleston, SC 29403
USA
Telephone

Voice: +001 973.218.0310
Email

For further information contact
Susan Poeton:
spoeton@industrytoday.com

Contribute

Showcase your brand and promote your business to our highly targeted audience. We offer detailed Google Analytics with measurable ROI to assure success. Submit your content for review by our Editorial team who will contact you to discuss the project further.

About Us

Reach Your Targeted Audience and Grow Your Business. Learn more About Industry Today.

Contact Us

This field is hidden when viewing the form

Name

Name(Required)

Email(Required)

Phone

Comments

CAPTCHA

July 2, 2025Precision Farming In 2025: Market, Trends, And Real Benefits

July 2, 2025Resilient Breach Protection for Resource-Limited Teams

July 2, 2025The Manufacturing Cost Estimation Gap

July 2, 2025Why UK Manufacturers Need Partners, Not Just Investors

July 2, 2025Master Your Manufacturing Data To Supercharge AI

June 30, 2025Publisher’s Letter

June 24, 2025Industrial Manufacturing US Deals 2025 Midyear Outlook

June 16, 2025Building Up, Not Out with a Multistory Mezzanine

June 12, 2025Southeast Crane & Hoist Installs R&M Cranes at Foundry

May 27, 2025Building an ETO BOM

May 20, 20252025 Telecom and Media Report

May 19, 2025Custom Box Produced by Heavy-Duty Corrugated Box Maker

June 25, 2025Print Smarter, Not Harder

June 25, 2025Alchemy Announces Fraser Parker as Chief Finance Officer

June 25, 2025Datadobi Launches StorageMAP 7.3

June 25, 2025Caldwell Expands Range of In-Stock RUD Products

June 25, 2025Rootstock Named an ERP Leader by Nucleus Research

June 19, 2025Fathom Manufacturing Partners with Gopher Motorsports

November 16, 2018 Preparing Your Data for Predictive Analytics

Predictive analytics can enable organizations to streamline their operational processes among many other benefits for your business.

Subscribe to Industry Today

Most Recent EpisodePMI Pulse: Navigating Contraction with ISM’s Susan Spence

News ............. And More

Subscribe to Industry Today’s regular e-newslettersindustrytoday.com

July 2, 2025Precision Farming In 2025: Market, Trends, And Real Benefits

July 2, 2025Resilient Breach Protection for Resource-Limited Teams

July 2, 2025The Manufacturing Cost Estimation Gap

July 2, 2025Why UK Manufacturers Need Partners, Not Just Investors

July 2, 2025Master Your Manufacturing Data To Supercharge AI

June 30, 2025Publisher’s Letter

June 24, 2025Industrial Manufacturing US Deals 2025 Midyear Outlook

June 16, 2025Building Up, Not Out with a Multistory Mezzanine

June 12, 2025Southeast Crane & Hoist Installs R&M Cranes at Foundry

May 27, 2025Building an ETO BOM

May 20, 20252025 Telecom and Media Report

May 19, 2025Custom Box Produced by Heavy-Duty Corrugated Box Maker

June 25, 2025Print Smarter, Not Harder

June 25, 2025Alchemy Announces Fraser Parker as Chief Finance Officer

June 25, 2025Datadobi Launches StorageMAP 7.3

June 25, 2025Caldwell Expands Range of In-Stock RUD Products

June 25, 2025Rootstock Named an ERP Leader by Nucleus Research

June 19, 2025Fathom Manufacturing Partners with Gopher Motorsports

November 16, 2018 Preparing Your Data for Predictive Analytics

Predictive analytics can enable organizations to streamline their operational processes among many other benefits for your business.

Subscribe to Industry Today

Subscribe to Industry Today’s regular e-newslettersindustrytoday.com

Most Recent EpisodePMI Pulse: Navigating Contraction with ISM’s Susan Spence

News ............. And More