April 28, 2017 Semi-Automated Conversion Projects: Best Practices

An overview of current processes, tools and methods for mitigating problems in a conversion project for high-volume legacy materials.

By Mark Gross, President, DCL

Many users have years or even decades of legacy materials they would like to bring into the 21st century. This could be thousands and thousands of pages of hard copy, PDF, SGML, or even XML that they need to be up to date in order to add it to their current cache of materials. It would be nice if there was a totally “lights-out” solution that would produce high quality results. Sorry to say that after working on hundreds of key projects over the years at DCL, I have learned that for the most part there is no silver bullet. That’s the “bad” news.

Here’s the “good” news: There are ways to help mitigate problems in a conversion project. The more you do and know upfront before your project begins, and the better you manage it during the process, the better your results will be. Hopefully this article will act as a knowledge share on the common pitfalls and solutions we’ve uncovered during DCL’s 35 years in business. I’d like to offer you an overview of our process and some of the tools we’ve developed to help in a semi-automated conversion process. Let’s first break down some common hurdles.

Finding the inconsistencies in legacy content

Source material and legacy material spend years, decades, and in the case of paper, even centuries, stuck in their non-structured format. For each data source or each decade, different authors follow the beat of their own drum, and there are a variety of style guides. For example, callouts and/or citations can be superscript numbers, numbers in parentheses, or square brackets. The labels or figures in the table can be either fully spelled out or abbreviated. You can see how complexities can build up for projects like this.

Tagging is tricky business

There are multiple ways to markup the same content and you’ll need to decide which way is best. Document Type Definitions, or DTDs, which define the structure of documents in a general way, also contain many tags that are optional, and also some of the tags can be used in multiple structures, with room for interpretation. Therefore, there is a need to make sure that tagging is done consistently, and in a way that properly interprets your materials. When converting, without specific rules, the XML will likely not be consistent and the resulting materials will not render consistently.

Visual versus content

Automated tools will often decide tagging based on the look of the page. But the goal of XML is to markup the content to retain its meaning, and that often means a need for human review.

Text extraction in PDFs.

There are many tools for text extraction from PDF Normal files. They’re all wonderful tools but none of them are perfect across all projects. Tools in common use include Adobe, Jade, and Gemini. We use all of these, plus a number of specialized tools developed at DCL.

Pre-analysis and Zoning are a good way to start.

Above we indicated some of the things to look out for when beginning a conversion. The following breaks down some possible tactics and solutions. At DCL, we have developed tools that have gotten us as close as possible to a fully automated process, with minimal human intervention. The goal is always to deliver high quality and consistent results.

Pre-Analysis of a collection is essential. You need to decide which DTD best suits the dataset and what you need the data for. And you need to document your decision in a specification document. This document should be clearly laid out. It should be flexible and robust. It should be able to handle almost any situation thrown at it. Lastly, it should be updated as needed to handle new situations, and it should be available to all team members.

OCR/text extraction and proofreading go hand and hand.

Unless document pages are very consistent a zoning step prior to OCR/text extraction is often useful. It identifies what needs to be captured. It detects each structure if it’s a text box, an image, or a table for instance. It defines the reading order of the page which is necessary when you have multi-column pages.

Depending on the source – paper, PDF image or PDF normal – a good OCR software will extract different elements. Proofreading followed by styling/pre-editing is the next step. DCL has developed text tools to help with the proofreading. Be aware of tricky bits like Os and zeros looking very similar. The same goes for Ls and 1s, and Zs and 2s. After this, our usual approach is to convert the result of this step into a styled Word document. This document is reviewed and corrected where necessary.

Let the automation begin – conversion and parsing.

The styled Word documents are then run through our conversion software to create an XML file. If styling was done the resulting output will be upwards of 90% accurate. Testing of the resulting document to make sure it follows the rules is done with a parser. The parser is software that analyzes the resulting XML file to verify correctness, and to indicate where corrections need to be made. After the parsing is complete we now have valid XML, but we are not done yet.

Adding back in the human element

At this point in the conversion, editorial review is often necessary. While the tagging may be technically correct, it may not have the right “meaning”. It could be the way a sentence might be grammatically correct, but still not convey the right meaning, or any meaning. This review is done with viewing tools that render the XML, or lay it out in a way that’s readable, and that conveys its meaning. The last step is quality control (QC). At DCL, we use QC software to process the XML as a last check to make sure that there are no discrepancies between the XML and the conversional step.

So, what have we learned? Abraham Lincoln once said, “Give me six hours to chop down a tree, and I will spend the first four sharpening the axe.” Simply put – conversions are complex and putting the effort upfront will saved you a lot of headaches in the end. Equally important is having the expertise to manage the process straight through to avoid time-consuming and costly speedbumps. All of this is vital to a successful conversion.

About the Author
Mark Gross, president of Data Conversion Laboratory (DCL), is an authority on XML implementation and document conversion. Prior to joining DCL in 1981, Gross was with the consulting practice of Arthur Young & Co. He has a B.S. degree in Engineering from Columbia University and an MBA from New York University. He also has taught at the New York University Graduate School of Business, the New School, and Pace University. He is a frequent speaker on the topic of automated conversions to XML and SGML.

Data Conversion Laboratory

Subscribe to Industry Today

This field is hidden when viewing the form

Name

Name(Required)

First Last

Company Name(Required)

Email(Required)

Job Title(Required)

Other

Country(Required)

Business Type(Required)

Your Industry(Required)

CAPTCHA

Read Our Current Issue

Industry in Transition: The Forces Reshaping Manufacturing

Most Recent EpisodeManaging Complexity in the Age of Mass Customization

Listen Now

As manufacturers offer more customization than ever before, managing product complexity has become a critical challenge. Tune in with Dan Joe Barry, Vice President of Product Marketing at Configit, who explores how companies are tackling the growing number of product configurations across engineering, sales, manufacturing, and service. He explains how Configuration Lifecycle Management (CLM) helps organizations maintain a single source of truth for configuration data. The result: fewer errors, faster quoting, and the ability to deliver customized products at scale.

News ............. And More

June 16, 2026

From Systems to Teammates: Agentic AI in Manufacturing

June 16, 2026

Reducing Field Engineer Admin to Unlock Productivity

June 16, 2026

Manufacturing Resilience in an Era of Global Uncertainty

June 16, 2026

Who’s Liable When Industrial AI Agents Go Wrong?

June 11, 2026

Manufacturing News

June 10, 2026

Infrastructure & Built Environment News

June 9, 2026

Manufacturers Need An AI Control Tower

June 9, 2026

Communicating Supply Chain Wins: Narrative Over Noise

June 8, 2026

Why Digital Transformation Fails in Manufacturing Today

June 8, 2026

NCSC’s China-Nexus Advisory & Identity-Centric Defense

June 4, 2026

Manufacturing News

June 3, 2026

Technology & Communications News

See All

Get In Touch

Google news and SEO compliant, Industry Today’s state-of-the-art digital media platform offers bespoke media campaigns that target key decision makers and buyers to achieve your marketing and promotional goals.

Industry Today

472 Meeting Street
Ste C-156
Charleston, SC 29403
USA
Telephone

Voice: +001 973.218.0310
Email

For further information please contact the following:

Media Campaigns: Susan Poeton
spoeton@industrytoday.com

Press Releases:
editor@industrytoday.com or submit direct

Content Submissions/Interview Opportunities:
editorialdesk@industrytoday.com

Contribute

Showcase your brand and promote your business to our highly targeted audience. We offer detailed Google Analytics with measurable ROI to assure success. Submit your content for review by our Editorial team who will contact you to discuss the project further.

About Us

Reach Your Targeted Audience and Grow Your Business. Learn more About Industry Today.

Contact Us

This field is hidden when viewing the form

Name

Name(Required)

Email(Required)

Phone

Comments

CAPTCHA

June 16, 2026From Systems to Teammates: Agentic AI in Manufacturing

June 16, 2026Reducing Field Engineer Admin to Unlock Productivity

June 16, 2026Manufacturing Resilience in an Era of Global Uncertainty

June 16, 2026Who’s Liable When Industrial AI Agents Go Wrong?

June 15, 2026Best Monthly Car Rental Solutions in Abu Dhabi

June 15, 20268 Best CNC Turning Services

June 3, 2026The Cost of Factory Closures — and the Case for Rebuilding

May 29, 2026Free Fluke eBook: Laser Shaft Alignment Guide

May 21, 2026The Manufacturing Limits of EV Battery Cooling Hardware

May 19, 20262026 State of Manufacturing & Supply Chain Report

May 6, 2026Inside Our Journey to an Injury Free Workplace

April 10, 2026What Does “Antistatic Material” Actually Mean?

June 12, 2026Robot Dog Outfitted with Blackline Safety Gas Detector

June 11, 2026Bluegrass Ingredients Showcases Seasoning Innovation at IFT First 2026

June 10, 2026EPOS Announces Next Generation Headset Series

June 10, 2026Caldwell Welcomes Three Hires

June 5, 2026AirSight Showcases AirGuard Ecosystem Software Platform

June 2, 2026PFlow Industries Awarded Top Workplaces 2026

April 28, 2017 Semi-Automated Conversion Projects: Best Practices

An overview of current processes, tools and methods for mitigating problems in a conversion project for high-volume legacy materials.

Finding the inconsistencies in legacy content

Tagging is tricky business

Visual versus content

Text extraction in PDFs.

Pre-analysis and Zoning are a good way to start.

OCR/text extraction and proofreading go hand and hand.

Let the automation begin – conversion and parsing.

Adding back in the human element

Subscribe to Industry Today

Most Recent EpisodeManaging Complexity in the Age of Mass Customization

News ............. And More

Subscribe to Industry Today’s regular e-newslettersindustrytoday.com

June 16, 2026From Systems to Teammates: Agentic AI in Manufacturing

June 16, 2026Reducing Field Engineer Admin to Unlock Productivity

June 16, 2026Manufacturing Resilience in an Era of Global Uncertainty

June 16, 2026Who’s Liable When Industrial AI Agents Go Wrong?

June 15, 2026Best Monthly Car Rental Solutions in Abu Dhabi

June 15, 20268 Best CNC Turning Services

June 3, 2026The Cost of Factory Closures — and the Case for Rebuilding

May 29, 2026Free Fluke eBook: Laser Shaft Alignment Guide

May 21, 2026The Manufacturing Limits of EV Battery Cooling Hardware

May 19, 20262026 State of Manufacturing & Supply Chain Report

May 6, 2026Inside Our Journey to an Injury Free Workplace

April 10, 2026What Does “Antistatic Material” Actually Mean?

June 12, 2026Robot Dog Outfitted with Blackline Safety Gas Detector

June 11, 2026Bluegrass Ingredients Showcases Seasoning Innovation at IFT First 2026

June 10, 2026EPOS Announces Next Generation Headset Series

June 10, 2026Caldwell Welcomes Three Hires

June 5, 2026AirSight Showcases AirGuard Ecosystem Software Platform

June 2, 2026PFlow Industries Awarded Top Workplaces 2026

April 28, 2017 Semi-Automated Conversion Projects: Best Practices

An overview of current processes, tools and methods for mitigating problems in a conversion project for high-volume legacy materials.

Finding the inconsistencies in legacy content

Tagging is tricky business

Visual versus content

Text extraction in PDFs.

Pre-analysis and Zoning are a good way to start.

OCR/text extraction and proofreading go hand and hand.

Let the automation begin – conversion and parsing.

Adding back in the human element

Subscribe to Industry Today

Subscribe to Industry Today’s regular e-newslettersindustrytoday.com

Most Recent EpisodeManaging Complexity in the Age of Mass Customization

News ............. And More