April 28, 2017 Semi-Automated Conversion Projects: Best Practices

An overview of current processes, tools and methods for mitigating problems in a conversion project for high-volume legacy materials.

By Mark Gross, President, DCL

Many users have years or even decades of legacy materials they would like to bring into the 21st century. This could be thousands and thousands of pages of hard copy, PDF, SGML, or even XML that they need to be up to date in order to add it to their current cache of materials. It would be nice if there was a totally “lights-out” solution that would produce high quality results. Sorry to say that after working on hundreds of key projects over the years at DCL, I have learned that for the most part there is no silver bullet. That’s the “bad” news.

Here’s the “good” news: There are ways to help mitigate problems in a conversion project. The more you do and know upfront before your project begins, and the better you manage it during the process, the better your results will be. Hopefully this article will act as a knowledge share on the common pitfalls and solutions we’ve uncovered during DCL’s 35 years in business. I’d like to offer you an overview of our process and some of the tools we’ve developed to help in a semi-automated conversion process. Let’s first break down some common hurdles.

Finding the inconsistencies in legacy content

Source material and legacy material spend years, decades, and in the case of paper, even centuries, stuck in their non-structured format. For each data source or each decade, different authors follow the beat of their own drum, and there are a variety of style guides. For example, callouts and/or citations can be superscript numbers, numbers in parentheses, or square brackets. The labels or figures in the table can be either fully spelled out or abbreviated. You can see how complexities can build up for projects like this.

Tagging is tricky business

There are multiple ways to markup the same content and you’ll need to decide which way is best. Document Type Definitions, or DTDs, which define the structure of documents in a general way, also contain many tags that are optional, and also some of the tags can be used in multiple structures, with room for interpretation. Therefore, there is a need to make sure that tagging is done consistently, and in a way that properly interprets your materials. When converting, without specific rules, the XML will likely not be consistent and the resulting materials will not render consistently.

Visual versus content

Automated tools will often decide tagging based on the look of the page. But the goal of XML is to markup the content to retain its meaning, and that often means a need for human review.

Text extraction in PDFs.

There are many tools for text extraction from PDF Normal files. They’re all wonderful tools but none of them are perfect across all projects. Tools in common use include Adobe, Jade, and Gemini. We use all of these, plus a number of specialized tools developed at DCL.

Pre-analysis and Zoning are a good way to start.

Above we indicated some of the things to look out for when beginning a conversion. The following breaks down some possible tactics and solutions. At DCL, we have developed tools that have gotten us as close as possible to a fully automated process, with minimal human intervention. The goal is always to deliver high quality and consistent results.

Pre-Analysis of a collection is essential. You need to decide which DTD best suits the dataset and what you need the data for. And you need to document your decision in a specification document. This document should be clearly laid out. It should be flexible and robust. It should be able to handle almost any situation thrown at it. Lastly, it should be updated as needed to handle new situations, and it should be available to all team members.

OCR/text extraction and proofreading go hand and hand.

Unless document pages are very consistent a zoning step prior to OCR/text extraction is often useful. It identifies what needs to be captured. It detects each structure if it’s a text box, an image, or a table for instance. It defines the reading order of the page which is necessary when you have multi-column pages.

Depending on the source – paper, PDF image or PDF normal – a good OCR software will extract different elements. Proofreading followed by styling/pre-editing is the next step. DCL has developed text tools to help with the proofreading. Be aware of tricky bits like Os and zeros looking very similar. The same goes for Ls and 1s, and Zs and 2s. After this, our usual approach is to convert the result of this step into a styled Word document. This document is reviewed and corrected where necessary.

Let the automation begin – conversion and parsing.

The styled Word documents are then run through our conversion software to create an XML file. If styling was done the resulting output will be upwards of 90% accurate. Testing of the resulting document to make sure it follows the rules is done with a parser. The parser is software that analyzes the resulting XML file to verify correctness, and to indicate where corrections need to be made. After the parsing is complete we now have valid XML, but we are not done yet.

Adding back in the human element

At this point in the conversion, editorial review is often necessary. While the tagging may be technically correct, it may not have the right “meaning”. It could be the way a sentence might be grammatically correct, but still not convey the right meaning, or any meaning. This review is done with viewing tools that render the XML, or lay it out in a way that’s readable, and that conveys its meaning. The last step is quality control (QC). At DCL, we use QC software to process the XML as a last check to make sure that there are no discrepancies between the XML and the conversional step.

So, what have we learned? Abraham Lincoln once said, “Give me six hours to chop down a tree, and I will spend the first four sharpening the axe.” Simply put – conversions are complex and putting the effort upfront will saved you a lot of headaches in the end. Equally important is having the expertise to manage the process straight through to avoid time-consuming and costly speedbumps. All of this is vital to a successful conversion.

About the Author
Mark Gross, president of Data Conversion Laboratory (DCL), is an authority on XML implementation and document conversion. Prior to joining DCL in 1981, Gross was with the consulting practice of Arthur Young & Co. He has a B.S. degree in Engineering from Columbia University and an MBA from New York University. He also has taught at the New York University Graduate School of Business, the New School, and Pace University. He is a frequent speaker on the topic of automated conversions to XML and SGML.

Data Conversion Laboratory

Subscribe to Industry Today

This field is hidden when viewing the form

Name

Name(Required)

First Last

Company Name(Required)

Email(Required)

Job Title(Required)

Other

Country(Required)

Business Type(Required)

Your Industry(Required)

CAPTCHA

Read Our Current Issue

The Rise of American Manufacturing: A New Industrial Era

Most Recent EpisodeScaling Manufacturing Worldwide: Scott Ellyson’s Leadership Playbook

Listen Now

Scott Ellyson, CEO of East West Manufacturing, brings decades of global manufacturing and supply chain leadership to the conversation. In this episode, he shares practical insights on scaling operations, navigating complexity, and building resilient manufacturing networks in an increasingly connected world.

News ............. And More

February 13, 2026

Tapping AI’s Power to Scale Business Continuity

February 13, 2026

OT Security in 2026: Why Technology Alone Is Not Enough

February 12, 2026

Manufacturing News

February 12, 2026

Why Manufacturing is at a Crossroads with ESG Reporting

February 12, 2026

Dirty Power vs. Uptime: Who Wins in Your Plant?

February 9, 2026

Introduction to Injection Molding Best Practices

February 5, 2026

Technology ROI Isn’t a Strategy: Avoid Costly Mistakes

February 5, 2026

Re-Evaluating Your Benefits to Better Support Workers

February 5, 2026

Manufacturing News

February 4, 2026

Technology & Communications News

February 4, 2026

How to Win in Industrial Hiring in 2026

February 3, 2026

Where Is Everyone? Super Bowl Monday Absence

See All

Get In Touch

Google news and SEO compliant, Industry Today’s state-of-the-art digital media platform offers bespoke media campaigns that target key decision makers and buyers to achieve your marketing and promotional goals.

Industry Today

472 Meeting Street
Ste C-156
Charleston, SC 29403
USA
Telephone

Voice: +001 973.218.0310
Email

For further information please contact the following:

Media Campaigns: Susan Poeton
spoeton@industrytoday.com

Press Releases:
editor@industrytoday.com or submit direct

Content Submissions/Interview Opportunities:
editorialdesk@industrytoday.com

Contribute

Showcase your brand and promote your business to our highly targeted audience. We offer detailed Google Analytics with measurable ROI to assure success. Submit your content for review by our Editorial team who will contact you to discuss the project further.

About Us

Reach Your Targeted Audience and Grow Your Business. Learn more About Industry Today.

Contact Us

This field is hidden when viewing the form

Name

Name(Required)

Email(Required)

Phone

Comments

CAPTCHA

February 14, 2026Rapid Injection Molding Accelerates Auto Product Development

February 13, 2026Tapping AI’s Power to Scale Business Continuity

February 13, 2026OT Security in 2026: Why Technology Alone Is Not Enough

February 12, 2026Manufacturing News

February 12, 2026Why Manufacturing is at a Crossroads with ESG Reporting

February 12, 2026Dirty Power vs. Uptime: Who Wins in Your Plant?

January 9, 2026Talent Challenges Industrial Employers Face Today

December 5, 2025Manufacturer Selects SENTINEL Label Printing Automation Software

December 3, 2025Pepsi Chooses Domino F720i for Cans Marking

December 3, 2025India’s Industrial Compliance Shift Starts With Digital

December 3, 2025November 2025 ISM® Manufacturing PMI® Report

November 11, 2025When Production Stops, Everything’s at Risk

February 11, 2026Nationwide Boiler Offers a Superheat Skid-mounted Boiler

February 9, 2026TrafFix Devices Introduces Water Cable Barrier

February 9, 2026Tele Radio Remote Controls for Construction at CONEXPO

February 5, 2026Spotter AI Launches Enhanced TMS Platform

February 3, 2026OZ Lifting Continues K9 Police Dog Program

January 30, 2026Reveel Launches Freight Audit and Payment Solution

April 28, 2017 Semi-Automated Conversion Projects: Best Practices

An overview of current processes, tools and methods for mitigating problems in a conversion project for high-volume legacy materials.

Finding the inconsistencies in legacy content

Tagging is tricky business

Visual versus content

Text extraction in PDFs.

Pre-analysis and Zoning are a good way to start.

OCR/text extraction and proofreading go hand and hand.

Let the automation begin – conversion and parsing.

Adding back in the human element

Subscribe to Industry Today

Most Recent EpisodeScaling Manufacturing Worldwide: Scott Ellyson’s Leadership Playbook

News ............. And More

Subscribe to Industry Today’s regular e-newslettersindustrytoday.com

February 14, 2026Rapid Injection Molding Accelerates Auto Product Development

February 13, 2026Tapping AI’s Power to Scale Business Continuity

February 13, 2026OT Security in 2026: Why Technology Alone Is Not Enough

February 12, 2026Manufacturing News

February 12, 2026Why Manufacturing is at a Crossroads with ESG Reporting

February 12, 2026Dirty Power vs. Uptime: Who Wins in Your Plant?

January 9, 2026Talent Challenges Industrial Employers Face Today

December 5, 2025Manufacturer Selects SENTINEL Label Printing Automation Software

December 3, 2025Pepsi Chooses Domino F720i for Cans Marking

December 3, 2025India’s Industrial Compliance Shift Starts With Digital

December 3, 2025November 2025 ISM® Manufacturing PMI® Report

November 11, 2025When Production Stops, Everything’s at Risk

February 11, 2026Nationwide Boiler Offers a Superheat Skid-mounted Boiler

February 9, 2026TrafFix Devices Introduces Water Cable Barrier

February 9, 2026Tele Radio Remote Controls for Construction at CONEXPO

February 5, 2026Spotter AI Launches Enhanced TMS Platform

February 3, 2026OZ Lifting Continues K9 Police Dog Program

January 30, 2026Reveel Launches Freight Audit and Payment Solution

April 28, 2017 Semi-Automated Conversion Projects: Best Practices

An overview of current processes, tools and methods for mitigating problems in a conversion project for high-volume legacy materials.

Finding the inconsistencies in legacy content

Tagging is tricky business

Visual versus content

Text extraction in PDFs.

Pre-analysis and Zoning are a good way to start.

OCR/text extraction and proofreading go hand and hand.

Let the automation begin – conversion and parsing.

Adding back in the human element

Subscribe to Industry Today

Subscribe to Industry Today’s regular e-newslettersindustrytoday.com

Most Recent EpisodeScaling Manufacturing Worldwide: Scott Ellyson’s Leadership Playbook

News ............. And More