April 28, 2017 Semi-Automated Conversion Projects: Best Practices

An overview of current processes, tools and methods for mitigating problems in a conversion project for high-volume legacy materials.

By Mark Gross, President, DCL

Many users have years or even decades of legacy materials they would like to bring into the 21st century. This could be thousands and thousands of pages of hard copy, PDF, SGML, or even XML that they need to be up to date in order to add it to their current cache of materials. It would be nice if there was a totally “lights-out” solution that would produce high quality results. Sorry to say that after working on hundreds of key projects over the years at DCL, I have learned that for the most part there is no silver bullet. That’s the “bad” news.

Here’s the “good” news: There are ways to help mitigate problems in a conversion project. The more you do and know upfront before your project begins, and the better you manage it during the process, the better your results will be. Hopefully this article will act as a knowledge share on the common pitfalls and solutions we’ve uncovered during DCL’s 35 years in business. I’d like to offer you an overview of our process and some of the tools we’ve developed to help in a semi-automated conversion process. Let’s first break down some common hurdles.

Finding the inconsistencies in legacy content

Source material and legacy material spend years, decades, and in the case of paper, even centuries, stuck in their non-structured format. For each data source or each decade, different authors follow the beat of their own drum, and there are a variety of style guides. For example, callouts and/or citations can be superscript numbers, numbers in parentheses, or square brackets. The labels or figures in the table can be either fully spelled out or abbreviated. You can see how complexities can build up for projects like this.

Tagging is tricky business

There are multiple ways to markup the same content and you’ll need to decide which way is best. Document Type Definitions, or DTDs, which define the structure of documents in a general way, also contain many tags that are optional, and also some of the tags can be used in multiple structures, with room for interpretation. Therefore, there is a need to make sure that tagging is done consistently, and in a way that properly interprets your materials. When converting, without specific rules, the XML will likely not be consistent and the resulting materials will not render consistently.

Visual versus content

Automated tools will often decide tagging based on the look of the page. But the goal of XML is to markup the content to retain its meaning, and that often means a need for human review.

Text extraction in PDFs.

There are many tools for text extraction from PDF Normal files. They’re all wonderful tools but none of them are perfect across all projects. Tools in common use include Adobe, Jade, and Gemini. We use all of these, plus a number of specialized tools developed at DCL.

Pre-analysis and Zoning are a good way to start.

Above we indicated some of the things to look out for when beginning a conversion. The following breaks down some possible tactics and solutions. At DCL, we have developed tools that have gotten us as close as possible to a fully automated process, with minimal human intervention. The goal is always to deliver high quality and consistent results.

Pre-Analysis of a collection is essential. You need to decide which DTD best suits the dataset and what you need the data for. And you need to document your decision in a specification document. This document should be clearly laid out. It should be flexible and robust. It should be able to handle almost any situation thrown at it. Lastly, it should be updated as needed to handle new situations, and it should be available to all team members.

OCR/text extraction and proofreading go hand and hand.

Unless document pages are very consistent a zoning step prior to OCR/text extraction is often useful. It identifies what needs to be captured. It detects each structure if it’s a text box, an image, or a table for instance. It defines the reading order of the page which is necessary when you have multi-column pages.

Depending on the source – paper, PDF image or PDF normal – a good OCR software will extract different elements. Proofreading followed by styling/pre-editing is the next step. DCL has developed text tools to help with the proofreading. Be aware of tricky bits like Os and zeros looking very similar. The same goes for Ls and 1s, and Zs and 2s. After this, our usual approach is to convert the result of this step into a styled Word document. This document is reviewed and corrected where necessary.

Let the automation begin – conversion and parsing.

The styled Word documents are then run through our conversion software to create an XML file. If styling was done the resulting output will be upwards of 90% accurate. Testing of the resulting document to make sure it follows the rules is done with a parser. The parser is software that analyzes the resulting XML file to verify correctness, and to indicate where corrections need to be made. After the parsing is complete we now have valid XML, but we are not done yet.

Adding back in the human element

At this point in the conversion, editorial review is often necessary. While the tagging may be technically correct, it may not have the right “meaning”. It could be the way a sentence might be grammatically correct, but still not convey the right meaning, or any meaning. This review is done with viewing tools that render the XML, or lay it out in a way that’s readable, and that conveys its meaning. The last step is quality control (QC). At DCL, we use QC software to process the XML as a last check to make sure that there are no discrepancies between the XML and the conversional step.

So, what have we learned? Abraham Lincoln once said, “Give me six hours to chop down a tree, and I will spend the first four sharpening the axe.” Simply put – conversions are complex and putting the effort upfront will saved you a lot of headaches in the end. Equally important is having the expertise to manage the process straight through to avoid time-consuming and costly speedbumps. All of this is vital to a successful conversion.

About the Author
Mark Gross, president of Data Conversion Laboratory (DCL), is an authority on XML implementation and document conversion. Prior to joining DCL in 1981, Gross was with the consulting practice of Arthur Young & Co. He has a B.S. degree in Engineering from Columbia University and an MBA from New York University. He also has taught at the New York University Graduate School of Business, the New School, and Pace University. He is a frequent speaker on the topic of automated conversions to XML and SGML.

Data Conversion Laboratory

Subscribe to Industry Today

This field is hidden when viewing the form

Name

Name(Required)

First Last

Company Name(Required)

Email(Required)

Job Title(Required)

Other

Country(Required)

Business Type(Required)

Your Industry(Required)

CAPTCHA

Read Our Current Issue

Hire Heroes USA: Channeling Veteran Skills to Power U.S. Manufacturing

Most Recent EpisodeSmart Sites, Smart Growth: Inside Site Selection with Didi Caldwell

Listen Now

A warm welcome to our guest Didi Caldwell, CEO of Global Location Strategies (GLS) and one of the world’s top site selection experts. With over $44 billion in projects across 30+countries, Didi is reshaping how companies choose where to grow. Here she shares insights on reshoring, data-driven strategy, and navigating global industry shifts.

News ............. And More

August 14, 2025

Manufacturing News

August 13, 2025

Logistics Leaders Mustn’t Delay Digital Transformation

August 13, 2025

Why OT Management Is Manufacturing’s Next Frontier

August 11, 2025

Boost Back Office Profits In An Uncertain Economy

August 11, 2025

Safeguarding Workers’ Rights Amid Global Turmoil

August 11, 2025

Rethink the Line: Solving Manufacturing’s Talent Crisis

August 11, 2025

Overuse Hits Manufacturing Hard – Can Tech Save It?

August 7, 2025

Manufacturing News

August 6, 2025

Refineries Reconfiguring Safety Systems, Sans PLCs

August 6, 2025

Prioritizing Mental Health in the Workplace

August 6, 2025

Advanced Extraction: From Scraps to Sustainable Systems

August 6, 2025

Manufacturing News – Discovery Education

See All

Get In Touch

Google news and SEO compliant, Industry Today’s state-of-the-art digital media platform offers bespoke media campaigns that target key decision makers and buyers to achieve your marketing and promotional goals.

Industry Today

472 Meeting Street
Ste C-156
Charleston, SC 29403
USA
Telephone

Voice: +001 973.218.0310
Email

For further information please contact the following:

Media Campaigns: Susan Poeton
spoeton@industrytoday.com

Press Releases:
editor@industrytoday.com or submit direct

Content Submissions/Interview Opportunities:
editorialdesk@industrytoday.com

Contribute

Showcase your brand and promote your business to our highly targeted audience. We offer detailed Google Analytics with measurable ROI to assure success. Submit your content for review by our Editorial team who will contact you to discuss the project further.

About Us

Reach Your Targeted Audience and Grow Your Business. Learn more About Industry Today.

Contact Us

This field is hidden when viewing the form

Name

Name(Required)

Email(Required)

Phone

Comments

CAPTCHA

August 14, 2025Manufacturing News

August 14, 2025Hyde Park Capital Advises CRA on its Sale

August 14, 2025Corevist Launches ZenX Innovation Layer

August 14, 2025Blackline Announces Quality Assurance Program Expansion

August 13, 2025Logistics Leaders Mustn’t Delay Digital Transformation

August 13, 2025Why OT Management Is Manufacturing’s Next Frontier

August 6, 2025Photonic AI Chip Platform

July 29, 20252025 Best Places for Food Manufacturing Insights Report

July 25, 2025Reduce Downtime with a Predictive Maintenance Strategy

July 14, 2025Turning the Tide Against a Tsunami of Cyber Threats

July 8, 2025AI and Data Drive the Aftermarket of the Future

June 24, 2025Industrial Manufacturing US Deals 2025 Midyear Outlook

August 14, 2025Hyde Park Capital Advises CRA on its Sale

August 14, 2025Corevist Launches ZenX Innovation Layer

August 14, 2025Blackline Announces Quality Assurance Program Expansion

August 13, 2025ElevatIQ: Rootstock ERP Strong Fit for Manufacturers

August 6, 2025Caldwell Launches Custom Lifting Beam with LGH

August 5, 2025Clay Comm Launches AI “Query Engineering” Service

April 28, 2017 Semi-Automated Conversion Projects: Best Practices

An overview of current processes, tools and methods for mitigating problems in a conversion project for high-volume legacy materials.

Finding the inconsistencies in legacy content

Tagging is tricky business

Visual versus content

Text extraction in PDFs.

Pre-analysis and Zoning are a good way to start.

OCR/text extraction and proofreading go hand and hand.

Let the automation begin – conversion and parsing.

Adding back in the human element

Subscribe to Industry Today

Most Recent EpisodeSmart Sites, Smart Growth: Inside Site Selection with Didi Caldwell

News ............. And More

Subscribe to Industry Today’s regular e-newslettersindustrytoday.com

August 14, 2025Manufacturing News

August 14, 2025Hyde Park Capital Advises CRA on its Sale

August 14, 2025Corevist Launches ZenX Innovation Layer

August 14, 2025Blackline Announces Quality Assurance Program Expansion

August 13, 2025Logistics Leaders Mustn’t Delay Digital Transformation

August 13, 2025Why OT Management Is Manufacturing’s Next Frontier

August 6, 2025Photonic AI Chip Platform

July 29, 20252025 Best Places for Food Manufacturing Insights Report

July 25, 2025Reduce Downtime with a Predictive Maintenance Strategy

July 14, 2025Turning the Tide Against a Tsunami of Cyber Threats

July 8, 2025AI and Data Drive the Aftermarket of the Future

June 24, 2025Industrial Manufacturing US Deals 2025 Midyear Outlook

August 14, 2025Hyde Park Capital Advises CRA on its Sale

August 14, 2025Corevist Launches ZenX Innovation Layer

August 14, 2025Blackline Announces Quality Assurance Program Expansion

August 13, 2025ElevatIQ: Rootstock ERP Strong Fit for Manufacturers

August 6, 2025Caldwell Launches Custom Lifting Beam with LGH

August 5, 2025Clay Comm Launches AI “Query Engineering” Service

April 28, 2017 Semi-Automated Conversion Projects: Best Practices

An overview of current processes, tools and methods for mitigating problems in a conversion project for high-volume legacy materials.

Finding the inconsistencies in legacy content

Tagging is tricky business

Visual versus content

Text extraction in PDFs.

Pre-analysis and Zoning are a good way to start.

OCR/text extraction and proofreading go hand and hand.

Let the automation begin – conversion and parsing.

Adding back in the human element

Subscribe to Industry Today

Subscribe to Industry Today’s regular e-newslettersindustrytoday.com

Most Recent EpisodeSmart Sites, Smart Growth: Inside Site Selection with Didi Caldwell

News ............. And More