April 28, 2017 Semi-Automated Conversion Projects: Best Practices

An overview of current processes, tools and methods for mitigating problems in a conversion project for high-volume legacy materials.

By Mark Gross, President, DCL

Many users have years or even decades of legacy materials they would like to bring into the 21st century. This could be thousands and thousands of pages of hard copy, PDF, SGML, or even XML that they need to be up to date in order to add it to their current cache of materials. It would be nice if there was a totally “lights-out” solution that would produce high quality results. Sorry to say that after working on hundreds of key projects over the years at DCL, I have learned that for the most part there is no silver bullet. That’s the “bad” news.

Here’s the “good” news: There are ways to help mitigate problems in a conversion project. The more you do and know upfront before your project begins, and the better you manage it during the process, the better your results will be. Hopefully this article will act as a knowledge share on the common pitfalls and solutions we’ve uncovered during DCL’s 35 years in business. I’d like to offer you an overview of our process and some of the tools we’ve developed to help in a semi-automated conversion process. Let’s first break down some common hurdles.

Finding the inconsistencies in legacy content

Source material and legacy material spend years, decades, and in the case of paper, even centuries, stuck in their non-structured format. For each data source or each decade, different authors follow the beat of their own drum, and there are a variety of style guides. For example, callouts and/or citations can be superscript numbers, numbers in parentheses, or square brackets. The labels or figures in the table can be either fully spelled out or abbreviated. You can see how complexities can build up for projects like this.

Tagging is tricky business

There are multiple ways to markup the same content and you’ll need to decide which way is best. Document Type Definitions, or DTDs, which define the structure of documents in a general way, also contain many tags that are optional, and also some of the tags can be used in multiple structures, with room for interpretation. Therefore, there is a need to make sure that tagging is done consistently, and in a way that properly interprets your materials. When converting, without specific rules, the XML will likely not be consistent and the resulting materials will not render consistently.

Visual versus content

Automated tools will often decide tagging based on the look of the page. But the goal of XML is to markup the content to retain its meaning, and that often means a need for human review.

Text extraction in PDFs.

There are many tools for text extraction from PDF Normal files. They’re all wonderful tools but none of them are perfect across all projects. Tools in common use include Adobe, Jade, and Gemini. We use all of these, plus a number of specialized tools developed at DCL.

Pre-analysis and Zoning are a good way to start.

Above we indicated some of the things to look out for when beginning a conversion. The following breaks down some possible tactics and solutions. At DCL, we have developed tools that have gotten us as close as possible to a fully automated process, with minimal human intervention. The goal is always to deliver high quality and consistent results.

Pre-Analysis of a collection is essential. You need to decide which DTD best suits the dataset and what you need the data for. And you need to document your decision in a specification document. This document should be clearly laid out. It should be flexible and robust. It should be able to handle almost any situation thrown at it. Lastly, it should be updated as needed to handle new situations, and it should be available to all team members.

OCR/text extraction and proofreading go hand and hand.

Unless document pages are very consistent a zoning step prior to OCR/text extraction is often useful. It identifies what needs to be captured. It detects each structure if it’s a text box, an image, or a table for instance. It defines the reading order of the page which is necessary when you have multi-column pages.

Depending on the source – paper, PDF image or PDF normal – a good OCR software will extract different elements. Proofreading followed by styling/pre-editing is the next step. DCL has developed text tools to help with the proofreading. Be aware of tricky bits like Os and zeros looking very similar. The same goes for Ls and 1s, and Zs and 2s. After this, our usual approach is to convert the result of this step into a styled Word document. This document is reviewed and corrected where necessary.

Let the automation begin – conversion and parsing.

The styled Word documents are then run through our conversion software to create an XML file. If styling was done the resulting output will be upwards of 90% accurate. Testing of the resulting document to make sure it follows the rules is done with a parser. The parser is software that analyzes the resulting XML file to verify correctness, and to indicate where corrections need to be made. After the parsing is complete we now have valid XML, but we are not done yet.

Adding back in the human element

At this point in the conversion, editorial review is often necessary. While the tagging may be technically correct, it may not have the right “meaning”. It could be the way a sentence might be grammatically correct, but still not convey the right meaning, or any meaning. This review is done with viewing tools that render the XML, or lay it out in a way that’s readable, and that conveys its meaning. The last step is quality control (QC). At DCL, we use QC software to process the XML as a last check to make sure that there are no discrepancies between the XML and the conversional step.

So, what have we learned? Abraham Lincoln once said, “Give me six hours to chop down a tree, and I will spend the first four sharpening the axe.” Simply put – conversions are complex and putting the effort upfront will saved you a lot of headaches in the end. Equally important is having the expertise to manage the process straight through to avoid time-consuming and costly speedbumps. All of this is vital to a successful conversion.

About the Author
Mark Gross, president of Data Conversion Laboratory (DCL), is an authority on XML implementation and document conversion. Prior to joining DCL in 1981, Gross was with the consulting practice of Arthur Young & Co. He has a B.S. degree in Engineering from Columbia University and an MBA from New York University. He also has taught at the New York University Graduate School of Business, the New School, and Pace University. He is a frequent speaker on the topic of automated conversions to XML and SGML.

Data Conversion Laboratory

Subscribe to Industry Today

This field is hidden when viewing the form

Name

Name(Required)

First Last

Company Name(Required)

Email(Required)

Job Title(Required)

Other

Country(Required)

Business Type(Required)

Your Industry(Required)

CAPTCHA

Read Our Current Issue

Hire Heroes USA: Channeling Veteran Skills to Power U.S. Manufacturing

Most Recent EpisodePMI Pulse: Navigating Contraction with ISM’s Susan Spence

Listen Now

Tune in for a timely conversation with Susan Spence, MBA, the new Chair of the ISM Manufacturing Business Survey Committee. With decades of global sourcing leadership—from United Technologies to managing $25B in procurement at FedEx—Susan shares insights on the key trends shaping global supply chains and what they mean for the manufacturing outlook.

News ............. And More

July 8, 2025

Navigating Disruptions with Supply Chain Resilience

July 8, 2025

AI and Data Drive the Aftermarket of the Future

July 8, 2025

What’s the Story on Failed Condition Monitoring Pilots?

July 8, 2025

Boost Efficiency and Compliance with Paper-on-Glass

July 8, 2025

Technology News

July 2, 2025

Resilient Breach Protection for Resource-Limited Teams

July 2, 2025

The Manufacturing Cost Estimation Gap

July 2, 2025

Why UK Manufacturers Need Partners, Not Just Investors

July 2, 2025

Master Your Manufacturing Data To Supercharge AI

July 2, 2025

Manufacturing News

June 30, 2025

Publisher’s Letter

June 30, 2025

Why Hiring Veterans is a Smart Strategic Move

See All

Get In Touch

Google news and SEO compliant, Industry Today’s state-of-the-art digital media platform offers bespoke media campaigns that target key decision makers and buyers to achieve your marketing and promotional goals.

Industry Today

472 Meeting Street
Ste C-156
Charleston, SC 29403
USA
Telephone

Voice: +001 973.218.0310
Email

For further information contact
Susan Poeton:
spoeton@industrytoday.com

Contribute

Showcase your brand and promote your business to our highly targeted audience. We offer detailed Google Analytics with measurable ROI to assure success. Submit your content for review by our Editorial team who will contact you to discuss the project further.

About Us

Reach Your Targeted Audience and Grow Your Business. Learn more About Industry Today.

Contact Us

This field is hidden when viewing the form

Name

Name(Required)

Email(Required)

Phone

Comments

CAPTCHA

July 10, 2025EGGER Group Gets 99.99% Uptime with SIOS LifeKeeper

July 10, 2025YOKE Bolsters Global Stock of Eye Self-Locking Hooks

July 8, 2025Navigating Disruptions with Supply Chain Resilience

July 8, 2025AI and Data Drive the Aftermarket of the Future

July 8, 2025What’s the Story on Failed Condition Monitoring Pilots?

July 8, 2025Boost Efficiency and Compliance with Paper-on-Glass

July 8, 2025AI and Data Drive the Aftermarket of the Future

June 24, 2025Industrial Manufacturing US Deals 2025 Midyear Outlook

June 16, 2025Building Up, Not Out with a Multistory Mezzanine

June 12, 2025Southeast Crane & Hoist Installs R&M Cranes at Foundry

May 27, 2025Building an ETO BOM

May 20, 20252025 Telecom and Media Report

July 10, 2025EGGER Group Gets 99.99% Uptime with SIOS LifeKeeper

July 10, 2025YOKE Bolsters Global Stock of Eye Self-Locking Hooks

July 8, 2025Bernie’s Book Bank Partners with Newcastle Systems

July 8, 2025Evolve IP Growth Strategy Revealed

July 7, 2025Little Rock CCVB Streamlines Procurement with Vroozi

July 7, 2025DASH and S4i Complete Merger, Form SMRTR

April 28, 2017 Semi-Automated Conversion Projects: Best Practices

An overview of current processes, tools and methods for mitigating problems in a conversion project for high-volume legacy materials.

Finding the inconsistencies in legacy content

Tagging is tricky business

Visual versus content

Text extraction in PDFs.

Pre-analysis and Zoning are a good way to start.

OCR/text extraction and proofreading go hand and hand.

Let the automation begin – conversion and parsing.

Adding back in the human element

Subscribe to Industry Today

Most Recent EpisodePMI Pulse: Navigating Contraction with ISM’s Susan Spence

News ............. And More

Subscribe to Industry Today’s regular e-newslettersindustrytoday.com

July 10, 2025EGGER Group Gets 99.99% Uptime with SIOS LifeKeeper

July 10, 2025YOKE Bolsters Global Stock of Eye Self-Locking Hooks

July 8, 2025Navigating Disruptions with Supply Chain Resilience

July 8, 2025AI and Data Drive the Aftermarket of the Future

July 8, 2025What’s the Story on Failed Condition Monitoring Pilots?

July 8, 2025Boost Efficiency and Compliance with Paper-on-Glass

July 8, 2025AI and Data Drive the Aftermarket of the Future

June 24, 2025Industrial Manufacturing US Deals 2025 Midyear Outlook

June 16, 2025Building Up, Not Out with a Multistory Mezzanine

June 12, 2025Southeast Crane & Hoist Installs R&M Cranes at Foundry

May 27, 2025Building an ETO BOM

May 20, 20252025 Telecom and Media Report

July 10, 2025EGGER Group Gets 99.99% Uptime with SIOS LifeKeeper

July 10, 2025YOKE Bolsters Global Stock of Eye Self-Locking Hooks

July 8, 2025Bernie’s Book Bank Partners with Newcastle Systems

July 8, 2025Evolve IP Growth Strategy Revealed

July 7, 2025Little Rock CCVB Streamlines Procurement with Vroozi

July 7, 2025DASH and S4i Complete Merger, Form SMRTR

April 28, 2017 Semi-Automated Conversion Projects: Best Practices

An overview of current processes, tools and methods for mitigating problems in a conversion project for high-volume legacy materials.

Finding the inconsistencies in legacy content

Tagging is tricky business

Visual versus content

Text extraction in PDFs.

Pre-analysis and Zoning are a good way to start.

OCR/text extraction and proofreading go hand and hand.

Let the automation begin – conversion and parsing.

Adding back in the human element

Subscribe to Industry Today

Subscribe to Industry Today’s regular e-newslettersindustrytoday.com

Most Recent EpisodePMI Pulse: Navigating Contraction with ISM’s Susan Spence

News ............. And More