An overview of current processes, tools and methods for mitigating problems in a conversion project for high-volume legacy materials.
By Mark Gross, President, DCL
Many users have years or even decades of legacy materials they would like to bring into the 21st century. This could be thousands and thousands of pages of hard copy, PDF, SGML, or even XML that they need to be up to date in order to add it to their current cache of materials. It would be nice if there was a totally “lights-out” solution that would produce high quality results. Sorry to say that after working on hundreds of key projects over the years at DCL, I have learned that for the most part there is no silver bullet. That’s the “bad” news.
Here’s the “good” news: There are ways to help mitigate problems in a conversion project. The more you do and know upfront before your project begins, and the better you manage it during the process, the better your results will be. Hopefully this article will act as a knowledge share on the common pitfalls and solutions we’ve uncovered during DCL’s 35 years in business. I’d like to offer you an overview of our process and some of the tools we’ve developed to help in a semi-automated conversion process. Let’s first break down some common hurdles.
Finding the inconsistencies in legacy content
Source material and legacy material spend years, decades, and in the case of paper, even centuries, stuck in their non-structured format. For each data source or each decade, different authors follow the beat of their own drum, and there are a variety of style guides. For example, callouts and/or citations can be superscript numbers, numbers in parentheses, or square brackets. The labels or figures in the table can be either fully spelled out or abbreviated. You can see how complexities can build up for projects like this.
Tagging is tricky business
There are multiple ways to markup the same content and you’ll need to decide which way is best. Document Type Definitions, or DTDs, which define the structure of documents in a general way, also contain many tags that are optional, and also some of the tags can be used in multiple structures, with room for interpretation. Therefore, there is a need to make sure that tagging is done consistently, and in a way that properly interprets your materials. When converting, without specific rules, the XML will likely not be consistent and the resulting materials will not render consistently.
Visual versus content
Automated tools will often decide tagging based on the look of the page. But the goal of XML is to markup the content to retain its meaning, and that often means a need for human review.
Text extraction in PDFs.
There are many tools for text extraction from PDF Normal files. They’re all wonderful tools but none of them are perfect across all projects. Tools in common use include Adobe, Jade, and Gemini. We use all of these, plus a number of specialized tools developed at DCL.
Pre-analysis and Zoning are a good way to start.
Above we indicated some of the things to look out for when beginning a conversion. The following breaks down some possible tactics and solutions. At DCL, we have developed tools that have gotten us as close as possible to a fully automated process, with minimal human intervention. The goal is always to deliver high quality and consistent results.
Pre-Analysis of a collection is essential. You need to decide which DTD best suits the dataset and what you need the data for. And you need to document your decision in a specification document. This document should be clearly laid out. It should be flexible and robust. It should be able to handle almost any situation thrown at it. Lastly, it should be updated as needed to handle new situations, and it should be available to all team members.
OCR/text extraction and proofreading go hand and hand.
Unless document pages are very consistent a zoning step prior to OCR/text extraction is often useful. It identifies what needs to be captured. It detects each structure if it’s a text box, an image, or a table for instance. It defines the reading order of the page which is necessary when you have multi-column pages.
Depending on the source – paper, PDF image or PDF normal – a good OCR software will extract different elements. Proofreading followed by styling/pre-editing is the next step. DCL has developed text tools to help with the proofreading. Be aware of tricky bits like Os and zeros looking very similar. The same goes for Ls and 1s, and Zs and 2s. After this, our usual approach is to convert the result of this step into a styled Word document. This document is reviewed and corrected where necessary.
Let the automation begin – conversion and parsing.
The styled Word documents are then run through our conversion software to create an XML file. If styling was done the resulting output will be upwards of 90% accurate. Testing of the resulting document to make sure it follows the rules is done with a parser. The parser is software that analyzes the resulting XML file to verify correctness, and to indicate where corrections need to be made. After the parsing is complete we now have valid XML, but we are not done yet.
Adding back in the human element
At this point in the conversion, editorial review is often necessary. While the tagging may be technically correct, it may not have the right “meaning”. It could be the way a sentence might be grammatically correct, but still not convey the right meaning, or any meaning. This review is done with viewing tools that render the XML, or lay it out in a way that’s readable, and that conveys its meaning. The last step is quality control (QC). At DCL, we use QC software to process the XML as a last check to make sure that there are no discrepancies between the XML and the conversional step.
So, what have we learned? Abraham Lincoln once said, “Give me six hours to chop down a tree, and I will spend the first four sharpening the axe.” Simply put – conversions are complex and putting the effort upfront will saved you a lot of headaches in the end. Equally important is having the expertise to manage the process straight through to avoid time-consuming and costly speedbumps. All of this is vital to a successful conversion.
About the Author
Mark Gross, president of Data Conversion Laboratory (DCL), is an authority on XML implementation and document conversion. Prior to joining DCL in 1981, Gross was with the consulting practice of Arthur Young & Co. He has a B.S. degree in Engineering from Columbia University and an MBA from New York University. He also has taught at the New York University Graduate School of Business, the New School, and Pace University. He is a frequent speaker on the topic of automated conversions to XML and SGML.