Microsoft Word - pdf2XML

PDF2XML: Converting PDF to XML

Yonggao Yang Kwang Paick Yanxiong Peng Yukong Zhang

Department of Computer Science

Prairie View A&M University

Prairie View, TX 77446

Abstract: XML is a markup language for documents containing structured information. It is designed to make it

easy to interchange structured documents over the Internet and further integrate them with management database

system. PDF is a document format intended to electronically reproduce the look of a page. There is a huge

demand of converting existing PDF documents into XML documents, so that they will be searchable and

manageable. Since PDF is basically a page layout format and does not carry original document structure,

converting PDF to XML remains a challenging task. This paper addresses the related technique problems and

explores approaches. As part of the Data Conversion Project under development at the Data Conversion Center

funded by DoD, we present a system, PDF2XML, designed to automatically perform the conversion with minimum

human interaction.

Keywords: Data Conversion, PDF, XML

1. Introduction

Portable Document Format, PDF, is a widely used electronic document format intended to

electronically reproduce the look of a page [1, 2]. It was designed to be a publishing format, which is

often referred to as “Electronic Paper”. However, a PDF document does not contain any structure

information that its original electronic file possesses, such as paragraph information, table structures,

and so forth. This allows a PDF document to be usually much smaller in size than its original electronic

file in other formats, thus makes PDF documents convenient for transmitting on the Internet.

EXtensible Markup Language, XML, is a markup language for documents containing structured

information [3, 4, 5]. It allows that richly structured documents can represent database data as well as

other kinds of structured data, and be used over the web in communication between business

applications. XML provides a facility to define tags and the structural relationships between them.

Huge amount of documents are generated everyday and stored in PDF format. Furthermore, we

have very large amount of legacy PDF documents without their original electronic files from which

they were generated. Virk [8] lists the gains we can achieve by converting other documents into XML

format. It might be easy to perform some conversions among various formats, such as XML to PDF and

HTML to XML [6, 7].

Currently most of the conversion work is done manually, thus is time-consuming and error-prone.

Few efforts are directed towards designing automatic conversion tools. Ouahid and Karmouch [6]

address methods of converting web pages into XML documents. Youn and Ku [7] discuss issues of

migrating data from one system to another, without mentioning PDF-to-XML. One commercial

software worth of mention is Omnimark [9], which is designed to help convert documents in RTF

format into XML format.

We are aware of that it is extremely difficult to develop a system to automatically conduct the

100% of PDF to XML conversion due to the versatility of PDF documents and XML DTD

specification. We lunched a project to develop a computer system capable of performing the 60~70% of

the conversion task automatically, thus leaving about 30% of conversion task to be done manually. As

the result of this project, a system, PDF2XML, was developed.

2. Convert PDF to XML

Our goal is to design and implement a system to help convert PDF document into XML documents

with minimum human interaction. The structure of the system, PDF2XML, is depicted in Figure 1. The

filled thick arrows show the flow of the data from the input of the PDF document to the output of the

XML document. It goes through four phases: (1) PDF Extraction; (2) Structure Reconstruction; (3)

Manual Modification; and (4) XML Wrapping.

XML Document

Paragraph

Merger

PDF

Analyser &

Extracter

XML Wrapper

Content Type

Recognizer

Content

Recognizer

PDF Document

CECOM DTD

Tags

Formatted

Document

Filter

Filtering Rules

Original Document Structure

Reconstructor

Table

Recognizer

Figure

Recognizer

Manual

Modification

DTD Tag Rules

DTD File

Style Sheet

INPUT

OUTPUT

Interactive Parameters Configuration

Work Package

Splitter

Work Package

Rules

Figure 1: PDF2XML system structure

2.1 Phase-I: Information Extraction from PDF Document

The first phase is to extract as much information as possible from the PDF document, and filter out

those unnecessary contents, and save it as a text file. A PDF Normal document does not contain original

document structure. It is basically a page layout format, which shows where on the page a line should

be placed. The information we can extract includes: (1) Document information: title, author, and date;

(2) Page information: page number, page height and weight; (3) Each line string in the page: position

information (top, left, width, and height), font information (size, type, style, and color), and others.

Figure 2 is an example that shows part of a sample PDF page and the extracted information. The

numbers of each line in Figure 2(b) are Top, Left, Width, Height, Font Size, and Font Style (Other

information is ignored because they are not required by the XML DTD system in this example).

(a) Part of a page in PDF (b) Extracted information in plain text (c) Output of Phase-II

Figure 2: Extract information from PDF

The filter and the filtering rules in this phase are used to throw away those non-required contents on

PDF pages, such as the page numbers usually printed at the bottom of the pages, the document ID

appearing on each page (e.g. the ID is at top-right corner in Figure 2(a)), and so forth.

2.2 Phase-II: Reconstruction of Original Document Structure

Reconstructing the original document structure from PDF file is the most challenging task of data

conversion tools because PDF document does not carry any such information. The output of this phase

is also a plain text file (Figure 2c).

Paragraph Merger

: PDF document does not carry the paragraph information (Figure 2a). After the

extraction, each line is an independent entity (Figure 2b). By using the line position, font, page number,

and other parameters (such as space between regular lines and space between paragraphs, etc.),

Paragraph Merger is able to tell the start and the end of a paragraph, and merge lines into paragraphs.

Figure Recognizer

: XML treats each figure as an entity. Currently we manually snapshot each

figure from PDF file and save it as a file, and simply use XML tags to wrap the figure’s title to point to

the file. However, during the automatic conversion, PDF2XML still needs to ignore the words

belonging to the figure and extract only the figure title.

Table Recognizer and table structure reconstruction

: Identifying tables and reconstructing their

structures is the most difficult task for data conversion tools, particularly for those complex and

irregular tables. Based on the position information and parameters provided by users, PDF2XML

currently is capable of handling regular tables. However, for irregular tables, PDF2XML still relies on

human interaction (Phase-III) to recover their structures.

Title Identifier

: Identifying chapter titles, multi-level subtitles and footnotes is a critical task of

reconstructing original document structure. Title Identifier of PDF2XML uses font size, font style, and

line position, along with other parameters provided by the users to accomplish this task.

Figure 2c shows partial of the output from this phase. We use several predefined pseudo-tags (in

curly braces at the beginning of each line) to mark the types of the contents. The major pseudo-tags are

{CHAPTER_ID}, {CHAPTER_TITLE}, {SECTIONSTR}, {SUBTITLE_x}, {PARAGRAPH},

{FIGURE}, {TABLE}, and others. Here {SUBTITLE_x} means a level x subtitle.

2.3 Phase-III: Manual Modification

After Phase-II, we almost recover and reconstruct the original document structure from the PDF

document, except those special cases that the previous two phases are not able to handle. This phase

allows us to use any text-editing tool to modify the output of Phase II.

2.4 Phase-IV: XML Wrapper

After Phase-III, the remaining task is to use XML DTD tags to wrap the contents in the plain text

file accordingly, and to generate the XML document. This task involves XML DTD tags and their

syntax (rules), and original document structures stored in the plain text file from Phase 3. PDF2XML

uses an Access database system to store all the DTD tags and the rules regarding the use of them. Users

should modify this DTD database to meet their own XML DTD requirements. Sometimes, what DTD

tags we should use depends on the content (we name it content-sensitive tags), “Content Recognizer”

module is called to parse the content and make a decision on this. For some special cases, users might

want to identify this manually and insert specified control marks appropriately to the plain text file to

facilitate XML Wrapper doing its job.

Figure 3 is a snapshot of PDF2XML interface. The top panel is used to provide system-use-help

information, view the source PDF file, display and edit the multiple temporary files generated during

the several conversion phases, and display the generated XML file with or without its style sheet. The

bottom panel hosts all the control-buttons and various parameter-editing boxes. Through this panel,

users interact with PDF2XML.

Figure 3: A snapshot of the PDF2XML interface

PDF2XML system was implemented with Visual C++ on Windows environments. It uses Access

as the database system to store and locate CECOM DTD tags and wrapping rules.

3. Conclusion

Currently, by using PDF2XML, we are working with DoD to convert huge amount of weapon and

equipment manuals in PDF format to XML documents. We plan to continue improving PDF2XML

system and integrate other functions, including converting figures in PDF files to SVG Format that can

be integrated into XML documents.

References

[1] Inc. Adobe Systems, “PDF Reference: Version 1.4,” Addison-Wesley Pub Co., 2001.

[2] Inc. Adobe Systems, “PDF Reader,” http://www.adobe.com

[3] XML official website, http://www.xml.org

[4] Coyle, F. P., “XML, Web Services, and the Data Revolution,” Addison-Wesley Pub Co., 2002.

[5] Birbeck M., etc., “Professional XML,” Wrox Press Inc., 2000.

[6] Ouahid, H., and Karmouch, A., “Converting Web Pages into Well-formated XML Documents,” IEEE Proc. of

International Conference on Communications, 1999, pp. 676~680.

[7] Youn, C., and Ku, C. S., “Data Migration,” IEEE Proc. of the Fifth Distributed Memory Computing

Conference, 1992, pp. 1028~1037.

[8] Virk, R, “Why Use XML for Documents & Content?”

http://www.datawarehouse.com/iknowledge/whitepapers/CID3443.pdf

[9] Barker, M., “Internet Programming with OmniMark,” Kluwer Academic Pub., 2000.