 |
|
 |
CereSoft solutions are powered by two extraordinary technologies: our freestyle recognition engine and our Universal Script that captures data from unstructured documents.
The following white paper outlines our approach to capturing data from
dynamic, loosely-structured documents.
CereSoft incorporates our intelligent Universal Script into
industry-specific applications to help organizations improve productivity
and data accuracy.
H. H. Chen
CereSoft, Inc.
- Summary
- Introduction
- Types of Paper Documents
- Templates and Static Forms Processing
- Intelligent Script Dynamic Forms Processing
- Recognition and Search Strategy
- Document Understanding
I. Summary
Document recognition technology has progressed so much lately that it is now possible for computers to recognize and interpret the content of an image-based document. A natural application of this advanced technology is the capturing of data from dynamic documents. Documents are considered "dynamic" if they contain similar data elements but have varying page layouts so that the data elements are located in different place on every document.
Dynamic document processing will revolutionize the document and data capture industry. Users will be liberated from the chores of locating and entering data from documents such as invoices, remittance advices, bills, listings etc. It is now possible to create a comprehensive data capture product that will automatically scan, sort, batch, extract data, index, store, retrieve and route all the business forms passing through an enterprise.
Back to the Top
II. Introduction
Currently, there are two separate types of data capture processes performed by document imaging systems. In the first process, many documents are scanned in and the data on them are manually indexed for storage, search and retrieval. In these applications, the critical data from these documents is often manually indexed because the documents are mostly un-structured or loosely structured, so that the information needed from them appears in different places. In the second process, documents that are so highly structured that their layout can be matched to a set of geometric templates are automatically recognized by a forms processing application that captures the required data. These geometric templates often contain information about the locations and shapes of the data fields, lines, checkmarks, and barcodes on the form as well as information on how to handle the type of data found in a specific location. A template's information is the knowledge entry point for processing a static form and its proper setup is crucial to the success of forms processing software.
Unfortunately, most of the paper documents that a business needs to process are not highly structured static forms. The ability to find and read the pertinent data from un-structured or loosely structured documents is essential to the creation of a comprehensive document imaging system. Back to the Top
III. Types of Paper Documents
Loosely speaking, we can classify paper documents into three categories:
Highly structured - These are forms whose layouts hardly ever change. The data on these forms can be reliably be found in set locations. The format on these static forms is so precisely defined that templates can be built to geometrically identify and extract the critical data contained in them. Geometric template matching and character recognition technologies are the most important factor in creating a forms processing solution for structured forms. Forms processing is a mature industry and currently a number of vendors offer targeted solutions for static forms processing.
Unstructured - These are documents with free flowing formats and a constantly changing or unknowable set of data elements. They do not have useful geometrical structures for us to identify and extract needed data. Examples of unstructured documents are correspondence, pictures, manuals, contracts, class notes, etc. For text documents, we may use a full text OCR software to convert a text image into ASCII data for search and retrieval. We can even build a keyword list or a profile of the document and classify it according to the frequency of occurrence of these keywords. Full text OCR approach works for documents with high image quality. But it fails miserably if the image quality is at all degraded. This is an intrinsic deficiency of the currently popular post-OCR fuzzy match approach for indexing and retrieving an unstructured document.
Loosely structured - These are dynamic forms that have a regular set of data items but an imprecise or totally irregular page layout so that templates now become useless when attempting to identify and extract such data items. In this case, a new approach based on the use of an Intelligent Universal Script is used in CereSoft technologies to locate and read data from dynamic form documents. In principle, the Universal Script approach will also work for the un-structured documents. Back to the Top
IV. Templates and Static Forms Processing
For static forms, the page layouts are usually fixed. Their geometric specifications are so precise that it is possible to match the document images to a template. Although images scanned from different pages may have variations, these variations can be compensated by a mapping that employs a few adjustable parameters. Examples of these parameters are horizontal and vertical shifts, horizontal and vertical stretching, skew and shear. Searches for data can be found by matching the form image with the form templates.
A template is a recipe, based on a blank form image, that instructs the software where to find the data fields and how to process these data fields. It therefore contains all the intelligence that the software must have in order to process a particular static form.
The template approach, although powerful, nevertheless, has to be modified or even abandoned when geometric information regarding the locations of data sets becomes imprecise or unavailable. This is the situation we will encounter when processing dynamic form documents. Back to the Top
V. Different Types of Dynamic Forms
Depending on the degree of variations, we may classify dynamic documents into three different types.
Type 1 Floating Column - These documents have a fixed format but an indefinite number of items in a data column such as those found on an invoice. Therefore, the exact locations of the data, although within the same column will be floating along the column depending on the number of items present. This will in turn affect the summary data such as the total amount, tax amount, etc., which will follow after the last item in the column.
Type 2 Geometrically Similar - These are pages that have similar but not identical page layouts or where the page layout is not printed precisely following the page layout specifications. Sometimes, a form is designed with the basic theme of layout but not the precise specification for its geometrical parameters. There are also forms whose layouts are subject to periodic modifications. Different generations of these forms inherit many common elements but also include changes that may confuse a template application. The documents may assume a similar global structure but have different data locations and varying numbers of corresponding data fields. To locate the data fields in these forms, the traditional method of template matching does not work because of the lack of precise information about these data field locations. We have to relax the rigid geometrical template matching and replace it with a more flexible dynamic template matching. The flexible template will only give a rough,
imprecise location indicator for the data field. It is up to the form processing software to intelligently find the needed data from this specified domain of occurrence.
Type 3 Content Similar but Geometrically Dissimilar - These are totally dynamic forms such as invoices and receipts. Their page layouts are totally irregular while the data items needed to be extracted are similar. In order to extract data from this type of document, the application must be able to recognize and comprehend the content of the whole document and be able to decide whether the needed data is available in the document. In a sense, this is the extreme case of a type 2 document, such that the domain of occurrence for the data is expanded to the whole page. The content driven data search relies on the building of an intelligent Universal Script instead of the geometrically based template. Back to the Top
VI. Intelligent Script Dynamic Forms Processing
As discussed in the previous sections, the more irregular the data field locations become, the more we have to rely on the data content to find the fields. In case of a totally dynamic form, we will use an intelligent Universal Script that contains only content information for data field that instructs the software to locate and read the data.
In general, there are two kinds of dynamic data fields. The first kind has a description called prompting text at the vicinity of the data to point out the data for extraction. This is the prompted data. The second kind does not have a prompting text to describe and point out the data. A reader should rely on the understanding of the data itself to identify and extract the data. This is the a-prompted data or self-prompted data. In most cases, it is much easier and more reliable to find prompted data. They occur predominantly in a form environment. A-prompted data usually occur as keywords or the indexed data in a text document such as a business letter. Finding a-prompted data in a form environment is usually more difficult.
An intelligent Universal Script provides the description of the prompted or a-prompted data fields we would like to extract from a document and instructions on how to identify and read these data fields from a dynamic form.
Prompted Data Field - These data fields have an explicit description, the prompting text, in the vicinity of the data. For example, the prompting text "Invoice Number" will point to the number in its immediate vicinity to be the invoice number. Without the prompting text, it is usually difficult to know from the number itself whether or not it represents the invoice number. Many dynamic forms use prompting text in their designs to greatly enhance the data search. On the other hand, the search for the correct prompting text is still not trivial. In the first place, there are many different words and phrases that describe the same thing. For example, Amount, Balance, Total Due may be the same prompting text for the amount due data on an invoice. A second complication is to determine which data at the vicinity of the prompting text should be the correct data associated with the prompting text.
A-prompted Data Field - If a data field does not have a prompting text, usually, it probably does not need one. It is self-evident from its content what the data it represents. For example, "September 22, 1999" is evidently a date. We do not really need prompting text to tell us it is a date. However, in many circumstances, it can be very difficult to determine the kind of data that is being read or capture from the content of the data alone. Back to the Top
VI. Recognition and Search Strategy
The performance of an automatic machine print (OCR) or handprint (ICR) recognition system depends heavily on the quality of the image. If a document contains many broken, touching or noisy character images, the accuracy of recognition will degrade significantly, diminishing the chances of finding the right data from the document.
Traditional OCR systems use techniques such as OCR confusion table, positional N-grams, etc., to correct and remedy the mistakes already committed by an OCR engine. The so-called fuzzy match could map an OCR result that contains a few errors to the correct words in a vocabulary if the number of errors in a word is limited. No software can match a seriously erred OCR result such as "eolvcatian" to what it should be: "education". But it is not hard to map a slightly erred result: "umderstand" to the correct one: "understand". Obviously, the error in the second example is much more limited than in the first.
The recognition engine FreeStyle used in CereSoft's dynamic form processing software, is based on a different strategy. It is lexicon driven instead of character driven.
In a character driven system, the OCR engine does not know the lexicon. It tries to first segment the word image into character image blobs using heuristic rules. These character image blobs are then sent to a character recognizer to determine its class label. Since the character segmentation and recognition are done independently, errors on each of these operations will be compounded together rather than mutually canceling. What is worse is that these errors committed early in the process will have little chance of being corrected by a post-OCR fuzzy matching process.
An alternative to the character-driven recognition is the lexicon driven recognition system adopted by CereSoft's FreeStyle recognition engine. Since the prompting text and the syntax for the a-prompted data fields are known beforehand, the system will look for words that match the vocabulary set of these data fields. It is possible to avoid the cumulative errors and early commitment errors altogether.
Back to the Top
VII. Document Understanding
In addition to character recognition, a successful dynamic forms processing system needs to have at least a rudimentary understanding of the document, its document class, structure and content. For example, in order to process an invoice, we need to find the following set of data:
invoice number
PO number
vendor name
line items
amount due
terms
invoice date
ship to
charge to
and others. The presence or absence of part or all of these data items will determine whether a document is indeed an invoice. Furthermore, invoices usually arrange their data in geometric structures such as text blocks, columns, tables, etc. There are also non-text images in the documents such as lines, graphics, logos, etc. that are trivial to the data capture process. In terms of prompting text and their associated data fields, there are also many different rules to link them together. A dynamic forms processing software must be able to perform such analyses and define data fields with the rules and supports it can get from a Script. Examples of such supports are:
Lexical - thesaurus, aliases, semantic networks for the keywords and phrases.
Geometrical row, column, lines, proximity, below, above, etc.
Arithmetical quantity, unit price, sum, total, percentage, etc.
Consistency digit for amount, word length, etc.
Syntactic date, phone number, serial number, etc.
In this sense, the way a computer reads data from a dynamic document follows essentially the same way a human does. It scans the page, finds the plausible candidates, checks their relationship with their surroundings to weed out wrong candidates and eventually converges on the correct data.
Back to the Top
|
 |