| Extracting content from document images | Model-directed recognition | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Recognition, compression, & retrieval | Formal probabilistic models | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Page layout and printed text | Automatic inference of models | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The statistical pattern and image analysis (SPIA) research team invents algorithms for extracting content from document images. In addition, we develop theories that motivate the algorithms and build prototypes of software tools that embody them. The Document Image Decoding (DID) effort continues as a project within SPIA. We focus on methods for recognition, compression, and retrieval that are particularly sensitive to machine-printed text, page layout formats, and the logical structure of the text. We have for several years emphasized the use of formal probabilistic models of the stages by which document images are generated. These models approximate the statistics of the language and domain of discourse, typefaces and page layouts, and degradations in image quality resulting from printing and imaging. We have been able to express most of these models as finite-state Markov grammars, allowing us to compose them. Our algorithms can 'decode' a given page image with respect to the composed model in a rigorous manner so that the resulting interpretation is optimal in a precise, realistic, and useful sense. In practice this yields remarkably high recognition accuracies (characters interpreted correctly) and furthermore it permits the logical structure of the document to be extracted as tags embedded in the raw text. Our technology has past, present, and potential applications across a wide range of scan-based products. At present, we are actively engaged with Xerox business groups to improve:
Our group's technical skills include document image analysis, pattern recognition, information theory and source/channel coding, and general computer-science algorithms and software engineering. |