Document Decomposition into Geometric and Logical Layout



One popular method of producing aesthetically pleasing PDF documents including diverse contents is LaTeX. LaTeX is a low-level markup and programming language that allows high flexibility for designing placement of text and figures, as well as overall document structure. However, this precision is hard to reproduce; once a PDF document is generated, there is no way in general to access the code used to generate the document. In particular, it is very difficult to recreate the template used to design a document.

This project aims to analyze the layout of a PDF document in order to simplify the generation of LaTeX templates.

Images are taken on an Android device and sent to a server to be processed . Once sent to the server, it is processed in a combination of MATLAB and Java, and the final output is saved on the server. The algorithm is comprised of 3 main steps: preprocessing, detecting maximal white rectangles, and classifying components of the document.


