-
Notifications
You must be signed in to change notification settings - Fork 243
Document Layout Analysis
In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order. Detection and labeling of the different zones (or blocks) as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis. But text zones play different logical roles inside the document (titles, captions, footnotes, etc.) and this kind of semantic labeling is the scope of the logical layout analysis. – Wikipedia
In our case, we are using the pdf document itself instead of image representation. The following categories of tools are available:
- Word extractors
- Page segmenters
- Reading order detectors
- Other layout tools
- Export – Tools to export the result of document layout analysis
Word extractors deal with the task of building words using letters in a page. 2 different methods are currently available:
Words | Default method | Nearest Neighbour |
---|---|---|
Horizontal | ✔️ | ✔️ |
Axis aligned rotated | ✖️ | ✔️ |
Rotated | ❌ | ✔️ |
Curved | ❌ | ✔️ |
Legend: ✔️: supported, ✖️: partial support, ❌: not supported
TO DO
The nearest neighbour word extractor is useful to extract words from pdf documents with complex layouts.
It will seek to connect each glyph bound box's EndBaseLine
point with the closest glyph bound box StartBaseLine
point.
I order to decide wether two glyphs are close enough from each other, the algorithm uses the maximum of both candidates Width as a reference distance.
- For glyphs with known text direction, the Manhattan distance is used and the threshold is set to 20% of the width .
- For glyphs with unknown text direction, the Euclidean distance is used and the threshold is set to 50% of the width.
If the measured distance between the two glyphs is below this threshold, they are deemed to be connected.
Once glyphs are connected, they are then grouped to form words via a depth first search algorithm. The glyphs ordering is also done by the algorithm.
It seems that both left-to-right and right-to-left scripts have there glyph StartBaseLine
on the left and EndBaseLine
on the right.
So for right-to-left script, the word's glyph ordering should be in reverse oder.
using (var document = PdfDocument.Open(@"document.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
var words = page.GetWords(NearestNeighbourWordExtractor.Instance);
foreach (var word in words)
{
// Do something
}
}
}
The algorithm was used on this map that has a complex layout, with glyphs/words having very diverse text directions. The algorithm is able to rebuild words independently of their direction.
Words are indicated by red rectangles:
Page segmenters deal with the task of finding block of text in a page. 3 different methods are currently available:
- Default method
- Recursive XY Cut – a top-down method
- Docstrum for bounding boxes – a bottom-up method
Text | Default method | Recursive XY Cut | Docstrum |
---|---|---|---|
Single Column | ✔️ | ✔️ | ✔️ |
Multi Columns | ✖️ | ✔️ | ✔️ |
L-shaped text | ❌ | ❌ | ✔️ |
Rotated lines/paragraphs | ❌ | ✖️ | ✔️ |
Legend: ✔️: supported, ✖️: partial support, ❌: not supported
This method returns one single block, containing all words in the page.
var blocks = DefaultPageSegmenter.Instance.GetBlocks(words);
The recursive X-Y cut is a top-down page segmentation technique that decomposes a document image recursively into a set of rectangular blocks. – Wikipedia
The algorithm works in a top-down manner because it starts at the page level, then scan it from left to right to find a large enough gap between words. Once the gap is found, a vertical cut is made at this level and the page is now separated in two blocks. Then the algorithm start scanning from bottom to top each block in order to find a large enough gap, and once it is found, a horizontal cut is made. This 2-step process is repeated until no cut is possible.
The height and width of the dominant font are used to decide vertical and horizontal gap sizes. A minimum block width can also be used to filter small blocks.
Once all blocks are discovered, lines of text are simply created by grouping words by their bottom coordinate and ordering them from left to right.
The default implementation will use the mode of the height and width of the letters in each block as relevant gap sizes. The minimum block width is 0 (no minimum).
- https://en.wikipedia.org/wiki/Recursive_X-Y_cut
- Recursive X-Y Cut using Bounding Boxes of Connected Components by Jaekyu Ha, Robert M.Haralick and Ihsin T. Phillips
using (var document = PdfDocument.Open(“document.pdf”))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
var words = page.GetWords();
// Use default parameters
// - mode of letters' height and width used as gap size
// - no minimum block width
var blocks = RecursiveXYCut.Instance.GetBlocks(words);
foreach (var block in blocks)
{
// Do something
}
}
}
The method can be tailored by providing a minimum block width, and horizontal and vertical gap sizes/functions:
- Minimum block width is set to 1/3 of page width:
var blocks = RecursiveXYCut.Instance.GetBlocks(words,
new RecursiveXYCut.RecursiveXYCutOptions()
{
MinimumWidth = page.Width / 3.0
});
- Average of the page letters' height and width is used as gap size (the values won’t change), and minimum block width is set to 1/3 of page width:
var blocks = RecursiveXYCut.Instance.GetBlocks(words,
new RecursiveXYCut.RecursiveXYCutOptions()
{
MinimumWidth = page.Width / 3.0,
DominantFontWidthFunc = _ => page.Letters.Average(l => l.GlyphRectangle.Width),
DominantFontHeightFunc = _ => page.Letters.Average(l => l.GlyphRectangle.Height)
});
- A function that will be applied to each block letters’ height and width can also be provided. Here we use the average and minimum block width is set to 1/3 of page width:
var blocks = RecursiveXYCut.Instance.GetBlocks(words,
new RecursiveXYCut.RecursiveXYCutOptions()
{
MinimumWidth = page.Width / 3.0,
DominantFontWidthFunc = letters => letters.Select(l => l.GlyphRectangle.Width).Average(),
DominantFontHeightFunc = letters => letters.Select(l => l.GlyphRectangle.Height).Average()
});
NB: Isolated bullet points can be handled by setting a minimum block width, e.g. RecursiveXYCut.Instance.GetBlocks(words, new RecursiveXYCut.RecursiveXYCutOptions() { MinimumWidth = page.Width / 3.0 })
Paraphrasing the abstract of the original paper, the document spectrum (or docstrum) is a method for structural page layout analysis based on bottom-up, nearest-neighbour clustering of page components. The method yields an accurate within-line, and between-line spacings and locates text lines and text blocks.
A description of the original algorithm’s steps can be found on the Wikipedia page.
This implementation loosely follows these steps, using only a subset them, as we don’t start from an image but from the document itself and don’t need to estimate the skew. It works in a bottom-up fashion, starting at word level:
- Estimate in line and between line spacing
- Within-line distance
- Between-line distance
- Build lines of text (from words)
- Build blocks of text (from text lines)
- https://en.wikipedia.org/wiki/Document_layout_analysis#Example_of_a_bottom_up_approach
- The Document Spectrum for Page Layout Analysis by Lawrence O’Gorman
- Document Structure and Layout Analysis by Anoop M. Namboodiri and Anil K. Jain
- Document Layout Analysis by Garrett Hoch
using (var document = PdfDocument.Open(@"document.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
var words = page.GetWords();
// Use default parameters
// - within line angle is set to [-30, +30] degree (horizontal angle)
// - between lines angle is set to [45, 135] degree (vertical angle)
// - between line multiplier is 1.3
var blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words);
foreach (var block in blocks)
{
// Do something
}
}
}
The method can be tailored by providing a within line angle, a between lines angle and a between line multiplier:
- Here, the within line angle is set to [-45, +45] degree (horizontal angle), the between lines angle is set to [35, 170] degree (vertical angle) and the between line multiplier is 1.5:
var blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words,
new DocstrumBoundingBoxes.DocstrumBoundingBoxesOptions()
{
WithinLineBounds = new DocstrumBoundingBoxes.AngleBounds(-45, 45),
BetweenLineBounds = new DocstrumBoundingBoxes.AngleBounds(35, 170),
BetweenLineMultiplier = 1.5
});
Reading order detectors deal with the task of finding the blocks' reading order in a page. Only TextBlock
are ordered at the moment. 3 different methods are currently available:
This reading order detector does nothing. It will return the blocks as they are provided and the ReadingOrder
of each TextBlock
remains -1.
n/a
using (var document = PdfDocument.Open(@"document.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
var words = page.GetWords(NearestNeighbourWordExtractor.Instance);
var blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words);
var orderedBlocks = DefaultReadingOrderDetector.Instance.Get(blocks);
foreach (var block in orderedBlocks)
{
// Do something
}
}
}
n/a
This reading order detector uses the average of the Letter.TextSequence
contained in each block to determine the reading order.
n/a
using (var document = PdfDocument.Open(@"document.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
var words = page.GetWords(NearestNeighbourWordExtractor.Instance);
var blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words);
var orderedBlocks = RenderingReadingOrderDetector.Instance.Get(blocks);
foreach (var block in orderedBlocks)
{
// Do something
}
}
}
TO DO
As per the original paper (see References): We follow the approach of Aiello et al., who defined a set of binary relations for intervals in X and Y direction that allow a certain amount of tolerance for the coordinate values. In total there are 13 relations in both X and Y direction, and for each pair of two-dimensional bounding boxes exactly one X relation and exactly one Y relation is true. This tolerance is implemented by a parameter T
; if two coordinates are closer than T
they are considered equal. This flexibility is necessary because due to the inherent noise in the PDF extraction text blocks in the same column might not be exactly aligned (here we choose T = 5
). Aiello et al. then defined the BeforeInReading
relation as a Boolean combination of binary relations for intervals in X and Y direction, which states for any pair of bounding boxes whether the first one occurs at some point (not necessarily immediately) before the other in a column-wise reading order.
In addition to [the above], we also define the BeforeInRendering
relation that tells whether a block is rendered at some time before another block in the PDF. We incorporate both relations into a single partial ordering of blocks by specifying a directed graph with an edge between every pair of blocks for which at least one of the two relations hold. – S. Klampfl et al.
- Section 5.1 of Unsupervised document structure analysis of digital scientific articles by S. Klampfl, M. Granitzer, K. Jack, R. Kern
- Document Understanding for a Broad Class of Documents by L. Todoran, M. Worring, M. Aiello and C. Monz.
using (var document = PdfDocument.Open(@"document.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
var words = page.GetWords(NearestNeighbourWordExtractor.Instance);
var blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words);
var orderedBlocks = UnsupervisedReadingOrderDetector.Instance.Get(blocks);
foreach (var block in orderedBlocks)
{
// Do something
}
}
}
- Set the tolerance parameter
T
to 10.
using (var document = PdfDocument.Open(@"document.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
var words = page.GetWords(NearestNeighbourWordExtractor.Instance);
var blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words);
var unsupervisedReadingOrderDetector = new UnsupervisedReadingOrderDetector(10);
var orderedBlocks = unsupervisedReadingOrderDetector.Get(blocks);
foreach (var block in orderedBlocks)
{
// Do something
}
}
}
Viewing the exported xml file in PRImA Research Lab's LayoutEvalGUI:
NB: In this example, the decoration blocks (header and page number) are not ordered correctly. The pdf was generated using Word and it seems that the rendering order is not respected for decoration blocks. See the Decoration Text Block Classifier to know how to detect these blocks.
TO DO
TO DO
TO DO
TO DO
TO DO
TO DO: image
The algorithm finds a cover of the background whitespace of a document in terms of maximal empty rectangles. It is a top-down algorithm, which means it uses a whole page as its starting point, and works in a way analogous to quicksort or Branch and Bound algorithms. – Ø. R. Berg, 2011
- Section 3.2 of High precision text extraction from PDF documents by Øyvind Raddum Berg
- Section 2 of Two geometric algorithms for layout analysis by Thomas M. Breuel
using (var document = PdfDocument.Open(@"document.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
// Use default parameters
// - maximum number of whitespace rectangles is 40 by default
// - whitespace fuzziness is 0.15
// - no limit to queue size
var whitespaces = WhitespaceCoverExtractor.GetWhitespaces(page.GetWords(),
page.GetImages());
}
}
The method can be tailored by providing maxRectangleCount, maxBoundQueueSize, whitespaceFuzziness, minWidth and minHeight (the fuzziness is defined as the percentage by which candidate whitespace rectangles are allowed to overlap the surrounding obstacles):
- Here, the maximum number of whitespace rectangles is set to 100 and the maximum queue size is set to 10,000:
using (var document = PdfDocument.Open(@"document.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
var whitespaces = WhitespaceCoverExtractor.GetWhitespaces(page.GetWords(),
page.GetImages(),
100,
10_000);
}
}
- Here, the maximum number of whitespace rectangles is set to 100, the maximum queue size is set to 10,000, the minimum whitespace rectangles' Width and Height is set to 0, and the fuzziness is set to 0 (instead of 0.15 by default):
using (var document = PdfDocument.Open(@"document.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
var whitespaces = WhitespaceCoverExtractor.GetWhitespaces(page.GetWords(),
page.GetImages(),
minWidth: 0,
minHeight: 0,
maxRectangleCount: 100,
whitespaceFuzziness: 0.0,
maxBoundQueueSize: 10_000);
}
}
The algorithm returns text blocks that are classified as decoration blocks. From the paper in reference: Many digital documents have archival information such as author names, publication titles, page numbers, and release dates printed repeatedly at the border of each page. Most prominently this content is placed inside headers or footers, but sometimes also at the left or right edge of the page. We refer to text blocks containing this type of information as decoration blocks. – S. Klampfl et al.
The document should contain more than 2 pages to work.
- Unsupervised document structure analysis of digital scientific articles by S. Klampfl, M. Granitzer, K. Jack, R. Kern.
using (var document = PdfDocument.Open(@"document.pdf"))
{
var docDecorations = DecorationTextBlockClassifier.Get(document.GetPages().ToList(),
DefaultWordExtractor.Instance,
DocstrumBoundingBoxes.Instance);
for (int i = 0; i < document.NumberOfPages; i++)
{
IReadOnlyList<TextBlock> decorations = docDecorations[i];
// Do something
}
}
If you want to use a different library to compute the minimum edit distance, used in the algorithm to compare strings, you can do as follow (using the Fastenshtein library):
using Fastenshtein;
using (var document = PdfDocument.Open(@"document.pdf"))
{
var docDecorations = DecorationTextBlockClassifier.Get(document.GetPages().ToList(),
DefaultWordExtractor.Instance,
DocstrumBoundingBoxes.Instance,
(string s1, string s2) => Levenshtein.Distance(s1, s2) / (double)Math.Max(s1.Length, s2.Length));
// The minimum edit distance should be normalised
for (int i = 0; i < document.NumberOfPages; i++)
{
IReadOnlyList<TextBlock> decorations = docDecorations[i];
// Do something
}
}
TO DO
Exporters deal with the task of transforming a pdf page into a text representation, while keeping as much information about the layout as possible. 3 different formats are currently available:
The PAGE (Page Analysis and Ground-Truth Elements) Format is an XML-based page image representation framework that records information on image characteristics (image borders, geometric distortions and corresponding corrections, binarisation etc.) in addition to layout structure and page content. – PRImA Research Lab
You can use PRImA Research Lab's LayoutEvalGUI software to open the exported file.
- The PAGE xml exporter is the only exporter supporting reading order (as of Jan. 2020). See here for an example.
- https://www.primaresearch.org/publications/ICPR2010_Pletschacher_PAGE
- http://www.primaresearch.org/tools
- https://github.com/PRImA-Research-Lab/PAGE-XML
Using the DefaultWordExtractor
, RecursiveXYCut
and UnsupervisedReadingOrderDetector
:
PageXmlTextExporter pageXmlTextExporter = new PageXmlTextExporter(
DefaultWordExtractor.Instance,
RecursiveXYCut.Instance,
UnsupervisedReadingOrderDetector.Instance);
using (var document = PdfDocument.Open(@"document.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
// Convert page to text
string xml = pageXmlTextExporter.Get(page);
// Save text to an xml file
File.WriteAllText("document.pagexml.xml", xml);
}
}
Opening the exported xml file in PRImA Research Lab's LayoutEvalGUI:
ALTO (Analyzed Layout and Text Object) is an open XML Schema developed by the EU-funded project called METAe. The standard was initially developed for the description of text OCR and layout information of pages for digitized material. The goal was to describe the layout and text in a form to be able to reconstruct the original appearance based on the digitized information - similar to the approach of a lossless image saving operation. – Wikipedia
You can use PRImA Research Lab's LayoutEvalGUI software to open the exported file.
- https://www.loc.gov/standards/alto/
- https://github.com/altoxml
- https://en.wikipedia.org/wiki/ALTO_(XML)
Using the NearestNeighbourWordExtractor
and DocstrumBoundingBoxes
:
AltoXmlTextExporter altoXmlTextExporter = new AltoXmlTextExporter(
NearestNeighbourWordExtractor.Instance,
DocstrumBoundingBoxes.Instance);
using (var document = PdfDocument.Open(@"document.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
// Convert page to text
string xml = altoXmlTextExporter.Get(page);
// Save text to an xml file
File.WriteAllText("document.altoxml.xml", xml);
}
}
Opening the Alto xml file in should yield the same result as for the PAGE xml format.
hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR). The definition encodes text, style, layout information, recognition confidence metrics and other information using Extensible Markup Language (XML) in the form of Hypertext Markup Language (HTML) or XHTML. – Wikipedia
- http://kba.cloud/hocr-spec/1.2/
- https://github.com/kba/hocr-spec
- https://en.wikipedia.org/wiki/HOCR
- https://github.com/kba/hocrjs
Using the DefaultWordExtractor
and DocstrumBoundingBoxes
:
HOcrTextExporter hocrTextExporter = new HOcrTextExporter(
DefaultWordExtractor.Instance,
DocstrumBoundingBoxes.Instance);
using (var document = PdfDocument.Open(@"document.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
// Convert page to text
string html = hocrTextExporter.Get(page);
// Save text to an html file
File.WriteAllText("document.hocr.html", html);
}
}
Using the DefaultWordExtractor
and DocstrumBoundingBoxes
and adding the reference to the hocrjs script (more on hocrjs here):
HOcrTextExporter hocrTextExporter = new HOcrTextExporter(
DefaultWordExtractor.Instance,
DocstrumBoundingBoxes.Instance);
using (var document = PdfDocument.Open(@"document.pdf"))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
// Convert page to text, adding a reference to hocrjs script
string html = hocrTextExporter.Get(page, useHocrjs: true);
// Save text to an html file
File.WriteAllText("document.hocr.html", html);
}
}
Viewing the exported html file using hocrjs:
Converts the pdf page to SVG (Scalable Vector Graphics).
SvgTextExporter exporter = new SvgTextExporter();
var options = new ParsingOptions() { ClipPaths = true }; // true if clipped path are needed
using (var document = PdfDocument.Open(@"document.pdf", options ))
{
for (var i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
// Convert page to text
var svg = exporter.Get(page);
// Save text to an html file
File.WriteAllText("document.html", svg);
}
}