Skip to content

Document Layout Analysis

BobLd edited this page Jun 20, 2020 · 69 revisions

In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order. Detection and labeling of the different zones (or blocks) as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis. But text zones play different logical roles inside the document (titles, captions, footnotes, etc.) and this kind of semantic labeling is the scope of the logical layout analysis.Wikipedia

In our case, we are using the pdf document itself instead of image representation. The following categories of tools are available:

Word extractors

Word extractors deal with the task of building words using letters in a page. 2 different methods are currently available:

Use cases

Words Default method Nearest Neighbour
Horizontal ✔️ ✔️
Axis aligned rotated ✖️ ✔️
Rotated ✔️
Curved ✔️

Legend: ✔️: supported, ✖️: partial support, ❌: not supported

TO DO

Description

The nearest neighbour word extractor is useful to extract words from pdf documents with complex layouts. It will seek to connect each glyph bound box's EndBaseLine point with the closest glyph bound box StartBaseLine point. I order to decide wether two glyphs are close enough from each other, the algorithm uses the maximum of both candidates Width as a reference distance.

  • For glyphs with known text direction, the Manhattan distance is used and the threshold is set to 20% of the width .
  • For glyphs with unknown text direction, the Euclidean distance is used and the threshold is set to 50% of the width.

If the measured distance between the two glyphs is below this threshold, they are deemed to be connected.

Once glyphs are connected, they are then grouped to form words via a depth first search algorithm. The glyphs ordering is also done by the algorithm. It seems that both left-to-right and right-to-left scripts have there glyph StartBaseLine on the left and EndBaseLine on the right. So for right-to-left script, the word's glyph ordering should be in reverse oder.

Usage

using (var document = PdfDocument.Open(@"document.pdf"))
{
	for (var i = 0; i < document.NumberOfPages; i++)
	{
		var page = document.GetPage(i + 1);

		var words = page.GetWords(NearestNeighbourWordExtractor.Instance);

		foreach (var word in words)
		{
			// Do something
		}
	}
}

Results

The algorithm was used on this map that has a complex layout, with glyphs/words having very diverse text directions. The algorithm is able to rebuild words independently of their direction.

Words are indicated by red rectangles:

nearest neighbour word example

Page segmenters

Page segmenters deal with the task of finding block of text in a page. 3 different methods are currently available:

Use cases

Text Default method Recursive XY Cut Docstrum
Single Column ✔️ ✔️ ✔️
Multi Columns ✖️ ✔️ ✔️
L-shaped text ✔️
Rotated lines/paragraphs ✖️ ✔️

Legend: ✔️: supported, ✖️: partial support, ❌: not supported

Description

This method returns one single block, containing all words in the page.

Usage

var blocks = DefaultPageSegmenter.Instance.GetBlocks(words);

Description

The recursive X-Y cut is a top-down page segmentation technique that decomposes a document image recursively into a set of rectangular blocks. – Wikipedia

The algorithm works in a top-down manner because it starts at the page level, then scan it from left to right to find a large enough gap between words. Once the gap is found, a vertical cut is made at this level and the page is now separated in two blocks. Then the algorithm start scanning from bottom to top each block in order to find a large enough gap, and once it is found, a horizontal cut is made. This 2-step process is repeated until no cut is possible.

The height and width of the dominant font are used to decide vertical and horizontal gap sizes. A minimum block width can also be used to filter small blocks.

Once all blocks are discovered, lines of text are simply created by grouping words by their bottom coordinate and ordering them from left to right.

The default implementation will use the mode of the height and width of the letters in each block as relevant gap sizes. The minimum block width is 0 (no minimum).

References

Usage

Simple case

using (var document = PdfDocument.Open(“document.pdf”))
{
	for (var i = 0; i < document.NumberOfPages; i++)
	{
		var page = document.GetPage(i + 1);
		var words = page.GetWords();
		
		// Use default parameters
		// - mode of letters' height and width used as gap size
		// - no minimum block width 
		var blocks = RecursiveXYCut.Instance.GetBlocks(words);

		foreach (var block in blocks)
		{
			// Do something
		}
	}
}

Advanced cases

The method can be tailored by providing a minimum block width, and horizontal and vertical gap sizes/functions:

  • Minimum block width is set to 1/3 of page width:
var blocks = RecursiveXYCut.Instance.GetBlocks(words,
				new RecursiveXYCut.RecursiveXYCutOptions()
				{
					MinimumWidth = page.Width / 3.0
				});
  • Average of the page letters' height and width is used as gap size (the values won’t change), and minimum block width is set to 1/3 of page width:
var blocks = RecursiveXYCut.Instance.GetBlocks(words,
				new RecursiveXYCut.RecursiveXYCutOptions()
				{
					MinimumWidth = page.Width / 3.0,
					DominantFontWidthFunc = _ => page.Letters.Average(l => l.GlyphRectangle.Width),
					DominantFontHeightFunc = _ => page.Letters.Average(l => l.GlyphRectangle.Height)
				});
  • A function that will be applied to each block letters’ height and width can also be provided. Here we use the average and minimum block width is set to 1/3 of page width:
var blocks = RecursiveXYCut.Instance.GetBlocks(words,
				new RecursiveXYCut.RecursiveXYCutOptions()
				{
					MinimumWidth = page.Width / 3.0,
					DominantFontWidthFunc = letters => letters.Select(l => l.GlyphRectangle.Width).Average(),
					DominantFontHeightFunc = letters => letters.Select(l => l.GlyphRectangle.Height).Average()
				});

Results

recursive xy cut example

NB: Isolated bullet points can be handled by setting a minimum block width, e.g. RecursiveXYCut.Instance.GetBlocks(words, new RecursiveXYCut.RecursiveXYCutOptions() { MinimumWidth = page.Width / 3.0 })

Description

Paraphrasing the abstract of the original paper, the document spectrum (or docstrum) is a method for structural page layout analysis based on bottom-up, nearest-neighbour clustering of page components. The method yields an accurate within-line, and between-line spacings and locates text lines and text blocks.

A description of the original algorithm’s steps can be found on the Wikipedia page.

This implementation loosely follows these steps, using only a subset them, as we don’t start from an image but from the document itself and don’t need to estimate the skew. It works in a bottom-up fashion, starting at word level:

  1. Estimate in line and between line spacing
    • Within-line distance
    • Between-line distance
  2. Build lines of text (from words)
  3. Build blocks of text (from text lines)

References

Usage

Simple case

using (var document = PdfDocument.Open(@"document.pdf"))
{
	for (var i = 0; i < document.NumberOfPages; i++)
	{
		var page = document.GetPage(i + 1);
		var words = page.GetWords();

		// Use default parameters
		// - within line angle is set to [-30, +30] degree (horizontal angle)
		// - between lines angle is set to [45, 135] degree (vertical angle)
		// - between line multiplier is 1.3
		var blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words);

		foreach (var block in blocks)
		{
			// Do something
		}
	}
}

Advanced case

The method can be tailored by providing a within line angle, a between lines angle and a between line multiplier:

  • Here, the within line angle is set to [-45, +45] degree (horizontal angle), the between lines angle is set to [35, 170] degree (vertical angle) and the between line multiplier is 1.5:
var blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words,
				new DocstrumBoundingBoxes.DocstrumBoundingBoxesOptions()
				{
					WithinLineBounds = new DocstrumBoundingBoxes.AngleBounds(-45, 45),
					BetweenLineBounds = new DocstrumBoundingBoxes.AngleBounds(35, 170),
					BetweenLineMultiplier = 1.5
				});

Results

docstrum bounding boxes example

docstrum bounding boxes example rotated

Reading order detectors

Reading order detectors deal with the task of finding the blocks' reading order in a page. Only TextBlock are ordered at the moment. 3 different methods are currently available:

Description

This reading order detector does nothing. It will return the blocks as they are provided and the ReadingOrder of each TextBlock remains -1.

References

n/a

Usage

Simple case

using (var document = PdfDocument.Open(@"document.pdf"))
{
	for (var i = 0; i < document.NumberOfPages; i++)
	{
		var page = document.GetPage(i + 1);

		var words = page.GetWords(NearestNeighbourWordExtractor.Instance);
		var blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words);
		var orderedBlocks = DefaultReadingOrderDetector.Instance.Get(blocks);

		foreach (var block in orderedBlocks)
		{
			// Do something
		}
	}
}

Result

n/a

Description

This reading order detector uses the average of the Letter.TextSequence contained in each block to determine the reading order.

References

n/a

Usage

Simple case

using (var document = PdfDocument.Open(@"document.pdf"))
{
	for (var i = 0; i < document.NumberOfPages; i++)
	{
		var page = document.GetPage(i + 1);

		var words = page.GetWords(NearestNeighbourWordExtractor.Instance);
		var blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words);
		var orderedBlocks = RenderingReadingOrderDetector.Instance.Get(blocks);

		foreach (var block in orderedBlocks)
		{
			// Do something
		}
	}
}

Result

TO DO

Description

As per the original paper (see References): We follow the approach of Aiello et al., who defined a set of binary relations for intervals in X and Y direction that allow a certain amount of tolerance for the coordinate values. In total there are 13 relations in both X and Y direction, and for each pair of two-dimensional bounding boxes exactly one X relation and exactly one Y relation is true. This tolerance is implemented by a parameter T; if two coordinates are closer than T they are considered equal. This flexibility is necessary because due to the inherent noise in the PDF extraction text blocks in the same column might not be exactly aligned (here we choose T = 5). Aiello et al. then defined the BeforeInReading relation as a Boolean combination of binary relations for intervals in X and Y direction, which states for any pair of bounding boxes whether the first one occurs at some point (not necessarily immediately) before the other in a column-wise reading order.

In addition to [the above], we also define the BeforeInRendering relation that tells whether a block is rendered at some time before another block in the PDF. We incorporate both relations into a single partial ordering of blocks by specifying a directed graph with an edge between every pair of blocks for which at least one of the two relations hold. – S. Klampfl et al.

References

Usage

Simple case

using (var document = PdfDocument.Open(@"document.pdf"))
{
	for (var i = 0; i < document.NumberOfPages; i++)
	{
		var page = document.GetPage(i + 1);

		var words = page.GetWords(NearestNeighbourWordExtractor.Instance);
		var blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words);
		var orderedBlocks = UnsupervisedReadingOrderDetector.Instance.Get(blocks);

		foreach (var block in orderedBlocks)
		{
			// Do something
		}
	}
}

Advanced case

  • Set the tolerance parameter T to 10.
using (var document = PdfDocument.Open(@"document.pdf"))
{
	for (var i = 0; i < document.NumberOfPages; i++)
	{
		var page = document.GetPage(i + 1);

		var words = page.GetWords(NearestNeighbourWordExtractor.Instance);
		var blocks = DocstrumBoundingBoxes.Instance.GetBlocks(words);

		var unsupervisedReadingOrderDetector = new UnsupervisedReadingOrderDetector(10);
		var orderedBlocks = unsupervisedReadingOrderDetector.Get(blocks);

		foreach (var block in orderedBlocks)
		{
			// Do something
		}
	}
}

Result

Viewing the exported xml file in PRImA Research Lab's LayoutEvalGUI: reading order example

NB: In this example, the decoration blocks (header and page number) are not ordered correctly. The pdf was generated using Word and it seems that the rendering order is not respected for decoration blocks. See the Decoration Text Block Classifier to know how to detect these blocks.

Other layout tools

TO DO

Description

TO DO

References

TO DO

Usage

Simple case

TO DO

Advanced case

TO DO

Result

TO DO: image

Description

The algorithm finds a cover of the background whitespace of a document in terms of maximal empty rectangles. It is a top-down algorithm, which means it uses a whole page as its starting point, and works in a way analogous to quicksort or Branch and Bound algorithms. – Ø. R. Berg, 2011

References

Usage

Simple case

using (var document = PdfDocument.Open(@"document.pdf"))
{
	for (var i = 0; i < document.NumberOfPages; i++)
	{
		var page = document.GetPage(i + 1);

		// Use default parameters
		// - maximum number of whitespace rectangles is 40 by default
		// - whitespace fuzziness is 0.15
		// - no limit to queue size
		var whitespaces = WhitespaceCoverExtractor.GetWhitespaces(page.GetWords(),
									  page.GetImages());
	}
}

Advanced case

The method can be tailored by providing maxRectangleCount, maxBoundQueueSize, whitespaceFuzziness, minWidth and minHeight (the fuzziness is defined as the percentage by which candidate whitespace rectangles are allowed to overlap the surrounding obstacles):

  • Here, the maximum number of whitespace rectangles is set to 100 and the maximum queue size is set to 10,000:
using (var document = PdfDocument.Open(@"document.pdf"))
{
	for (var i = 0; i < document.NumberOfPages; i++)
	{
		var page = document.GetPage(i + 1);

		var whitespaces = WhitespaceCoverExtractor.GetWhitespaces(page.GetWords(),
									  page.GetImages(), 
									  100,
									  10_000);
	}
}
  • Here, the maximum number of whitespace rectangles is set to 100, the maximum queue size is set to 10,000, the minimum whitespace rectangles' Width and Height is set to 0, and the fuzziness is set to 0 (instead of 0.15 by default):
using (var document = PdfDocument.Open(@"document.pdf"))
{
	for (var i = 0; i < document.NumberOfPages; i++)
	{
		var page = document.GetPage(i + 1);

		var whitespaces = WhitespaceCoverExtractor.GetWhitespaces(page.GetWords(),
									  page.GetImages(),
									  minWidth: 0,
									  minHeight: 0,
									  maxRectangleCount: 100,
									  whitespaceFuzziness: 0.0,
									  maxBoundQueueSize: 10_000);
	}
}

Result

whitespace coverage example

The algorithm returns text blocks that are classified as decoration blocks. From the paper in reference: Many digital documents have archival information such as author names, publication titles, page numbers, and release dates printed repeatedly at the border of each page. Most prominently this content is placed inside headers or footers, but sometimes also at the left or right edge of the page. We refer to text blocks containing this type of information as decoration blocks. – S. Klampfl et al.

The document should contain more than 2 pages to work.

References

Usage

Simple case

using (var document = PdfDocument.Open(@"document.pdf"))
{
	var docDecorations = DecorationTextBlockClassifier.Get(document.GetPages().ToList(),
						DefaultWordExtractor.Instance,
						DocstrumBoundingBoxes.Instance);

	for (int i = 0; i < document.NumberOfPages; i++)
	{
		IReadOnlyList<TextBlock> decorations = docDecorations[i];
		// Do something
	}
}

Advanced case

If you want to use a different library to compute the minimum edit distance, used in the algorithm to compare strings, you can do as follow (using the Fastenshtein library):

using Fastenshtein;

using (var document = PdfDocument.Open(@"document.pdf"))
{
	var docDecorations = DecorationTextBlockClassifier.Get(document.GetPages().ToList(),
						DefaultWordExtractor.Instance,
						DocstrumBoundingBoxes.Instance,
						(string s1, string s2) => Levenshtein.Distance(s1, s2) / (double)Math.Max(s1.Length, s2.Length));
	// The minimum edit distance should be normalised

	for (int i = 0; i < document.NumberOfPages; i++)
	{
		IReadOnlyList<TextBlock> decorations = docDecorations[i];
		// Do something
	}
}

Result

TO DO

Export

Exporters deal with the task of transforming a pdf page into a text representation, while keeping as much information about the layout as possible. 3 different formats are currently available:

Description

The PAGE (Page Analysis and Ground-Truth Elements) Format is an XML-based page image representation framework that records information on image characteristics (image borders, geometric distortions and corresponding corrections, binarisation etc.) in addition to layout structure and page content.PRImA Research Lab

You can use PRImA Research Lab's LayoutEvalGUI software to open the exported file.

  • The PAGE xml exporter is the only exporter supporting reading order (as of Jan. 2020). See here for an example.

References

Usage

Using the DefaultWordExtractor, RecursiveXYCut and UnsupervisedReadingOrderDetector:

PageXmlTextExporter pageXmlTextExporter = new PageXmlTextExporter(
	DefaultWordExtractor.Instance,
	RecursiveXYCut.Instance,
	UnsupervisedReadingOrderDetector.Instance);

using (var document = PdfDocument.Open(@"document.pdf"))
{
	for (var i = 0; i < document.NumberOfPages; i++)
	{
		var page = document.GetPage(i + 1);

		// Convert page to text
		string xml = pageXmlTextExporter.Get(page);

		// Save text to an xml file
		File.WriteAllText("document.pagexml.xml", xml);
	}
}

Results

Opening the exported xml file in PRImA Research Lab's LayoutEvalGUI:

page xml example

Description

ALTO (Analyzed Layout and Text Object) is an open XML Schema developed by the EU-funded project called METAe. The standard was initially developed for the description of text OCR and layout information of pages for digitized material. The goal was to describe the layout and text in a form to be able to reconstruct the original appearance based on the digitized information - similar to the approach of a lossless image saving operation.Wikipedia

You can use PRImA Research Lab's LayoutEvalGUI software to open the exported file.

References

Usage

Using the NearestNeighbourWordExtractor and DocstrumBoundingBoxes:

AltoXmlTextExporter altoXmlTextExporter = new AltoXmlTextExporter(
	NearestNeighbourWordExtractor.Instance,
	DocstrumBoundingBoxes.Instance);

using (var document = PdfDocument.Open(@"document.pdf"))
{
	for (var i = 0; i < document.NumberOfPages; i++)
	{
		var page = document.GetPage(i + 1);

		// Convert page to text
		string xml = altoXmlTextExporter.Get(page);

		// Save text to an xml file
		File.WriteAllText("document.altoxml.xml", xml);
	}
}

Results

Opening the Alto xml file in should yield the same result as for the PAGE xml format.

Description

hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR). The definition encodes text, style, layout information, recognition confidence metrics and other information using Extensible Markup Language (XML) in the form of Hypertext Markup Language (HTML) or XHTML.Wikipedia

References

Usage

Simple case

Using the DefaultWordExtractor and DocstrumBoundingBoxes:

HOcrTextExporter hocrTextExporter = new HOcrTextExporter(
	DefaultWordExtractor.Instance,
	DocstrumBoundingBoxes.Instance);

using (var document = PdfDocument.Open(@"document.pdf"))
{
	for (var i = 0; i < document.NumberOfPages; i++)
	{
		var page = document.GetPage(i + 1);

		// Convert page to text
		string html = hocrTextExporter.Get(page);

		// Save text to an html file
		File.WriteAllText("document.hocr.html", html);
	}
}

Advanced case

Using the DefaultWordExtractor and DocstrumBoundingBoxes and adding the reference to the hocrjs script (more on hocrjs here):

HOcrTextExporter hocrTextExporter = new HOcrTextExporter(
	DefaultWordExtractor.Instance,
	DocstrumBoundingBoxes.Instance);

using (var document = PdfDocument.Open(@"document.pdf"))
{
	for (var i = 0; i < document.NumberOfPages; i++)
	{
		var page = document.GetPage(i + 1);

		// Convert page to text, adding a reference to hocrjs script
		string html = hocrTextExporter.Get(page, useHocrjs: true);

		// Save text to an html file
		File.WriteAllText("document.hocr.html", html);
	}
}

Results

Viewing the exported html file using hocrjs:

hocr example

Description

Converts the pdf page to SVG (Scalable Vector Graphics).

References

Usage

SvgTextExporter exporter = new SvgTextExporter();

var options = new ParsingOptions() { ClipPaths = true }; // true if clipped path are needed
using (var document = PdfDocument.Open(@"document.pdf", options ))
{
	for (var i = 0; i < document.NumberOfPages; i++)
	{
		var page = document.GetPage(i + 1);

		// Convert page to text
		var svg = exporter.Get(page);

		// Save text to an html file
		File.WriteAllText("document.html", svg);
	}
}

Results

Clone this wiki locally