Skip to content
davebrokit edited this page May 27, 2024 · 36 revisions

This wiki contains more detail on various aspects of the public API and the PDF document format.

Features

  • Extracts the position and size of letters from any PDF document. This enables access to the text and words in a PDF document.
  • Allows the user to retrieve images from the PDF document.
  • Allows the user to read PDF annotations, PDF forms, embedded documents and hyperlinks from a PDF.
  • Provides access to metadata in the document.
  • Exposes the internal structure of the PDF document.
  • Creates PDF documents containing text and path operations.
  • Read content from encrypted files by providing the password.
  • Document Layout Analysis - PdfPig also comes with some tools for document layout analysis such as the Recursive XY Cut, Document Spectrum and Nearest Neighbour algorithms, along with others. It also provides support for exporting page contents to Alto, PageXML and hOcr format. See Document Layout Analysis
  • Tables are not directly supported but you can use Tabula Sharp or Camelot Sharp. As of 2023 Tabula-sharp is the most complete port source

This provides an alternative to the commercial libraries such as SpirePDF or copyleft alternatives such as iText 7 (AGPL) for some use-cases.

Things you can't do:

Getting Started

PdfPig aims to provide 2 main areas of functionality:

  • Extracting PDF content.
  • Creating PDFs.

The simplest usage of the library for extracting content involves opening a document and extracting the position and text of all words across all pages:

using System.Collections.Generic;
using UglyToad.PdfPig;
using UglyToad.PdfPig.Content;

using (PdfDocument document = PdfDocument.Open(@"C:\path\to\file.pdf"))
{
	foreach (Page page in document.GetPages())
	{
		IEnumerable<Word> words = page.GetWords();
	}
}

Pages can also be accessed individually with an index starting at 1. You can also access the positions and sizes of the individual letters on a page:

using System.Collections.Generic;
using UglyToad.PdfPig;
using UglyToad.PdfPig.Content;

using (PdfDocument document = PdfDocument.Open(@"C:\path\to\file.pdf"))
{
	Page page = document.GetPage(1);
	IReadOnlyList<Letter> letters = page.Letters;
}

For document creation a new document can be created using the Standard14 fonts which are included in the PDF specification:

PdfDocumentBuilder builder = new PdfDocumentBuilder();
PdfPageBuilder page = builder.AddPage(PageSize.A4);
PdfDocumentBuilder.AddedFont font = builder.AddStandard14Font(Standard14Font.Helvetica);
page.AddText("Hello World!", 12, new PdfPoint(25, 520), font);
byte[] b = builder.Build();

The resulting bytes are a valid PDF document and can be saved to the file system, served from a web server, etc.

You can use document builder to visualise what pdf pig has done for document reading by copying the pdf and drawing rectangles using bounding boxes information.

using UglyToad.PdfPig;
using UglyToad.PdfPig.Writer;

 using (var document = PdfDocument.Open(pdf))
 {
    var builder = new PdfDocumentBuilder{};
    var pageBuilder = builder.AddPage(document, pageNumber);
    pageBuilder.SetStrokeColor(255,0,0);
    var page = document.GetPage(pageNumber);
    foreach(var word in page.GetWords())
    {
         var box = word.BoundingBox;
         pageBuilder.DrawRectangle(box.BottomLeft, (decimal)box.Width, (decimal)box.Height);
    }
    
    byte[] b = builder.Build();  
    // Save to file etc
 }

View this gist that goes through some basic beginner examples: https://gist.github.com/cordasfilip/c6d2510b358323dc2f71c843460cbcdf

Contents

More details on the API can be found here.

Additional automated documentation from doc-comments can be found on DotNetApis.

Release Notes

Release notes as well as downloadable packages can be found on the releases page https://github.com/UglyToad/PdfPig/releases.

Clone this wiki locally