-
Notifications
You must be signed in to change notification settings - Fork 251
Images
PDF files can contain 2 types of images:
- Inline: these images appear inline in the page's content stream (basically PostScript code defining the appearance of the page). These are generally used for small images.
- XObjects: Larger images are defined outside the content stream and referenced from the content stream by name using an operator.
There are some differences between the types of information stored depending on how the image is defined. PdfPig defines both InlineImage
and XObjectImage
which both implement IPdfImage
.
The images for a page are accessed via the Page.GetImages()
method which returns the set of images on the page.
The IPdfImage
has properties for the placement rectangle of the image on the page Bounds
as well as the width and the height of the original image, before any PDF transforms are applied (WidthInSamples
and HeightInSamples
, where samples are usually pixels).
The actual content of the image bytes is either:
- A PDF format bitmap based on the ColorSpace.
- A JPEG file directly embedded in the file.
Where the image is a JPEG decoding the bytes is not supported directly (IPdfImage.TryGetBytes(out var bytes)
will return false). The IPdfImage.RawBytes
is a valid JPEG file. Where the image is in PDF format the RawBytes
are usually the bitmap with one or more PDF filters applied (FlateDecode
etc.). IPdfImage.TryGetBytes(out var bytes)
will return the bytes after reversing these filters in PDF format. The actual bytes are then subject to interpretation based on the ColorSpace, the bits per component, width and height in samples, etc.
For common image types the IPdfImage.TryGetPng(out byte[] bytes)
will take the raw bytes, decode the raw data by reversing the filters and convert the resulting PDF bitmap into a valid PNG file. Where PNG creation is successful the resulting bytes can be interpreted as a valid PNG image.
You will need PdfPig 0.1.10 or great to add support for additional filters.
PdfPig does not support all filters out of the box. The filters that need external NuGet packages to run are JBIG2, JPX (Jpeg2000) and DCT (Jpg). The following packages are possible choices:
- https://github.com/BobLd/UglyToad.PdfPig.Filters.Jbig2.PdfboxJbig2
- https://github.com/BobLd/UglyToad.PdfPig.Filters.Jpx.OpenJpegDotNet
- https://github.com/BobLd/UglyToad.PdfPig.Filters.Dct.JpegLibrary
The filters can be used as below:
using System.Collections.Generic;
using UglyToad.PdfPig.Filters;
using UglyToad.PdfPig.Filters.Dct.JpegLibrary;
using UglyToad.PdfPig.Filters.Jbig2.PdfboxJbig2;
using UglyToad.PdfPig.Filters.Jpx.OpenJpeg;
using UglyToad.PdfPig.Tokens;
/// <summary>
/// Filter provider to add support for JBIG2, DCT and JPX filters.
/// </summary>
public sealed class MyFilterProvider : BaseFilterProvider
{
/// <summary>
/// The single instance of this provider.
/// </summary>
public static readonly MyFilterProvider Instance = new MyFilterProvider();
/// <inheritdoc/>
private MyFilterProvider() : base(GetDictionary())
{
}
private static Dictionary<string, IFilter> GetDictionary()
{
// New filters
var dct = new JpegLibraryDctDecodeFilter();
var jbig2 = new PdfboxJbig2DecodeFilter();
var jpx = new OpenJpegJpxDecodeFilter();
// Standard PdfPig filters
var ascii85 = new Ascii85Filter();
var asciiHex = new AsciiHexDecodeFilter();
var ccitt = new CcittFaxDecodeFilter();
var flate = new FlateFilter();
var runLength = new RunLengthFilter();
var lzw = new LzwFilter();
return new Dictionary<string, IFilter>
{
{ NameToken.Ascii85Decode.Data, ascii85 },
{ NameToken.Ascii85DecodeAbbreviation.Data, ascii85 },
{ NameToken.AsciiHexDecode.Data, asciiHex },
{ NameToken.AsciiHexDecodeAbbreviation.Data, asciiHex },
{ NameToken.CcittfaxDecode.Data, ccitt },
{ NameToken.CcittfaxDecodeAbbreviation.Data, ccitt },
{ NameToken.DctDecode.Data, dct },
{ NameToken.DctDecodeAbbreviation.Data, dct },
{ NameToken.FlateDecode.Data, flate },
{ NameToken.FlateDecodeAbbreviation.Data, flate },
{ NameToken.Jbig2Decode.Data, jbig2 },
{ NameToken.JpxDecode.Data, jpx },
{ NameToken.RunLengthDecode.Data, runLength },
{ NameToken.RunLengthDecodeAbbreviation.Data, runLength },
{ NameToken.LzwDecode.Data, lzw },
{ NameToken.LzwDecodeAbbreviation.Data, lzw }
};
}
}
var parsingOption = new ParsingOptions()
{
UseLenientParsing = true,
SkipMissingFonts = true,
FilterProvider = MyFilterProvider.Instance
};
using (var doc = PdfDocument.Open("test.pdf", parsingOption))
{
int i = 0;
foreach (var page in doc.GetPages())
{
foreach (var pdfImage in page.GetImages())
{
Assert.True(pdfImage.TryGetPng(out var bytes));
File.WriteAllBytes($"image_{i++}.jpeg", bytes);
}
}
}