How to turn off OCR (useful if you only want metadata extraction)

Jump to bottom Edit New page

Jan Schlautmann edited this page Sep 25, 2024 · 1 revision

Problem

Even if parser.from_text(x, service = 'meta') is selected, Tika extracts the content. For files that need OCR'ing this can take a lot of time.

Solution

There are some solutions offered by Tika here to turn off OCR'ing. Since tika-python uses a Tika Server the last solution can be used:

parser.from_file(x, service = 'meta', headers = {"X-Tika-OCRskipOcr": 'true'})

This also works with service = 'all'. It returns the content if there is content that can be returned without OCR.