Skip to content

Latest commit

 

History

History
180 lines (114 loc) · 5.77 KB

API.md

File metadata and controls

180 lines (114 loc) · 5.77 KB

Table of Contents

init

Initialize the program and optionally pre-load resources.

Parameters

  • params Object?

    • params.pdf boolean Load PDF renderer. (optional, default false)
    • params.ocr boolean Load OCR engine. (optional, default false)
    • params.font boolean Load built-in fonts. The PDF renderer and OCR engine are automatically loaded when needed. Therefore, the only reason to set pdf or ocr to true is to pre-load them. (optional, default false)

extractText

Function for extracting text from image and PDF files with a single function call. By default, existing text content is extracted for text-native PDF files; otherwise text is extracted using OCR. To control how text from PDF files is handled, set the options in the opt.usePDFText object. For more control, use init, importFiles, recognize, and exportData separately.

Parameters

  • files
  • langs Array<string> (optional, default ['eng'])
  • outputFormat (optional, default 'txt')
  • options (optional, default {})

writeDebugImages

Parameters

clear

Clears all document-specific data.

terminate

Terminates the program and releases resources.

exportData

Export active OCR data to specified format.

Parameters

  • format ("pdf" | "hocr" | "docx" | "xlsx" | "txt" | "text") (optional, default 'txt')
  • minPage number First page to export. (optional, default 0)
  • maxPage number Last page to export (inclusive). -1 exports through the last page. (optional, default -1)

Returns Promise<(string | ArrayBuffer)>

download

Runs exportData and saves the result as a download (browser) or local file (Node.js).

Parameters

  • format ("pdf" | "hocr" | "docx" | "xlsx" | "txt" | "text")
  • fileName string
  • minPage number First page to export. (optional, default 0)
  • maxPage number Last page to export (inclusive). -1 exports through the last page. (optional, default -1)

SortedInputFiles

An object with this shape can be used to provide input to the importFiles function, without needing that function to figure out the file types. This is required when using ArrayBuffer inputs.

Type: Object

Properties

importFiles

Import files for processing. An object with pdfFiles, imageFiles, and ocrFiles arrays can be provided to import multiple types of files. Alternatively, for File objects (browser) and file paths (Node.js), a single array can be provided, which is sorted based on extension.

Parameters

recognize

Recognize all pages in active document. Files for recognition should already be imported using importFiles before calling this function. The results of recognition can be exported by calling exportFiles after this function.

Parameters

  • options Object (optional, default {})

    • options.mode ("speed" | "quality") Recognition mode. (optional, default 'quality')
    • options.langs Array<string> Language(s) in document. (optional, default ['eng'])
    • options.modeAdv ("lstm" | "legacy" | "combined") Alternative method of setting recognition mode. (optional, default 'combined')
    • options.combineMode ("conf" | "data" | "none") Method of combining OCR results. Used if OCR data already exists. (optional, default 'data')
    • options.vanillaMode boolean Whether to use the vanilla Tesseract.js model. (optional, default false)
    • options.config Object<string, string> Config params to pass to to Tesseract.js. (optional, default {})