- init
- extractText
- writeDebugImages
- clear
- terminate
- exportData
- download
- SortedInputFiles
- importFiles
- recognize
Initialize the program and optionally pre-load resources.
-
params
Object?params.pdf
boolean Load PDF renderer. (optional, defaultfalse
)params.ocr
boolean Load OCR engine. (optional, defaultfalse
)params.font
boolean Load built-in fonts. The PDF renderer and OCR engine are automatically loaded when needed. Therefore, the only reason to setpdf
orocr
totrue
is to pre-load them. (optional, defaultfalse
)
Function for extracting text from image and PDF files with a single function call.
By default, existing text content is extracted for text-native PDF files; otherwise text is extracted using OCR.
To control how text from PDF files is handled, set the options in the opt.usePDFText
object.
For more control, use init
, importFiles
, recognize
, and exportData
separately.
files
langs
Array<string> (optional, default['eng']
)outputFormat
(optional, default'txt'
)options
(optional, default{}
)
Clears all document-specific data.
Terminates the program and releases resources.
Export active OCR data to specified format.
format
("pdf"
|"hocr"
|"docx"
|"xlsx"
|"txt"
|"text"
) (optional, default'txt'
)minPage
number First page to export. (optional, default0
)maxPage
number Last page to export (inclusive). -1 exports through the last page. (optional, default-1
)
Returns Promise<(string | ArrayBuffer)>
Runs exportData
and saves the result as a download (browser) or local file (Node.js).
format
("pdf"
|"hocr"
|"docx"
|"xlsx"
|"txt"
|"text"
)fileName
stringminPage
number First page to export. (optional, default0
)maxPage
number Last page to export (inclusive). -1 exports through the last page. (optional, default-1
)
An object with this shape can be used to provide input to the importFiles
function,
without needing that function to figure out the file types.
This is required when using ArrayBuffer inputs.
Type: Object
pdfFiles
(Array<File> | Array<string> | Array<ArrayBuffer>)?imageFiles
(Array<File> | Array<string> | Array<ArrayBuffer>)?ocrFiles
(Array<File> | Array<string> | Array<ArrayBuffer>)?
Import files for processing.
An object with pdfFiles
, imageFiles
, and ocrFiles
arrays can be provided to import multiple types of files.
Alternatively, for File
objects (browser) and file paths (Node.js), a single array can be provided, which is sorted based on extension.
files
(Array<File> | FileList | Array<string> | SortedInputFiles)
Recognize all pages in active document.
Files for recognition should already be imported using importFiles
before calling this function.
The results of recognition can be exported by calling exportFiles
after this function.
-
options
Object (optional, default{}
)options.mode
("speed"
|"quality"
) Recognition mode. (optional, default'quality'
)options.langs
Array<string> Language(s) in document. (optional, default['eng']
)options.modeAdv
("lstm"
|"legacy"
|"combined"
) Alternative method of setting recognition mode. (optional, default'combined'
)options.combineMode
("conf"
|"data"
|"none"
) Method of combining OCR results. Used if OCR data already exists. (optional, default'data'
)options.vanillaMode
boolean Whether to use the vanilla Tesseract.js model. (optional, defaultfalse
)options.config
Object<string, string> Config params to pass to to Tesseract.js. (optional, default{}
)