-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metadata #6
Comments
Can you provide more detail regarding your use-case? It is probably possible to provide more control over PDF text extraction, but I'm not sure what you mean by custom processing for each page. There is no pre-existing generic JavaScript interface for the PDF reader--the PDF reader build is specific to this project, so adding new features would require making changes.
If "whether the page is an image" refers to our categorization of "text native" and "image native" PDFs, this is not a metadata field, but rather something that is determined after reading the text content of the document. These categories are more nuanced than simply looking for whether images or text exist. For example, PDFs that contain no images may be categorized as "image native" if the document contains no valid encoding to map between glyphs and characters (so text cannot be extracted directly), which is surprisingly common. |
Thanks. The use case is really having the ability to 1) To know, prior to processing, how many pages will be processed, and 2) To process pages one by one, something like for(page in pages) { scribe(page) }. I would like to be able to time the processing per-page (e.g. for logging), and I want to know if an image or not as images take a lot more time to process. So that's what I'm aiming for, essentially the ability to in a controlled way parse page-by-page. Thanks again. |
Hi, great library!
Is there a way to configure pdf reading behavior? I'd like to be able to get metadata on a file -- number of chapters, whether page is an image or not, for instance -- prior to processing it. And all other metadata possible around a pdf.
And generally I'd like to determine pages, and go single page by page with custom processing for each. Is there maybe an interface to your internal pdf reader that can be exposed?
Thanks!
The text was updated successfully, but these errors were encountered: