Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error during extraction #25

Open
bennyzen opened this issue Nov 30, 2024 · 5 comments
Open

Error during extraction #25

bennyzen opened this issue Nov 30, 2024 · 5 comments

Comments

@bennyzen
Copy link

I'm using scribe.js to batch process a large number of PDFs. The error below keeps emerging from time to time. Its appearance is pretty random and is NOT related to specific PDFs. After restarting the batch process, it processes that specific PDF just fine.

file:///home/node/node_modules/.pnpm/[email protected]/node_modules/scribe.js-ocr/js/generalWorkerMain.js:335
    if (gs.schedulerInner.workers.length > 0) {
                          ^

TypeError: Cannot read properties of null (reading 'workers')
    at gs.initTesseract (file:///home/node/node_modules/.pnpm/[email protected]/node_modules/scribe.js-ocr/js/generalWorkerMain.js:335:27)
    at async Promise.all (index 2)
    at async init (file:///home/node/node_modules/.pnpm/[email protected]/node_modules/scribe.js-ocr/scribe.js:77:3)

Node.js v22.11.0
@Balearica
Copy link
Contributor

Can you post a minimal version of the code you are using? This error appears to be caused by resources being cleared while recognition is still running, so it's not possible to answer without knowing the code that is being run.

@bennyzen
Copy link
Author

bennyzen commented Dec 9, 2024

Yes, sure. It's happening even using your code from the nodejs example:

import scribe from 'scribe.js-ocr'

const text = await scribe.extractText(['./image-native.pdf'])
await scribe.terminate()
console.log(text)

I'm not absolutely sure, but in v0.5.0 it seems to happen more often than in previous versions. With certain files it happens always. Unfortunately, I cannot attach the culprit file for you, as it's confidential.

Strange thing: using the scribeocr frontend, it just works without any problem.

Things I've tried so far:

  • using different Node versions 20, 22, 23 - always fully rebuilding packages
  • using Bun runtime, which does not work at all
  • using NPM or PNPM as package manager

With the minimal example above I get this error instead:

Error: TypeError: Cannot read properties of undefined (reading '0')
    at Worker.<anonymous> (/home/ben/repos/scribejs/node_modules/@scribe.js/tesseract.js/src/createWorker.js:283:15)
    at Worker.emit (node:events:518:28)
    at MessagePort.<anonymous> (node:internal/worker:263:53)
    at [nodejs.internal.kHybridDispatch] (node:internal/event_target:826:20)
    at exports.emitMessage (node:internal/per_context/messageport:23:28)

And yes, I've tried the recommended init, import, ... route, but unfortunately it fails there too.

Any hint what I could try next? Any clue on why it is working in the browser (scribeocr) and not on the terminal?

Thank you in advance.

@Balearica
Copy link
Contributor

@bennyzen Thanks for clarifying. I was able to replicate the second error message (Cannot read properties of undefined (reading '0')). However, I believe this is a separate issue given where in the code it occurs, so have opened a separate issue to track it: #26.

I will troubleshoot and fix that error, and update #26 accordingly. However, as noted above, I doubt this is directly related to the original error message, so let me know if you can reliably replicate the original error message with a minimal example.

@bennyzen
Copy link
Author

Yes, you're right. While trying and re-trying, I was causing confusion and mixing different errors. I'll track the other error over there #26

Thank you for your effort. I really appreciate.

@Balearica
Copy link
Contributor

@bennyzen #26 has been patched in v0.5.1, so if you update to the latest version, any errors should be specific to the original issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants