Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setup guide for nextjs? #7

Open
Elon-Mask12 opened this issue Sep 19, 2024 · 7 comments
Open

setup guide for nextjs? #7

Elon-Mask12 opened this issue Sep 19, 2024 · 7 comments

Comments

@Elon-Mask12
Copy link

Elon-Mask12 commented Sep 19, 2024

hey, anyone managed to get it working in nextjs pages dir?
heres what ive tried:
imported as:

// @ts-ignore
import scribe from "scribe.js-ocr/scribe.js";
//other code
await scribe.init({ ocr: true, font: true });
scribe.extractText(files).then((res: any) => console.log(res));

i see error saying:

Import trace for requested module:
./node_modules/.pnpm/[email protected][email protected]/node_modules/scribe.js-ocr/scribe.js
./src/components/upload-file-modal.tsx
./src/pages/f/index.tsx
 ○ Compiling /not-found ...
 ⨯ ./node_modules/.pnpm/[email protected][email protected]/node_modules/scribe.js-ocr/js/containers/fontContainer.js:14:26
  Module not found: Can't resolve 'module'

i updated my next.config to pass custom webpack configs similar to the one in the example app, but it doesnt work.
also tried passing it in. also tried nextjs serverComponentsExternalPackages in next.config, which doesnt work either.

  experimental: {
    serverComponentsExternalPackages: [
      "scribe.js-ocr",
      "scribe.js-ocr/scribe.js",
    ],
  },

tesseract seems to have a createWorker method, which is used in the client-side, maybe is it possible to expose something similar?.

any help is appreciated thanks.

@Balearica
Copy link
Contributor

I generated a repo using npx create-next-app@latest, added some basic scribe.js code, and updated the webpack configuration, and everything seemed to run as expected. The repo is here.

@Elon-Mask12
Copy link
Author

Elon-Mask12 commented Sep 21, 2024

holy fuck, that works. also thanks a fuckton for building this library, and even replying to my query. <3

i got it workin on my next app, but im having few doubts.
so here im writing this function which takes a pdf in File type and then return the ocr-ed version of the pdf in file type. So this will be the code right?

await scribe.init({ ocr: true, font: true });
scribe.opt.displayMode = "invis";

await scribe.importFiles(
  files,
);
await scribe.recognize({
  mode: "quality",
  langs: ["eng"],
  modeAdv: "combined",
  vanillaMode: true,
  combineMode: "data",
});
const data = await scribe.exportData("pdf");

const blob = new Blob([data], { type: "application/pdf" });
const file = new File([blob], "test.pdf", {
  type: "application/pdf",
});
return [file];

or am i doing something wrong, the docs seems a bit hard to understand.
this seems to take around 33 seconds in my next app, but only around 6-7 seconds on the scribeocr.com website.

regarding the options i passed in for each, its just something that worked for me when i was playing around with https://scribeocr.com . am i doing something wrong?
thanks :)

@Balearica
Copy link
Contributor

so here im writing this function which takes a pdf in File type and then return the ocr-ed version of the pdf in file type. So this will be the code right?

Yes, this code looks fine at a glance. Several of the arguments specified won't do anything (e.g. mode is ignored when modeAdv is specified), however that will not cause issues.

this seems to take around 33 seconds in my next app, but only around 6-7 seconds on the scribeocr.com website.

The scribeocr.com website uses scribe.js directly as a submodule, so there should not be any difference in performance for the same document. Additionally, the runtime observed in the example Next.js repo I linked to above seemed to be roughly comparable to the scribeocr.com site, so I don't think using Next.js should change things. Therefore, if you are noticing large disparities, the troubleshooting steps I would take are below.

  1. Make sure that devtools are closed when running both your site and scribeocr.com, so allow for an apples-to-apples comparison.
    1. Running with devtools open, or using test/debugging tools that use devtools under the hood (e.g. Cypress), causes a significant performance hit.
  2. Confirm the versions of scribe.js being compared are the same.
    1. Both scribe.js and scribeocr.com are in alpha and are constantly being updated, so different versions of scribe.js may have been used between the sites.

@zeus-12
Copy link

zeus-12 commented Sep 24, 2024

used the next build version. tested the same pdf on both, and scribeocr.com is significantly better.
i uploaded a 34 page file, with plenty of images, took around 40s in scribeocr, and around 3 minutes in my app. clearly doing something wrong, also any way i could access the progress (the percentage of completion)? kinda debating now whether ocr on browser is a good idea.

thanks.

@Balearica
Copy link
Contributor

i uploaded a 34 page file, with plenty of images, took around 40s in scribeocr, and around 3 minutes in my app

I checked that the Next.js demo I linked has runtime similar to scribeocr.com, and it looks like it does, so I think any disparity is caused by something specific to your project.

also any way i could access the progress (the percentage of completion)

The scribeocr.com website is simply one user of the scribe.js package--everything it does can be implemented by other users. While not all of the features it uses are fully documented, you can check that codebase to see how specific features work. In this case the relevant code for incrementing a progress bar after each page is here.

@zeus-12
Copy link

zeus-12 commented Sep 26, 2024

You're right, that improved a lot in prod-setting, and getting super-close results. Thanks for building this <3

@gcphost
Copy link

gcphost commented Oct 21, 2024

Hi there! I'm trying this with nextjs app dir, when setting the next config to set process as undefined I am no longer able to get environment variables, for example, from a middleware.

TypeError: Cannot set properties of undefined (setting 'env')

Is there any other solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants