-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using doclytics to update the "Content" of a document #97
Comments
Hi @hermanmak, thanks for bringing this to my attention. I wasn't aware that mixed language documents, are an issue for the paperless-ngx ocr engine. I did a bit of research and it seems like llava could be a good candidate for that. I will have a look into it. |
I'm wondering, does the processing in doclytics do ocr on the raw image or does it infer based on the text layer of the images? |
Hi, just discovered doclytics and gotta say I'm very interested. I'm pretty disatisfied with paperless's OCR quality (not their fault) especially because I tend to take pictures with my phone instead of using a proper scanner. I wanted to add here that llava models are pretty old now and there are now much better models. The most recent that comes to mind are Qwen2-VL family of models. They exist in both large and small (2B) sizes, seem SOTA for OCR/vision (including handwriting!) and are multinlingual. Here's an OCR example: https://simonwillison.net/2024/Sep/4/qwen2-vl/ Here's their page: https://github.com/QwenLM/Qwen2-VL The only issue is that they are not yet supported by ollama, as you can see in this issue because llama.cpp itself does not yet have it (issue). Allow me to show an example using Qwen2 72B: My prompt: This resulted in the following markdown: # Qwen2-VL 7B Instruct (free) 📦
| qwen/qwen-2-vl-7b-instruct:free 📦 |
|----------------------------------|
| Updated Aug 28 | 32,768 context | $0/M input tokens | $0/M output tokens | $0/K input imgs |
### Qwen2 VL 7B is a multimodal LLM from the Qwen Team with the following key enhancements:...
- Free 📦
### Model weights 📦
### Overview | Providers | Apps | Activity | Parameters | Uptime | API
### Providers for Qwen2-VL 7B Instruct (free)
OpenRouter [load-balances requests](https://docs.openrouter.ai/docs/developers/providers-and-pools) across providers
weighted by price unless you use [dynamic routing](https://docs.openrouter.ai/docs/developers/routing).
| **Hyperbolic** | Max Output | Input | Output | Latency | Throughput |
|----------------|------------|-------|--------|---------|------------|
| bf16 🚹 ⚙️ | 2,048 | $0 | $0 | 0.95s | 98.49t/s | So yeah this is much much more readable than OCR, be it for humans or for LLMs, making it even easier to do an embedding search. If you want to give it a try I suggest getting an openrouter api key |
Edit: I forgot to mention minicpm-v model. It's super good for OCR even at 7B and seems to work on at least french and english. |
Right now, it uses the content field of the document from the doclytics api, so it uses the text extracted by the paperless ocr. I will evaluate the input of @thiswillbeyourgithub. Very interesting. I think this could improve the quality. |
Hi, Here are some recent tests using ollama and minicpm-v: Here's a random image from the internet: Here's the prompt I used: Here's the output:
It even seems to work fine on vertical images: **Title:**
Napoleon
---
**Subtitle:**
The Emperor Napoleon in His Study at the Tuileries, 1812
Emperor of the French
---
**Body Text:**
- **1st successor:** Louis XVI [Y][4] (born 6 August 1754 – died 20 April 1793)
- **2nd successor:** Louis XVIII[5] (born 20 March 1755 – died 18 June 1816)
First Consul of the French Republic, Second Consul
- **3rd successor:** Louis XVIII [Y][4]
- **4th successor:** First Consult of the French Republic
---
**Timeline:**
- Born: 20 March 1755 (born April–August), in Corsica
Aguja, Corsica, Kingdom of France
- Died:
- May 8, 1821 (aged [49])
Longwood Island,
Saint Helena[6] (Longwood House)
Carcavelos, Portugal
- Appointments:
- First Consul of the French Republic, Second Consul
---
**Family Information:**
Spouses:
Josephine de Beauharnais (m. [1796; deceased]; born in Aix-en-Provence on December 23)
Marie Louise de Bourbon[4]
Marriage to Marie Louise was dissolved after her death.
Issue:
Napoleon II
---
**Footer:**
More info... Napoleon It can hallucinate things. For example it says napoleon was married to "Marie Louise de Bourbon" but the picture says "Marie Louise of Austria". Here's my takeaway:
What do you think? Any ETA on when we could test it on doclytics? |
Btw here's an update to qwen2 VL 3B support using llama.cpp. It should rather quickly come to ollama once it's merged |
I think Ollama also supports a more streamlined huggingface GGUF model
usage now. Basically GGUF models on Huggingface can come prepackaged with
the model templates. Previously the hard part was to find/write the model
templates before you could use the GGUF model.
…-herman
On Mon, Oct 21, 2024 at 7:54 AM thiswillbeyourgithub < ***@***.***> wrote:
Btw here's an update to qwen2 VL 3B
<ggerganov/llama.cpp#9246 (comment)>
support using llama.cpp. It should rather quickly come to ollama once it's
merged
—
Reply to this email directly, view it on GitHub
<#97 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABSP7VQUUD7TQ5XGP7CDS4TZ4Q7FBAVCNFSM6AAAAABOHU3L4GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRVGI4TSOBWGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hello 👋 for interested folks who want to play around with LLM OCR and paperless: |
+1 this is a no brainer, ollama already has the capability, only needs to be called for each page of a document (converted to image and base64, see https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-completion) and the results concatenated. |
Paperless-ngx uses an OCR engine that is not particularly good with languages like chinese, korean and especially seems to perform badly when multiple languages are present in the same document.
Multiple language in the same document is extremely common in Hong Kong.
Could doclytics be a bridge to apply LLMs to do the OCR instead of the built in (or overwrite)?
For example the new model available called minicpm-v is capable of OCR in multiple languages
Thanks!
The text was updated successfully, but these errors were encountered: