We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Currently we use BERT model (more precisely, bert-base-chinese) to vectorize OCR text, then use COSINE distance for indexing and searching.
bert-base-chinese
However, this method seems to have low performance when processing partial keywords or semantically similar sentences.
For instance,
FYI, the OCR text of the image: 1. please 2. 你最 3. 叔 4. 什么情况兄弟 5. 爱 6. 爱 7. 害怕 8. 乳 9. 嘿
And only when I provide more detailed text, the server can return some more accurate result:
Any solution to improve the OCR text matching?
https://github.com/hv0905/NekoImageGallery/blob/master/app/Services/transformers_service.py#L59
https://huggingface.co/tasks/sentence-similarity
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Currently we use BERT model (more precisely,
bert-base-chinese
) to vectorize OCR text, then use COSINE distance for indexing and searching.However, this method seems to have low performance when processing partial keywords or semantically similar sentences.
For instance,

And only when I provide more detailed text, the server can return some more accurate result:

Any solution to improve the OCR text matching?
Related code
https://github.com/hv0905/NekoImageGallery/blob/master/app/Services/transformers_service.py#L59
Related documentation
https://huggingface.co/tasks/sentence-similarity
The text was updated successfully, but these errors were encountered: