Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve parsed-price accuracy #26

Open
EtienneLamoureux opened this issue Dec 1, 2023 · 0 comments
Open

Improve parsed-price accuracy #26

EtienneLamoureux opened this issue Dec 1, 2023 · 0 comments
Labels
help wanted Extra attention is needed refactor Change or improvement to an existing functionality

Comments

@EtienneLamoureux
Copy link
Owner

EtienneLamoureux commented Dec 1, 2023

Situation

Prices are prefixed with the ¤ symbol. This symbol is not in the english training set of Tesseract and is read as a random character. When this character is read as a digit, it inflates the prices read by an order of magnitude, i.e. ¤900 becomes 2900.

Tasks

  1. Experiment with heuristics to mitigate the issue
    1. Thousands are always separated by a comma , and groups of digit are only up to 3 long
    2. Only 1 digit is present before the comma , when the price is listed in kilo units K
    3. Others

Results

  1. The ¤ character doesn't inflate prices
@EtienneLamoureux EtienneLamoureux added help wanted Extra attention is needed refactor Change or improvement to an existing functionality labels Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed refactor Change or improvement to an existing functionality
Projects
None yet
Development

No branches or pull requests

1 participant