generated from CDCgov/template
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement tesseract backend #375
Merged
Merged
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
e641bd8
Initial tesserocr
jonchang 1e3827f
drop pytesseract
jonchang 6a926dc
Use actual raw API backend for confidence score
jonchang 11c48ca
ensure PIL image is passed
jonchang 93ce87c
Guess at tessdata path
jonchang 6586779
Install tesseract as part of docker setup
jonchang 6aefc3e
documentation
jonchang c60af35
lint check
jonchang 0cb3182
Use tesserocr api instead of pathlib shenanigans
jonchang b28dd12
Update docstring
jonchang ea85a6d
Fix path detection crash
jonchang 8fcf3f5
Strip tesseract output
jonchang a31b4eb
Update tests for tesseract comparisons
jonchang a9c815b
Update CI runs
jonchang File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
import os | ||
|
||
import tesserocr | ||
import numpy as np | ||
from PIL import Image | ||
|
||
|
||
class TesseractOCR: | ||
@staticmethod | ||
def _guess_tessdata_path(wanted_lang="eng") -> bytes: | ||
""" | ||
Attempts to guess potential locations for the `tessdata` folder. | ||
|
||
The `tessdata` folder is needed to use pre-trained Tesseract OCR data, though the automatic detection | ||
provided in `tesserocr` may not be reliable. Instead iterate over common paths on various systems (e.g., | ||
Red Hat, Ubuntu, macOS) and check for the presence of a `tessdata` folder. | ||
|
||
If `TESSDATA_PREFIX` is available in the environment, this function will check that location first. | ||
If all guessed locations do not exist, fall back to automatic detection provided by `tesserocr` and | ||
the tesseract API. | ||
|
||
`wanted_lang` (str): a desired language to search for. Defaults to English `eng`. | ||
""" | ||
candidate_paths = [ | ||
"/usr/local/share/tesseract/tessdata", | ||
"/usr/share/tesseract/tessdata", | ||
"/usr/share/tesseract-ocr/4.00/tessdata", | ||
"/opt/homebrew/share/tessdata", | ||
"/opt/local/share/tessdata", | ||
] | ||
|
||
# Prepend env variable if defined | ||
if "TESSDATA_PREFIX" in os.environ: | ||
candidate_paths.insert(os.environ["TESSDATA_PREFIX"], 0) | ||
|
||
# Test candidate paths | ||
for path in candidate_paths: | ||
# When compiled for certain systems (macOS), libtesseract aborts due to an untrapped exception if it | ||
# cannot access the path for any reason (e.g., does not exist, lacks read permissions). Attempt to | ||
# enumerate the directory and, if it fails, skip this path. | ||
try: | ||
os.listdir(path) | ||
except OSError: | ||
continue | ||
|
||
retpath, langs = tesserocr.get_languages(path) | ||
if wanted_lang in langs: | ||
return retpath | ||
|
||
# Nothing matched, just return the default path | ||
return tesserocr.get_languages()[0] | ||
|
||
def image_to_text(self, segments: dict[str, np.ndarray]) -> dict[str, tuple[str, float]]: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TODO: init class and invoke fxn in api call. |
||
digitized: dict[str, tuple[str, float]] = {} | ||
with tesserocr.PyTessBaseAPI(path=self._guess_tessdata_path()) as api: | ||
for label, image in segments.items(): | ||
if image is None: | ||
continue | ||
|
||
api.SetImage(Image.fromarray(image)) | ||
digitized[label] = (api.GetUTF8Text().strip(), api.MeanTextConf()) | ||
|
||
return digitized |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@derekadombek these are the dockerfile-related changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh gotcha! Kinda what I was imagining. makes sense. like what we chatted about earlier, it shouldn't be much of a difference in build time. Now that we're adding this though, do you know if we're able to eliminate other installed dependencies to make these images smaller or will they still be needed?
Not sure if we'll be able to get this in or not by January, but it would be nice to scan these images for CVEs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be honest I have no clue why ffmpeg and xlib are in there. I can look into it though if the image size is a problem. I also note that we don't clean up after
apt update
which is also a concern