-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compatibility with tesseract 4 #273
Comments
I'm currently working on supporting Tesseract 4.0. Unfortunately, an upgrade attempt has revealed unforeseen problems:
Moreover, Tesseract's full page mode has been proven to perform rather poor on text recognition in presence of musical symbols. Especially, lyrics and chords are often affected because they use uncommon layout vs. grammar. This is something we cannot work around easily. For the time being, Audiveris let Tesseract to perform one-shot text detection and recognition relying on algorithms we have no control over. This need to be reworked to allow multistage recognition/rejection using different parameters, see #44. |
Update:
fixed
Audiveris relies on the information returned by Tesseract 4 has been redesigned in such a way that the font information except character size isn't available anymore, see tesseract-ocr/tesseract#1074 A support for font attributes is feasible but isn't available yet. According to the principal Tesseract developer, Ray Smith, this is one more reason for delaying deprecation of the v3 engine. Many people recommend to stick to the old engine instead of switching to the recent one. The reality is a bit different:
I'm currently redesigning Tesseract-related classes to support the new engine. Results will be reported shortly... |
After spending several days analyzing Tesseract's 4 output via TessAPI, I found out several heavy-weight problems preventing further adoption of the LSTM engine for our OMR task. I therefore decided to wait for the Tesseract team to fix all bounding box related issues first. Audiveris will stick to Tesseract 3.x for now. |
I just tested Tesseract 4 in the legacy engine mode (OEM_TESSERACT_ONLY). It seems to work as expected. The updated code was pushed to the tess4 feature branch. Please test it and give me a feedback. |
I can confirm that it does not crash and it produces musicxml files, but the musicxml files are almost completely empty for the handful of tiff files I tried. This is the full musicxml file output. <?xml version="1.0" ?>
<sheet last-persistent-id="0" number="1">
<glyph-index></glyph-index>
</sheet> |
Will close this issue and create a new one for the new issue. |
That's what I have expected, too. Tesseract 4 and even the latest Tesseract 5.0.1 are still compatible with Tesseract 3 in legacy mode. Why was the update abandoned? I noticed that there exist pre-built jar files which can be used for 4.0.0, but I could not find jar files for newer releases. |
Now I could at least build with Tesseract 4.1.1 (based on your tess4 branch). See https://github.com/stweil/audiveris/tree/tess4. |
@stweil But I did not really use OCR by this time, I was focusing on a new attempt of head recognition via a patch classifier. This work is still on pause right now, it should get resurrected some day, but that's another story. If you could spend some time to evaluate the actual OCR results (of 4.x, and perhaps 5.x as you mentioned), we would all benefit from such experience. |
In theory Tesseract 4 and 5 in legacy mode should produce identical results as Tesseract 3 because all use the same OCR engine (and the same kind of models), so the quality would be identical. Tesseract 5 would still be faster, include a lot of bug fixes and support more platforms (ARM, Apple M1, ...). I have much experience with Tesseract, so I can help on that side. And I have no experience with Audiveris. |
@hbitteur Stefan asked for the reason to not merging the Line 14 in 8671b09
To my understanding, nothing prevents us from switching to the newer Tesseract 4.1 or even 5.x as long as they run in the legacy engine mode. This will require changes available in the |
Audiveris doesn't use pre-built binaries. It uses the javacpp-presets wrapper for accessing Tesseract. The recent |
@maximumspatium |
I'll go ahead and switch to |
I just tried Audiveris with 4.1.1, and that seems to work fine. The modifications from your |
@stweil I finally switched Audiveris to Tesseract 4.1.1, see ce97610 I also tried Audiveris with Tesseract 5.0.1. Unfortunately, |
@stweil We're experiencing issues with Tesseract sometimes reporting unreliable symbol positions when running in the full page mode. Selecting the area and letting Tesseract recognize it again usually produces better results: It looks like a bug in the Tesseract API I never managed to catch. |
Do you get those wrong positions also when the same page is processed by the |
The I tried two different page segmentation modes and got similar results: PSM=3, Tesseract's default, also used by Audiveris: <TextLine ID="line_3" HPOS="139" VPOS="392" WIDTH="536" HEIGHT="39">
<String ID="string_10" HPOS="139" VPOS="392" WIDTH="226" HEIGHT="39" WC="0.84" CONTENT="Arrangement"/><SP WIDTH="14" VPOS="392" HPOS="365"/>
<String ID="string_11" HPOS="379" VPOS="402" WIDTH="4" HEIGHT="21" WC="0.89" CONTENT=":"/><SP WIDTH="16" VPOS="402" HPOS="383"/>
<String ID="string_12" HPOS="399" VPOS="392" WIDTH="94" HEIGHT="31" WC="0.83" CONTENT="Alain"/><SP WIDTH="12" VPOS="392" HPOS="493"/>
<String ID="string_13" HPOS="505" VPOS="393" WIDTH="170" HEIGHT="30" WC="0.89" CONTENT="BRUNET"/>
</TextLine> PSM=11 i.e. "find as much test as possible": <TextLine ID="line_2" HPOS="139" VPOS="392" WIDTH="536" HEIGHT="39">
<String ID="string_8" HPOS="139" VPOS="392" WIDTH="226" HEIGHT="39" WC="0.82" CONTENT="Arrangement"/><SP WIDTH="14" VPOS="392" HPOS="365"/>
<String ID="string_9" HPOS="379" VPOS="402" WIDTH="4" HEIGHT="21" WC="0.89" CONTENT=":"/><SP WIDTH="16" VPOS="402" HPOS="383"/>
<String ID="string_10" HPOS="399" VPOS="392" WIDTH="94" HEIGHT="31" WC="0.83" CONTENT="Alain"/><SP WIDTH="12" VPOS="392" HPOS="493"/>
<String ID="string_11" HPOS="505" VPOS="393" WIDTH="170" HEIGHT="30" WC="0.89" CONTENT="BRUNET"/>
</TextLine> I assume a bug somewhere in the public API. |
I just wanted to try the new code, but it looks like the Javacpp-presets are unavailable for M1 MacOS. |
@stweil That's true. Apparently, it's very easy to adapt Javacpp-presets to an unsupported architecture. Each preset includes a build script that compiles both the native library as well as its JNI bridge. If you can compile Tesseract in your M1, you will be able to compile its Java bindings. Unfortunately, I can't do it because I don't own a M1 Mac :) |
Let's move our discussion regarding OCR issues to #575. |
FYI, |
5.1.0:6780b1f91
Tesseract OCR, version 3.04.01
When will we see support for Tesseract
4.0
?The text was updated successfully, but these errors were encountered: