Incorrect word coordinates #1712

troplin · 2018-06-28T09:00:11Z

Environment

Tesseract Version:

tesseract --version
tesseract 4.0.0-beta.1
leptonica-1.76.0 (Jun 26 2018, 18:21:40) [MSC v.1900 LIB Release x86]
 libgif 5.1.4 : libjpeg 9b : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX
Found SSE

(says beta.1, but beta.3 seems to be correct)

Commit Number: AppVeyor: 4.0.0-beta.3.1776
Platform: Windows 10 64bit (but tesseract running as 32bit)
Tessdata: tessdata-fast

We've integrated the engine in our (closed-source) application using the C API, so I cannot share the actual code.

What I do is basically just iterating over the result iterator using RIL_WORD, get the text, bounding box and baseline for each word and then creating a PDF out of it, with the recognized text as a red overlay and drawing the bounding boxes in green for better visibility.

But the official PDF output config has the same flaw, it's just more difficult to spot:

tesseract andromeda.png andromeda.tess4cli pdf

produces the following PDF:
andromeda.tess4cli.pdf

Files:

Behavior can be reproduced using the following PNG (which is a part of a bigger file):

Current Behavior:

The coordinates are correct for most words.
But for some words, there seems to be an error in computing the boundaries between the words.
Couriously, the wrong boundary is always before the last character of the previous word.
I almost looks like some kind of off-by-one error.

Expected Behavior:

Result with tesseract 3:

The text was updated successfully, but these errors were encountered:

willaaam · 2018-09-03T19:48:53Z

Seems like this could be related to the issue I'm facing.

#1192 (comment)

amitdo · 2018-10-05T19:06:25Z

I don't think there is a bug in this case.

The lstm network only returns xcoords. AFAIK, each xcoord is just one spot (pixel) in each recognized glyph. The spot can be in the beginning, middle or end of each glyph. Because the lstm net does not produce bboxes, tesseract tries to estimate the bboxes from these xcoords. This process can't be done accurately.

In cases where some bboxes are just complete garbage, there is certainly a bug somewhere, probably in the conversion from xcoords to bboxes.

Don't mix between the 'no bug' cases and the buggy cases.

amitdo · 2018-10-05T20:09:10Z

Thinking about this again.
It's possible that there is a small bug here.

As said, the cases with complete garbage boxes are certainly bugs.

Sintun · 2018-10-06T08:18:48Z

Hey there, I'm still working on this bug.
By looking at the internal values of tesseract applied to

I tracked the issue down to RecodeBeamSearch::Decode.
After the application of this function
beam_[beam_size_ - 1]->beams_[beam_index].get(h).data
contains slightly off positions, duplicates, unichar_id s for all valid values of beam_index, h.
For example, the positions of u and s are attributed to the left and right half of the letter u.

It is possible, that this function works correctly but gets the off-values from the lstmrecognizer. I will investigate this the next time i can take a look at this.

Reproducible by dropping the following code at the end of RecodeBeamSearch::Decode

{
    GenericVector<const RecodeNode*> _best_nodes;
    GenericVector<const RecodeNode*>* best_nodes = &_best_nodes;
    const RecodeBeam* last_beam = beam_[beam_size_ - 1];
    unsigned node_ct = 0;
    for (int c = 0; c < NC_COUNT; ++c)
    {
      NodeContinuation cont = static_cast<NodeContinuation>(c);
      for (int is_dawg = 0; is_dawg < 2; ++is_dawg)
      {
        int beam_index = BeamIndex(is_dawg, cont, 0);
        int heap_size = last_beam->beams_[beam_index].size();
        for (int h = 0; h < heap_size; ++h)
        {
          const RecodeNode* node = &last_beam->beams_[beam_index].get(h).data;
          ExtractPath(node, best_nodes);
          ++node_ct;
          
          int dbg_ct = 0;
          for( unsigned i = 0; i < best_nodes->size(); ++i )
          {
            if( (*best_nodes)[i]->unichar_id != INVALID_UNICHAR_ID )
              printf("~ %d %d x-break pos %d, "
                "conf:%f, duplicate:%d, u-id:%d\n", node_ct, dbg_ct++, int(std::round(i*2.75f)),
                (*best_nodes)[i]->certainty,int((*best_nodes)[i]->duplicate),
                (*best_nodes)[i]->unichar_id);
          }
        }
      }
    }
  }

Multiplication by 2.75 is done in order to obtain the original image coordinates. This factor probably only applies to exactly this image.

Example output (for one of the beam_index, h combinations) :

~ 2 13 x-break pos 61, conf:-0.085068, duplicate:1, u-id:101
~ 2 14 x-break pos 72, conf:-0.085045, duplicate:0, u-id:91
~ 2 15 x-break pos 74, conf:-0.085093, duplicate:1, u-id:91
~ 2 16 x-break pos 77, conf:-0.085023, duplicate:1, u-id:91
~ 2 17 x-break pos 80, conf:-0.089978, duplicate:1, u-id:91
~ 2 18 x-break pos 83, conf:-0.085023, duplicate:1, u-id:91
~ 2 19 x-break pos 91, conf:-0.085059, duplicate:0, u-id:94
~ 2 20 x-break pos 94, conf:-0.086384, duplicate:1, u-id:94
~ 2 21 x-break pos 96, conf:-0.085012, duplicate:1, u-id:94
~ 2 22 x-break pos 99, conf:-0.085054, duplicate:1, u-id:94
~ 2 23 x-break pos 102, conf:-0.085051, duplicate:1, u-id:94
~ 2 24 x-break pos 113, conf:-0.085026, duplicate:0, u-id:89
~ 2 25 x-break pos 116, conf:-0.085202, duplicate:1, u-id:89

u-id 91 and 94 are the letters u and s. the x-break positions show that they are attributed to the position of u in the original image.

Sintun · 2018-10-21T16:44:38Z

Hey there,

I invested another day, and got into the LSTM decoder. I found nothing wrong with it and so i tested the outcome of different lstm models for the same and different languages. It turns out that this behavior is strongly model dependent, for example the standard and best english model as well as the fast german give the correct bounding boxes for the "thousand Billion" example, while other models of these languages fail to do so.
So i came to the conclusion, that the LSTM output is unreliable for exact character & word positions. Nonetheless the output of the segmenter ist still very good, it just gets discarded before the lines go into the LSTM.

I will write a little function, that first uses tesseracts layout analysis to obtain word (and probably character) positions and then overwrite the LSTM results with the results of the layout analysis when possible.

So the bug source is unfixable for me and a possible workaround can be done.
I don't think workarounds should be part of tesseract, so i will do this outside of Tesseract after getting the necessary information via the api.

At the moment I'm planning to only create this workaround function for my own project. But I'm willing to integrate this workaround in the tesseract command line tool if someone is interested in using it.

Shreeshrii · 2018-10-22T01:06:27Z

I will write a little function, that first uses tesseracts layout analysis to obtain word (and probably character) positions and then overwrite the LSTM results with the results of the layout analysis when possible.

@Sintun I think this function will be useful for many Tesseract users, specially those who have been relying on the character coordinates in older versions. It would be great if you will integrate it in tesseract.

Shreeshrii · 2018-10-22T16:21:03Z

I will write a little function, that first uses tesseracts layout

analysis to obtain word (and probably character) positions and then overwrite the LSTM results with the results of the layout analysis when possible. @Sintun I think this function will be useful for many Tesseract users, specially those who have been relying on the character coordinates in older versions. It would be great if you will integrate it in tesseract.

FarhadKhalafi · 2018-10-26T04:56:21Z

I am wondering if what I see in this screenshot is related to the same problem. The text is correctly extracted but the bounding rectangles are wrong for a couple of words.

Sintun · 2018-10-26T07:18:33Z

Hi @FarhadKhalafi ,
yes that is this bug. I have a working workaround (for my own project) that reliably restores the word boxes and gives some reasonable results for character positions. It should also work for #1192. I'm planning on integrating it into tesseract within the next 2-3 weeks.

FarhadKhalafi · 2018-10-26T17:21:52Z

Hi @Sintun ,
Thanks for the update. Accurate location of, at least, words is often critical. I use the OCR text for table extraction and this kind of shift will break my algorithms. If you are willing to share your fix, I might be able to assist or at least test.

Sintun · 2018-10-26T18:24:26Z

Specifically testing would be greatly appreciated :) Sharing the workaround code would be of small value, because it solves the problem after the tesseract api calls (even after the tess object destruction) in my project. I will refactor this in order to do this within tesseract.

willaaam · 2018-10-27T17:42:16Z

Even sharing that might help someone Sintun, a person in my team is writing that after API fix as well for python (it corrects the HOCR). I'll be sharing that as well when its done.

Sintun · 2018-11-11T15:16:10Z

Hey there, i just started to write my workaround and looked at the state of affairs.
I had to realize, that ray created a native workaround using the historically grown tesseract objects. From my perspective, this is the best case, because i wouldn't had the time to do this in the tess framework. A closer look reveals, that this is similiar to my external approach, so i'm stopping my work on this.
As can be seen in #1192 it should find it's way into the git repo within the next days.

zdenop · 2018-11-12T11:59:59Z

Please test current code.

FarhadKhalafi · 2018-11-14T05:27:48Z

I tested the new code for accurate placement of bounding rectangles for recognized words. The code works much better in identifying the left and right edges of words. However, I did observe certain anomalies with the vertical placement of bounding rectangles. I have attached a ZIP of two TIFF files. The file named 85754_full.tif shows the vertical placement problem:

If I crop this image and zoom on the problem area, then the anomaly disappears (see 85754_crop.tif):

My guess is that the surrounding boxes influence the vertical placement when the text lines are detected.

85754.zip

LSTM model output produces only approximate character positions without boundary data. This creates a problem because the input blobs cannot be accuratelly mapped to characters and thus the accuracy of character bounding boxes is compromised as well. Current this problem is solved as follows. The character boundaries are computed according to the character positions from the LSTM output by placing the boundaries at the middle between two character positions. The blobs are then assigned according to which character the center of the blob falls to. In other words the blobs are assigned to the nearest characters. This unfortunately produces a lot of errors because the character positions in the LSTM output have a tendency to drift, thus the nearest character is often not the right one. Fortunately while the LSTM model produces approximate positions, the blob boundaries produced by the regular segmenter are pretty good. Most of the time a single blob corresponds to a single character and vice-versa. The above is used to create an optimization algorithm that treats the output of the regular segmenter as a template to which LSTM model output is matched. The selection of best match is done by assigning each unwanted property of the outcome a cost and then minimizing the total cost of the solution. This reliably solves the most frequent error present in the current solution when blobs are simply assigned to wrong character. As a result the current algorithm produces up to 20 times less errors. Fixes tesseract-ocr#1712.

LSTM model output produces only approximate character positions without boundary data. This creates a problem because the input blobs cannot be accurately mapped to characters and thus the accuracy of character bounding boxes is compromised as well. Current this problem is solved as follows. The character boundaries are computed according to the character positions from the LSTM output by placing the boundaries at the middle between two character positions. The blobs are then assigned according to which character the center of the blob falls to. In other words the blobs are assigned to the nearest characters. This unfortunately produces a lot of errors because the character positions in the LSTM output have a tendency to drift, thus the nearest character is often not the right one. Fortunately while the LSTM model produces approximate positions, the blob boundaries produced by the regular segmenter are pretty good. Most of the time a single blob corresponds to a single character and vice-versa. The above is used to create an optimization algorithm that treats the output of the regular segmenter as a template to which LSTM model output is matched. The selection of best match is done by assigning each unwanted property of the outcome a cost and then minimizing the total cost of the solution. This reliably solves the most frequent error present in the current solution when blobs are simply assigned to wrong character. As a result the current algorithm produces up to 20 times less errors. Fixes tesseract-ocr#1712.

When using LSTM models the accuracy of character bounding boxes is low with many blobs assigned to wrong characters. This is caused by the fact that LSTM model output produces only approximate character positions without boundary data. As a result the input blobs cannot be accurately mapped to characters and which compromises the accuracy of character bounding boxes. Current this problem is solved as follows. The character boundaries are computed according to the character positions from the LSTM output by placing the boundaries at the middle between two character positions. The blobs are then assigned according to which character the center of the blob falls to. In other words the blobs are assigned to the nearest characters. This unfortunately produces a lot of errors because the character positions in the LSTM output have a tendency to drift, thus the nearest character is often not the right one. Fortunately while the LSTM model produces approximate positions, the blob boundaries produced by the regular segmenter are pretty good. Most of the time a single blob corresponds to a single character and vice-versa. The above is used to create an optimization algorithm that treats the output of the regular segmenter as a template to which LSTM model output is matched. The selection of best match is done by assigning each unwanted property of the outcome a cost and then minimizing the total cost of the solution. This reliably solves the most frequent error present in the current solution when blobs are simply assigned to wrong character. As a result the current algorithm produces up to 20 times less errors. Fixes tesseract-ocr#1712.

Sintun mentioned this issue Jul 9, 2018

[Feature Request] Table structure extraction at the API #1714

Open

Sintun mentioned this issue Sep 3, 2018

Noise characters recognized with bbox as the entire page #1192

Open

troplin mentioned this issue Sep 14, 2018

RFC: Tesseract 4.0.0 – open tasks #1423

Closed

zdenop added bug accuracy labels Sep 30, 2018

vidiecan mentioned this issue Oct 23, 2018

Wrong symbol bounding box coordinates #2024

Closed

zdenop mentioned this issue Mar 6, 2019

Incorrect bounding boxes #2264

Closed

Kristinita mentioned this issue Aug 15, 2019

bug(pdf): incorrect selected area UB-Mannheim/tesseract#20

Open

tesseract-ocr deleted a comment from Shreeshrii Aug 16, 2019

RicketyRick mentioned this issue Dec 18, 2019

Overlapping Character Boundingboxes #2825

Open

amitdo added the bounding box label Mar 12, 2021

p12tic linked a pull request Apr 10, 2022 that will close this issue

RFC: Improve positioning of symbol bounding boxes #3787

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect word coordinates #1712

Incorrect word coordinates #1712

troplin commented Jun 28, 2018

willaaam commented Sep 3, 2018

amitdo commented Oct 5, 2018

amitdo commented Oct 5, 2018

Sintun commented Oct 6, 2018

Sintun commented Oct 21, 2018 •

edited

Loading

Shreeshrii commented Oct 22, 2018

Shreeshrii commented Oct 22, 2018 via email

FarhadKhalafi commented Oct 26, 2018

Sintun commented Oct 26, 2018

FarhadKhalafi commented Oct 26, 2018

Sintun commented Oct 26, 2018

willaaam commented Oct 27, 2018

Sintun commented Nov 11, 2018

zdenop commented Nov 12, 2018

FarhadKhalafi commented Nov 14, 2018 •

edited

Loading

Incorrect word coordinates #1712

Incorrect word coordinates #1712

Comments

troplin commented Jun 28, 2018

Environment

Files:

Current Behavior:

Expected Behavior:

willaaam commented Sep 3, 2018

amitdo commented Oct 5, 2018

amitdo commented Oct 5, 2018

Sintun commented Oct 6, 2018

Sintun commented Oct 21, 2018 • edited Loading

Shreeshrii commented Oct 22, 2018

Shreeshrii commented Oct 22, 2018 via email

FarhadKhalafi commented Oct 26, 2018

Sintun commented Oct 26, 2018

FarhadKhalafi commented Oct 26, 2018

Sintun commented Oct 26, 2018

willaaam commented Oct 27, 2018

Sintun commented Nov 11, 2018

zdenop commented Nov 12, 2018

FarhadKhalafi commented Nov 14, 2018 • edited Loading

Sintun commented Oct 21, 2018 •

edited

Loading

FarhadKhalafi commented Nov 14, 2018 •

edited

Loading