feature request: dump only failed images. #34

fundies · 2014-01-17T19:39:50Z

ERROR: OCR failed for 1
ERROR: OCR failed for 23
ERROR: OCR failed for 133
ERROR: OCR failed for 367
ERROR: OCR failed for 367
ERROR: OCR failed for 386

Can you make an argumen to dump only the images that failed to ocr? And if possible allow them to be opened in external image editor so I can be prompted on the cli for a fix?

Signed-off-by: Rüdiger Sonderfeld <[email protected]>

ruediger · 2014-01-17T20:08:15Z

I implemented this option in a branch for now. I'm not sure if it is really needed. I think opening them in an external image editor should be done in a GUI which could also use dump-images and simply parse the error output to get to the images.

Eventually I want to figure out how to extract more information from tesseract about the OCR process. It should provide some kind of confidence or error estimate. That would probably be even more useful than simply looking for images with complete OCR failure.

fundies · 2014-01-17T20:24:11Z

Cloning into 'VobSub2SRT'...
done.
==> Starting pkgver()...
==> Updated version: vobsub2srt-git v1.0pre6.36.gde90184-1
==> Starting build()...
Switched to a new branch 'origin/dump-error-images'
-- The C compiler identification is GNU 4.8.2

Maybe I did something wrong but. [greg@greg-desktop test]$ vobsub2srt --dump-error-images vobsub
ERROR: OCR failed for 1
ERROR: OCR failed for 23
ERROR: OCR failed for 133
ERROR: OCR failed for 367
ERROR: OCR failed for 367
ERROR: OCR failed for 386
Wrote Subtitles to 'vobsub.srt'
[greg@greg-desktop test]$ ls
vobsub.idx vobsub.srt vobsub.sub

As you can see there are no images. Also I wanted a prompt because vobsub2srt deletes the line it can't ocr then shifts all the timecoodes. So not only do I have to figure out the ocr. I got to manually open the idx and get the time codes too then edit the line in. It's kinda annoying :P

ruediger · 2014-01-17T20:34:08Z

Could you provide me with a sample file? (e.g., via e-mail [email protected]) I have several VobSub samples but none for which OCR fails.

Also I wanted a prompt because vobsub2srt deletes the line it can't ocr then shifts all the timecoodes.

shifts all the timecodes? That's strange. I'll have to test it. I guess the best way would be to write the error message to the SRT as well. That way a GUI tool could easily point to the part of the SRT that needs fixing.

fundies · 2014-01-17T20:41:46Z

I sent them. By shift imecodes. I mean. It completly deletes the empty line. Ie if it were line 21 itd make line 22 become line 21.

ruediger · 2014-01-18T12:32:34Z

hmm works for me.

$ ../build/bin/vobsub2srt --dump-error-images error-vobsub
ERROR: OCR failed for 1
ERROR: OCR failed for 23
ERROR: OCR failed for 133
ERROR: OCR failed for 367
ERROR: OCR failed for 367
ERROR: OCR failed for 386
Wrote Subtitles to 'error-vobsub.srt'
$ ls error-vobsub*
error-vobsub-001.pgm  error-vobsub-023.pgm  error-vobsub-133.pgm  error-vobsub-367.pgm
error-vobsub-386.pgm  error-vobsub.idx  error-vobsub.srt  error-vobsub.sub

maybe you are calling an old version of vobsub2srt or haven't rebuild it properly.

ruediger · 2014-01-18T12:46:43Z

b70b6f5 should fix the shifting problem and writes an error message to the SRT in case of OCR error.

Thanks for reporting that issue and providing me with the sample subtitles.

fundies · 2014-01-18T18:16:08Z

I got it but cant for the life of me figure out whats needed to open a pmg... Nothing I try can view it

ruediger · 2014-01-18T18:24:07Z

PGM is a rather simple format. What operating system are you using? On Linux you should enter xdg-open filename.pgm and it should open an appropriate image viewer if one is installed (e.g., KDE's gewnview). You can also simply convert it into a different format if you have ImageMagick installed: convert filename.pgm filename.png should simply convert it to PNG.

https://en.wikipedia.org/wiki/Portable_pixmap

fundies · 2014-01-18T18:28:54Z

Hmm It appears the Images It fails to ocr are corrupt? I can open the rest just fine :/

ruediger · 2014-01-18T18:47:51Z

Ah, ok. I was surprised that tesseract would simply return NULL for an OCR error but in fact it seems to be an error with the bitmap data. It seems the subtitle has a height of 0. Are those subtitles displayed when you watch them with MPlayer? Do they contain actual text?

fundies · 2014-01-18T19:02:25Z

most of them are nothing but ocasionally its a line :/. Watching in mplayer everything displays fine

ruediger · 2014-01-18T19:09:25Z

ah, that's bad. Because it means the problem is not in the mplayer code but how I call the mplayer code. This will probably take a while for me to figure it out. Are these the only subtitles you have with errors? They are only 6 frames with error so I guess you can work around that for now.

Sorry about that.

julien-nc · 2015-03-27T16:00:59Z

Hi, thanks for the work. Amazing tool to get rid of vobsub.

I had two or three missing lines on the sub i processed. I spotted them when i watched the movie. I compared with the vobsub to make sure there was a miss.

My problem is that those mistakes are not detected/signaled during process even with the --dump-error-images option. The missing lines don't let any clue in the srt file. There is apparently no way to detect those errors except watching the whole movie. Do i miss something ?

If one day you feel you want to attack this issue, here are my files :
vobsub :
http://pluton.cassio.pe/~demo/manhunter.idx
http://pluton.cassio.pe/~demo/manhunter.sub
and result :
http://pluton.cassio.pe/~demo/manhunter.srt

You'll need french tesseract data (tesseract-ocr-fra package in ubuntu)

One miss is between 753 and 754, at 01:04:42 .

ruediger added a commit that referenced this issue Jan 17, 2014

Implement --dump-error-images (#34).

e27439d

Signed-off-by: Rüdiger Sonderfeld <[email protected]>

fundies closed this as completed Jan 18, 2014

ruediger reopened this Jan 18, 2014

abrasive mentioned this issue May 19, 2015

Skip invisible pictures #47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature request: dump only failed images. #34

feature request: dump only failed images. #34

fundies commented Jan 17, 2014

ruediger commented Jan 17, 2014

fundies commented Jan 17, 2014

ruediger commented Jan 17, 2014

fundies commented Jan 17, 2014

ruediger commented Jan 18, 2014

ruediger commented Jan 18, 2014

fundies commented Jan 18, 2014

ruediger commented Jan 18, 2014

fundies commented Jan 18, 2014

ruediger commented Jan 18, 2014

fundies commented Jan 18, 2014

ruediger commented Jan 18, 2014

julien-nc commented Mar 27, 2015

feature request: dump only failed images. #34

feature request: dump only failed images. #34

Comments

fundies commented Jan 17, 2014

ruediger commented Jan 17, 2014

fundies commented Jan 17, 2014

ruediger commented Jan 17, 2014

fundies commented Jan 17, 2014

ruediger commented Jan 18, 2014

ruediger commented Jan 18, 2014

fundies commented Jan 18, 2014

ruediger commented Jan 18, 2014

fundies commented Jan 18, 2014

ruediger commented Jan 18, 2014

fundies commented Jan 18, 2014

ruediger commented Jan 18, 2014

julien-nc commented Mar 27, 2015