Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: dump only failed images. #34

Open
fundies opened this issue Jan 17, 2014 · 13 comments
Open

feature request: dump only failed images. #34

fundies opened this issue Jan 17, 2014 · 13 comments
Labels

Comments

@fundies
Copy link

fundies commented Jan 17, 2014

ERROR: OCR failed for 1
ERROR: OCR failed for 23
ERROR: OCR failed for 133
ERROR: OCR failed for 367
ERROR: OCR failed for 367
ERROR: OCR failed for 386

Can you make an argumen to dump only the images that failed to ocr? And if possible allow them to be opened in external image editor so I can be prompted on the cli for a fix?

ruediger added a commit that referenced this issue Jan 17, 2014
Signed-off-by: Rüdiger Sonderfeld <[email protected]>
@ruediger
Copy link
Owner

I implemented this option in a branch for now. I'm not sure if it is really needed. I think opening them in an external image editor should be done in a GUI which could also use dump-images and simply parse the error output to get to the images.

Eventually I want to figure out how to extract more information from tesseract about the OCR process. It should provide some kind of confidence or error estimate. That would probably be even more useful than simply looking for images with complete OCR failure.

@fundies
Copy link
Author

fundies commented Jan 17, 2014

Cloning into 'VobSub2SRT'...
done.
==> Starting pkgver()...
==> Updated version: vobsub2srt-git v1.0pre6.36.gde90184-1
==> Starting build()...
Switched to a new branch 'origin/dump-error-images'
-- The C compiler identification is GNU 4.8.2

Maybe I did something wrong but. [greg@greg-desktop test]$ vobsub2srt --dump-error-images vobsub
ERROR: OCR failed for 1
ERROR: OCR failed for 23
ERROR: OCR failed for 133
ERROR: OCR failed for 367
ERROR: OCR failed for 367
ERROR: OCR failed for 386
Wrote Subtitles to 'vobsub.srt'
[greg@greg-desktop test]$ ls
vobsub.idx vobsub.srt vobsub.sub

As you can see there are no images. Also I wanted a prompt because vobsub2srt deletes the line it can't ocr then shifts all the timecoodes. So not only do I have to figure out the ocr. I got to manually open the idx and get the time codes too then edit the line in. It's kinda annoying :P

@ruediger
Copy link
Owner

Could you provide me with a sample file? (e.g., via e-mail [email protected]) I have several VobSub samples but none for which OCR fails.

Also I wanted a prompt because vobsub2srt deletes the line it can't ocr then shifts all the timecoodes.

shifts all the timecodes? That's strange. I'll have to test it. I guess the best way would be to write the error message to the SRT as well. That way a GUI tool could easily point to the part of the SRT that needs fixing.

@fundies
Copy link
Author

fundies commented Jan 17, 2014

I sent them. By shift imecodes. I mean. It completly deletes the empty line. Ie if it were line 21 itd make line 22 become line 21.

@ruediger
Copy link
Owner

hmm works for me.

$ ../build/bin/vobsub2srt --dump-error-images error-vobsub
ERROR: OCR failed for 1
ERROR: OCR failed for 23
ERROR: OCR failed for 133
ERROR: OCR failed for 367
ERROR: OCR failed for 367
ERROR: OCR failed for 386
Wrote Subtitles to 'error-vobsub.srt'
$ ls error-vobsub*
error-vobsub-001.pgm  error-vobsub-023.pgm  error-vobsub-133.pgm  error-vobsub-367.pgm
error-vobsub-386.pgm  error-vobsub.idx  error-vobsub.srt  error-vobsub.sub

maybe you are calling an old version of vobsub2srt or haven't rebuild it properly.

@ruediger
Copy link
Owner

b70b6f5 should fix the shifting problem and writes an error message to the SRT in case of OCR error.

Thanks for reporting that issue and providing me with the sample subtitles.

@fundies
Copy link
Author

fundies commented Jan 18, 2014

I got it but cant for the life of me figure out whats needed to open a pmg... Nothing I try can view it

@fundies fundies closed this as completed Jan 18, 2014
@ruediger
Copy link
Owner

PGM is a rather simple format. What operating system are you using? On Linux you should enter xdg-open filename.pgm and it should open an appropriate image viewer if one is installed (e.g., KDE's gewnview). You can also simply convert it into a different format if you have ImageMagick installed: convert filename.pgm filename.png should simply convert it to PNG.

https://en.wikipedia.org/wiki/Portable_pixmap

@fundies
Copy link
Author

fundies commented Jan 18, 2014

Hmm It appears the Images It fails to ocr are corrupt? I can open the rest just fine :/

@ruediger
Copy link
Owner

Ah, ok. I was surprised that tesseract would simply return NULL for an OCR error but in fact it seems to be an error with the bitmap data. It seems the subtitle has a height of 0. Are those subtitles displayed when you watch them with MPlayer? Do they contain actual text?

@ruediger ruediger reopened this Jan 18, 2014
@fundies
Copy link
Author

fundies commented Jan 18, 2014

most of them are nothing but ocasionally its a line :/. Watching in mplayer everything displays fine

@ruediger
Copy link
Owner

ah, that's bad. Because it means the problem is not in the mplayer code but how I call the mplayer code. This will probably take a while for me to figure it out. Are these the only subtitles you have with errors? They are only 6 frames with error so I guess you can work around that for now.

Sorry about that.

@julien-nc
Copy link

Hi, thanks for the work. Amazing tool to get rid of vobsub.

I had two or three missing lines on the sub i processed. I spotted them when i watched the movie. I compared with the vobsub to make sure there was a miss.

My problem is that those mistakes are not detected/signaled during process even with the --dump-error-images option. The missing lines don't let any clue in the srt file. There is apparently no way to detect those errors except watching the whole movie. Do i miss something ?

If one day you feel you want to attack this issue, here are my files :
vobsub :
http://pluton.cassio.pe/~demo/manhunter.idx
http://pluton.cassio.pe/~demo/manhunter.sub
and result :
http://pluton.cassio.pe/~demo/manhunter.srt

You'll need french tesseract data (tesseract-ocr-fra package in ubuntu)

One miss is between 753 and 754, at 01:04:42 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants