-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: add hOCR output support #13
Comments
Hi, thanks for raising this issue and glad to hear that you like rusty_tesseract! Tesseract (and rusty_tesseract) already provide the option to output in hOCR format by setting the 'tessedit_create_hocr' flag to '1'. Consider lines 31-40 in the main.rs file: You can simply add the hOCR flag to the config_variables HashMap as follows: let image_to_string_args = Args {
lang: "eng".into(),
config_variables: HashMap::from([
(
"tessedit_char_whitelist".into(),
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ".into(),
),
("tessedit_create_hocr".into(), "1".into())]),
dpi: Some(150),
psm: Some(6),
oem: Some(3),
}; Then the rusty_tesseract::image_to_string() output looks as follows: The String output is: <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<meta name='ocr-system' content='tesseract 4.1.1' />
<meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
</head>
<body>
<div class='ocr_page' id='page_1' title='image "/tmp/rusty-tesseractkxwqOh.png"; bbox 0 0 696 89; ppageno 0'>
<div class='ocr_carea' id='block_1_1' title="bbox 18 29 671 64">
<p class='ocr_par' id='par_1_1' lang='eng' title="bbox 18 29 671 64">
<span class='ocr_line' id='line_1_1' title="bbox 18 29 671 64; baseline 0 -1; x_size 44.862743; x_descenders 11.215686; x_ascenders 11.215686">
<span class='ocrx_word' id='word_1_1' title='bbox 18 29 162 64; x_wconf 95'>LOREM</span>
<span class='ocrx_word' id='word_1_2' title='bbox 181 29 304 64; x_wconf 91'>IPSUM</span>
<span class='ocrx_word' id='word_1_3' title='bbox 323 29 476 64; x_wconf 91'>DOLOR</span>
<span class='ocrx_word' id='word_1_4' title='bbox 490 29 540 64; x_wconf 96'>SIT</span>
<span class='ocrx_word' id='word_1_5' title='bbox 553 30 671 63; x_wconf 96'>AMET</span>
</span>
</p>
</div>
</div>
</body>
</html> However, it might not be entirely clear for new users that such a config flag exists within tesseract, so please feel free to create a new function P.S. Similarly, you can append the Thanks, Thomas |
Hi! rusty-tesseract is amzaing work! It works pretty well on my both Linux and MacOS machine!
I have used it on my personal project https://github.com/strrl/dejavu, and I found that I require more detailed information like
page, paragraph, line, not only the "word". ref: STRRL/dejavu#7
I found that both
alto
andhOCR
output could make it possible, and both of them are XML-based output. And I prefer to hOCR because it seems it still keeps updating, https://github.com/kba/hocr-spec/So here is my proposal:
image_to_hocr
, and output is the string which the content is the xml-based hOCRHow do you think about it? ❤️
I could draft a PR for that.
The text was updated successfully, but these errors were encountered: