Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: add hOCR output support #13

Open
STRRL opened this issue Aug 19, 2023 · 1 comment
Open

Feature Request: add hOCR output support #13

STRRL opened this issue Aug 19, 2023 · 1 comment

Comments

@STRRL
Copy link

STRRL commented Aug 19, 2023

Hi! rusty-tesseract is amzaing work! It works pretty well on my both Linux and MacOS machine!

I have used it on my personal project https://github.com/strrl/dejavu, and I found that I require more detailed information like
page, paragraph, line, not only the "word". ref: STRRL/dejavu#7

I found that both alto and hOCR output could make it possible, and both of them are XML-based output. And I prefer to hOCR because it seems it still keeps updating, https://github.com/kba/hocr-spec/

So here is my proposal:

  • append new function called image_to_hocr, and output is the string which the content is the xml-based hOCR

How do you think about it? ❤️

I could draft a PR for that.

@thomasgruebl
Copy link
Owner

Hi, thanks for raising this issue and glad to hear that you like rusty_tesseract!

Tesseract (and rusty_tesseract) already provide the option to output in hOCR format by setting the 'tessedit_create_hocr' flag to '1'.

Consider lines 31-40 in the main.rs file: You can simply add the hOCR flag to the config_variables HashMap as follows:

let image_to_string_args = Args {
        lang: "eng".into(),
        config_variables: HashMap::from([
        (
            "tessedit_char_whitelist".into(),
            "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ".into(),
        ),
        ("tessedit_create_hocr".into(), "1".into())]),
        dpi: Some(150),
        psm: Some(6),
        oem: Some(3),
    };

Then the rusty_tesseract::image_to_string() output looks as follows:

The String output is: <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
  <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
  <meta name='ocr-system' content='tesseract 4.1.1' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
 </head>
 <body>
  <div class='ocr_page' id='page_1' title='image "/tmp/rusty-tesseractkxwqOh.png"; bbox 0 0 696 89; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 18 29 671 64">
    <p class='ocr_par' id='par_1_1' lang='eng' title="bbox 18 29 671 64">
     <span class='ocr_line' id='line_1_1' title="bbox 18 29 671 64; baseline 0 -1; x_size 44.862743; x_descenders 11.215686; x_ascenders 11.215686">
      <span class='ocrx_word' id='word_1_1' title='bbox 18 29 162 64; x_wconf 95'>LOREM</span>
      <span class='ocrx_word' id='word_1_2' title='bbox 181 29 304 64; x_wconf 91'>IPSUM</span>
      <span class='ocrx_word' id='word_1_3' title='bbox 323 29 476 64; x_wconf 91'>DOLOR</span>
      <span class='ocrx_word' id='word_1_4' title='bbox 490 29 540 64; x_wconf 96'>SIT</span>
      <span class='ocrx_word' id='word_1_5' title='bbox 553 30 671 63; x_wconf 96'>AMET</span>
     </span>
    </p>
   </div>
  </div>
 </body>
</html>

However, it might not be entirely clear for new users that such a config flag exists within tesseract, so please feel free to create a new function image_to_hocr that automatically appends the tessedit_create_hocr flag to the config_variables HashMap.

P.S. Similarly, you can append the tessedit_create_alto flag to the config_variables or any other flag that is listed in the tesseract --print-parameters list.

Thanks,

Thomas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants