Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove page IDs when saving image to text or scanning to text using OCR #128

Open
DraganRatkovich opened this issue Mar 12, 2022 · 5 comments
Labels
enhancement New feature or request

Comments

@DraganRatkovich
Copy link
Collaborator

DraganRatkovich commented Mar 12, 2022

Is your feature request related to a problem? Please describe.

When saving an image to a text file or selecting the Scan to Text File option and selecting a scanned book for text extraction using OCR, Bookworm adds Page 1 Page 2 identifiers to the text file, which is useless in this case, because it doesn't help in any way when pasting this text into a Word document to automatically arrange the pages like in the previous document, Word will very easily do the rest of the work for itself, plus the additional font, paragraph style, line spacing will be applied to the text if the user of this would require, so writing in a text file Page 1 , Page 2 and the extra page brake character is very useless, no text format exporters, at least the popular ones like MSWord, Adobe PDF, do this.

Describe the solution you'd like

Simply extract pure text from a PDF file or image without adding a Word "page" and numbers, and a page brake symbol.
@mush42 It will be very useful if fixed soon because saving as a text file of a pdf or word document will be increased many times and the text will be clean and smooth.

@DraganRatkovich DraganRatkovich changed the title Remove page id words when saving any book to text file Remove page IDs when saving image to text or scanning to text using OCR. Mar 12, 2022
@DraganRatkovich DraganRatkovich changed the title Remove page IDs when saving image to text or scanning to text using OCR. Remove page IDs when saving image to text or scanning to text using OCR Mar 12, 2022
@mush42
Copy link
Collaborator

mush42 commented Mar 12, 2022

Hello @DraganRatkovich

I may agree with removing the page numbering, but the page break char is semantically important, specially for OCR results.

Anyhow, I'll make text exporting customizable. A dialog box will be shown when exporting to plane text or scanning to text file.

Best
Musharraf

@DraganRatkovich
Copy link
Collaborator Author

@mush42 Yes, it would be nice if checkboxes appeared during the save process in order to remove or save page brake symbols, etc.

@DraganRatkovich
Copy link
Collaborator Author

Hello @mush42
do you have any news on this issue?

@mush42
Copy link
Collaborator

mush42 commented Apr 6, 2022

@DraganRatkovich
Yes. the fix is coming.

@DraganRatkovich
Copy link
Collaborator Author

@mush42 Also, I didn't change the title, but please consider also adding options to select when saving any document in txt format, like from .pdf, docx, etc, not only when saving an image or scanning to text using OCR.

@DraganRatkovich DraganRatkovich added Improvement Improving or fixing an existing feature enhancement New feature or request and removed Improvement Improving or fixing an existing feature labels Jun 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants