-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Google drive pdf viewer #248
Comments
Actually text are rendered into images, while hidden text layer is provided for selection. In this way the hidden text layer may be not so accurate, and that's why lots of styles may be removed. |
Ok, I've seen now... It isn't so good than first appearance... Thanks. This library is better...I only miss lighter html code and solve duplicated fonts problem..;-) |
High optimization without lose accurate. I've proved to reduce the amount of html elements of the html result... I've found a patter that could be useful...I can't improve like this changing library parameters. 1 - Found neighbors divs (corresponding to lines) that have the same "m x h fs fc sc ls ws" classes... Now text flows through the div space and you can obtain the similar result...I'll prove it on a Crhome debugger... Maybe is dificult to obtain the line-height or letter-spacing...but I think the improvement will be tremendous... What do you think?? |
(1) and part of (2) should be done with I'd actually been planning to do 3 and 4, at the cost of some inaccuracy. Not finished yet. |
@Toneti777 There are other concerns of item 2, |
@Toneti777 But if you managed to optimize the output a lot with item 1 and 2, it sounds like a bug of pdf2htmlEX — producing unnecessary span elements. In that case, can you please file a new bug with sample files? |
In 1) I have one problem about how calculate the width of line when merge two or more lines. The library build one div by line and I think it might be inprove. I manually try it and it's perfect but I'm lost to find width for a automatic process, maybe inside your library is easier. In 2) I try to delete the span elements with very low value in margin-left attribute. For my pdfs on each line, library puts a lot of span elements between letters of each word. Most of them have a low value and it might be fixed with letter-spacing or word-spacing on div element. I obtain a much less complex html output. |
Maybe you can try a larger value for |
I think this is a duplicate of #56 so I close this one. |
These recent days I have opened some pdf files attached in a e-mail on my gmail account. The viewer online surprise me..
They have a very good html viewer!!
I have inspeted some files and I advise some improvents against this library.
They use
for one or more text line without aditional elements resulting in a more eficient and lighter.
I don't kwok how it works and if can you study their proccess...or catch their goals.
The text was updated successfully, but these errors were encountered: