Switch from athenapdf to pagedjs-cli from HTML to PDF conversion #394

dhimmel · 2020-11-24T14:37:00Z

Athenapdf has worked well but has two problems:

it appears to no longer be maintained
it requires docker and has started to hit docker hub rate limits on CI Docker Hub Rate Limits #393

From https://www.pagedjs.org/documentation/02-getting-started-with-paged-js/

The command line version of Paged.js uses a headless browser (a browser without any graphical interface) to generate a PDF. It can be run on the server to launch a headless Chromium in fully automated workflows. With the command line version, you don't need to call the Paged.js script in your document: it will be done automatically.

Links:

It looks like pagedjs-cli is installed via npm, with a Dockerfile available such that we could also create an image if needed.

First step is to see whether pagedjs-cli has conversion fidelity as good or better than athenapdf.

The text was updated successfully, but these errors were encountered:

agitter · 2020-11-25T22:57:39Z

I tried a few quick tests with the pagedjs-cli Docker image from DockerHub, which corresponds to version 0.0.9. I was able to convert a toy HTML file that had a single header and a single paragraph.

However, it hangs if I try to convert manuscript.html from the rootstock output branch. The output is

✔ Loaded
◷ Rendering: Page 581

where the page count continued increasing indefinitely until I killed it after 20 min. There's a good chance I'm doing something wrong or that it would work better by building the Docker image locally using their latest version of pagedjs.

If anyone wants to test the Docker image, the executable is ./bin/paged not pagedjs-cli.

dhimmel · 2020-11-28T20:02:25Z

I installed pagedjs-cli 0.1.1 from npm:

pagedjs-cli \
  --page-size=A4 \
  --inputs https://manubot.github.io/rootstock/v/97b294802ffcd39071b6e5b8ab59f60faf4be118/ \
  --output output/pagedjs.pdf

Output:

✔ Loaded
✔ Rendering 10 pages took 1220.1599999971222 milliseconds.
✔ Generated
✔ Processed
✔ Saved to /home/dhimmel/Documents/repos/manubot-rootstock/output/pagedjs.pdf

Here's the rendered PDF: pagedjs.pdf. Compare to athenapdf PDF here generated from

rootstock/build/build.sh

Lines 72 to 75 in 97b2948

    
           athenapdf \ 
        
           --delay=${MANUBOT_ATHENAPDF_DELAY:-1100} \ 
        
           --pagesize=A4 \ 
        
           manuscript.html manuscript.pdf

Opened upstream issues for the problems:

julientaq · 2020-11-29T18:50:32Z

Thanks for opening those issues.
I’ll have a check in the morning as i believe we already had the issue with mathjax/scripts and fixed it.
Also, the margins is something new, we never got it before, i’ll check in the morning

vincerubinetti · 2021-02-08T16:28:57Z

Here's something I hadn't considered until now: writing our own pdf conversion. It actually might not be as hard as we think... Take a look at this library:

https://github.com/Richienb/pdfly/blob/master/index.js

All we really need to do is have a way to programmatically open an instance of chrome (e.g. via Puppeteer) and print a document.

https://github.com/westmonroe/pdf-puppeteer#readme (javascript)
https://github.com/miyakogi/pyppeteer (python)

julientaq · 2021-02-08T17:02:26Z

All we really need to do is have a way to programmatically open an instance of chrome (e.g. via Puppeteer) and print a document.

That’s depend on how much functionnalities you’d like to support.

having a headless browser that generate a pdf is one thing, having a way to support css print features is way more complex (page number, cross references, footnotes, etc. for example —check the list here.

We’ve been working hard on the footnotes for the last 6 months or so, so we’re a little bit behind our timeline.

Especially as there is some cli update in the works. The issue opened are the ones we want to check as soon as the footnotes are shipped.

What are the feature you may want to use?

vincerubinetti · 2021-02-08T17:06:01Z

having a headless browser that generate a pdf is one thing, having a way to support css print features is way more complex (page number, cross references, footnotes, etc. for example —check the list here.

Yes, those features are difficult. Afaik we don't support those features yet, which is why I suggested using Puppeteer. But those features have been requested and are something that the team has wanted to support for a long time, so perhaps using Puppeteer wasn't a good suggestion in the long term. It could be something to switch to in the short term if Athena gives us problems though.

Fwiw, of that feature list, I believe the most requested ones were page numbers and footnotes.

julientaq · 2021-02-08T17:13:16Z

so perhaps using Puppeteer wasn't a good suggestion in the long term

it’s a good starting point to see what’s doable :)
Pagedjs uses pupeteer to generate the pdf from a pagedjs preview in a headless chrome, so yes, that’s the right idea.

Fwiw, of that feature list, I believe the most requested ones were page numbers and footnotes.

Awesome, we’re almost there with that (page number is already something that work fines (it’s easy to build table of content) :)

I’ll come back when our release is testable, so we’ll be able to help you if you wanna try it out.

dhimmel · 2021-02-15T19:10:32Z

Here's something I hadn't considered until now: writing our own pdf conversion

I'd strongly prefer if we could piggy back on an existing project, as I don't think we want the responsibility of maintaining a converter. Athena has worked quite well, but is no longer maintained. I think HTML-to-PDF is common enough of a conversion task we should be able to find existing projects with long-term backing. Time might be best spent contributing features to existing projects if there are small blockers for Manubot's use case.

The pagedjs feature list looks impressive. And it's affiliation with Cabbage Tree Labs, whose mission is to make publishing more open, is promising.

In my comment above, I linked to three issues that were potential blockers for Manubot to adopt pagedjs. I haven't gotten a reply on any of those issues. @julientaq is there a problem with notifications on the PagedMedia GitLab or insufficient developer bandwidth to respond to user feedback? We'd love to switch to pagedjs, and Manubot seems like an ideal use case for it, but we'll need the above issues looked at as well as a more confidence that the project will have the resources to deal with user requests and bug reports in a timely fashion.

dhimmel · 2022-04-24T14:59:00Z

Noting that the source code for pagedjs has been migrated from gitlab.pagedmedia.org to gitlab.coko.foundation, so the issue links above are broken. Here are updated links for these issues (although the original author and date metadata appears missing):

Interestingly, there is also a pagedjs github at https://github.com/pagedjs/pagedjs. Not clear if that repo or https://gitlab.coko.foundation/pagedjs/pagedjs is where contributions should occur. @fchasen (active contributor) might know? Also @fchasen any ability to look into the issues we posted?

julientaq · 2022-04-24T15:32:20Z

Hi there!

I’m sorry, i completely miss your message (from last year, that not really acceptable, i’m sorry!)

So basically, our gitlab got completely screwed up by a couple of attaks and issues, and it was so silent that it wasn’t adressed for a while. And the github was supposedly a way to handle issues and merge requests coming in different places, but it’s not working as we’d hope (so long interoperability :-/).

So yes, we’re back in in coko’s gitlab, which is the right place to manage your issues.

I’ll check your issues right now!

julientaq · 2022-04-24T15:47:31Z

@dhimmel do you have an account on gitlab.coko.foundation? So i can add you to the issues?

dhimmel · 2022-04-24T16:45:00Z

do you have an account on gitlab.coko.foundation

https://gitlab.coko.foundation/dhimmel

vincerubinetti · 2022-05-02T15:48:43Z

Please see this issue for another strong reason we need to abandon Athena:

greenelab/covid19-review#1133

Key points:

Athena is using Electron 3.0.5. The current version of Electron is 18. Electron 3.0.5 is using Chromium version 66.0.3359.181. The current version of Chrome is ~100. Something about combining @media only screen with a complex selector within it is causing an issue with the Chrome 66 print preview.

dhimmel mentioned this issue Nov 24, 2020

Docker Hub Rate Limits #393

Open

dhimmel mentioned this issue Nov 29, 2020

Use gotenberg for HTML to PDF conversion #396

Open

dhimmel mentioned this issue Apr 24, 2022

Disable undesired plugins for Athena pdf print #467

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch from athenapdf to pagedjs-cli from HTML to PDF conversion #394

Switch from athenapdf to pagedjs-cli from HTML to PDF conversion #394

dhimmel commented Nov 24, 2020 •

edited

Loading

agitter commented Nov 25, 2020

dhimmel commented Nov 28, 2020 •

edited

Loading

julientaq commented Nov 29, 2020

vincerubinetti commented Feb 8, 2021 •

edited

Loading

julientaq commented Feb 8, 2021

vincerubinetti commented Feb 8, 2021 •

edited

Loading

julientaq commented Feb 8, 2021

dhimmel commented Feb 15, 2021

dhimmel commented Apr 24, 2022

julientaq commented Apr 24, 2022

julientaq commented Apr 24, 2022

dhimmel commented Apr 24, 2022

vincerubinetti commented May 2, 2022

Switch from athenapdf to pagedjs-cli from HTML to PDF conversion #394

Switch from athenapdf to pagedjs-cli from HTML to PDF conversion #394

Comments

dhimmel commented Nov 24, 2020 • edited Loading

agitter commented Nov 25, 2020

dhimmel commented Nov 28, 2020 • edited Loading

julientaq commented Nov 29, 2020

vincerubinetti commented Feb 8, 2021 • edited Loading

julientaq commented Feb 8, 2021

vincerubinetti commented Feb 8, 2021 • edited Loading

julientaq commented Feb 8, 2021

dhimmel commented Feb 15, 2021

dhimmel commented Apr 24, 2022

julientaq commented Apr 24, 2022

julientaq commented Apr 24, 2022

dhimmel commented Apr 24, 2022

vincerubinetti commented May 2, 2022

dhimmel commented Nov 24, 2020 •

edited

Loading

dhimmel commented Nov 28, 2020 •

edited

Loading

vincerubinetti commented Feb 8, 2021 •

edited

Loading

vincerubinetti commented Feb 8, 2021 •

edited

Loading