Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from athenapdf to pagedjs-cli from HTML to PDF conversion #394

Open
dhimmel opened this issue Nov 24, 2020 · 13 comments
Open

Switch from athenapdf to pagedjs-cli from HTML to PDF conversion #394

dhimmel opened this issue Nov 24, 2020 · 13 comments

Comments

@dhimmel
Copy link
Member

dhimmel commented Nov 24, 2020

Athenapdf has worked well but has two problems:

  1. it appears to no longer be maintained
  2. it requires docker and has started to hit docker hub rate limits on CI Docker Hub Rate Limits #393

From https://www.pagedjs.org/documentation/02-getting-started-with-paged-js/

The command line version of Paged.js uses a headless browser (a browser without any graphical interface) to generate a PDF. It can be run on the server to launch a headless Chromium in fully automated workflows. With the command line version, you don't need to call the Paged.js script in your document: it will be done automatically.

Links:

It looks like pagedjs-cli is installed via npm, with a Dockerfile available such that we could also create an image if needed.

First step is to see whether pagedjs-cli has conversion fidelity as good or better than athenapdf.

@agitter
Copy link
Member

agitter commented Nov 25, 2020

I tried a few quick tests with the pagedjs-cli Docker image from DockerHub, which corresponds to version 0.0.9. I was able to convert a toy HTML file that had a single header and a single paragraph.

However, it hangs if I try to convert manuscript.html from the rootstock output branch. The output is

✔ Loaded
◷ Rendering: Page 581

where the page count continued increasing indefinitely until I killed it after 20 min. There's a good chance I'm doing something wrong or that it would work better by building the Docker image locally using their latest version of pagedjs.

If anyone wants to test the Docker image, the executable is ./bin/paged not pagedjs-cli.

@dhimmel
Copy link
Member Author

dhimmel commented Nov 28, 2020

I installed pagedjs-cli 0.1.1 from npm:

pagedjs-cli \
  --page-size=A4 \
  --inputs https://manubot.github.io/rootstock/v/97b294802ffcd39071b6e5b8ab59f60faf4be118/ \
  --output output/pagedjs.pdf

Output:

✔ Loaded
✔ Rendering 10 pages took 1220.1599999971222 milliseconds.
✔ Generated
✔ Processed
✔ Saved to /home/dhimmel/Documents/repos/manubot-rootstock/output/pagedjs.pdf

Here's the rendered PDF: pagedjs.pdf. Compare to athenapdf PDF here generated from

rootstock/build/build.sh

Lines 72 to 75 in 97b2948

athenapdf \
--delay=${MANUBOT_ATHENAPDF_DELAY:-1100} \
--pagesize=A4 \
manuscript.html manuscript.pdf

Opened upstream issues for the problems:

@julientaq
Copy link

Thanks for opening those issues.
I’ll have a check in the morning as i believe we already had the issue with mathjax/scripts and fixed it.
Also, the margins is something new, we never got it before, i’ll check in the morning

@vincerubinetti
Copy link
Collaborator

vincerubinetti commented Feb 8, 2021

Here's something I hadn't considered until now: writing our own pdf conversion. It actually might not be as hard as we think... Take a look at this library:

https://github.com/Richienb/pdfly/blob/master/index.js

All we really need to do is have a way to programmatically open an instance of chrome (e.g. via Puppeteer) and print a document.

https://github.com/westmonroe/pdf-puppeteer#readme (javascript)
https://github.com/miyakogi/pyppeteer (python)

@julientaq
Copy link

All we really need to do is have a way to programmatically open an instance of chrome (e.g. via Puppeteer) and print a document.

That’s depend on how much functionnalities you’d like to support.

having a headless browser that generate a pdf is one thing, having a way to support css print features is way more complex (page number, cross references, footnotes, etc. for example —check the list here.

We’ve been working hard on the footnotes for the last 6 months or so, so we’re a little bit behind our timeline.

Especially as there is some cli update in the works. The issue opened are the ones we want to check as soon as the footnotes are shipped.

What are the feature you may want to use?

@vincerubinetti
Copy link
Collaborator

vincerubinetti commented Feb 8, 2021

having a headless browser that generate a pdf is one thing, having a way to support css print features is way more complex (page number, cross references, footnotes, etc. for example —check the list here.

Yes, those features are difficult. Afaik we don't support those features yet, which is why I suggested using Puppeteer. But those features have been requested and are something that the team has wanted to support for a long time, so perhaps using Puppeteer wasn't a good suggestion in the long term. It could be something to switch to in the short term if Athena gives us problems though.

Fwiw, of that feature list, I believe the most requested ones were page numbers and footnotes.

@julientaq
Copy link

so perhaps using Puppeteer wasn't a good suggestion in the long term

it’s a good starting point to see what’s doable :)
Pagedjs uses pupeteer to generate the pdf from a pagedjs preview in a headless chrome, so yes, that’s the right idea.

Fwiw, of that feature list, I believe the most requested ones were page numbers and footnotes.

Awesome, we’re almost there with that (page number is already something that work fines (it’s easy to build table of content) :)

I’ll come back when our release is testable, so we’ll be able to help you if you wanna try it out.

@dhimmel
Copy link
Member Author

dhimmel commented Feb 15, 2021

Here's something I hadn't considered until now: writing our own pdf conversion

I'd strongly prefer if we could piggy back on an existing project, as I don't think we want the responsibility of maintaining a converter. Athena has worked quite well, but is no longer maintained. I think HTML-to-PDF is common enough of a conversion task we should be able to find existing projects with long-term backing. Time might be best spent contributing features to existing projects if there are small blockers for Manubot's use case.

The pagedjs feature list looks impressive. And it's affiliation with Cabbage Tree Labs, whose mission is to make publishing more open, is promising.

In my comment above, I linked to three issues that were potential blockers for Manubot to adopt pagedjs. I haven't gotten a reply on any of those issues. @julientaq is there a problem with notifications on the PagedMedia GitLab or insufficient developer bandwidth to respond to user feedback? We'd love to switch to pagedjs, and Manubot seems like an ideal use case for it, but we'll need the above issues looked at as well as a more confidence that the project will have the resources to deal with user requests and bug reports in a timely fashion.

@dhimmel
Copy link
Member Author

dhimmel commented Apr 24, 2022

Noting that the source code for pagedjs has been migrated from gitlab.pagedmedia.org to gitlab.coko.foundation, so the issue links above are broken. Here are updated links for these issues (although the original author and date metadata appears missing):

Interestingly, there is also a pagedjs github at https://github.com/pagedjs/pagedjs. Not clear if that repo or https://gitlab.coko.foundation/pagedjs/pagedjs is where contributions should occur. @fchasen (active contributor) might know? Also @fchasen any ability to look into the issues we posted?

@julientaq
Copy link

Hi there!

I’m sorry, i completely miss your message (from last year, that not really acceptable, i’m sorry!)

So basically, our gitlab got completely screwed up by a couple of attaks and issues, and it was so silent that it wasn’t adressed for a while. And the github was supposedly a way to handle issues and merge requests coming in different places, but it’s not working as we’d hope (so long interoperability :-/).

So yes, we’re back in in coko’s gitlab, which is the right place to manage your issues.

I’ll check your issues right now!

@julientaq
Copy link

@dhimmel do you have an account on gitlab.coko.foundation? So i can add you to the issues?

@dhimmel
Copy link
Member Author

dhimmel commented Apr 24, 2022

do you have an account on gitlab.coko.foundation

https://gitlab.coko.foundation/dhimmel

@vincerubinetti
Copy link
Collaborator

Please see this issue for another strong reason we need to abandon Athena:

greenelab/covid19-review#1133

Key points:

Athena is using Electron 3.0.5. The current version of Electron is 18. Electron 3.0.5 is using Chromium version 66.0.3359.181. The current version of Chrome is ~100. Something about combining @media only screen with a complex selector within it is causing an issue with the Chrome 66 print preview.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants