Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use gotenberg for HTML to PDF conversion #396

Open
dhimmel opened this issue Nov 29, 2020 · 9 comments
Open

Use gotenberg for HTML to PDF conversion #396

dhimmel opened this issue Nov 29, 2020 · 9 comments

Comments

@dhimmel
Copy link
Member

dhimmel commented Nov 29, 2020

Originally mentioned by @agitter at #393 (comment), gotenberg is a:

Docker-powered stateless API for converting HTML, Markdown and Office documents to PDF

Since we're looking at replacing athenapdf with pagedjs-cli in #394, it also makes sense to evaluate gotenberg.

Links:

@dhimmel
Copy link
Member Author

dhimmel commented Nov 29, 2020

One challenge is that the Docker image is large: 844 MB for thecodingmachine/gotenberg:6.3.1. This compares to 291 MB for arachnysdocker/athenapdf:2.16.0

@dhimmel
Copy link
Member Author

dhimmel commented Nov 29, 2020

conversion from a URL

First run the docker:

docker run --rm --publish 3000:3000 thecodingmachine/gotenberg:6.3

Second make an API call to export the manuscript

curl --request POST \
    --url http://localhost:3000/convert/url \
    --header 'Content-Type: multipart/form-data' \
    --form remoteURL=https://manubot.github.io/rootstock/v/97b294802ffcd39071b6e5b8ab59f60faf4be118/ \
    --output output/gotenberg.pdf

Result at gotenberg.pdf looks good (similar to athenapdf).

@castedo
Copy link

castedo commented Oct 6, 2022

@dhimmel It looks like manubot has settled on using WeasyPrint for HTML -> PDF conversion. Is this correct?

In my current manubot-like workflow (but not manubot) I use pandoc to generate JATS XML from markdown and then I generate HTML and PDF from JATS XML as an independent stage. I'm starting to think generating both HTML and PDF from the same JATS XML is a mistake. I'm now considering doing just JATS XML -> HTML -> PDF using WeasyPrint.

Any advice?

(It's a long explanation why I'm not doing markdown directly to HTML).

@agitter
Copy link
Member

agitter commented May 30, 2023

I'm revisiting this after @vincerubinetti pointed out that athenapdf has been archived in #254 (comment)

It may be time to look more seriously into pagedjs-cli versus gotenberg as an athenapdf replacement. Based on @dhimmel's old comment above, it looks like gotenberg worked in initial testing. The latest gotenberg image 7.8.3 is now somewhat smaller at 644MB.

@castedo
Copy link

castedo commented May 30, 2023

FWIW, I've gone pretty far down the WeasyPrint path and gotten good results. I've gotten good results in large part because I'm careful to use fairly old HTML/CSS features. An example is the PDF link off this page:
https://popgen.es/H5NOlCVM9P5Vv4LbeuwJsaME8kM/1.1/
The PDF is by WeasyPrint from a subset of the webpage content.

I have decoupled much of the HTML/CSS implementation from the above example into a separate project:
https://gitlab.com/castedo/printstrap/
to help others do similarly with WeasyPrint.

In particular you might be interested in the article.html example on the article branch:
https://gitlab.com/castedo/printstrap/-/blob/article/example/article.html

@castedo
Copy link

castedo commented May 30, 2023

Also quick clarification: the article.html example in the article branch is actually much more advanced than the live example I give above on popgen.es today. The article.html example is a 2-column format kind of like eLife articles but is fully responsive with the PDF corresponding directly to the HTML content at a particular screen width.

@castedo
Copy link

castedo commented May 30, 2023

This discussion might be helpful in evaluating Chromium vs not:

singlesourcepub/community#49

I've partly gone down the WeasyPrint path because I hesitate to rely on Chromium. I consider it an open question whether Chromium is the right tool for specialized HTML -> PDF conversion where the HTML is high constrained and not really a full web page of a website.

@vincerubinetti
Copy link
Collaborator

I consider it an open question whether Chromium is the right tool for specialized HTML -> PDF conversion where the HTML is high constrained and not really a full web page of a website.

I think Chromium is probably necessary. We need to rely on "newer" CSS properties sometimes, like overflow-wrap and word-break, which are not supported in Weasy. More importantly, we need to rely on JavaScript execution sometimes, like the attributes plugin way of merging table cells together. You could argue that we should find ways to statically do things at build time as much as possible, without javascript, but it would be a significant effort.

@vincerubinetti
Copy link
Collaborator

Maybe we should also emphasize somewhere in the docs that as a last resort, one can manually print to pdf from the html version in any major browser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants