Use gotenberg for HTML to PDF conversion #396

dhimmel · 2020-11-29T13:41:50Z

Originally mentioned by @agitter at #393 (comment), gotenberg is a:

Docker-powered stateless API for converting HTML, Markdown and Office documents to PDF

Since we're looking at replacing athenapdf with pagedjs-cli in #394, it also makes sense to evaluate gotenberg.

Links:

dhimmel · 2020-11-29T13:45:16Z

One challenge is that the Docker image is large: 844 MB for thecodingmachine/gotenberg:6.3.1. This compares to 291 MB for arachnysdocker/athenapdf:2.16.0

dhimmel · 2020-11-29T14:05:04Z

conversion from a URL

First run the docker:

docker run --rm --publish 3000:3000 thecodingmachine/gotenberg:6.3

Second make an API call to export the manuscript

curl --request POST \
    --url http://localhost:3000/convert/url \
    --header 'Content-Type: multipart/form-data' \
    --form remoteURL=https://manubot.github.io/rootstock/v/97b294802ffcd39071b6e5b8ab59f60faf4be118/ \
    --output output/gotenberg.pdf

Result at gotenberg.pdf looks good (similar to athenapdf).

castedo · 2022-10-06T21:30:55Z

@dhimmel It looks like manubot has settled on using WeasyPrint for HTML -> PDF conversion. Is this correct?

In my current manubot-like workflow (but not manubot) I use pandoc to generate JATS XML from markdown and then I generate HTML and PDF from JATS XML as an independent stage. I'm starting to think generating both HTML and PDF from the same JATS XML is a mistake. I'm now considering doing just JATS XML -> HTML -> PDF using WeasyPrint.

Any advice?

(It's a long explanation why I'm not doing markdown directly to HTML).

agitter · 2023-05-30T21:16:43Z

I'm revisiting this after @vincerubinetti pointed out that athenapdf has been archived in #254 (comment)

It may be time to look more seriously into pagedjs-cli versus gotenberg as an athenapdf replacement. Based on @dhimmel's old comment above, it looks like gotenberg worked in initial testing. The latest gotenberg image 7.8.3 is now somewhat smaller at 644MB.

castedo · 2023-05-30T21:35:36Z

FWIW, I've gone pretty far down the WeasyPrint path and gotten good results. I've gotten good results in large part because I'm careful to use fairly old HTML/CSS features. An example is the PDF link off this page:
https://popgen.es/H5NOlCVM9P5Vv4LbeuwJsaME8kM/1.1/
The PDF is by WeasyPrint from a subset of the webpage content.

I have decoupled much of the HTML/CSS implementation from the above example into a separate project:
https://gitlab.com/castedo/printstrap/
to help others do similarly with WeasyPrint.

In particular you might be interested in the article.html example on the article branch:
https://gitlab.com/castedo/printstrap/-/blob/article/example/article.html

castedo · 2023-05-30T21:38:43Z

Also quick clarification: the article.html example in the article branch is actually much more advanced than the live example I give above on popgen.es today. The article.html example is a 2-column format kind of like eLife articles but is fully responsive with the PDF corresponding directly to the HTML content at a particular screen width.

castedo · 2023-05-30T21:50:19Z

This discussion might be helpful in evaluating Chromium vs not:

singlesourcepub/community#49

I've partly gone down the WeasyPrint path because I hesitate to rely on Chromium. I consider it an open question whether Chromium is the right tool for specialized HTML -> PDF conversion where the HTML is high constrained and not really a full web page of a website.

vincerubinetti · 2023-05-30T22:07:54Z

I consider it an open question whether Chromium is the right tool for specialized HTML -> PDF conversion where the HTML is high constrained and not really a full web page of a website.

I think Chromium is probably necessary. We need to rely on "newer" CSS properties sometimes, like overflow-wrap and word-break, which are not supported in Weasy. More importantly, we need to rely on JavaScript execution sometimes, like the attributes plugin way of merging table cells together. You could argue that we should find ways to statically do things at build time as much as possible, without javascript, but it would be a significant effort.

vincerubinetti · 2023-05-31T17:23:46Z

Maybe we should also emphasize somewhere in the docs that as a last resort, one can manually print to pdf from the html version in any major browser.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use gotenberg for HTML to PDF conversion #396

Use gotenberg for HTML to PDF conversion #396

dhimmel commented Nov 29, 2020

dhimmel commented Nov 29, 2020

dhimmel commented Nov 29, 2020

castedo commented Oct 6, 2022

agitter commented May 30, 2023

castedo commented May 30, 2023

castedo commented May 30, 2023

castedo commented May 30, 2023

vincerubinetti commented May 30, 2023

vincerubinetti commented May 31, 2023

Use gotenberg for HTML to PDF conversion #396

Use gotenberg for HTML to PDF conversion #396

Comments

dhimmel commented Nov 29, 2020

dhimmel commented Nov 29, 2020

dhimmel commented Nov 29, 2020

conversion from a URL

castedo commented Oct 6, 2022

agitter commented May 30, 2023

castedo commented May 30, 2023

castedo commented May 30, 2023

castedo commented May 30, 2023

vincerubinetti commented May 30, 2023

vincerubinetti commented May 31, 2023