Convert PDFs to text as webservice.
Forked and extended from with the following changes:
- provide multiple modes to convert
- allow reading PDF from GCS bucket
- run conversion under cgroup with memory limit
This repository uses uv
. In the following
we assume that uv
is installed.
uv sync
to install dependencies (including development dependencies) into a new
virtualenv in .venv
uv run hypercorn --bind webserver:app
docker build -t arxiv-pdftotext:latest .
Start the docker service with
docker run --privileged --cgroupns=host -d -p8888:8888 arxiv-pdftotext:latest
Three modes are supported:
- pdftotext: uses the poppler utilities tool
- pdf2txt: uses the pdfminer.six tool
- auto: first tries pdftotext and if that fails pdf2txt
Default mode is auto
curl -F "file=@tests/hello-world.pdf;" http://localhost:8888/
Passing a different mode:
curl -F "file=@tests/hello-world.pdf;" http://localhost:8888/?mode=pdf2txt
Note that files not ending in .pdf
will be rejected.
Be default, if a conversion takes longer than 3min (180sec), the conversion process will be killed (and, depending on the mode, the next conversion method be tried, see avove).
The timeout is configurable by passing convert_timeout=NN
as API parameter.
It is possible to convert PDF files in GCS buckets via the endpoint
. By default, all buckets are accepted.
If the program is started with the environment variable ACCEPTED_BUCKETS
begin set to a comma-separated list of acceptable buckets,
only gs://
URIs with one of these buckets will be accepted.
Note that one can set ACCEPTED_BUCKETS
also via .env
Example conversion:
curl -X POST 'http://localhost:8888/from_bucket?uri=gs://some-bucket/some-file.pdf'
We provide a pre-commit configuration file. We strongly recommend activating the pre-commit hook by e.g. running
uv tool install pre-commit --with pre-commit-uv --force-reinstall
pre-commit install --install-hooks