-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to add Docling models as package dependency instead of downloading them at runtime? #61
Comments
I slightly disagree in the statement of the request and the comparison with spaCy:
Docling already has a section in the documentation on how to download the weights and use it offline. There is a simple one-line command to download the weights, e.g. python -c 'from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline; StandardPdfPipeline.download_models_hf();' or python -c 'from huggingface_hub import snapshot_download; snapshot_download("ds4sd/docling-models", tag="v2.0.1")' I think the real point of the issue is whether the download of the weights could be simplified further, i.e. using helper scripts or other tools for it. |
Please note that I haven't claimed that or suggested Docling should do so. Thinking of it, perhaps it would help discoverability if it's were a fully optional dependancy.
Despite searching for this by me and two data scientist colleagues, we missed this. It may be helpful to improve discoverability.
Returning to that topic, the original topic of my Issue description, I like the approach described and supported by spaCy. Developing extra tools requires maintenance. A requirement for us would be that we can initialize these artifacts in a containerized environment. In my experience running helper scripts is more of a hassle then, as they often require installing (Docling) differently, having certain paths writable and with sufficient space, having some permissions in general wrt. e.g., execution and networking, and difficulty with bit-for-bit reproducibility. |
We actually don't dislike the idea and we were thinking of it, but we hold on it because pypi doesn't allow to push model artifacts, and, from our direct experience, having python packages on other registries is creating lots of downstream issues. (e.g. torch is really not easy to handle in many situations)
Definitely! |
spaCy also does it using GitHub. With Git LFS pushing large binaries is feasible. |
@dolfim-ibm Should the caller set class PipelineDocuments:
def __init__(
self,
*,
path_dir_artifacts: str | None = snapshot_download(
local_files_only=True,
repo_id="ds4sd/docling-models",
revision="v2.1.0",
),
) -> None:
pdfpipelineoptions = PdfPipelineOptions(
artifacts_path=path_dir_artifacts,
do_ocr=False,
document_timeout=60,
ocr_options=EasyOcrOptions(download_enabled=False),
) It would be helpful for us if Docling were to accept some global configuration option to never download models or anything else at runtime. |
Yes, the argument is correct. |
See https://huggingface.co/docs/hub/spacy#using-existing-models for how this is done with spaCy. I didn't find wheels or tarballs for these models.
The text was updated successfully, but these errors were encountered: