Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to add Docling models as package dependency instead of downloading them at runtime? #61

Open
sanmai-NL opened this issue Dec 6, 2024 · 6 comments

Comments

@sanmai-NL
Copy link

See https://huggingface.co/docs/hub/spacy#using-existing-models for how this is done with spaCy. I didn't find wheels or tarballs for these models.

@dolfim-ibm
Copy link
Contributor

I slightly disagree in the statement of the request and the comparison with spaCy:

  • spaCy does not have the models as package dependency
  • the models are (usually) downloaded at runtime with spacy.load("en_core_web_sm")
  • additionally, they also allow to download the model weights as packages. but this is not a direct dependency of spaCy

Docling already has a section in the documentation on how to download the weights and use it offline. There is a simple one-line command to download the weights, e.g.

python -c 'from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline; StandardPdfPipeline.download_models_hf();'

or

python -c 'from huggingface_hub import snapshot_download; snapshot_download("ds4sd/docling-models", tag="v2.0.1")'

I think the real point of the issue is whether the download of the weights could be simplified further, i.e. using helper scripts or other tools for it.

@sanmai-NL
Copy link
Author

sanmai-NL commented Dec 11, 2024

I slightly disagree in the statement of the request and the comparison with spaCy:

  • spaCy does not have the models as package dependency

  • the models are (usually) downloaded at runtime with spacy.load("en_core_web_sm")

  • additionally, they also allow to download the model weights as packages. but this is not a direct dependency of spaCy

Please note that I haven't claimed that or suggested Docling should do so. Thinking of it, perhaps it would help discoverability if it's were a fully optional dependancy.

Docling already has a section in the documentation on how to download the weights and use it offline. There is a simple one-line command to download the weights, e.g.

python -c 'from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline; StandardPdfPipeline.download_models_hf();'

or

python -c 'from huggingface_hub import snapshot_download; snapshot_download("ds4sd/docling-models", tag="v2.0.1")'

Despite searching for this by me and two data scientist colleagues, we missed this. It may be helpful to improve discoverability.

I think the real point of the issue is whether the download of the weights could be simplified further, i.e. using helper scripts or other tools for it.

Returning to that topic, the original topic of my Issue description, I like the approach described and supported by spaCy. Developing extra tools requires maintenance. A requirement for us would be that we can initialize these artifacts in a containerized environment. In my experience running helper scripts is more of a hassle then, as they often require installing (Docling) differently, having certain paths writable and with sufficient space, having some permissions in general wrt. e.g., execution and networking, and difficulty with bit-for-bit reproducibility.

@dolfim-ibm
Copy link
Contributor

We actually don't dislike the idea and we were thinking of it, but we hold on it because pypi doesn't allow to push model artifacts, and, from our direct experience, having python packages on other registries is creating lots of downstream issues. (e.g. torch is really not easy to handle in many situations)

[..] improve discoverability [..]

Definitely!
Just to add to it, we even have an example Dockerfile which does it all already: https://github.com/DS4SD/docling/blob/main/Dockerfile

@sanmai-NL
Copy link
Author

sanmai-NL commented Dec 11, 2024

spaCy also does it using GitHub. With Git LFS pushing large binaries is feasible.

@sanmai-NL
Copy link
Author

sanmai-NL commented Dec 18, 2024

@dolfim-ibm Should the caller set artifacts_path explicitly like so, to prevent automatic downloading at runtime?

class PipelineDocuments:
    def __init__(
        self,
        *,
        path_dir_artifacts: str | None = snapshot_download(
            local_files_only=True,
            repo_id="ds4sd/docling-models",
            revision="v2.1.0",
        ),
    ) -> None:
        pdfpipelineoptions = PdfPipelineOptions(
            artifacts_path=path_dir_artifacts,
            do_ocr=False,
            document_timeout=60,
            ocr_options=EasyOcrOptions(download_enabled=False),
        )

It would be helpful for us if Docling were to accept some global configuration option to never download models or anything else at runtime.

@dolfim-ibm
Copy link
Contributor

Yes, the argument is correct.
I think we could consider some arguments similar to how the models can be configures with a given accelerator device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants