How to add Docling models as package dependency instead of downloading them at runtime? #61

sanmai-NL · 2024-12-06T12:55:53Z

See https://huggingface.co/docs/hub/spacy#using-existing-models for how this is done with spaCy. I didn't find wheels or tarballs for these models.

dolfim-ibm · 2024-12-11T06:17:52Z

I slightly disagree in the statement of the request and the comparison with spaCy:

spaCy does not have the models as package dependency
the models are (usually) downloaded at runtime with spacy.load("en_core_web_sm")
additionally, they also allow to download the model weights as packages. but this is not a direct dependency of spaCy

Docling already has a section in the documentation on how to download the weights and use it offline. There is a simple one-line command to download the weights, e.g.

python -c 'from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline; StandardPdfPipeline.download_models_hf();'

or

python -c 'from huggingface_hub import snapshot_download; snapshot_download("ds4sd/docling-models", tag="v2.0.1")'

I think the real point of the issue is whether the download of the weights could be simplified further, i.e. using helper scripts or other tools for it.

sanmai-NL · 2024-12-11T07:39:20Z

I slightly disagree in the statement of the request and the comparison with spaCy:

spaCy does not have the models as package dependency

the models are (usually) downloaded at runtime with spacy.load("en_core_web_sm")

additionally, they also allow to download the model weights as packages. but this is not a direct dependency of spaCy

Please note that I haven't claimed that or suggested Docling should do so. Thinking of it, perhaps it would help discoverability if it's were a fully optional dependancy.

Docling already has a section in the documentation on how to download the weights and use it offline. There is a simple one-line command to download the weights, e.g.
python -c 'from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline; StandardPdfPipeline.download_models_hf();'
or
python -c 'from huggingface_hub import snapshot_download; snapshot_download("ds4sd/docling-models", tag="v2.0.1")'

Despite searching for this by me and two data scientist colleagues, we missed this. It may be helpful to improve discoverability.

I think the real point of the issue is whether the download of the weights could be simplified further, i.e. using helper scripts or other tools for it.

Returning to that topic, the original topic of my Issue description, I like the approach described and supported by spaCy. Developing extra tools requires maintenance. A requirement for us would be that we can initialize these artifacts in a containerized environment. In my experience running helper scripts is more of a hassle then, as they often require installing (Docling) differently, having certain paths writable and with sufficient space, having some permissions in general wrt. e.g., execution and networking, and difficulty with bit-for-bit reproducibility.

dolfim-ibm · 2024-12-11T07:48:05Z

We actually don't dislike the idea and we were thinking of it, but we hold on it because pypi doesn't allow to push model artifacts, and, from our direct experience, having python packages on other registries is creating lots of downstream issues. (e.g. torch is really not easy to handle in many situations)

[..] improve discoverability [..]

Definitely!
Just to add to it, we even have an example Dockerfile which does it all already: https://github.com/DS4SD/docling/blob/main/Dockerfile

sanmai-NL · 2024-12-11T07:52:07Z

spaCy also does it using GitHub. With Git LFS pushing large binaries is feasible.

sanmai-NL · 2024-12-18T10:21:03Z

@dolfim-ibm Should the caller set artifacts_path explicitly like so, to prevent automatic downloading at runtime?

class PipelineDocuments:
    def __init__(
        self,
        *,
        path_dir_artifacts: str | None = snapshot_download(
            local_files_only=True,
            repo_id="ds4sd/docling-models",
            revision="v2.1.0",
        ),
    ) -> None:
        pdfpipelineoptions = PdfPipelineOptions(
            artifacts_path=path_dir_artifacts,
            do_ocr=False,
            document_timeout=60,
            ocr_options=EasyOcrOptions(download_enabled=False),
        )

It would be helpful for us if Docling were to accept some global configuration option to never download models or anything else at runtime.

dolfim-ibm · 2024-12-20T16:36:54Z

Yes, the argument is correct.
I think we could consider some arguments similar to how the models can be configures with a given accelerator device.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to add Docling models as package dependency instead of downloading them at runtime? #61

How to add Docling models as package dependency instead of downloading them at runtime? #61

sanmai-NL commented Dec 6, 2024

dolfim-ibm commented Dec 11, 2024

sanmai-NL commented Dec 11, 2024 •

edited

Loading

dolfim-ibm commented Dec 11, 2024

sanmai-NL commented Dec 11, 2024 •

edited

Loading

sanmai-NL commented Dec 18, 2024 •

edited

Loading

dolfim-ibm commented Dec 20, 2024

How to add Docling models as package dependency instead of downloading them at runtime? #61

How to add Docling models as package dependency instead of downloading them at runtime? #61

Comments

sanmai-NL commented Dec 6, 2024

dolfim-ibm commented Dec 11, 2024

sanmai-NL commented Dec 11, 2024 • edited Loading

dolfim-ibm commented Dec 11, 2024

sanmai-NL commented Dec 11, 2024 • edited Loading

sanmai-NL commented Dec 18, 2024 • edited Loading

dolfim-ibm commented Dec 20, 2024

sanmai-NL commented Dec 11, 2024 •

edited

Loading

sanmai-NL commented Dec 11, 2024 •

edited

Loading

sanmai-NL commented Dec 18, 2024 •

edited

Loading