Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add models_download.py script to download model files at docker-building time, to avoid model download issues at run time. #15

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions Containerfile
Original file line number Diff line number Diff line change
Expand Up @@ -9,18 +9,18 @@ RUN apt-get update \

RUN pip install --no-cache-dir poetry

COPY pyproject.toml poetry.lock README.md /docling-serve/
COPY pyproject.toml poetry.lock README.md models_download.py /docling-serve/

RUN if [ "$CPU_ONLY" = "true" ]; then \
poetry install --no-root --with cpu; \
else \
poetry install --no-root; \
fi
fi && \
poetry run python models_download.py

ENV HF_HOME=/tmp/
ENV TORCH_HOME=/tmp/

RUN poetry run python -c 'from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline; artifacts_path = StandardPdfPipeline.download_models_hf(force=True);'

# On container environments, always set a thread budget to avoid undesired thread congestion.
ENV OMP_NUM_THREADS=4
Expand All @@ -29,4 +29,4 @@ COPY ./docling_serve /docling-serve/docling_serve

EXPOSE 5000

CMD ["poetry", "run", "uvicorn", "--port", "5000", "--host", "0.0.0.0", "docling_serve.app:app"]
CMD ["poetry", "run", "uvicorn", "--port", "5000", "--host", "0.0.0.0", "--log-level", "debug", "docling_serve.app:app"]
40 changes: 40 additions & 0 deletions models_download.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import os
import zipfile

import requests
from deepsearch_glm.utils.load_pretrained_models import load_pretrained_nlp_models
from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline

# Download Docling models
StandardPdfPipeline.download_models_hf(force=True)
load_pretrained_nlp_models(verbose=True)

# Download EasyOCR models
urls = [
"https://github.com/JaidedAI/EasyOCR/releases/download/v1.3/latin_g2.zip",
"https://github.com/JaidedAI/EasyOCR/releases/download/pre-v1.1.6/craft_mlt_25k.zip"
]

local_zip_paths = [
"/root/latin_g2.zip",
"/root/craft_mlt_25k.zip"
]

extract_path = "/root/.EasyOCR/model/"

# Create the extract directory if it doesn't exist
os.makedirs(extract_path, exist_ok=True)
os.makedirs(os.path.dirname(local_zip_paths[0]), exist_ok=True) # Create directory for zip files

for url, local_zip_path in zip(urls, local_zip_paths):
# Download the file
response = requests.get(url)
with open(local_zip_path, "wb") as file:
file.write(response.content)

# Unzip the file
with zipfile.ZipFile(local_zip_path, "r") as zip_ref:
zip_ref.extractall(extract_path)

# Clean up the zip file
os.remove(local_zip_path)