Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caption Recognition Model Download Timing Out at 30% (during the Docker image build process -- Github Actions) #900

Open
shahdyousefak opened this issue Oct 18, 2024 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@shahdyousefak
Copy link
Contributor

shahdyousefak commented Oct 18, 2024

The caption recognition preprocessor is currently facing issues with downloading the model during the Docker build process. The specific command causing the problem is /preprocessors/caption-recognition/Dockerfile:

RUN wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP/blip_coco_caption_base.pth -P /home/python/.cache/torch/hub/checkpoints

We also attempted to download the model from the Pegasus server., yet both yield to:

#8 12.51 791150K .......... .......... .......... .......... .......... 30% 120M 28s

It consistently gets stuck at 30%, then times out, and causes the caption recognition build to fail.

This issue may be related to rate limiting by the server or limitations imposed by GitHub for free accounts, yet a more thorough investigation may be required.

(Pasting from my conversation with @jeffbl)
so if the problem is the large 2.5GB checkpoint file, I think the issue is one of:

  • we were rate limited by the server holding the checkpoint. this seemed likely when we were pulling from someone else's server, but you're seeing it from pegasus as well. You can try doing a wget on that file and see if there are any perf issues. Otherwise I think we can rule out the server limiting? --> I was able to successfully execute the wget command, and the model downloaded without any issues, which suggests that it's not a performance issue.
  • it is a limitation of the build system on github, esp. for a free account. maybe they do rate limiting.

Options I can think of (probably there are more):

  1. remove github build for this, build image locally on unicorn, and push image to github (similar limits?) or someplace else
  2. move the checkpoint download to happen not when building the image, but have that happen each time the container is run. many downsides to this, but would work for our use since it is all on local network. startup time for that container will be bad
  3. We don't use this in production. if an LMM would likely to better at this point anyway, maybe we should just deprecate this.
  4. We turn off github actions for this specific preprocessor, and say in the README it is not automatically building, and open a github item to fix it when we figure out best solution
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant