Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Load or download the model. #55

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

georgeamccarthy
Copy link
Owner

@georgeamccarthy georgeamccarthy commented Aug 10, 2021

PR type

Purpose

  • Allows model and tokenizor to be stored locally & will download if not found.

Why?

  • Unable to download indexer within flow on GCP. (deployment).

Extra info

New protein_search/models directory to store models in.

models/
└── prot_bert
    ├── model
    │   ├── config.json
    │   └── pytorch_model.bin
    └── tokenizer
        ├── special_tokens_map.json
        ├── tokenizer_config.json
        └── vocab.txt

Models downloaded from huggingface and then I moved them into these dirs.

Feedback required over

  • A quick pair of 👀 on the code
  • Discussion on the technical approach

Mentions

References

Legal

@georgeamccarthy
Copy link
Owner Author

Not sure if gonna merge this but needing it on GCP without the Dockerization merged. Could probs use a simpler model files structure https://huggingface.co/Rostlab/prot_bert/tree/main

@georgeamccarthy
Copy link
Owner Author

There may be a simpler to get around the issue. If I try and download the model with a simple script

from transformers import BertModel, BertTokenizer

model_path = "Rostlab/prot_bert"

print("Loading tokenizer.")
tokenizer = BertTokenizer.from_pretrained(model_path, do_lower_case=False)
print("loading model.")
model = BertModel.from_pretrained(model_path)

self.tokenizer = tokenizer
self.model = model

print("Done.")

then the system runs out of RAM ~1 GB and throws an error Killed.

To monitor RAM usage ps -m -o %cpu,%mem,command

Instead of downloading the repo I might just be able to configure the download to use a disk cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant