-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BioBERT Model Available - Trained on BioASQ for Question Answer #88
Comments
Hey @trisongz, Thanks for sharing this! Did I understand correctly that you took the text corpus from BioASQ 7b and pretrained a BERT from scratch with MLM and NSP objective? For conversion: Is it the TF format used in the original BERT repo by google? Would be helpful if you (or someone from the community here) could convert it to pytorch / transformers.
Yes, the current docker image is supporting CPU only. We could add another one for GPU that inherits from |
The objectives for this BERT model is extractive QA in SQuAD style - so it should be able to do Question Answering given the input text. I figured that would be the most useful given the context of the solution we're aiming for. I took the implementation from BioBERT and used the original BERT checkpoints from Google, since the BioBERT model had a slightly smaller model parameter. So the TF format should match the original BERT Implementation. |
Oh great, that's even better :) Could you convert the model to PyTorch / transformers format? We are also about to start some expert labeling sessions to gather training/eval data for QA on the CORD-19 dataset. We will share this data once it's available. Maybe you could try to evaluate your model or continue training on it? |
I actually spent a bit of time cleaning up the CORD-19 dataset and compiled it into a single jsonl file. It's pre-processed along with using SciBERT to label potential diseases mentioned, solutions, results, removed any non-english or papers that were smaller than 100 words, and (I think) any that were missing abstract had an abstract generated using GenSIM's summarization library. https://drive.google.com/open?id=1fd0QJ7soYpYQubeWUxeYmUJEK7eX_RhK Would love to have the additional dataset to continue training. Would you also be able to add a script to convert it to the same format as https://storage.cloud.google.com/ce-covid-public/BioASQ-6b/train/Full-Abstract/BioASQ-train-factoid-6b-full-annotated.json to save some time? I'll try to have the checkpoint converted and provide both formats for anyone to continue training since I know that's always a pain when models are only in one format/framework. |
Great! Thanks for sharing your data.
Looks like standard SQuAD format, right? We can definitely provide the dataset in this format. Just be aware that it might take some time until we have enough labels gathered. |
Nice, good finding @ViktorAlm. From a quick look, this dataset seems to contain many question-answer pairs, but is lacking the "context" text as it's rather extracted from FAQ websites. Not sure how this could be useful for extractive QA 🤔 |
I think the context is the link on the top. im still looking at it. Im spending most of my time on normal work/ training some swedish models. If the spans doesnt match the urls then i guess its not very useful. Maybe for pretraining qnli before qa. Im not familiar with methods for better qa. It might be useful for the annotators to see a huge list of questions though edit: She does seem to have data that would be useful for sentencebert and other stuff though. |
Hi - I wanted to share a model that I've pretrained from scratch using BERT Large Cased and the BioASQ 7b - factoid dataset on TPU v2-8.
Original Implementation:
https://github.com/dmis-lab/biobert
Dataset can also be found on their repo.
Model Details:
loss = 0.41782737
step = 18000
max_seq_length = 384
learning_rate = 3e-6
doc_stride = 128
The model is tensorflow based, and I haven't yet converted it to torch or transformers, and haven't evaluated it yet.
I'd like to continue training it on COVID related questions, as well as additional data from BioASQ but haven't yet found an easy way to convert the raw bioASQ data into the format for training. If someone would like to do that so I can continue training the model further, please let me know.
You should be able to download all the files easily with gsutil installed by running
gsutil -m cp -r gs://ce-covid-public/biobert-large-cased/* /path/to/folder/
If someone wants to run evaluation on the models and provide the metrics, I can update this.
Question - when running the backend on Docker with GPU enabled and BERT embeddings, it doesn't seem to be using the GPUs even with all the correct drivers. Is there some documentation around this?
Great job on the progress so far! I believe there's a lot of value in what's being done.
The text was updated successfully, but these errors were encountered: