Medical Named Entity Recognition (MedNER) is a deep learning-based project designed to extract medical entities from text using a fine-tuned BERT model. This project utilizes the Hugging Face transformers
library to identify named entities such as diseases, medications, genes, and other biomedical terms.
-
Dataset
- The dataset is sourced from
parsa-mhmdi/Medical_NER
on Hugging Face. - It consists of tokenized medical text with annotated named entities in the IOB format.
- The dataset is sourced from
-
Model
- A fine-tuned
bert-base-cased
model is used for Named Entity Recognition (NER). - The model is trained using the Hugging Face
Trainer
API.
- A fine-tuned
-
Training Pipeline
- Tokenization using
AutoTokenizer
from Hugging Face. - Data alignment to match tokenized input with entity labels.
- Training with evaluation and model selection based on best validation performance.
- Tokenization using
-
Deployment
- The trained model is deployed as a Hugging Face Space using
Gradio
. - A web-based interactive demo is provided for real-time text analysis.
- The trained model is deployed as a Hugging Face Space using
This repository contains the following essential files:
.git
- Version control folder (not necessary for direct use)..gradio
- Configuration files for Gradio interface settings..gitattributes
- Defines Git LFS tracking for large files.app.py
- Main script for running the Gradio interface.config.json
- Configuration file for the model, specifying hyperparameters.README.md
- Documentation containing project details and usage instructions.requirements.txt
- Lists all dependencies required to run the project.tokenizer.json
- Tokenizer configuration containing vocabulary and model-specific settings.tokenizer_config.json
- Configuration settings for the tokenizer.trainer_code.ipynb
- Jupyter Notebook containing training scripts and model fine-tuning process.vocab.txt
- Vocabulary file used by the tokenizer.
To run the project locally, clone the repository and install dependencies:
git clone https://huggingface.co/spaces/parsa-mhmdi/MedNER
cd MedNER
pip install -r requirements.txt
Run the application using:
python app.py
This will launch a Gradio interface where you can enter medical text to identify named entities.
To train the model from scratch, run the following script:
python train.py
This will:
- Load the dataset
- Tokenize and preprocess text
- Train the
bert-base-cased
model - Save the best-performing model checkpoint
To save storage space, the best model is compressed and uploaded to Hugging Face:
import shutil
shutil.make_archive("./ner_model_compressed", 'zip', "./ner_model")
The compressed model is then uploaded to the repository:
from huggingface_hub import upload_folder
upload_folder(repo_id="parsa-mhmdi/MedNER", folder_path="./ner_model_compressed.zip")
Try the live demo of MedNER on Hugging Face Spaces: 🔗 MedNER Hugging Face Space
We welcome contributions! Feel free to fork the repository and submit a pull request with improvements.
This project is open-source and available under the MIT License.