Skip to content

MedNER: A deep learning tool for medical NER. Fine-tune transformer models (BERT/BioBERT) to extract entities like diseases, treatments, and medications from clinical texts. Enjoy GPU-accelerated training, TensorBoard visualization, and an easy inference pipeline.

Notifications You must be signed in to change notification settings

PARSA-MHMDI/MedNER

Repository files navigation

Medical Named Entity Recognition (MedNER)

Overview

Medical Named Entity Recognition (MedNER) is a deep learning-based project designed to extract medical entities from text using a fine-tuned BERT model. This project utilizes the Hugging Face transformers library to identify named entities such as diseases, medications, genes, and other biomedical terms.

Project Outline

  1. Dataset

    • The dataset is sourced from parsa-mhmdi/Medical_NER on Hugging Face.
    • It consists of tokenized medical text with annotated named entities in the IOB format.
  2. Model

    • A fine-tuned bert-base-cased model is used for Named Entity Recognition (NER).
    • The model is trained using the Hugging Face Trainer API.
  3. Training Pipeline

    • Tokenization using AutoTokenizer from Hugging Face.
    • Data alignment to match tokenized input with entity labels.
    • Training with evaluation and model selection based on best validation performance.
  4. Deployment

    • The trained model is deployed as a Hugging Face Space using Gradio.
    • A web-based interactive demo is provided for real-time text analysis.

Project Files

This repository contains the following essential files:

  • .git - Version control folder (not necessary for direct use).
  • .gradio - Configuration files for Gradio interface settings.
  • .gitattributes - Defines Git LFS tracking for large files.
  • app.py - Main script for running the Gradio interface.
  • config.json - Configuration file for the model, specifying hyperparameters.
  • README.md - Documentation containing project details and usage instructions.
  • requirements.txt - Lists all dependencies required to run the project.
  • tokenizer.json - Tokenizer configuration containing vocabulary and model-specific settings.
  • tokenizer_config.json - Configuration settings for the tokenizer.
  • trainer_code.ipynb - Jupyter Notebook containing training scripts and model fine-tuning process.
  • vocab.txt - Vocabulary file used by the tokenizer.

Installation

To run the project locally, clone the repository and install dependencies:

git clone https://huggingface.co/spaces/parsa-mhmdi/MedNER
cd MedNER
pip install -r requirements.txt

Usage

Run the application using:

python app.py

This will launch a Gradio interface where you can enter medical text to identify named entities.

Training the Model

To train the model from scratch, run the following script:

python train.py

This will:

  • Load the dataset
  • Tokenize and preprocess text
  • Train the bert-base-cased model
  • Save the best-performing model checkpoint

Model Compression & Upload

To save storage space, the best model is compressed and uploaded to Hugging Face:

import shutil
shutil.make_archive("./ner_model_compressed", 'zip', "./ner_model")

The compressed model is then uploaded to the repository:

from huggingface_hub import upload_folder
upload_folder(repo_id="parsa-mhmdi/MedNER", folder_path="./ner_model_compressed.zip")

Demo Link

Try the live demo of MedNER on Hugging Face Spaces: 🔗 MedNER Hugging Face Space

Contribution

We welcome contributions! Feel free to fork the repository and submit a pull request with improvements.

License

This project is open-source and available under the MIT License.

About

MedNER: A deep learning tool for medical NER. Fine-tune transformer models (BERT/BioBERT) to extract entities like diseases, treatments, and medications from clinical texts. Enjoy GPU-accelerated training, TensorBoard visualization, and an easy inference pipeline.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published