Skip to content

Latest commit



164 lines (133 loc) · 5.24 KB

File metadata and controls

164 lines (133 loc) · 5.24 KB

Named Entity Recognition (NER) System


In order to run the system, you need to have the following tools installed locally:

Set up

Clone the project repo:

Navigate to the project root directory and install the ner-system package using Poetry:

poetry install

Activate the Poetry Shell, which enables the project's virtual environment:

poetry shell

Download SpaCy's English language model using the Poetry CLI:

poetry run python -m spacy download en_core_web_sm

NOTE: Using a lightweight version for demonstration purposes. You can download a more robust English model folowing the guidelines from this list.

Finally, unzip the provided file and copy/paste the data, models and artifacts directories at the project's root level. If no such file was provided, create these directories at the root project level and follow the instructions on the Reproducibility section.

The following should be the final project structure after completing all these steps:

│   .env
│   .gitignore
│   poetry.lock
│   pyproject.toml
│       test_data.pkl
│       test_tags.pkl
│       train_data.pkl
│       train_tags.pkl
│       validation_data.pkl
│       validation_tags.pkl
│   ├───CoNLL003
│   │       metadata
│   │       test.txt
│   │       train.txt
│   │       valid.txt
│   │
│   └───DataWorld
│           cnn_data.json
│           cnn_data_sample.txt
│       crf_model_vanilla.pkl
|       crf_model_optim.pkl
│   │
│   │
│   │
│   │
│   │
│   │
│   │
│   ├───api
│   │   │
│   │   │   sample_request.json
│   │   │   sample_response.json
│   │   │
│   │   │
│   │   └───__pycache__
│   │
│   ├───models
│   │   │
│   │   │
│   │   │
│   │   └───__pycache__
│   │        
│   └───__pycache__
│       R&D_BERT+CRF_Model.ipynb
│       R&D_CRF_Model.ipynb


First, launch the NER System Application:

poetry run python ner-system/

INFO:     Started server process [5980]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)

Open your browser and navigate to the URL displayed on the terminal: http://localhost:8000. It should redirect to the API Documentation page located at URL http://localhost:8000/docs.

The endpoint /api/v0/ner/predict is responsible for extracting entities from a payload containing news articles. Sample JSON files sample_request.json and sample_response.json have been provided inside the ./api directory to illustrate the API schema as documented in localhost:8000/docs.

With the application running, send a request to the API by running the file ./tests/ using another terminal. First activate the Poetry shell and then run the following:

poetry run python tests/
  • Alternatively, you can use an API testing software like Postman to submit the request to the URL http://localhost:8000/api/v0/ner/predict, using the sample_request.json file located inside the ./api folder as reference to fill in the Body parameters.


A simple unit test suite has been provided under the folder ./tests and can be run using PyTest:

poetry run python pytest

===================================== test session starts =====================================
platform win32 -- Python 3.11.0, pytest-8.0.0, pluggy-1.4.0
rootdir: path/to/root/dir/ner-system
plugins: anyio-4.2.0
collected 6 items

tests\ ..                                                                     [ 33%]
tests\ ....                                                              [100%]

To run an end-to-end integration test, run as follows:

poetry run python tests/


  • Run the notebook ./notebooks/R&D_CRF_Model.ipynb from top to bottom, which will produce a trained CRF model on the CoNNL2003 dataset, persisted into the ./models folder. The corresponding artifacts are persisted on the ./artifacts folder.

    NOTE: Make sure to run the notebook using the Kernel that contains the virtual environment activated by Poetry. For reference, it should start with the name ner-system-XXXX.


For any questions, comments or bug reporting, please submit an issue or contact me at juan at helelab dot org.