Bangla Name Entity Recognition (NER
) using SpaCy. NER from bangla input text sentences. Experiment is done only using one entity name (person
) label as PER
Perform 5 different experiment this this data and foud that Transformer base model perform better compare to other model so far for this data. Please check the experimental detail and F1 score
in experimental history. Where best F1 score ~.80
Bangla NER data is collected from,
conda install spacy=3.1
pip install spacy-transformers # need if you want to use transformer
NOTE: If you want to just test the ner model please,
- Clean IOB and remove data which is in wrong IOB format
- IOB to spacy
.spacy
data format in in SpaCy3.xpython -m spacy convert -c iob -s -n 1 ner-token-per-line.iob ./data
- Example SpaCy json data
- Convert
BLIOU
json format to.spacy
data formatpython -m spacy convert train.json ./data
- To automate data prepration just run,
python utils/convert_to_spacy_json_format.py
This scrip will generate data/train.json
, data/val.json
- Convert json data to
.spacy
data
python -m spacy convert data/train.json ./data
python -m spacy convert data/val.json ./data
# Outputs
✔ Generated output file (8986 documents): data/train.spacy
✔ Generated output file (999 documents): data/val.spacy
Above two command will generate data/train.spacy
, and data/val.spacy
Go to the link and create a base config file and save it uinder ./configs/base_config.cfg
Required fils for ner task are already in,
configs/
├── base_config.cfg # base ner file configuration download from spacy website
├── config.cfg # use to train ner pipeline
└── config_pretrain.cfg # use to train only tok2vec seperately
Now convert ./configs/base_config.cfg
to config file ./configs/config.cfg
python -m spacy init fill-config configs/base_config.cfg configs/config.cfg
python -m spacy train configs/config.cfg \
--output ./models \
--paths.train ./data/train.spacy \
--paths.dev ./data/val.spacy
You will get F1 score on val data around 0.66
For inferance please run,
python test.py
You can already pretrain model in test.py
. Please download the pretrain model from google drive (4.4MB) and set the model path in test.py
file
To training spacy transformer model please check need GPU
,
Transformer training and inferance guide
You will get F1 score on val data around 0.80
if you want to use already trained model please download pretrain model from google drive (622.8MB) and set the model path in test.py
file
import spacy
nlp = spacy.load("./models_multilingual_bert/model-best")
text_list = [
"আব্দুর রহিম নামের কাস্টমারকে একশ টাকা বাকি দিলাম",
"১০০ টাকা জমা দিয়েছেন কবির",
"ডিপিডিসির স্পেশাল টাস্কফোর্সের প্রধান মুনীর চৌধুরী জানান",
"অগ্রণী ব্যাংকের জ্যেষ্ঠ কর্মকর্তা পদে নিয়োগ পরীক্ষার প্রশ্নপত্র ফাঁসের অভিযোগ উঠেছে।",
"সে আজকে ঢাকা যাবে",
]
for text in text_list:
doc = nlp(text)
print(f"Input: {text}")
for entity in doc.ents:
print(f"Entity: {entity.text}, Label: {entity.label_}")
print("---")
# Outputs
Input: আব্দুর রহিম নামের কাস্টমারকে একশ টাকা বাকি দিলাম
Entity: আব্দুর রহিম, Label: PER
---
Input: ১০০ টাকা জমা দিয়েছেন কবির
Entity: কবির, Label: PER
---
Input: ডিপিডিসির স্পেশাল টাস্কফোর্সের প্রধান মুনীর চৌধুরী জানান
Entity: মুনীর চৌধুরী, Label: PER
---
Input: অগ্রণী ব্যাংকের জ্যেষ্ঠ কর্মকর্তা পদে নিয়োগ পরীক্ষার প্রশ্নপত্র ফাঁসের অভিযোগ উঠেছে।
---
Input: সে আজকে ঢাকা যাবে
---
NOTE: Why to use Transformer base model ?
{"text": "Can I ask where you work now and what you do, and if you enjoy it?"}
{"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."}
python -m spacy pretrain config.cfg ./output_pretrain --paths.raw_text ./data.jsonl
python -m spacy init vectors bn pretrain_vectors/bangla_word2vec_gen4/bangla_word2vec/bnwiki_word2vec.vector pretrain_vectors/bangla_word2vec_gen4/bangla_word2vec_spacy --verbose
BLIOU data format meaning
B = Begin
L = Last
I = Inside
O = Outside
U = Unique
IOB data format meaning
I = Inside
O = Outside
B = Begin