Skip to content

Phân loại văn bản Tiếng Việt sử dụng pretrained model - PhoBERT

Notifications You must be signed in to change notification settings

dangvansam/phobert-text-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

f0d0a72 · Feb 1, 2021

History

14 Commits
Feb 1, 2021
Jan 29, 2021
Feb 1, 2021
Feb 1, 2021
Feb 1, 2021
Feb 1, 2021
Feb 1, 2021
Feb 1, 2021
Feb 1, 2021
Feb 1, 2021
Feb 1, 2021
Feb 1, 2021

Repository files navigation

Vietnamese Text Classify with PhoBert

use PhoBert(base) https://huggingface.co/vinai/phobert-base to extract embedding vectors (768 dim) for words in sequence(max_len=256, pad=0)

Download PhoBert pretrained model

download file from: https://public.vinai.io/PhoBERT_base_transformers.tar.gz or https://huggingface.co/vinai/phobert-base
with folder struct
alt text

install transformers

https://github.com/huggingface/transformers
pip install transformers1

install vncorenlp

https://github.com/vncorenlp/VnCoreNLP

pip install vncorenlp
mkdir -p vncorenlp/models/wordsegmenter  
wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/VnCoreNLP-1.1.1.jar  
wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/models/wordsegmenter/vi-vocab  
wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/models/wordsegmenter/wordsegmenter.rdr  
mv VnCoreNLP-1.1.1.jar vncorenlp/   
mv vi-vocab vncorenlp/models/wordsegmenter/  
mv wordsegmenter.rdr vncorenlp/models/wordsegmenter/  

Training

  • train model with tensorflow keras
    python train_classifier_keras.py
    alt text

  • train model with transformers(RobertaForSequenceClassification) pytorch
    python train_transformers_classifier_pytorch.py
    alt text

About

Phân loại văn bản Tiếng Việt sử dụng pretrained model - PhoBERT

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages