It is an intelligent tagger of Spanish-Mixtec parallel corpora using CRF, LSTM and Transformers. It allows you to train new neural network and CRF models and then incorporate these new models into the application to generate labeled texts. This functionality allows you to have an intelligent labeler since the new data is added to the training data and allows you to improve precision.
The application is capable of generating synthetic text labels in the Mixtec language using GPT-4 and GPT-4O. The tagged text generated is a useful resource for the development of technologies for other languages with low digital resources, development of automatic translation systems, voice recognition, among other tools.
- Clone the repository
git clone https://github.com/hermilocap/PosTaggingMixtec-Spanish.git
- Navigate to the project directory
cd PosTaggingMixtec-Spanish
- Generate the environment using
python -m venv env
Activate env. If you work on Windows
env\Scripts\activate.ps1
or
env\Scripts\activate
- Install the libraries from the archive using
python -m pip install -r requirements.txt
First, it is necessary to know the subdirectory and files that the project contains.
Data: The data directory, where it contains the training file train.txt and the output tags.txt file.
- Train.txt. File must contain one sentence per line. The maximum number of sentences allowed is 25 sentences. As indicated below. Train file must contain one sentence per line. As indicated below
Yaa yìì Ñuu Kò’yó.
NàÑuu Kò’yó nàndà’yìyó kuàtyi
nákoo tì’va kàa xí’ín kuáyì
tandà nìkisiin miímà’ñú ñu’ùn
ndànìsìsò nìka’ndi tùxìí.
Katúúnyó xìnì yiváyó yùkù kuíì
ñàkoo viiyó, xí’ín yi’ya kúuñà và’a
tyiñàndiví, nìnì’ìn tá’vikún kandú’ukún
xí’ín nduku nda’à yi’ya kà’yirañà.
Tàa ñinka ñuu kàtyira kani tná’anyó
taxí’ín xà’àrá nìxàñùrà nùù ñú’ùnkún
naki’in xìnìkún, yiváyó tyíndiví
ñii sè’e nàñuu tàxina kundaa yó’ó.
- Tags.txt. Contains the output format. An example is shown below:
Yaa#DA0MS0
yìì#NCMS000
Ñuu#NP00000
Kò’yó#VMSI3S0
.#Fp
NàÑuu#DA0MS0
Kò’yó#VMSI3S0
nàndà’yìyó#NCMS000
kuàtyi#NCFS000
nákoo#RN
tì’va#NCFS000
kàa#CC
xí’ín#DA0FS0
kuáyì#NCFS000
tandà#VMN0000
nìkisiin#NCMS000
miímà’ñú#NCMS000
ñu’ùn#NCMS000
ndànìsìsò#NCMS000
nìka’ndi#VMN0000
tùxìí#NCMS000
.#Fp
Notebooks: Contains 3 Google Colab notebooks for CRF training. LSTM, and Transformers.
- Add environment variables.
If you have Windows 11 you must access
Settings/Advanced system settings/Environment variables
Next add 2 new environment variables.
The first is the GPT key and the second is the name of the GPT model to use.
Variable name: KEYGPT, MODELGPT
Variable value: Your GPT key - Run the tool as:
python AITagger.py