Automatic Grammatical Tagger for a Spanish-Mixtec Parallel Corpus

It is an intelligent tagger of Spanish-Mixtec parallel corpora using CRF, LSTM and Transformers. It allows you to train new neural network and CRF models and then incorporate these new models into the application to generate labeled texts. This functionality allows you to have an intelligent labeler since the new data is added to the training data and allows you to improve precision.

The application is capable of generating synthetic text labels in the Mixtec language using GPT-4 and GPT-4O. The tagged text generated is a useful resource for the development of technologies for other languages with low digital resources, development of automatic translation systems, voice recognition, among other tools.

Installation

Clone the repository

git clone https://github.com/hermilocap/PosTaggingMixtec-Spanish.git

Navigate to the project directory

cd PosTaggingMixtec-Spanish

Generate the environment using

python -m venv env

Activate env. If you work on Windows

env\Scripts\activate.ps1

or

env\Scripts\activate

Install the libraries from the archive using

python -m pip install -r requirements.txt

Usage

First, it is necessary to know the subdirectory and files that the project contains.
Data: The data directory, where it contains the training file train.txt and the output tags.txt file.

Train.txt. File must contain one sentence per line. The maximum number of sentences allowed is 25 sentences. As indicated below. Train file must contain one sentence per line. As indicated below

Yaa yìì Ñuu Kò’yó.
NàÑuu Kò’yó nàndà’yìyó kuàtyi
nákoo tì’va kàa xí’ín kuáyì
tandà nìkisiin miímà’ñú ñu’ùn
ndànìsìsò nìka’ndi tùxìí.
Katúúnyó xìnì yiváyó yùkù kuíì
ñàkoo viiyó, xí’ín yi’ya kúuñà và’a
tyiñàndiví, nìnì’ìn tá’vikún kandú’ukún
xí’ín nduku nda’à yi’ya kà’yirañà.
Tàa ñinka ñuu kàtyira kani tná’anyó
taxí’ín xà’àrá nìxàñùrà nùù ñú’ùnkún
naki’in xìnìkún, yiváyó tyíndiví
ñii sè’e nàñuu tàxina kundaa yó’ó.

Tags.txt. Contains the output format. An example is shown below:

Yaa#DA0MS0
yìì#NCMS000
Ñuu#NP00000
Kò’yó#VMSI3S0
.#Fp
NàÑuu#DA0MS0
Kò’yó#VMSI3S0
nàndà’yìyó#NCMS000
kuàtyi#NCFS000
nákoo#RN
tì’va#NCFS000
kàa#CC
xí’ín#DA0FS0
kuáyì#NCFS000
tandà#VMN0000
nìkisiin#NCMS000
miímà’ñú#NCMS000
ñu’ùn#NCMS000
ndànìsìsò#NCMS000
nìka’ndi#VMN0000
tùxìí#NCMS000
.#Fp

Notebooks: Contains 3 Google Colab notebooks for CRF training. LSTM, and Transformers.

Add environment variables. If you have Windows 11 you must access Settings/Advanced system settings/Environment variables
Next add 2 new environment variables.
The first is the GPT key and the second is the name of the GPT model to use.
Variable name: KEYGPT, MODELGPT
Variable value: Your GPT key
Run the tool as:

python AITagger.py

Tagged. The main screen of the project is then displayed.
Each of the steps to follow to label a corpus are detailed below.
1.-Select path of your input file.
2.- Select path of you input file.
3.- Press on AITagger for start.
4.-App show results and generate output file.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Grammatical Tagger for a Spanish-Mixtec Parallel Corpus

Installation

Usage

About

Releases

Packages

Languages

License

ElsevierSoftwareX/SOFTX-D-24-00345

Folders and files

Latest commit

History

Repository files navigation

Automatic Grammatical Tagger for a Spanish-Mixtec Parallel Corpus

Installation

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages