Skip to content

unhammer/evttohus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Various scripts and such used to create dictionary candidates for nob2smj and nob2sma from nob2sme dictionary + corpora.

The scripts are as of now quite tied to Giellatekno’s formats and infrastructure, and would probably need a lot of work to be made generally re-usable.

Prerequisites

The main prerequisites are

  • CorpusTools
  • the sme/nob/sma/smj analysers from $GTHOME/langs
  • all sme/nob/sma/smj-related folders from $GTHOME/words/dicts

See http://giellatekno.uit.no/doc/infra/GettingStarted.html for how to set things up – there’s no need to mess up your ~/.bashrc, but you’ll need to set at least the $GTHOME/$GTFREE/$GTCORE (optionally $GTBOUND) variables, and for now the Xerox tools (xfst/lookup etc.) are needed.

Running the candidate generation

Run make to make the stuff; the first time around you need to also run make corpus (just runs convert2xml on your corpora).

Output file format

The results will appear in the out/ directory.

Folders with two language codes are intermediate outputs, where e.g. out/smesmj has a two-column format where the first word is a sme input and the second is a smj candidate.

The final files appear in out/nobsmjsme and out/nobsmasme, they include both nob and sme translations for the candidates, annotated with normalised frequencies for all the three words as well as number of hits in parallel sentences. The format is, tab-separated:

nob 	candidate	sme	fr_nob	fr_candidate	fr_sme	para_hits

(Note that parallel counts can actually be higher than fr_candidate, since they consider dynamic compounds as hits for the concatenated lemmas.)

Kintel files

The files named *_kintel_* have been merged with words from nobsmj-kintel. For every candidate we generate (whether from smj or nob), we also include the Kintel translation of the Bokmål word. The words are grouped by the Bokmål word, and the Kintel translations are marked with an initial `+’ symbol. Some times the Kintel translation was part of our generated candidates, in which case it’ll have frequency numbers, other times it was not, in which case only the nob and smj words will be listed for that line.

Output filename format

The filenames have this format:

PoS_method_inFST_NN_sourcelang

where PoS is part of speech (V, N, A or nonVNA for “the rest”) and method is one of

decomp
input is compound analysed, parts are translated with existing dictionaries and glued back together
precomp
existing dictionaries are compound analysed to create a dictionary of compound-part-translations; then input is compound analysed, parts are translated using the decompounded dictionaries, and glued back together
anymalign
from parallel word alignment (see para/anymalign)
xfst
using $GTHOME/words/dicts/smesmj/scripts/sme2smj-$PoS.fst
lexc
using $GTHOME/words/dicts/smesmj/bin/smesmj.fst
kintel
candidates here might come from other files, but there is also a suggestion from Kintel for every Bokmål word, pre-marked with a plus sign (+).
loan
input is translated using very simple loan-word regex replacement rules

The sourcelang is the input for the method (nob or sme), while inFST is “ana” or “noana” depending on whether the word had an analysis with the right PoS in $GTHOME/langs/${lang}/src/analyser-gt-desc.xfst.

The numbers (NN above) indicate the frequency rank; the 00 file contains the 1000 highest-frequency candidates, the 01 the 1000 next-highest-frequency candidates, and so on. Within each file, candidates are sorted alphabetically by the reverse of the source string (so e.g. all words ending in «-miljø» will be near each other).

Running the parallel word alignment

Follow para/anymalign/README.org to run the alignment. To get them into the same format as the other files, in this directory do

make anymalign
make

You should now have some files in out/nob*sme/*anymalign*

Running the candidate-respelling

… currently uses some ocaml stuff, TODO

stuff

See TODO.org.

Quality impressions

In general:

  • candidates from _decomp are better than _precomp
  • candidates from _sme are better than _nob
  • candidates from _multis are better than _singles
  • candidates from _ana are better than _noana

About

a jumble of scripts for generating translation candidates

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published