V IV Inf @FS-IMV #7->6
- "." CLB #8->8
+ "." CLB #8->8
- / \ \
+ / \ \
fal munnje \
/ \
doavttir boahtit.
- goas
+ goas
Our dependency structure is based upon a compromise between the Saami
grammatical tradition and the conventions used within [the visl
-Verb tags
+## Verb tags
The Saami disambiguation file `disambiguator.cg3` adds dependency tags to each
cohort. The CG verb tags are substituted with these tags:
-- <mv> <aux>
+- <mv> <aux>
There are main verbs and auxiliary verbs.
In main clauses: Finite main verb and auxiliary verb, and infinite main
verb and auxiliary verb.
-- @FS-STA @FS-N<
+- @FS-STA @FS-N<
@FS is a finite verb in a subclause. The first one in a subclause which
functions as a statement, the latter one in a relative subclause.
These are infinite main verbs and auxiliary verbs in an ordinary
subclause and in a relative subclause.
-- <ctjHead>
+- <ctjHead>
This tag helps in coordination contexts.
-Dependency tags alphabetically
+## Dependency tags alphabetically
Dependency tags look different from syntactic grammar tags.
-- **@>A:**
- Modifier of an adjective to the left.
- - **nu (Adv):**
- *Gulahallan Sámedikkiin dán gažaldagas šaddá *nu*
- konkrehtalažžan go vejolaš. - 'The discussion in the Saami
- Parliament about this issue gets *as* concrete as possible.'*
-- **@A<:**
- Modifier of an adjective to the right.
- - **:**
-- **@>Adv:**
- Modifier of an adverb.
- - **:**
-- **@Adv<:**
- Complement of an adverb.
- - **:**
-- **@ADVL:**
- Sentence adverbial.
- - **dál (Adv):**
- **Dál* lea Bireha vuorru. - 'It is Biret's turn *now*.'*
-- **@>ADVL:**
- Modifier of an adverbial.
- - **Man (Pron):**
- **Man* dávjá don lávet fitnat doppe? - '*How* often do you
- usually go there?'*
-- **@<ADVL:**
- adverbial to the right of the finite verb
- - **beaivvážis (N):**
- *Gávpot ii dárbbaš čuovgga *beaivvážis* ii ge mánus. - 'The city
- does not need light *from the sun* and not the from the moon
- either.*
-- **@ADVL<:**
- Complement of an adverbial.
- - **vahkus (N):**
- *Mun málestan guktii *vahkus*. - 'I make food twice a *week*.'*
-- **@ADVL>:**
- Adverbial to the left of the finite verb.
- - **lasttain (N):**
- *Ja muora *lasttain* ožžot álbmogat dearvvašvuođa. - 'And from
- the tree's *leaves*, the people get health.'*
-- **@ADVL>CS:**
- adverbial modifying a conjunction
- - **dallah (Adv):**
- **Dallah* goh Jeesuse tjaetseste tjuedtjele, dellie vuajna Elmie
- rihpesåvva jih Voejkene altasasse goh ledtie suaja. - '*Right*
- after Jesus stood up from the water, he sees that heaven opens
- and the holy spirit flies to him like a bird.'
- (*sma*)*
- - **dan dihte (Adv):**
- *Muhto go lassánedje olbmot, de bohte čáhppesbivttasolbmot fas
- dohko, gosa ledje sámit vuohččan ballán, ja dahke orohagaid jur
- dasa gos sámit ledje orrume, *dan dihte* go sii oidne, ahte das
- leai čáppa gieddi, maid ledje bohccot dutken, gožžan ja baikán —
- gos ledje sámit orron mánga olmmošbuolvva.*
-- **@ADVL<OBJ:**
- - **:**
-- **@ADVL>SUBJ:**
- - **:**
-- **@AGENS>:**
- kal
- - **atorfilittanit:**
- *Attartortumiit piginnittumut aaqqissuussineq
- namminersornerusuni atorfilittanit politikerinillu nuimasunit
- isertortumik atornerlunneqarsimammat illoqarfinni anginerni
- pingasuni attartortut nalinginnaasut pillarneqartussanngorput.*
-- **@APP-ADVL<:**
- Apposition to an adverbial to the left. If the apposition consists
- of more than one word, the head will get this tag.
- - **ovdal (Pr):**
- *Dolin, *ovdal* soađi, olbmot lávejedje vuovdit joŋaid. - 'In
- old times, *before* the war, people used to sell cowberries.'*
-- **@APP-N<:**
- Apposition to a noun to the left of it. If the apposition is more
- than one constituent, the head will get this tag.
- - **eatnigiela (N):**
- *Viimmat mun ohppen čállit sámegiela, mu *eatnigiela*. -
- 'Finally, I learned to write in Sámi, my *mother tongue*.'*
-- **@APP-Num<:**
- Apposition to a numeral to the left.
- - **suinniid (N):**
- *Juohke heasta borrá sullii 6 kilu *suinniid* beaivái. - 'Every
- horse eats approximately 6 kilograms of *grass* a day.'*
-- **@APP>Pron:**
- Apposition to a pronoun to the right. If the apposition is more than
- one constituent, the head will get this tag.
- - **Turner (N Prop):**
- *Muhto diet Will *Turner*, son nai lea fiinna olmmái. - 'But
- this Will *Turner*, he is also a nice guy.'*
-- **@APP-Pron<:**
- Apposition to a pronoun to the left. If the apposition is more than
- one constituent, the head will get this tag.
- - **olbmái (N):**
- *Dan mun muitalan dušše dutnje, mu buoremus *olbmái*. - 'This I
- tell only you, my best *friend*.'*
-- **@>CC:**
- modifier of CC
- - **sihke (CC):**
-- **@>CC:**
- modifier of CC
- - **sihke (CC):**
-- **@CL-ADVL>:**
- - **:**
-- **@CL-<ADVL:**
- - **:**
-- **@CMPND:**
- First part of a compound followed by a hyphen
- - **skaehtie-:**
- *Reerenasse galka båetije stoerredigkieboelhkesne jåerhkedh dam
- *skaehtie-* jïh åasadaltesem mij lea daelie, jïh daennie
- daltesisnie hov lea nuepie buerebe joekedimmiem darjodh.*
-- **@CNP:**
- Local conjunction or subjunction.
- - **ja (CC):**
- *Sihke Mázes *ja* Guovdageainnus leat boarrásat viššalit finadan
- doaibmaguovddážiin. - 'Both in Máze *and* Guovdageaidnu, the
- oldest people frequently got to the activitycentre.'*
- - **go (CS):**
- *Sámi geavaheaddjit hállet dávjá metaforaiguin ja sis leat ollu
- eará gulahallanvuogit *go* giella. - 'Saami users speak often in
- metaphores and the have many other ways of comunicating *than*
- by means of language.'*
-- **@COMPL-CS<:**
- Complement of subjunction.
- - **vejolaš (A):**
- *Gulahallan Sámedikkiin dán gažaldagas šaddá nu konkrehtalažžan
- go *vejolaš*. - 'The contact with the Saami Parliament about
- this issue gets as concrete as *possible*.'*
-- **@CVP:**
- Conjunction or subjunction that conjoins finite verb phrases
- - **ja (CC):**
- *Bealatjogas leat dološ rájes leamaš bálvvossajit *ja* dát
- golbma sieiddi ledje dovddus gitta olgoriikii. - 'Long since,
- there have been sacrificial sites at Bealatjohka *and* the three
- 'sieidi' (cult images) were known even abroad.*
- - **go (CS):**
- *Leago guhkes áigi dassá *go* Máreha oidnet? - 'Has it been a
- long time *since* you have seen Máret?'*
-- **@FAUX:**
- finite auxiliary
- - **ledje (V):**
- *Gávpotmuvrra vuođđogeađggit *ledje* čiŋahuvvon juohke lágán
- divrras geđggiiguin. - \`The cornerstones of the wall *were*
- decorated with every kind of expensive stones.'*
-- **@-F<ADVL:**
- Adverbial of infinite verb outside of the predicate
- - **árbbolaččain (N):**
- *Danne dárbbašit mii oažžut lobi Nils Aslak Valkeapää
- *árbbolaččain* almmuhit dán guokte lávlaga min sálbma-CD:s. -
- \`Therefore we need to get permission from Nils Aslak
- Valkeapää's *heirs* to release these two songs on our
- psalm-CD.'*
-- **@-FADVL>:**
- Adverbial of infinite verb outside the predicate
- - **várrogasat (Adv):**
- *Dihkkadeaddji rávve skohtervuddjiid *várrogasat* mátkkoštit.
- \`The roadman warns snowscooter drivers to drive *carefully*.'*
-- **@FMV:**
- finite mainverb
- - **lei (V):**
- *Gávpot lei njealječiegat, seammá guhkki go govdat. - \`The city
- *was* a square, same width as length.'*
-- **@FMVdic:**
- - **muitala (V):**
- *Ja go geassit eret dábálaš goluid, de lea buhtes sisaboahtu
- sullii 100 000 ruvnnu, *muitala* Eriksen. - \`And when we take
- away/subtract? the regular expenses, there is a remaining income
- of about 100 000 crowns, *says* Eriksen.'*
-- **@-F<OBJ:**
- Object of infinite verb outside the verbal to the right of it.
- - **govaid (N):**
- *Boađe mu lusa geahččat *govaid*! - \`Come to me and look *at
- the pictures*!'*
-- **@-FOBJ>:**
- Object of infinite verb outside the verbal to the left of it.
- - **váldovuoittuid (N):**
- *Valáštallanhálla lei njealjehas dievva olbmuiguin geat vurde
- *váldovuoittuid* fasket. - \`The gymn was to a quarter full of
- people that wait to grab *the main prizes*.'*
-- **@-F<SPRED:**
- - **duhtavaččat (A):**
- *IL Nordlysa beaivválaš jođiheaddji, Nils Peder Eriksen, lohká
- iežaset leat oalle *duhtavaččat* dán jagáš básárdoaluin.*
-- **@FM-SPRED<:**
- main clause functioning as a subject predicate to the right of
- another main clause
- - **ii (V):**
- *Ja dasa lea dát sivva: go sápmelaš boahtá moskkus gámmirii, de
- son ii *ii* ipmir ii báljo maidege, go ii biegga beasa bossut
- njuni vuostá. - \`And this is the reason: if a Saami comes ...,
- then he does *not* understand ...'*
-- **@FS-ADVL>:**
- subclause functioning as an adverbial to the finite verb of the main
- clause to the right of it.
- - **bohtet (V):**
- *Ja mo jos Muhtinlágan Stálu ustibat *bohtet* fitnat. - \`And
- what if the friends of some-kind-of troll *come* for a visit.'*
-- **@FS-<ADVL:**
- subclause functioning as an adverbial to the finite verb of the main
- clause to the left of it.
- - **galggai (V):**
- *Go *galggai* bargat rehkenastimiin sus šattai álo
- oaivvebávččas. - \`When they *should* work with arithmetics, she
- always got a headache.'*
-- **@FS-IAUX:**
- subclause infinite auxiliary
- - **sáhte:**
- *Mun in *sáhte* muitalit dán dutnje. - \`I *can*not tell you
- this.'*
-- **@FS-IMV:**
- subclause infinite mainverb
- - **ohcamin (V):**
- *Naba jos eadni lea sudno *ohcamin*, iige gávnna. - \`And if
- mother is *searching* for them, she will not find them.'*
-- **@FS-N<:**
- finite verb (either an auxiliary or main verb) of a relative
- subclause (with a noun (N) antecedent)
- - **lea (V):**
- *De son viežžá liegga liema ruittus mii *lea* oapmana alde.
- \`Then he fetched warm broth from the pot that *is* on the
- stove.'*
-- **@FS-N<IAUX:**
- infinite auxiliary of a (relative) subclause
- - **sáhttán (V):**
- *Mun oidnen nieidda gii ii *sáhttán* boahtit. - \`I saw the girl
- that *could* not come.'*
-- **@FS-N<IMV:**
- infinite mainverb of a (relative) subclause
- - **bargan (V):**
- *Mon lean okta sápmelaš, guhte lean *bargan* visot sámi bargguid
- ja mon dovddan visot sámi dili. - \`I am a Sámi, who has
- *worked* in all Saami occupations and I know all Saami
- affairs.'*
-- **@FS-OBJ:**
- finite verb of the subclause that has an object function
- - **leahkkasii (V):**
- *Arne ii fuobmán ahte uksa *leahkkasii*. - \`Arne did not notice
- that the door *opened*.'*
-- **@FS-OBJ>:**
- finite verb of a subclause that has object function used for kal
- e.g.
- - **pigilissagaa (V):**
- *Nunani issittuni sila pillugu tunngaviusumik ilisimasalernissaq
- silamut ilisimatusarfiup pigilissagaa ministerit marluullutik
- isumaqatigiipput, silallu allanngoriartornerata sunniutai
- maluginiarneqassasut.*
-- **@FS-P<:**
- finite verb of a subclause
- - **eru (V):**
- *Tað er ikki longur pláss fyri, at lutir og kenslur bara eru. -
- 'There is no longer any space for that things and feelings
- simply are.'*
- - **:**
- *(fao)*
-- **@FS-P<IMV:**
- finite verb of a subclause in fao
- - **:**
-- **@-F<OPRED:**
- - **:**
-- **@-FSUBJ>:**
- subject of a verbal infinitival object
- - **mánáid (N):**
- *Muhtinlágan Stállu cáhpá goikebierggu sudnuide ja dáhttu
- *mánáid* boradit. - 'Some-sort-of troll cuts dried meat for them
- and asks *the children* to eat.'*
-- **@FS-SUBJ:**
- finite verb head of a subordinate clause functioning as a subject
- - **boađát (V):**
- *Dehálaš lea ahte don maid *boađát*. - 'It is important that you
- *come* too.'*
-- **@FS-VFIN<:**
- - **eai (V):**
- *Idja ii leat šat, *eai* ge sii dárbbaš lámppá dahje beaivváža
- čuovgga, dasgo Hearrá Ipmil lea sin čuovga. - \`The night is not
- anymore, they do *not* need the lamp- or day- light either,
- because God the Lord is their light.'*
-- **@HAB:**
- Habitive, for a human target in illative or locative in a habitive
- construction (copula), is translated as "have". Possible verbs in a
- habitive construction are "boahtit" 'come', "leat" 'be', "goallut"
- 'pass', "heaŋgát", 'hang', "jápmit" 'die', "šaddat" 'become'.
- - **Máhtes (N):**
- **Máhtes* lea beana. - '*Máhtte* has a dog.'*
- - **sus (Pron):**
- **Sus* šattai álo oaivvebávččas go galggai bargat
- rehkenastimiin. - '*She* always got a headache when she was
- supposed to work with arithmetics.'*
-- **@HNOUN:**
- Stray noun in sentence fragments.
- - **boddu (N):**
- *Vuosttaš *boddu*. - 'First *lesson*.'*
-- **@IAUX:**
- non-finite auxiliary
- - **veaje (V):**
- *Dattetge ii *veaje* oađđit. - \`Still she did not *manage* to
- sleep.'*
-- **@ICL-ADVL:**
- infinitival clause adverbial
- - **árvvoštallat (V):**
- *Son namuha ahte sii leat gal ávžžuhan olbmuid geat dihtet ahte
- dárbbašit jođánit beassat buohccevissui, nugo ovdamearkka dihte
- áhpehis nissonolbmuid geain lahkona riegádahttináigi,
- *árvvoštallat* galget go ovdal go buohccájit juo vuolgit
- buohccevissui.*
-- **@ICL-OBJ:**
- infinitival clause object
- - **boradit (V):**
- *Muhtinlágan Stállu cáhpá goikebierggu sudnuide ja dáhttu mánáid
- *boradit*. - 'Some-sort-of troll cuts dried meat for them and
- asks the children *to eat*.'*
-- **@ICL-P<:**
- infinitival complement of a preposition
- - **skriva (V):**
- *Kenslan gav Unn íblástur til at skriva nakrar yrkingar um
- næstrakærleika.*
- - **:**
- *(fao)*
-- **@ICL-SUBJ:**
- infinitival subject
- - **sløkkja (V):**
- *Men tað er líka skjótt at sløkkja ljósið, lata eygu og oyru
- aftur og rista ábyrgdina av okkum.*
- - **:**
- *(fao)*
-- **@IM:**
- fao
- - **at (IM):**
- *Tá koyrdi Harrin Guð hann út úr aldingarðinum í Eden og setti
- hann til at dyrka ta jørð, sum hann var sjálvur tikin av.*
-- **@IMV:**
- non-finite mainverb
- - **čiŋahuvvon (V):**
- *Gávpotmuvrra vuođđogeađggit ledje *čiŋahuvvon* juohke lágán
- divrras geđggiiguin. - \`The cornerstones of the wall were
- *decorated* with every kind of expensive stones.'*
-- **@INF->N:**
- kal
- - **pillugu (V):**
- *Nunani issittuni sila pillugu tunngaviusumik ilisimasalernissaq
- silamut ilisimatusarfiup pigilissagaa ministerit marluullutik
- isumaqatigiipput, silallu allanngoriartornerata sunniutai
- maluginiarneqassasut.*
-- **@INTERJ:**
- Interjection.
- - **maid (Interj):**
- **Maid*, iigo leat boahtán? - '*What*, hasn't he/she come?'*
-- **@<IOBJ:**
- indirect object to the right of the finite verb.
- - **(N):**
-- **@IOBJ>:**
- Indirect object to the left of the finite verb.
- - **(N):**
-- **@MIK-OBJ:**
- kal
- - **illunik (N):**
- *Namminersornerusut Nuummi illunik ima amerlatigisunik
- tunisaqarsimalerput aningaasanut inatsit
- iluatsitaariniarfigisariaqalerlugu inissianik isatereriarlutik
- nutaanik sanaartortariaqaleramik atorfilittatik naammaginartunik
- inissaqartissinnaajumallugit.*
-- **@>N:**
- Prenominal modifier to the left
- - **geavatlaš (A):**
- *Ráđđehussii lea *geavatlaš* politihkka deaŧalaš. - 'For the
- government, *practical* politics is important.'*
- - **oahppo-:**
- **Oahppo-* ja dutkanministtar dat lea ráhkadan dieđáhusa alit
- sámi oahpu ja dutkama birra. - 'The secretary for *education*
- and research has given a notice about Saami higher education and
- research.'*
- - **rektor (N):**
- **Rektor* Tove Bull álgaga mielde... - 'According to *principal*
- Tove Bull ...'*
- - **Tove (N Prop):**
- *Rektor *Tove* Bull álgaga mielde... - 'According to principal
- *Tove* Bull ...'*
-- **@N<:**
- Modifier of the noun to the left.
- - **33 (Num):**
- *Mun lean ilus go beasan ovdanbuktit St.dieđ. nr. *33*. - 'I am
- happy that I get the opportunity to present the parliament
- notice number *33*.'* (In this case *33* modifies *St.dieđ.*.)
- - **vihtta (Num):**
- *Mun boađán diibmu *vihtta*. - 'I will come at *five* o'clock.'*
-- **@>Num:**
- Attributes of numeral to the right.
- - **nr (N):**
- *Mun lean ilus go beasan ovdanbuktit St.dieđ. *nr.* 33. - 'I am
- happy that I get the opportunity to present the parliament
- notice *number* 33.'*
-- **@Num<:**
- Attributes of numeral to the left.
- - **jagi (N):**
- *Son lea guoktelogi *jagi* boaris. - 'She/he is twenty *years*
- old.'*
-- **@<OBJ:**
- Direct object to the right of the finite verb.
- - **áiggi (N):**
- *Dat gáibida ollu *áiggi*. - 'That demands a lot of *time*.'*
-- **@OBJ>:**
- Direct object to the left of the finite verb.
- - **maid:**
- *Filbma lea oassi prošeavttas *maid* Sámi instituhtta lea
- ruthadan. - 'The film is a part of the project that the Saami
- institute has financed.'*
-- **@OPRED>:**
- Object predicative to the left of the finite verb.
- - **luoikkasin (N):**
- *Gaup dojii stivrrana hárjehallamiin, muhto oaččui *luoikkasin*
- eará stivrrana. - 'Gaup broke the handlebars during the
- practises, but got to *borrow* another steering.'*
-- **@<OPRED:**
- Object predicative.
- - **buriid (A):**
- *Gáhkkuid son ráhkada hui *buriid*. - 'Cakes, she/he makes
- really *good ones*.'*
- - **sámegielhállin (N):**
- *Dagat iežat *sámegielhállin*. - 'You make yourself *a Saami
- speaker*.'*
-- **@>P:**
- Complement of postposition to the left of it.
- - **oahpu (N), dutkama (N):**
- *Oahppo- ja dutkanministtar dat lea ráhkadan dieđáhusa alit sámi
- *oahpu* ja *dutkama* birra. - 'The secretary for education and
- research has given a notice about Saami higher *education* and
- *research*.'*
-- **@P<:**
- Complement of preposition to the right of it.
- - **oasálaččaid (N):**
- *Finnmárkkus ii goassige leat leamaš ságastallan gaskal muhtun
- muddui seammadássásaš *oasálaččaid*. - 'There has never been a
- discussion in Finnmark between somehow equal *parts*.'*
-- **@PCLE:**
- Particle.
- - **amma (Pcle):**
- **Amma* mii eat leat máksán? - 'We haven't paid, *have we*?'*
-- **@POSS>:**
- kal
- - **Jiisusi-Kristusip (N):**
- **Jiisusi-Kristusip*, Daavip ernerata, Aaperaap ernerata,
- eqqarliisa allassimaffiat.*
-- **@PPRED:**
- a predicative with a predicative as its head
- - **reaŋgan (N):**
- *Máhtes lea Jovnna *reaŋgan*. - 'Máhtte has Jovnna *as a
- searvant*.'*
-- **@>Pron:**
- Modifier of a pronoun to the left of it.
- - **buot (Pron):**
- *Mun, Johanas, lean dat guhte lean gullan ja oaidnán *buot*
- dán. - 'I, Johanas, am the one who has heard and seen *all* of
- it.'*
-- **@Pron<:**
- Modifier of pronoun to the right of it.
- - **ipmašiid (N):**
- *Maid *ipmašiid* doppe dagat? - 'What *the heck* are you doing
- there?'*
- - **golmmas (N):**
- *Mii *golmmas* oktan du vieljain finaimet Niillas-čeazi
- geahčen. - 'We *three* together with your brother visited uncle
- Niillas.'*
-- **@SPRED:**
- Subject predicative in elliptical sentences.
- - **nommh (N):**
- *Die maa onterligksh nommh, ih goh tuhtjh, men die ligan
- onterligksh nierretjh aaj. - ' '*
- - **:**
- *(sma)*
-- **@<SPRED:**
- Subject predicative to the right of the finite verb.
- - **galbmasat (A):**
- *Mus leat gieđat nu *galbmasat*. - 'My hands are so *cold*.'*
- - **beana (N):**
- *Mus lea *beana*. - 'I have *a dog*.'*
-- **@SPRED>:**
- Subject predicative to the right of the finite verb.
- - **vuođđun (N):**
- *Kommišuvnna evttohusaid *vuođđun* lea guohtundilalašvuođaid
- vuđolaš čielggadeapmi, man fágalávdegotti ášše-dovdit dahke.*
-- **@SPRED<OBJ:**
- - **:**
-- **@SUBJ:**
- Elliptical subject.
- - **ålma (N):**
- *Dennie synnagovgesne jis akte ålma maam doenh-aajmoe
- doerelamme. - ' '*
-- **@SUBJ>:**
- Subject to the left of the finite verb.
- - **son (Pron):**
- **Son* lea mu oabbá. - '*She* is my sister.'*
- - **luopmánat (N):**
- *Jeakkis leat *luopmánat*. - 'There are *cloudberries* in the
- swamp.'*
-- **@<SUBJ:**
- Subject to the right of the finite verb.
- - **ollusat (Pron):**
- *...ja dan vejolašvuođa orro gal *ollusat* geavahan. - '...and
- this opportunity, *many* seem to make use of.'*
-- **@SUBJ<ADVL:**
- - **:**
-- **@SUBJ\_COMP:**
- fao
- - **:**
-- **@<SUBJ\_COMP:**
- predicate of a subject/ subject complement (kal)
- - **:**
-- **@SUBJ<OBJ:**
- - **:**
-- **@tSUBJ:**
- Elliptical subject.
- - **tað (Pron):**
- **Tað* er ikki longur pláss fyri, at lutir og kenslur bara eru.*
-- **@i-ADVL>:**
- kal
- - **Babylonimut (N):**
- *Josijap Jekonja qatanngutaalu Babylonimut aallarussaanerup
- nalaani.*
-- **@i-><ADVL:**
- kal
- - **pruffiitikkut (N):**
- *Tamakku tamarmik pipput Naalakkap pruffiitikkut oqaaserisaa
- eqquuteqqullugu, oqarmat: »Takuat, niviarsiaq naartulissaaq
- ernertaassallunilu, atsissavaallu Immanuelimik« – imaappoq:
- Guuti ilagaarput.*
-- **@i->N:**
- kal
- - **naammaginartunik (N):**
- *Namminersornerusut Nuummi illunik ima amerlatigisunik
- tunisaqarsimalerput aningaasanut inatsit
- iluatsitaariniarfigisariaqalerlugu inissianik isatereriarlutik
- nutaanik sanaartortariaqaleramik atorfilittatik naammaginartunik
- inissaqartissinnaajumallugit.*
-- **@i->V:**
- kal
- - **tutinneq (N):**
- *Ernertaartinnaguli tutinneq ajorpaa.*
-- **>@V:**
- kal
- - **:**
-- **@VOC:**
- Vocative.
- - **hearrá:**
- **Hearrá*, du ráhkis ustit lea buohcci. - '*Lord*, your beloved
- friend is ill.'*
-- **<ctjHead>:**
- coordinated head, can be of different PoS' (V, A, N ...). The PoS
- taking part in coordination do not necessarily be of the same kind.
- The tag is useful if the coordinated part does not directly follow
- it's predecessor.
- - **geahččá:**
- *Dat *geahččá* nuppiid stáluide girkes čalmmiiguin ja reašká
- romet. - \`He/she *looks* at the other trolls with clear eyes
- and laughs hideously.'*
- - **stuorrát:**
- *Gieđat leat *stuorrát* dego steaikabánnot ja guolgan. - \`The
- hands are as *big* as fryingpans and covered with hair.'*
- - **soalsin:**
- *Skávžá lea buot *soalsin* ja njuoskkas. - \`The beard is all
- *covered with spit* and wet.'*
-- **<mv>:**
- main verb, especially useful in cases where the verb can be both a
- main verb and an auxiliary
- - **dárbbašit, oažžut:**
- *Danne *dárbbašit* mii *oažžut* lobi Nils Aslak Valkeapää
- árbbolaččain almmuhit dán guokte lávlaga min sálbma-CD:s. -
- \`Therefore we *need* to *get* permission from Nils Aslak
- Valkeapää's heirs to release these two songs on our psalm-CD.'*
-- **<vdic>:**
- verba dicendi, those that introduce direct speech, typically words
- of communication such as lohkat, cealkat, dadjat, oaivildit
- - **celkkii:**
- *Eŋgel *celkkii* munnje: Dát leat luohtehahtti ja duohta
- sánit. - \`The angel *told* me: These are trustworthy and true
- words.'*
+- **@>A:**
+ Modifier of an adjective to the left.
+ - **nu (Adv):**
+ - Gulahallan Sámedikkiin dán gažaldagas šaddá _nu_
+ konkrehtalažžan go vejolaš. - 'The discussion in the Saami
+ Parliament about this issue gets _as_ concrete as possible.'
+- **@A<:**
+ Modifier of an adjective to the right.
+ - **:**
+- **@>Adv:**
+ Modifier of an adverb.
+ - **:**
+- **@Adv<:**
+ Complement of an adverb.
+ - **:**
+- **@ADVL:**
+ Sentence adverbial.
+ - **dál (Adv):**
+ _Dál_ lea Bireha vuorru. - 'It is Biret's turn _now_.'
+- **@>ADVL:**
+ Modifier of an adverbial.
+ - **Man (Pron):**
+ _Man_ dávjá don lávet fitnat doppe? - '_How_ often do you
+ usually go there?'
+- **@<ADVL:**
+ adverbial to the right of the finite verb
+ - **beaivvážis (N):**
+ - Gávpot ii dárbbaš čuovgga _beaivvážis_ ii ge mánus. - 'The city
+ does not need light _from the sun_ and not the from the moon
+ either.
+- **@ADVL<:**
+ Complement of an adverbial.
+ - **vahkus (N):**
+ Mun málestan guktii _vahkus_. - 'I make food twice a _week_.'
+- **@ADVL>:**
+ Adverbial to the left of the finite verb.
+ - **lasttain (N):**
+ Ja muora _lasttain_ ožžot álbmogat dearvvašvuođa. - 'And from
+ the tree's _leaves_, the people get health.'
+- **@ADVL>CS:**
+ adverbial modifying a conjunction
+ - **dallah (Adv):**
+ _Dallah_ goh Jeesuse tjaetseste tjuedtjele, dellie vuajna Elmie
+ rihpesåvva jih Voejkene altasasse goh ledtie suaja. - '_Right_
+ after Jesus stood up from the water, he sees that heaven opens
+ and the holy spirit flies to him like a bird.'
+ (_sma_)
+ - **dan dihte (Adv):**
+ Muhto go lassánedje olbmot, de bohte čáhppesbivttasolbmot fas
+ dohko, gosa ledje sámit vuohččan ballán, ja dahke orohagaid jur
+ dasa gos sámit ledje orrume, _dan dihte_ go sii oidne, ahte das
+ leai čáppa gieddi, maid ledje bohccot dutken, gožžan ja baikán —
+ gos ledje sámit orron mánga olmmošbuolvva.
+- **@ADVL<OBJ:**
+ - **:**
+- **@ADVL>SUBJ:**
+ - **:**
+- **@AGENS>:**
+ kal
+ - **atorfilittanit:**
+ _Attartortumiit piginnittumut aaqqissuussineq
+ namminersornerusuni atorfilittanit politikerinillu nuimasunit
+ isertortumik atornerlunneqarsimammat illoqarfinni anginerni
+ pingasuni attartortut nalinginnaasut pillarneqartussanngorput._
+- **@APP-ADVL<:**
+ Apposition to an adverbial to the left. If the apposition consists
+ of more than one word, the head will get this tag.
+ - **ovdal (Pr):**
+ Dolin, _ovdal_ soađi, olbmot lávejedje vuovdit joŋaid. - 'In
+ old times, _before_ the war, people used to sell cowberries.'
+- **@APP-N<:**
+ Apposition to a noun to the left of it. If the apposition is more
+ than one constituent, the head will get this tag.
+ - **eatnigiela (N):**
+ Viimmat mun ohppen čállit sámegiela, mu _eatnigiela_. -
+ 'Finally, I learned to write in Sámi, my _mother tongue_.'
+- **@APP-Num<:**
+ Apposition to a numeral to the left.
+ - **suinniid (N):**
+ Juohke heasta borrá sullii 6 kilu _suinniid_ beaivái. - 'Every
+ horse eats approximately 6 kilograms of _grass_ a day.'
+- **@APP>Pron:**
+ Apposition to a pronoun to the right. If the apposition is more than
+ one constituent, the head will get this tag.
+ - **Turner (N Prop):**
+ Muhto diet Will _Turner_, son nai lea fiinna olmmái. - 'But
+ this Will _Turner_, he is also a nice guy.'
+- **@APP-Pron<:**
+ Apposition to a pronoun to the left. If the apposition is more than
+ one constituent, the head will get this tag.
+ - **olbmái (N):**
+ Dan mun muitalan dušše dutnje, mu buoremus _olbmái_. - 'This I
+ tell only you, my best _friend_.'
+- **@>CC:**
+ modifier of CC
+ - **sihke (CC):**
+- **@>CC:**
+ modifier of CC
+ - **sihke (CC):**
+- **@CL-ADVL>:**
+ - **:**
+- **@CL-<ADVL:**
+ - **:**
+- **@CMPND:**
+ First part of a compound followed by a hyphen
+ - **skaehtie-:**
+ Reerenasse galka båetije stoerredigkieboelhkesne jåerhkedh dam
+ _skaehtie-_ jïh åasadaltesem mij lea daelie, jïh daennie
+ daltesisnie hov lea nuepie buerebe joekedimmiem darjodh.
+- **@CNP:**
+ Local conjunction or subjunction.
+ - **ja (CC):**
+ Sihke Mázes _ja_ Guovdageainnus leat boarrásat viššalit finadan
+ doaibmaguovddážiin. - 'Both in Máze _and_ Guovdageaidnu, the
+ oldest people frequently got to the activitycentre.'
+ - **go (CS):**
+ Sámi geavaheaddjit hállet dávjá metaforaiguin ja sis leat ollu
+ eará gulahallanvuogit _go_ giella. - 'Saami users speak often in
+ metaphores and the have many other ways of comunicating _than_
+ by means of language.'
+- **@COMPL-CS<:**
+ Complement of subjunction.
+ - **vejolaš (A):**
+ Gulahallan Sámedikkiin dán gažaldagas šaddá nu konkrehtalažžan
+ go _vejolaš_. - 'The contact with the Saami Parliament about
+ this issue gets as concrete as _possible_.'
+- **@CVP:**
+ Conjunction or subjunction that conjoins finite verb phrases
+ - **ja (CC):**
+ Bealatjogas leat dološ rájes leamaš bálvvossajit _ja_ dát
+ golbma sieiddi ledje dovddus gitta olgoriikii. - 'Long since,
+ there have been sacrificial sites at Bealatjohka _and_ the three
+ 'sieidi' (cult images) were known even abroad.
+ - **go (CS):**
+ Leago guhkes áigi dassá _go_ Máreha oidnet? - 'Has it been a
+ long time _since_ you have seen Máret?'
+- **@FAUX:**
+ finite auxiliary
+ - **ledje (V):**
+ Gávpotmuvrra vuođđogeađggit _ledje_ čiŋahuvvon juohke lágán
+ divrras geđggiiguin. - 'The cornerstones of the wall _were_
+ decorated with every kind of expensive stones.'
+- **@-F<ADVL:**
+ Adverbial of infinite verb outside of the predicate
+ - **árbbolaččain (N):**
+ Danne dárbbašit mii oažžut lobi Nils Aslak Valkeapää
+ _árbbolaččain_ almmuhit dán guokte lávlaga min sálbma-CD:s. -
+ 'Therefore we need to get permission from Nils Aslak
+ Valkeapää's _heirs_ to release these two songs on our
+ psalm-CD.'
+- **@-FADVL>:**
+ Adverbial of infinite verb outside the predicate
+ - **várrogasat (Adv):**
+ Dihkkadeaddji rávve skohtervuddjiid _várrogasat_ mátkkoštit.
+ 'The roadman warns snowscooter drivers to drive _carefully_.'
+- **@FMV:**
+ finite mainverb
+ - **lei (V):**
+ Gávpot lei njealječiegat, seammá guhkki go govdat. - 'The city
+ _was_ a square, same width as length.'
+- **@FMVdic:**
+ - **muitala (V):**
+ Ja go geassit eret dábálaš goluid, de lea buhtes sisaboahtu
+ sullii 100 000 ruvnnu, _muitala_ Eriksen. - 'And when we take
+ away/subtract? the regular expenses, there is a remaining income
+ of about 100 000 crowns, _says_ Eriksen.'
+- **@-F<OBJ:**
+ Object of infinite verb outside the verbal to the right of it.
+ - **govaid (N):**
+ Boađe mu lusa geahččat _govaid_! - 'Come to me and look _at
+ the pictures_!'
+- **@-FOBJ>:**
+ Object of infinite verb outside the verbal to the left of it.
+ - **váldovuoittuid (N):**
+ Valáštallanhálla lei njealjehas dievva olbmuiguin geat vurde
+ _váldovuoittuid_ fasket. - 'The gymn was to a quarter full of
+ people that wait to grab _the main prizes_.'
+- **@-F<SPRED:**
+ - **duhtavaččat (A):**
+ IL Nordlysa beaivválaš jođiheaddji, Nils Peder Eriksen, lohká
+ iežaset leat oalle _duhtavaččat_ dán jagáš básárdoaluin.
+- **@FM-SPRED<:**
+ main clause functioning as a subject predicate to the right of
+ another main clause
+ - **ii (V):**
+ Ja dasa lea dát sivva: go sápmelaš boahtá moskkus gámmirii, de
+ son ii _ii_ ipmir ii báljo maidege, go ii biegga beasa bossut
+ njuni vuostá. - 'And this is the reason: if a Saami comes ...,
+ then he does _not_ understand ...'
+- **@FS-ADVL>:**
+ subclause functioning as an adverbial to the finite verb of the main
+ clause to the right of it.
+ - **bohtet (V):**
+ Ja mo jos Muhtinlágan Stálu ustibat _bohtet_ fitnat. - 'And
+ what if the friends of some-kind-of troll _come_ for a visit.'
+- **@FS-<ADVL:**
+ subclause functioning as an adverbial to the finite verb of the main
+ clause to the left of it.
+ - **galggai (V):**
+ Go _galggai_ bargat rehkenastimiin sus šattai álo
+ oaivvebávččas. - 'When they _should_ work with arithmetics, she
+ always got a headache.'
+- **@FS-IAUX:**
+ subclause infinite auxiliary
+ - **sáhte:**
+ Mun in _sáhte_ muitalit dán dutnje. - 'I *can*not tell you
+ this.'
+- **@FS-IMV:**
+ subclause infinite mainverb
+ - **ohcamin (V):**
+ Naba jos eadni lea sudno _ohcamin_, iige gávnna. - 'And if
+ mother is _searching_ for them, she will not find them.'
+- **@FS-N<:**
+ finite verb (either an auxiliary or main verb) of a relative
+ subclause (with a noun (N) antecedent)
+ - **lea (V):**
+ De son viežžá liegga liema ruittus mii _lea_ oapmana alde.
+ 'Then he fetched warm broth from the pot that _is_ on the
+ stove.'
+- **@FS-N<IAUX:**
+ infinite auxiliary of a (relative) subclause
+ - **sáhttán (V):**
+ Mun oidnen nieidda gii ii _sáhttán_ boahtit. - 'I saw the girl
+ that _could_ not come.'
+- **@FS-N<IMV:**
+ infinite mainverb of a (relative) subclause
+ - **bargan (V):**
+ Mon lean okta sápmelaš, guhte lean _bargan_ visot sámi bargguid
+ ja mon dovddan visot sámi dili. - 'I am a Sámi, who has
+ _worked_ in all Saami occupations and I know all Saami
+ affairs.'
+- **@FS-OBJ:**
+ finite verb of the subclause that has an object function
+ - **leahkkasii (V):**
+ Arne ii fuobmán ahte uksa _leahkkasii_. - 'Arne did not notice
+ that the door _opened_.'
+- **@FS-OBJ>:**
+ finite verb of a subclause that has object function used for kal
+ e.g.
+ - **pigilissagaa (V):**
+ _Nunani issittuni sila pillugu tunngaviusumik ilisimasalernissaq
+ silamut ilisimatusarfiup pigilissagaa ministerit marluullutik
+ isumaqatigiipput, silallu allanngoriartornerata sunniutai
+ maluginiarneqassasut._
+- **@FS-P<:**
+ finite verb of a subclause
+ - **eru (V):**
+ _Tað er ikki longur pláss fyri, at lutir og kenslur bara eru. -
+ 'There is no longer any space for that things and feelings
+ simply are.'_
+ - **:**
+ _(fao)_
+- **@FS-P<IMV:**
+ finite verb of a subclause in fao
+ - **:**
+- **@-F<OPRED:**
+ - **:**
+- **@-FSUBJ>:**
+ subject of a verbal infinitival object
+ - **mánáid (N):**
+ Muhtinlágan Stállu cáhpá goikebierggu sudnuide ja dáhttu
+ _mánáid_ boradit. - 'Some-sort-of troll cuts dried meat for them
+ and asks _the children_ to eat.'
+- **@FS-SUBJ:**
+ finite verb head of a subordinate clause functioning as a subject
+ - **boađát (V):**
+ Dehálaš lea ahte don maid _boađát_. - 'It is important that you
+ _come_ too.'
+- **@FS-VFIN<:**
+ - **eai (V):**
+ Idja ii leat šat, _eai_ ge sii dárbbaš lámppá dahje beaivváža
+ čuovgga, dasgo Hearrá Ipmil lea sin čuovga. - 'The night is not
+ anymore, they do _not_ need the lamp- or day- light either,
+ because God the Lord is their light.'
+- **@HAB:**
+ Habitive, for a human target in illative or locative in a habitive
+ construction (copula), is translated as "have". Possible verbs in a
+ habitive construction are "boahtit" 'come', "leat" 'be', "goallut"
+ 'pass', "heaŋgát", 'hang', "jápmit" 'die', "šaddat" 'become'.
+ - **Máhtes (N):**
+ _Máhtes_ lea beana. - '_Máhtte_ has a dog.'
+ - **sus (Pron):**
+ _Sus_ šattai álo oaivvebávččas go galggai bargat
+ rehkenastimiin. - '_She_ always got a headache when she was
+ supposed to work with arithmetics.'
+- **@HNOUN:**
+ Stray noun in sentence fragments.
+ - **boddu (N):**
+ Vuosttaš _boddu_. - 'First _lesson_.'
+- **@IAUX:**
+ non-finite auxiliary
+ - **veaje (V):**
+ Dattetge ii _veaje_ oađđit. - 'Still she did not _manage_ to
+ sleep.'\*
+- **@ICL-ADVL:**
+ infinitival clause adverbial
+ - **árvvoštallat (V):**
+ Son namuha ahte sii leat gal ávžžuhan olbmuid geat dihtet ahte
+ dárbbašit jođánit beassat buohccevissui, nugo ovdamearkka dihte
+ áhpehis nissonolbmuid geain lahkona riegádahttináigi,
+ árvvoštallat\* galget go ovdal go buohccájit juo vuolgit
+ buohccevissui.
+- **@ICL-OBJ:**
+ infinitival clause object
+ - **boradit (V):**
+ Muhtinlágan Stállu cáhpá goikebierggu sudnuide ja dáhttu mánáid
+ _boradit_. - 'Some-sort-of troll cuts dried meat for them and
+ asks the children _to eat_.'
+- **@ICL-P<:**
+ infinitival complement of a preposition
+ - **skriva (V):**
+ _Kenslan gav Unn íblástur til at skriva nakrar yrkingar um
+ næstrakærleika._
+ - **:**
+ _(fao)_
+- **@ICL-SUBJ:**
+ infinitival subject
+ - **sløkkja (V):**
+ _Men tað er líka skjótt at sløkkja ljósið, lata eygu og oyru
+ aftur og rista ábyrgdina av okkum._
+ - **:**
+ _(fao)_
+- **@IM:**
+ fao
+ - **at (IM):**
+ _Tá koyrdi Harrin Guð hann út úr aldingarðinum í Eden og setti
+ hann til at dyrka ta jørð, sum hann var sjálvur tikin av._
+- **@IMV:**
+ non-finite mainverb
+ - **čiŋahuvvon (V):**
+ Gávpotmuvrra vuođđogeađggit ledje _čiŋahuvvon_ juohke lágán
+ divrras geđggiiguin. - 'The cornerstones of the wall were
+ _decorated_ with every kind of expensive stones.'
+- **@INF->N:**
+ kal
+ - **pillugu (V):**
+ _Nunani issittuni sila pillugu tunngaviusumik ilisimasalernissaq
+ silamut ilisimatusarfiup pigilissagaa ministerit marluullutik
+ isumaqatigiipput, silallu allanngoriartornerata sunniutai
+ maluginiarneqassasut._
+- **@INTERJ:**
+ Interjection.
+ - **maid (Interj):**
+ _Maid_, iigo leat boahtán? - '_What_, hasn't he/she come?'
+- **@<IOBJ:**
+ indirect object to the right of the finite verb.
+ - **(N):**
+- **@IOBJ>:**
+ Indirect object to the left of the finite verb.
+ - **(N):**
+- **@MIK-OBJ:**
+ kal
+ - **illunik (N):**
+ Namminersornerusut Nuummi _illunik_ ima amerlatigisunik
+ tunisaqarsimalerput aningaasanut inatsit
+ iluatsitaariniarfigisariaqalerlugu inissianik isatereriarlutik
+ nutaanik sanaartortariaqaleramik atorfilittatik naammaginartunik
+ inissaqartissinnaajumallugit.
+- **@>N:**
+ Prenominal modifier to the left
+ - **geavatlaš (A):**
+ Ráđđehussii lea _geavatlaš_ politihkka deaŧalaš. - 'For the
+ government, _practical_ politics is important.'
+ - **oahppo-:**
+ _Oahppo-_ ja dutkanministtar dat lea ráhkadan dieđáhusa alit
+ sámi oahpu ja dutkama birra. - 'The secretary for _education_
+ and research has given a notice about Saami higher education and
+ research.'
+ - **rektor (N):**
+ _Rektor_ Tove Bull álgaga mielde... - 'According to _principal_
+ Tove Bull ...'
+ - **Tove (N Prop):**
+ Rektor _Tove_ Bull álgaga mielde... - 'According to principal
+ _Tove_ Bull ...'
+- **@N<:**
+ Modifier of the noun to the left.
+ - **33 (Num):**
+ Mun lean ilus go beasan ovdanbuktit St.dieđ. nr. _33_. - 'I am
+ happy that I get the opportunity to present the parliament
+ notice number _33_.'\* (In this case _33_ modifies _St.dieđ._.)
+ - **vihtta (Num):**
+ Mun boađán diibmu _vihtta_. - 'I will come at _five_ o'clock.'
+- **@>Num:**
+ Attributes of numeral to the right.
+ - **nr (N):**
+ Mun lean ilus go beasan ovdanbuktit St.dieđ. _nr._ 33. - 'I am
+ happy that I get the opportunity to present the parliament
+ notice _number_ 33.'
+- **@Num<:**
+ Attributes of numeral to the left.
+ - **jagi (N):**
+ Son lea guoktelogi _jagi_ boaris. - 'She/he is twenty _years_
+ old.'
+- **@<OBJ:**
+ Direct object to the right of the finite verb.
+ - **áiggi (N):**
+ Dat gáibida ollu _áiggi_. - 'That demands a lot of _time_.'
+- **@OBJ>:**
+ Direct object to the left of the finite verb.
+ - **maid:**
+ Filbma lea oassi prošeavttas _maid_ Sámi instituhtta lea
+ ruthadan. - 'The film is a part of the project that the Saami
+ institute has financed.'
+- **@OPRED>:**
+ Object predicative to the left of the finite verb.
+ - **luoikkasin (N):**
+ Gaup dojii stivrrana hárjehallamiin, muhto oaččui _luoikkasin_
+ eará stivrrana. - 'Gaup broke the handlebars during the
+ practises, but got to _borrow_ another steering.'
+- **@<OPRED:**
+ Object predicative.
+ - **buriid (A):**
+ Gáhkkuid son ráhkada hui _buriid_. - 'Cakes, she/he makes
+ really _good ones_.'
+ - **sámegielhállin (N):**
+ Dagat iežat _sámegielhállin_. - 'You make yourself _a Saami
+ speaker_.'
+- **@>P:**
+ Complement of postposition to the left of it.
+ - **oahpu (N), dutkama (N):**
+ Oahppo- ja dutkanministtar dat lea ráhkadan dieđáhusa alit sámi
+ _oahpu_ ja _dutkama_ birra. - 'The secretary for education and
+ research has given a notice about Saami higher _education_ and
+ _research_.'
+- **@P<:**
+ Complement of preposition to the right of it.
+ - **oasálaččaid (N):**
+ Finnmárkkus ii goassige leat leamaš ságastallan gaskal muhtun
+ muddui seammadássásaš _oasálaččaid_. - 'There has never been a
+ discussion in Finnmark between somehow equal _parts_.'
+- **@PCLE:**
+ Particle.
+ - **amma (Pcle):**
+ _Amma_ mii eat leat máksán? - 'We haven't paid, _have we_?'
+- **@POSS>:**
+ kal
+ - **Jiisusi-Kristusip (N):**
+ _Jiisusi-Kristusip_, Daavip ernerata, Aaperaap ernerata,
+ eqqarliisa allassimaffiat.
+- **@PPRED:**
+ a predicative with a predicative as its head
+ - **reaŋgan (N):**
+ Máhtes lea Jovnna _reaŋgan_. - 'Máhtte has Jovnna _as a
+ searvant_.'
+- **@>Pron:**
+ Modifier of a pronoun to the left of it.
+ - **buot (Pron):**
+ Mun, Johanas, lean dat guhte lean gullan ja oaidnán _buot_
+ dán. - 'I, Johanas, am the one who has heard and seen _all_ of
+ it.'
+- **@Pron<:**
+ Modifier of pronoun to the right of it.
+ - **ipmašiid (N):**
+ Maid _ipmašiid_ doppe dagat? - 'What _the heck_ are you doing
+ there?'
+ - **golmmas (N):**
+ Mii _golmmas_ oktan du vieljain finaimet Niillas-čeazi
+ geahčen. - 'We _three_ together with your brother visited uncle
+ Niillas.'
+- **@SPRED:**
+ Subject predicative in elliptical sentences.
+ - **nommh (N):**
+ _Die maa onterligksh nommh, ih goh tuhtjh, men die ligan
+ onterligksh nierretjh aaj. - ' '_
+ - **:**
+ _(sma)_
+- **@<SPRED:**
+ Subject predicative to the right of the finite verb.
+ - **galbmasat (A):**
+ Mus leat gieđat nu _galbmasat_. - 'My hands are so _cold_.'
+ - **beana (N):**
+ Mus lea _beana_. - 'I have _a dog_.'
+- **@SPRED>:**
+ Subject predicative to the right of the finite verb.
+ - **vuođđun (N):**
+ Kommišuvnna evttohusaid _vuođđun_ lea guohtundilalašvuođaid
+ vuđolaš čielggadeapmi, man fágalávdegotti ášše-dovdit dahke.
+- **@SPRED<OBJ:**
+ - **:**
+- **@SUBJ:**
+ Elliptical subject.
+ - **ålma (N):**
+ _Dennie synnagovgesne jis akte ålma maam doenh-aajmoe
+ doerelamme. - ' '_
+- **@SUBJ>:**
+ Subject to the left of the finite verb.
+ - **son (Pron):**
+ _Son_ lea mu oabbá. - '_She_ is my sister.'
+ - **luopmánat (N):**
+ Jeakkis leat _luopmánat_. - 'There are _cloudberries_ in the
+ swamp.'
+- **@<SUBJ:**
+ Subject to the right of the finite verb.
+ - **ollusat (Pron):**
+ ...ja dan vejolašvuođa orro gal _ollusat_ geavahan. - '...and
+ this opportunity, _many_ seem to make use of.'
+- **@SUBJ<ADVL:**
+ - **:**
+- **@SUBJ_COMP:**
+ fao
+ - **:**
+- **@<SUBJ_COMP:**
+ predicate of a subject/ subject complement (kal)
+ - **:**
+- **@SUBJ<OBJ:**
+ - **:**
+- **@tSUBJ:**
+ Elliptical subject.
+ - **tað (Pron):**
+ _Tað_ er ikki longur pláss fyri, at lutir og kenslur bara eru.
+- **@i-ADVL>:**
+ kal
+ - **Babylonimut (N):**
+ _Josijap Jekonja qatanngutaalu Babylonimut aallarussaanerup
+ nalaani._
+- **@i-><ADVL:**
+ kal
+ - **pruffiitikkut (N):**
+ Tamakku tamarmik pipput Naalakkap _pruffiitikkut_ oqaaserisaa
+ eqquuteqqullugu, oqarmat: »Takuat, niviarsiaq naartulissaaq
+ ernertaassallunilu, atsissavaallu Immanuelimik« – imaappoq:
+ Guuti ilagaarput.
+- **@i->N:**
+ kal
+ - **naammaginartunik (N):**
+ Namminersornerusut Nuummi illunik ima amerlatigisunik
+ tunisaqarsimalerput aningaasanut inatsit
+ iluatsitaariniarfigisariaqalerlugu inissianik isatereriarlutik
+ nutaanik sanaartortariaqaleramik atorfilittatik _naammaginartunik_
+ inissaqartissinnaajumallugit.
+- **@i->V:**
+ kal
+ - **tutinneq (N):**
+ Ernertaartinnaguli _tutinneq_ ajorpaa.
+- **>@V:**
+ kal
+ - **:**
+- **@VOC:**
+ Vocative.
+ - **hearrá:**
+ _Hearrá_, du ráhkis ustit lea buohcci. - _Lord_, your beloved
+ friend is ill.'
+- **<ctjHead>:**
+ coordinated head, can be of different PoS' (V, A, N ...). The PoS
+ taking part in coordination do not necessarily be of the same kind.
+ The tag is useful if the coordinated part does not directly follow
+ it's predecessor.
+ - **geahččá:**
+ Dat _geahččá_ nuppiid stáluide girkes čalmmiiguin ja reašká
+ romet. - 'He/she _looks_ at the other trolls with clear eyes
+ and laughs hideously.'
+ - **stuorrát:**
+ Gieđat leat _stuorrát_ dego steaikabánnot ja guolgan. - 'The
+ hands are as _big_ as fryingpans and covered with hair.'
+ - **soalsin:**
+ Skávžá lea buot _soalsin_ ja njuoskkas. - 'The beard is all
+ _covered with spit_ and wet.'
+- **<mv>:**
+ main verb, especially useful in cases where the verb can be both a
+ main verb and an auxiliary
+ - **dárbbašit, oažžut:**
+ Danne _dárbbašit_ mii _oažžut_ lobi Nils Aslak Valkeapää
+ árbbolaččain almmuhit dán guokte lávlaga min sálbma-CD:s. -
+ 'Therefore we _need_ to _get_ permission from Nils Aslak
+ Valkeapää's heirs to release these two songs on our psalm-CD.'
+- **<vdic>:**
+ verba dicendi, those that introduce direct speech, typically words
+ of communication such as lohkat, cealkat, dadjat, oaivildit
+ - **celkkii:**
+ Eŋgel _celkkii_ munnje: Dát leat luohtehahtti ja duohta
+ sánit. - 'The angel _told_ me: These are trustworthy and true
+ words.'
+## Coordination
Here are some examples of our coordination-analysis:
"Náhkiin sii gorro roavgguid, dorkkaid ja gápmagiid."
From the skin they were sewing furs, coats and shoes.
- / | \
+ / | \
Náhkiin sii roavgguid,
/ \
doarkkaid gápmagiid.
- ja
+ ja
"Bárdni válddii niibbi ja čuohpai ráiggi sehkkii ja luittii mánáid olggos."
The boy took the knife and cut a hole in the bag and let the children out.
/ / \ \
Bárdni niibbi \ luittii
| čuohpai | \
ja / | mánáid olggos.
- / |
+ / |
ráiggi sehkkii
- |
+ |
-Complex sentences
+## Complex sentences
Here are some examples:
@@ -694,7 +685,7 @@ Here are some examples:
/ | \ \
- Jus stállu de \
+ Jus stállu de \
/ \
son olbmo.
@@ -704,8 +695,8 @@ Here are some examples:
it __
/ | \
- don beasa \
- / \ \
+ don beasa \
+ / \ \
Dasto ruoktot borat
| \
jus luhtte.
@@ -724,9 +715,9 @@ Here are some examples:
| / | \
ja iđeda fas dan.
- nuppi
+ nuppi
- No verb in the main clause:
+ No verb in the main clause:
"Ovdal buorida Ipmil dálkkiidis go neavrres olmmoš dábiidis."
Rather does God improve the weather than a miserable person his habits.
@@ -737,16 +728,14 @@ Here are some examples:
/ | \
go neavrres dábiidis.
+## Punctuation
Punctuation such as ".", "," and ";" also receive dependency tags. The
sentence "Arvigoahtá. - It starts raining" actually consists of two
elements, the finite verb and the punctuation. The full stop is also
interpreted as a dependent of the root "\#2->0".
-Arguments and adjuncts
+## Arguments and adjuncts
Subcategorized arguments such as "beatnagis - of the dog" in the
sentence "Balat go beatnagis? - Are you afraid of the dog" are
diff --git a/lang/common/docu-sme-grammartags.md b/lang/common/docu-sme-grammartags.md
index a0d31b8a..18e6f12e 100644
--- a/lang/common/docu-sme-grammartags.md
+++ b/lang/common/docu-sme-grammartags.md
@@ -1,8 +1,7 @@
On the bottom of this page is a list with all tags in alphabetical
+# Overview
All the words are analysed with dictionary form + grammatical tags. Each
tag is introduced with a "+" sign. We thus have
@@ -21,8 +20,7 @@ Nouns (+N), adjectives (+A), verbs (+V), pronouns (+Pron), adverbs
(+Adv), particles (+Pcle), subjunctions (+CS), conjunctions (+CC),
postpositions (+Po), prepositions (+Pr) and interjections (+Interj).
-The nouns
+## The nouns
The string is
+N+(Subclass)+(Semclass)+Number+Case(+Possessivesuffix)(+Clitic)". The
@@ -34,7 +32,7 @@ analysis. Note that the grammatical categories in parentheses can be
| | |
+| ------------------- | -------------------------------------------------------------- |
| Part of speech | +N |
| Subclass | +Prop, +G3, +NomAg |
| Semantic class | +Sem/Hum, +Sem/Plc, +Sem/Veh (see list) |
@@ -43,13 +41,12 @@ omitted.
| (Possessive suffix) | +PxSg1 +PxSg2 +PxSg3 +PxDu1 +PxDu2 +PxDu3 +PxPl1 +PxPl2 +PxPl3 |
| (Clitic) | +Qst +Foc/ |
-The adjectives
+## The adjectives
Used non-attributively the adjective resembles the noun:
| | |
+| -------------- | ---------------------------------- |
| Part of speech | +A |
| (Grade) | +Comp, +Superl |
| Number | +Sg, +Pl |
@@ -59,19 +56,18 @@ Used non-attributively the adjective resembles the noun:
Used attributively the adjective has a quite simple tag scheme:
| | |
+| -------------- | -------------------- |
| Part of speech | +A |
| Attribute | +Attr |
| (Clitic) | e.g. +Qst (see list) |
-The verbs
+## The verbs
Finite and infinite verb forms have quite distinct paradigms. Finite
| | |
+| -------------- | ----------------------------------------------------- |
| Part of speech | +V |
| (Derivation) | +Der/PassL, +Der/PassS, +Der/h (see list) |
| Mood | +Ind, +Pot, +Cond, +Imprt |
@@ -82,7 +78,7 @@ first:
Infinite verb forms:
| | |
+| ----------------- | -------------------------------------------------- |
| Part of speech | +V |
| (Derivation) | +Der/PassL |
| Nominal verb form | +Inf, +Act, +Ger, +PrsPrc, +PrfPrc, +VGen, +VAbess |
@@ -91,7 +87,7 @@ Infinite verb forms:
Other derived verb forms:
| | |
+| ------------------- | -------------------------------------------------------------- |
| Part of speech | +V |
| Part of speech | +N |
| Derivation | +NomAg |
@@ -101,8 +97,7 @@ Other derived verb forms:
Here is an example: `oahppit` > `oahppi+N+NomAg+Pl+Nom`
-The pronouns
+## The pronouns
The personal, demonstrative and interrogative pronouns:
@@ -114,230 +109,223 @@ underlying form: `mun+Pron+Pers+Sg1+Com`, surface form: `muinna`
The reflexive pronouns:
-baseform+Pron+pronoun\_type+Case(+possessive suffix)
+baseform+Pron+pronoun_type+Case(+possessive suffix)
Example: underlying form: `ieš+Pron+Refl+Loc+PxDu1`, surface form:
-The indeclinable words
+## The indeclinable words
These have their POS tag as their only tag:
underlying form: `birra+Pr` or `birra+Po`, surface form: `birra`
-Alphabetic list over the tags
-Part of speech and subclass
-- **+A** adjective
- - **+Ord** ordinal
-- **+Adv** adverb
-- **+CC** conjunction
-- **+CS** subjunction
-- **+Interj** interjection
-- **+N** noun
- - **+Prop** proper noun
- - **+NomAg** agent noun
- - **+G3** geminat grade 3, e.g. s's
-- **+Num** numeral
- - **+Card** cardinal number
-- **+Pcle** particle
-- **+Po** postposition
-- **+Pr** preposition
-- **+Pron** pronoun
- - **+Dem** demonstrative pronoun
- - **+Indef** indefinite pronoun
- - **+Interr** interrogative pronoun
- - **+Pers** person pronoun
- - **+Refl** reflecsive pronoun
- - **+Recipr** reciprocal pronoun
- - **+Rel** relative pronoun
-- **+N, +Adv**for several parts of speech
- - **+ABBR** abbreviation
- - **+ACR** acronym
-Grammatical properties
-- **+Acc** Accusative
-- **+Actio** Actio form of the verb
-- **+Attr** Attributive
-- **+CLB** Clause boundary
-- **+Cmpnd** Compound (inconsistent with other notation for this tag)
-- **+Cmpnd+** Compound (left tag)(inconsistent with other notation for
- this tag)
-- **+Com** Comitative
-- **+Comp** Comperative
-- **+ConNeg** Negationform of the verb
-- **+Cond** Conditional
-- **+Du**Dual
-- **+Du1** Dual 1. person
-- **+Du2** Dual 2. person
-- **+Du3** Dual 3. person
-- **+Ess** Essive
-- **+Foc/ba** Focusclitic (-)ba
-- **+Foc/bat** Focusclitic (-)bat
-- **+Foc/be** Focusclitic (-)be
-- **+Foc/ge** Focusclitic (-)ge
-- **+Foc/gen** Focusclitic (-)gen
-- **+Foc/ges** Focusclitic (-)ges
-- **+Foc/gis** Focusclitic (-)gis
-- **+Foc/hal** Focusclitic (-)hal
-- **+Foc/han** Focusclitic (-)han
-- **+Foc/naj** Focusclitic (-)nai
-- **+Foc/naj+Qst** Focusclitic (-)naigo
-- **+Foc/son** Focusclitic (-)son
-- **+Gen** Genitive
-- **+Ger** Gerund
-- **+IV** Intrasitive verb
-- **+Ill** Illative
-- **+Imprt** Imperative
-- **+Ind** Indicative
-- **+Inf** Infinitive
-- **+LEFT** Left parenthesis
-- **+Loc** Locative
-- **+Neg** Negationverb
-- **+Nom** Nominative
-- **+PUNCT** Punctuation other than clause boundaries*"/", "(", ")",
- "+"*
-- **+Pl** Plural
-- **+Pl1** Plural 1. person
-- **+Pl2** Plural 2. person
-- **+Pl3** Plural 3. person
-- **+Cmp/PlGen** Plural genitive compound
-- **+Pot** Potential
-- **+PrfPrc** Perfect participle
-- **+Prs** Present tense
-- **+PrsPrc** Present participle
-- **+Prt** Preteritum
-- **+PxDu1** Possessivesuffix dual 1. person
-- **+PxDu2** Possessivesuffix dual 2. person
-- **+PxDu3** Possessivesuffix dual 3. person
-- **+PxPl1** Possessivesuffix plural 1. person
-- **+PxPl2** Possessivesuffix plural 2. person
-- **+PxPl3** Possessivesuffix plural 3. person
-- **+PxSg1** Possessivesuffix singular 1. person
-- **+PxSg2** Possessivesuffix singular 2. person
-- **+PxSg3** Possessivesuffix singular 3. person
-- **+Qst** Question clitic *(-)go*
-- **+Qst+Foc/son** Question clitic
-- **+RIGHT** Right parenthesis
-- **+Sg** Singular
-- **+Sg1** Singular 1. person
-- **+Sg2** Singular 2. person
-- **+Sg3** Singular 3. person
-- **+Cmp/Sg** Singular compound
-- **+Cmp/SgGen** Singular genitive compound
-- **+Cmp/SgNom** Singular nominative compound
-- **+Sup** Supinum
-- **+Superl** Superlative
-- **+TV** Transitive verb
-- **+VAbess** Verb abessive
-- **+VGen** Verb genitive
-Derivational suffix tags
-- **+Der/Dimin** Diminutive
-- **+Der/adda** suffix
-- **+Der/ahtti** suffix
-- **+Der/alla** suffix
-- **+Der/amoš** suffix
-- **+Der/asti** suffix
-- **+Der/aš** suffix
-- **+Der/d** suffix
-- **+Der/duohkai** suffix
-- **+Der/duohke** suffix
-- **+Der/NomAg** agent noun
-- **+Der/eamoš** suffix
-- **+Der/NomAct** action noun
-- **+Der/easti** suffix
-- **+Der/g** suffix
-- **+Der/geahtes** suffix
-- **+Der/goahti** suffix
-- **+Der/h** suffix
-- **+Der/halla** suffix
-- **+Der/hat** suffix
-- **+Der/heapmi** suffix
-- **+Der/hudda** suffix
-- **+Der/huhtti** suffix
-- **+Der/huvva** suffix
-- **+Der/j** suffix
-- **+Der/l** suffix
-- **+Der/las** suffix
-- **+Der/laš** suffix
-- **+Der/lágan** suffix
-- **+Der/meahttun** suffix
-- **+Der/muš** suffix
-- **+Der/PassL** passive verb, long form
-- **+Der/PassS** passive verb, short form
-- **+Der/st** suffix
-- **+Der/stuvva** suffix
-- **+Der/supmi** suffix
-- **+Der/upmi** suffix
-- **+Der/us** suffix
-- **+Der/viđi** suffix
-- **+Der/viđá** suffix
-- **+Der/vuohta** suffix
-- **+Der/vuolde** suffix
-- **+Der/vuollai** suffix
-- **+Der/vuolle** suffix
-- **+Der/š** suffix
-Semantic tags
+## Alphabetic list over the tags
+## Part of speech and subclass
+- **+A** adjective
+ - **+Ord** ordinal
+- **+Adv** adverb
+- **+CC** conjunction
+- **+CS** subjunction
+- **+Interj** interjection
+- **+N** noun
+ - **+Prop** proper noun
+ - **+NomAg** agent noun
+ - **+G3** geminat grade 3, e.g. s's
+- **+Num** numeral
+ - **+Card** cardinal number
+- **+Pcle** particle
+- **+Po** postposition
+- **+Pr** preposition
+- **+Pron** pronoun
+ - **+Dem** demonstrative pronoun
+ - **+Indef** indefinite pronoun
+ - **+Interr** interrogative pronoun
+ - **+Pers** person pronoun
+ - **+Refl** reflecsive pronoun
+ - **+Recipr** reciprocal pronoun
+ - **+Rel** relative pronoun
+- **+N, +Adv**for several parts of speech
+ - **+ABBR** abbreviation
+ - **+ACR** acronym
+## Grammatical properties
+- **+Acc** Accusative
+- **+Actio** Actio form of the verb
+- **+Attr** Attributive
+- **+CLB** Clause boundary
+- **+Cmpnd** Compound (inconsistent with other notation for this tag)
+- **+Cmpnd+** Compound (left tag)(inconsistent with other notation for
+ this tag)
+- **+Com** Comitative
+- **+Comp** Comperative
+- **+ConNeg** Negationform of the verb
+- **+Cond** Conditional
+- **+Du**Dual
+- **+Du1** Dual 1. person
+- **+Du2** Dual 2. person
+- **+Du3** Dual 3. person
+- **+Ess** Essive
+- **+Foc/ba** Focusclitic (-)ba
+- **+Foc/bat** Focusclitic (-)bat
+- **+Foc/be** Focusclitic (-)be
+- **+Foc/ge** Focusclitic (-)ge
+- **+Foc/gen** Focusclitic (-)gen
+- **+Foc/ges** Focusclitic (-)ges
+- **+Foc/gis** Focusclitic (-)gis
+- **+Foc/hal** Focusclitic (-)hal
+- **+Foc/han** Focusclitic (-)han
+- **+Foc/naj** Focusclitic (-)nai
+- **+Foc/naj+Qst** Focusclitic (-)naigo
+- **+Foc/son** Focusclitic (-)son
+- **+Gen** Genitive
+- **+Ger** Gerund
+- **+IV** Intrasitive verb
+- **+Ill** Illative
+- **+Imprt** Imperative
+- **+Ind** Indicative
+- **+Inf** Infinitive
+- **+LEFT** Left parenthesis
+- **+Loc** Locative
+- **+Neg** Negationverb
+- **+Nom** Nominative
+- **+PUNCT** Punctuation other than clause boundaries*"/", "(", ")",
+ "+"*
+- **+Pl** Plural
+- **+Pl1** Plural 1. person
+- **+Pl2** Plural 2. person
+- **+Pl3** Plural 3. person
+- **+Cmp/PlGen** Plural genitive compound
+- **+Pot** Potential
+- **+PrfPrc** Perfect participle
+- **+Prs** Present tense
+- **+PrsPrc** Present participle
+- **+Prt** Preteritum
+- **+PxDu1** Possessivesuffix dual 1. person
+- **+PxDu2** Possessivesuffix dual 2. person
+- **+PxDu3** Possessivesuffix dual 3. person
+- **+PxPl1** Possessivesuffix plural 1. person
+- **+PxPl2** Possessivesuffix plural 2. person
+- **+PxPl3** Possessivesuffix plural 3. person
+- **+PxSg1** Possessivesuffix singular 1. person
+- **+PxSg2** Possessivesuffix singular 2. person
+- **+PxSg3** Possessivesuffix singular 3. person
+- **+Qst** Question clitic _(-)go_
+- **+Qst+Foc/son** Question clitic
+- **+RIGHT** Right parenthesis
+- **+Sg** Singular
+- **+Sg1** Singular 1. person
+- **+Sg2** Singular 2. person
+- **+Sg3** Singular 3. person
+- **+Cmp/Sg** Singular compound
+- **+Cmp/SgGen** Singular genitive compound
+- **+Cmp/SgNom** Singular nominative compound
+- **+Sup** Supinum
+- **+Superl** Superlative
+- **+TV** Transitive verb
+- **+VAbess** Verb abessive
+- **+VGen** Verb genitive
+## Derivational suffix tags
+- **+Der/Dimin** Diminutive
+- **+Der/adda** suffix
+- **+Der/ahtti** suffix
+- **+Der/alla** suffix
+- **+Der/amoš** suffix
+- **+Der/asti** suffix
+- **+Der/aš** suffix
+- **+Der/d** suffix
+- **+Der/duohkai** suffix
+- **+Der/duohke** suffix
+- **+Der/NomAg** agent noun
+- **+Der/eamoš** suffix
+- **+Der/NomAct** action noun
+- **+Der/easti** suffix
+- **+Der/g** suffix
+- **+Der/geahtes** suffix
+- **+Der/goahti** suffix
+- **+Der/h** suffix
+- **+Der/halla** suffix
+- **+Der/hat** suffix
+- **+Der/heapmi** suffix
+- **+Der/hudda** suffix
+- **+Der/huhtti** suffix
+- **+Der/huvva** suffix
+- **+Der/j** suffix
+- **+Der/l** suffix
+- **+Der/las** suffix
+- **+Der/laš** suffix
+- **+Der/lágan** suffix
+- **+Der/meahttun** suffix
+- **+Der/muš** suffix
+- **+Der/PassL** passive verb, long form
+- **+Der/PassS** passive verb, short form
+- **+Der/st** suffix
+- **+Der/stuvva** suffix
+- **+Der/supmi** suffix
+- **+Der/upmi** suffix
+- **+Der/us** suffix
+- **+Der/viđi** suffix
+- **+Der/viđá** suffix
+- **+Der/vuohta** suffix
+- **+Der/vuolde** suffix
+- **+Der/vuollai** suffix
+- **+Der/vuolle** suffix
+- **+Der/š** suffix
+## Semantic tags
These are tags used for classifying names and nouns, e.g. +Prop+Sem/Fem
-- **+Sem/Ani:**
- Animal name
-- **+Sem/Fem:**
- Female name
-- **+Sem/Mal:**
- Male name
-- **+Sem/Obj:**
- Object
-- **+Sem/Org:**
- Organisation name
-- **+Sem/Plc:**
- Place
-- **+Sem/Sur:**
- Surname
-- **+Sem/WEB:**
- Web addresse
-A critical discussion of some particular tags
-**Determiner or Pronoun**
+- **+Sem/Ani:**
+ Animal name
+- **+Sem/Fem:**
+ Female name
+- **+Sem/Mal:**
+ Male name
+- **+Sem/Obj:**
+ Object
+- **+Sem/Org:**
+ Organisation name
+- **+Sem/Plc:**
+ Place
+- **+Sem/Sur:**
+ Surname
+- **+Sem/WEB:**
+ Web addresse
+## A critical discussion of some particular tags
+### Determiner or Pronoun
the POSs of words like buot, dat, etc. get different terms in the
-- indefinite pronouns ("ubestemte pronomen"): vaikko mii, mihkkege,
- buot, buohkat, eanas...
+- indefinite pronouns ("ubestemte pronomen"): vaikko mii, mihkkege,
+ buot, buohkat, eanas...
Vesa Guttorm:
-- determiner ("pronomendeterminatiiva")
+- determiner ("pronomendeterminatiiva")
Magga 1980:
-- demonstrative pronouns ("čujuheaddji pronomenat"): dat, dát, diet,
- duot, dot
-- interrogative pronouns ("gažaldatpronomenat")
-- relative pronouns ("relatiivapronomenat")
-- personal pronouns ("persovnnalaš pronomenat")
-- indefinite pronouns ("indefinihtta (mearritkeahtes) pronomenat"):
- muhtun, soames, goappašagat
-- reciprocal pronouns ("resiprohka pronomenat: goabbat buoibmame,
- guhtet guoimmiset")
+- demonstrative pronouns ("čujuheaddji pronomenat"): dat, dát, diet,
+ duot, dot
+- interrogative pronouns ("gažaldatpronomenat")
+- relative pronouns ("relatiivapronomenat")
+- personal pronouns ("persovnnalaš pronomenat")
+- indefinite pronouns ("indefinihtta (mearritkeahtes) pronomenat"):
+ muhtun, soames, goappašagat
+- reciprocal pronouns ("resiprohka pronomenat: goabbat buoibmame,
+ guhtet guoimmiset")
Outi Kilpimaa:
-- determiner ("dat-determinánttalš"): dat,...
+- determiner ("dat-determinánttalš"): dat,...
diff --git a/lang/common/docu-sme-syntaxtags.md b/lang/common/docu-sme-syntaxtags.md
index f985ff04..24bc5dd6 100644
--- a/lang/common/docu-sme-syntaxtags.md
+++ b/lang/common/docu-sme-syntaxtags.md
@@ -1,5 +1,4 @@
-Documentation of the syntactic tags
+# Documentation of the syntactic tags
See also separate pages on [compound](CompoundTags.html),
[semantic](SemanticTags.html), [morphological](MorphologicalTags.html)
@@ -8,19 +7,18 @@ and [dependency](docu-deptags.html) tags.
On the bottom of this page there is a list with all tags in alphabetical
-Syntactic tags
+## Syntactic tags
Our syntactic tags, or grammatical function tags, like @SUBJ>, @OBJ>,
etc., are based upon a balanced compromise between 3 principles:
-1. use the same tags across *giellalt* languages
-1. use the conventions from within within constraint grammar (CG), e.g. as found in [the visl project](http://visl.sdu.dk/) for interactive syntax learning
-1. take the grammatical tradition of the language in question into account
+1. use the same tags across _giellalt_ languages
+1. use the conventions from within within constraint grammar (CG), e.g. as found in [the visl project](http://visl.sdu.dk/) for interactive syntax learning
+1. take the grammatical tradition of the language in question into account
The main difference between the CG tradition (both giellalt and visl CG) and other descriptions is that CG is a linear system, where tags are given to **wordforms**, and not to **phrases**.
-Thus, in a sentence like the Saami equivalent of *Peter's dog barks*
-only the word *dog* will get the tag @SUBJ>. The word *Peter's* gets
+Thus, in a sentence like the Saami equivalent of _Peter's dog barks_
+only the word _dog_ will get the tag @SUBJ>. The word _Peter's_ gets
the tag @>N, or "modifying a noun to its right". It is then up to the
reader (or to further computer processing) to interpret the combination
of @>N and @SUBJ> as a phrase (phrase information will also be available via the [dependency tags](docu-deptags.html) when they are present).
@@ -43,45 +41,40 @@ distinguish them from morphological tags, which do not have such a
prefix. In the analysis, the syntactic tags are printed at the end of
the tag string.
-The syntactic tags for Saami
+## The syntactic tags for Saami
-We present here the tags used for the Saami languages (the best developed languages in the *Giellalt* infrastructure), but linguists working on other languages will find the presentation useful. The rules assigning tags are found in the file `lang-xxx/src/cg3/disambiguation.cg3`, where xxx is the ISO code of your language.
+We present here the tags used for the Saami languages (the best developed languages in the _Giellalt_ infrastructure), but linguists working on other languages will find the presentation useful. The rules assigning tags are found in the file `lang-xxx/src/cg3/disambiguation.cg3`, where xxx is the ISO code of your language.
+### The verb tags
-The verb tags
These tags are self-explanatory: there are finite and infinite main and
auxiliary verbs.
-The major function tags
+### The major function tags
-- @<SUBJ (@<SUBJ @<ext>) @SUBJ> @SUBJ @<SPRED
+- @<SUBJ (@<SUBJ @<ext>) @SUBJ> @SUBJ @<SPRED
The four main functions for subject, object and their predicatives are
-- @-FSUBJ> @-FOBJ> @-F<OBJ
+- @-FSUBJ> @-FOBJ> @-F<OBJ
These are tags for the same functions of infinite verbs outside the
-verbal: *mu* gets @-FSUBJ> in *Diet dáhpáhuvai mu dieđikeahttá* (the
-infinite verb gets @<ADVL) and *girjji* gets @-F<OBJ in *Munnje
-lei lossat lohkat girjji.* (the infinite verb gets @<SPRED).
+verbal: _mu_ gets @-FSUBJ> in _Diet dáhpáhuvai mu dieđikeahttá_ (the
+infinite verb gets @<ADVL) and _girjji_ gets @-F<OBJ in _Munnje
+lei lossat lohkat girjji._ (the infinite verb gets @<SPRED).
-The adverbial tags
+### The adverbial tags
-- @-FADVL
-- @P< @>P
-- @ADVL< @>ADVL
+- @-FADVL
+- @P< @>P
+- @ADVL< @>ADVL
The @ADVL> @<ADVL @ADVL tags mark adverbials (many, but not all of
the adverbials are adverbs). The two first ones indicate the direction
@@ -101,377 +94,371 @@ or is a complement of the adverbial to the left, respectively. Note that
these tags mark modifyers of adverbials rather than adverbials
-The NP-internal modifiers
+### The NP-internal modifiers
The other syntactic tags for modifiers tell which word they modify, and
whether they modify to the left or to the right.
-- @>N @>A @>Num @>Pron
-- @Pron< @N< @Num<
+- @>N @>A @>Num @>Pron
+- @Pron< @N< @Num<
The morphological tag will tell what kind of part of speech the
constituent itself is.
The @Pron< tag is for eg. numerals modifying pronouns to their left,
-like in *Mii golmmas finaimet máná geahčen*.
+like in _Mii golmmas finaimet máná geahčen_.
-The @Num< tag is for complements of numerals, like *máná* in *Sudnos
-leat golbma máná*.
+The @Num< tag is for complements of numerals, like _máná_ in _Sudnos
+leat golbma máná_.
+### Appositions
-- @APP-N< @APP-Pron< @APP-Num< @APP-ADVL<
-- @APP>Pron
+- @APP-N< @APP-Pron< @APP-Num< @APP-ADVL<
+- @APP>Pron
The apposition tag marks whether it is an apposition of a noun, a
pronoun, a numeral or an adverbial.
-The function words
+### The function words
-- @CNP @CVP
+- @CNP @CVP
Conjunction connecting NPs and VPs.
-Sentence-external tags
+### Sentence-external tags
Stray noun in sentence fragment, interjection and vocative.
-The @X tag
+### The @X tag
-- @X
+- @X
An @X tag is assigned to mark that no tag has been assigned (because of
omissions in our rule component)
-The tags, listed alphabetically
+## The tags, listed alphabetically
Here is a list of the tags, with a definition or description, and one or
more examples following each of them
-- **@+FAUXV:**
- Finite auxiliary verb.
- - **ferte (V):**
- *Sámi geavaheddjiid bálvalusaid vuođđun *ferte* leat
- sámegielmáhttu ja sámi kulturmáhttu. - 'Saami user services
- *need* to be based on Saami language competence and Saami
- cultural competence.'*
-- **@+FMAINV:**
- Finite main verb.
- - **Boađe (V):**
- **Boađe* boahtte vahku. - '*Come* next week.'*
-- **@-F<ADVL:**
- - **árbbolaččain (N):**
- *Danne dárbbašit mii oažžut lobi Nils Aslak Valkeapää
- *árbbolaččain* almmuhit dán guokte lávlaga min sálbma-CD:s. -
- 'Therefore we need to get permission from Nils Aslak Valkeapää's
- *heirs* to release these two songs on our psalm-CD.'*
-- **@-F<OBJ:**
- Object of infinite verb outside the verbal.
- - **govaid (N):**
- *Boađe mu lusa geahččat *govaid*! - 'Come to me and look at *the
- pictures*!'*
-- **@-F<OPRED:**
- Object predicative of infinite verb outside the verbal.
- - **xxx:**
- *xxx*
-- **@-F<SPRED:**
- Subject predicative of infinite verb outside the verbal.
- - **xxx:**
- *xxx*
-- **@>A:**
- Modifier of an adjective to the left.
- - **nu (Adv):**
- *Gulahallan Sámedikkiin dán gažaldagas šaddá *nu*
- konkrehtalažžan go vejolaš. - 'The discussion in the Saami
- Parliament about this issue gets *as* concrete as possible.'*
-- **@A<:**
- Modifier of an adjective to the right.
- - **básárdoaluin (N):**
- *IL Nordlysa beaivválaš jođiheaddji, Nils Peder Eriksen, lohká
- iežaset leat oalle duhtavaččat dán jagáš *básárdoaluin*. - 'The
- business manager of IL Nordlys, Nils Peder Eriksen, says he is
- really satisfied with this year's *bazar arrangment*.'*
-- **@ADVL:**
- Sentence adverbial, @ADVL> or @<ADVL.
-- **@>ADVL:**
- Modifier of an adverbial.
- - **Man (Adv):**
- **Man* dávjá don lávet fitnat doppe? - '*How* often do you
- usually go there?'*
-- **@<ADVL:**
- adverbial to the right of the finite verb
- - **beaivvážis (N):**
- *Gávpot ii dárbbaš čuovgga *beaivvážis* ii ge mánus. - 'The city
- does not need light *from the sun* and not the from the moon
- either.*
-- **@ADVL>:**
- Adverbial to the left of the finite verb.
- - **lasttain (N):**
- *Ja muora *lasttain* ožžot álbmogat dearvvašvuođa. - 'And from
- the tree's *leaves*, the people get health.'*
-- **@ADVL<:**
- Complement of an adverbial to the right of its head.
- - **vahkus (N):**
- *Mun málestan guktii *vahkus*. - 'I make food twice a *week*.'*
-- **@ADVL>CS:**
- adverbial modifying a conjunction
- - **dallah (Adv):**
- **Dallah* goh Jeesuse tjaetseste tjuedtjele, dellie vuajna Elmie
- rihpesåvva jih Voejkene altasasse goh ledtie suaja. - '*Right*
- after Jesus stood up from the water, he sees that heaven opens
- and the holy spirit flies to him like a bird.'*
- - **:**
- *(sma)*
-- **@APP-ADVL<:**
- Apposition to an adverbial to the left. If the apposition consists
- of more than one word, the head will get this tag.
- - **ovdal (Pr):**
- *Dolin, *ovdal* soađi, olbmot lávejedje vuovdit joŋaid. - 'In
- old times, *before* the war, people used to sell cowberries.'*
-- **@APP-N<:**
- Apposition to a noun to the left of it. If the apposition is more
- than one word, the head will get this tag.
- - **eatnigiela (N):**
- *Viimmat mun ohppen čállit sámegiela, mu *eatnigiela*. -
- 'Finally, I learned to write in Sámi, my *mother tongue*.'*
-- **@APP-Num<:**
- Apposition to a numeral to the left.
- - **suinniid (N):**
- *Juohke heasta borrá sullii 6 kilu *suinniid* beaivái. - 'Every
- horse eats approximately 6 kilograms of *grass* a day.'*
-- **@APP>Pron:**
- Apposition to a pronoun to the right. If the apposition is more than
- one constituent, the head will get this tag.
- - **Turner (N Prop):**
- *Muhto diet Will *Turner*, son nai lea fiinna olmmái. - 'But
- this Will *Turner*, he is also a nice guy.'*
-- **@APP-Pron<:**
- Apposition to a pronoun to the left. If the apposition is more than
- one constituent, the head will get this tag.
- - **olbmái (N):**
- *Dan mun muitalan dušše dutnje, mu buoremus *olbmái*. - 'This I
- tell only you, my best *friend*.'*
-- **@CMPND:**
- First part of a compound followed by a hyphen
- - **skaehtie-:**
- *Reerenasse galka båetije stoerredigkieboelhkesne jåerhkedh dam
- *skaehtie-* jïh åasadaltesem mij lea daelie, jïh daennie
- daltesisnie hov lea nuepie buerebe joekedimmiem darjodh.*
-- **@CNP:**
- Local conjunction or subjunction.
- - **ja (CC):**
- *Sihke Mázes *ja* Guovdageainnus leat boarrásat viššalit finadan
- doaibmaguovddážiin. - 'Both in Máze *and* Guovdageaidnu, the
- oldest people frequently got to the activitycentre.'*
- - **go (CS):**
- *Sámi geavaheaddjit hállet dávjá metaforaiguin ja sis leat ollu
- eará gulahallanvuogit *go* giella. - 'Saami users speak often in
- metaphores and the have many other ways of comunicating *than*
- by means of language.'*
-- **@COMP-CS<:**
- Complement of subjunction.
- - **vejolaš (A):**
- *Gulahallan Sámedikkiin dán gažaldagas šaddá nu konkrehtalažžan
- go *vejolaš*. - 'The contact with the Saami Parliament about
- this issue gets as concrete as *possible*.'*
-- **@CVP:**
- Conjunction or subjunction that conjoins finite verb phrases.
- - **ja (CC):**
- *Bealatjogas leat dološ rájes leamaš bálvvossajit *ja* dát
- golbma sieiddi ledje dovddus gitta olgoriikii. - 'Long since,
- there have been sacrificial sites at Bealatjohka *and* the three
- 'sieidi' (cult images) were known even abroad.*
- - **go (CS):**
- *Leago guhkes áigi dassá *go* Máreha oidnet? - 'Has it been a
- long time *since* you have seen Máret?'*
-- **@-FADVL>:**
- Complement of infinite verb outside the verbal.
- - **várrogasat (Adv):**
- *Dihkkadeaddji rávve skohtervuddjiid *várrogasat* mátkkoštit.
- 'The roadman warns snowscooter drivers to drive *carefully*.'*
-- **@-FAUXV:**
- Infinite auxiliary verb.
- - **sáhte (V):**
- *Eat mii *sáhte* vuolgit. - 'We *can*not leave.'*
-- **@-FMAINV:**
- Infinite main verb.
- - **geargan (V):**
- *Ja Biret-Elle lea easka skuvllas *geargan*. - 'And Biret-Elle
- has just *finished* school.'*
-- **@-FOBJ>:**
- Object of infinite verb outside the verbal.
- - **váldovuoittuid (N):**
- *Valáštallanhálla lei njealjehas dievva olbmuiguin geat vurde
- *váldovuoittuid* fasket. - 'The gymn was to a quarter full of
- people that wait to grab *the main prizes*.'*
-- **@-FSUBJ>:**
- Subject of infinite verb outside the verbal.
- - **mu (Pron):**
- *Diet dáhpáhuvai *mu* dieđikeahttá. - 'It happened without *me*
- knowing about it.'*
-- **@ADVL> <hab>:**
- Habitive to the left of the finite verb.
- - **Máhtes (N):**
- **Máhtes* lea beana. - '*Máhtte* has a dog.'*
-- **@<ADVL <hab>:**
- Habitive to the right of the finite verb.
- - **dus (Pron):**
- *Leago *dus* ruhta? - 'Do *you* have money?'*
-- **@HNOUN:**
- Stray noun in sentence fragments.
- - **boddu (N):**
- *Vuosttaš *boddu*. - 'First *lesson*.'*
-- **@INTERJ:**
- Interjection.
- - **maid (Interj):**
- **Maid*, iigo leat boahtán? - '*What*, hasn't he/she come?'*
-- **@>N:**
- Prenominal modifier to the left
- - **geavatlaš (A):**
- *Ráđđehussii lea *geavatlaš* politihkka deaŧalaš. - 'For the
- government, *practical* politics is important.'*
- - **oahppo-:**
- **Oahppo-* ja dutkanministtar dat lea ráhkadan dieđáhusa alit
- sámi oahpu ja dutkama birra. - 'The secretary for *education*
- and research has given a notice about Saami higher education and
- research.'*
- - **rektor (N):**
- **Rektor* Tove Bull álgaga mielde... - 'According to *principal*
- Tove Bull ...'*
- - **Tove (N Prop):**
- *Rektor *Tove* Bull álgaga mielde... - 'According to principal
- *Tove* Bull ...'*
-- **@N<:**
- Modifier of the noun to the left.
- - **33 (Num):**
- *Mun lean ilus go beasan ovdanbuktit St.dieđ. nr. *33*. - 'I am
- happy that I get the opportunity to present the parliament
- notice number *33*.'* (In this case *33* modifies *St.dieđ.*.)
- - **vihtta (Num):**
- *Mun boađán diibmu *vihtta*. - 'I will come at *five* o'clock.'*
-- **@>Num:**
- Attributes of numeral to the right.
- - **nr (N):**
- *Mun lean ilus go beasan ovdanbuktit St.dieđ. *nr.* 33. - 'I am
- happy that I get the opportunity to present the parliament
- notice *number* 33.'*
-- **@Num<:**
- Attributes of numeral to the left.
- - **jagi (N):**
- *Son lea guoktelogi *jagi* boaris. - 'She/he is twenty *years*
- old.'*
-- **@<OBJ:**
- Direct object to the right of the finite verb.
- - **áiggi (N):**
- *Dat gáibida ollu *áiggi*. - 'That demands a lot of *time*.'*
-- **@OBJ>:**
- Direct object to the left of the finite verb.
- - **maid (Pron):**
- *Filbma lea oassi prošeavttas *maid* Sámi instituhtta lea
- ruthadan. - 'The film is a part of the project *that* the Saami
- institute has financed.'*
-- **@OPRED>:**
- Object predicative to the left of the finite verb.
- - **luoikkasin (N):**
- *Gaup dojii stivrrana hárjehallamiin, muhto oaččui *luoikkasin*
- eará stivrrana. - 'Gaup broke the handlebars during the
- practises, but got to *borrow* another steering.'*
-- **@<OPRED:**
- Object predicative to the right of the finite verb.
- - **buriid (A):**
- *Gáhkkuid son ráhkada hui *buriid*. - 'Cakes, she/he makes
- really *good ones*.'*
- - **sámegielhállin (N):**
- *Dagat iežat *sámegielhállin*. - 'You make yourself *a Saami
- speaker*.'*
-- **@>P:**
- Complement of postposition to the left of it.
- - **oahpu (N), dutkama (N):**
- *Oahppo- ja dutkanministtar dat lea ráhkadan dieđáhusa alit sámi
- *oahpu* ja *dutkama* birra. - 'The secretary for education and
- research has given a notice about Saami higher *education* and
- *research*.'*
-- **@P<:**
- Complement of preposition to the right of it.
- - **oasálaččaid (N):**
- *Finnmárkkus ii goassige leat leamaš ságastallan gaskal muhtun
- muddui seammadássásaš *oasálaččaid*. - 'There has never been a
- discussion in Finnmark between somehow equal *parts*.'*
-- **@PCLE:**
- Particle.
- - **amma (Pcle):**
- **Amma* mii eat leat máksán? - 'We haven't paid, *have we*?'*
-- **@<PPRED:**
- a predicative with a predicative as its head
- - **reaŋgan (N):**
- *Máhtes lea Jovnna *reaŋgan*. - 'Máhtte has Jovnna *as a
- searvant*.'*
-- **@>Pron:**
- Modifier of a pronoun to the left of it.
- - **buot (Pron):**
- *Mun, Johanas, lean dat guhte lean gullan ja oaidnán *buot*
- dán. - 'I, Johanas, am the one who has heard and seen *all* of
- it.'*
-- **@Pron<:**
- Modifier of pronoun to the right of it.
- - **ipmašiid (N):**
- *Maid *ipmašiid* doppe dagat? - 'What *the heck* are you doing
- there?'*
- - **golmmas (N):**
- *Mii *golmmas* oktan du vieljain finaimet Niillas-čeazi
- geahčen. - 'We *three* together with your brother visited uncle
- Niillas.'*
-- **@SPRED:**
- Subject predicative in elliptical sentences.
- - **nommh (N):**
- *Die maa onterligksh nommh, ih goh tuhtjh, men die ligan
- onterligksh nierretjh aaj.*
- - **:**
- *(sma)*
-- **@<SPRED:**
- Subject predicative to the right of the finite verb.
- - **galbmasat (A):**
- *Mus leat gieđat nu *galbmasat*. - 'My hands are so *cold*.'*
-- **@SPRED>:**
- Subject predicative to the left of the finite verb.
- - **bargu (N):**
- *Sin *bargun* lei váldit fáŋgan Gonagasa. - 'Their *job* was to
- capture the King.'*
-- **@SUBJ:**
- Elliptical subject.
- - **ålma (N):**
- *Dennie synnagovgesne jis akte ålma maam doenh-aajmoe
- doerelamme.*
-- **@SUBJ>:**
- Subject to the left of the finite verb.
- - **son (Pron):**
- **Son* lea mu oabbá. - '*She* is my sister.'*
-- **@<SUBJ:**
- Subject to the right of the finite verb.
- - **ollusat (Pron):**
- *...ja dan vejolašvuođa orro gal *ollusat* geavahan. - '...and
- this opportunity, *many* seem to make use of.'*
-- **@<SUBJ <ext>:**
- Subject to the right of the finite verb, in a habitive or extencial
- construction.
- - **beana (N):**
- *Mus lea *beana*. - 'I have *a dog*.'*
- - **luopmánat (N):**
- *Jeakkis leat *luopmánat*. - 'There are *cloudberries* in the
- swamp.'*
-- **@VOC:**
- Vocative.
- - **hearrá:**
- **Hearrá*, du ráhkis ustit lea buohcci. - '*Lord*, your beloved
- friend is ill.'*
-- **@X:**
- A dummy tag assigned when no tag assignment rule has hit. This tag
- is useful for finding the flaws in the tag mapping section.
+- **@+FAUXV:**
+ Finite auxiliary verb.
+ - **ferte (V):**
+ Sámi geavaheddjiid bálvalusaid vuođđun _ferte_ leat
+ sámegielmáhttu ja sámi kulturmáhttu. - 'Saami user services
+ _need_ to be based on Saami language competence and Saami
+ cultural competence.'
+- **@+FMAINV:**
+ Finite main verb.
+ - **Boađe (V):**
+ _Boađe_ boahtte vahku. - '_Come_ next week.'
+- **@-F<ADVL:**
+ - **árbbolaččain (N):**
+ Danne dárbbašit mii oažžut lobi Nils Aslak Valkeapää
+ _árbbolaččain_ almmuhit dán guokte lávlaga min sálbma-CD:s. -
+ 'Therefore we need to get permission from Nils Aslak Valkeapää's
+ _heirs_ to release these two songs on our psalm-CD.'
+- **@-F<OBJ:**
+ Object of infinite verb outside the verbal.
+ - **govaid (N):**
+ Boađe mu lusa geahččat _govaid_! - 'Come to me and look at _the
+ pictures_!'
+- **@-F<OPRED:**
+ Object predicative of infinite verb outside the verbal.
+ - **xxx:**
+ _xxx_
+- **@-F<SPRED:**
+ Subject predicative of infinite verb outside the verbal.
+ - **xxx:**
+ _xxx_
+- **@>A:**
+ Modifier of an adjective to the left.
+ - **nu (Adv):**
+ Gulahallan Sámedikkiin dán gažaldagas šaddá _nu_
+ konkrehtalažžan go vejolaš. - 'The discussion in the Saami
+ Parliament about this issue gets _as_ concrete as possible.'
+- **@A<:**
+ Modifier of an adjective to the right.
+ - **básárdoaluin (N):**
+ IL Nordlysa beaivválaš jođiheaddji, Nils Peder Eriksen, lohká
+ iežaset leat oalle duhtavaččat dán jagáš _básárdoaluin_. - 'The
+ business manager of IL Nordlys, Nils Peder Eriksen, says he is
+ really satisfied with this year's _bazar arrangment_.'
+- **@ADVL:**
+ Sentence adverbial, @ADVL> or @<ADVL.
+- **@>ADVL:**
+ Modifier of an adverbial.
+ - **Man (Adv):**
+ _Man_ dávjá don lávet fitnat doppe? - '_How_ often do you
+ usually go there?'
+- **@<ADVL:**
+ adverbial to the right of the finite verb
+ - **beaivvážis (N):**
+ Gávpot ii dárbbaš čuovgga _beaivvážis_ ii ge mánus. - 'The city
+ does not need light _from the sun_ and not the from the moon
+ either.
+- **@ADVL>:**
+ Adverbial to the left of the finite verb.
+ - **lasttain (N):**
+ Ja muora _lasttain_ ožžot álbmogat dearvvašvuođa. - 'And from
+ the tree's _leaves_, the people get health.'
+- **@ADVL<:**
+ Complement of an adverbial to the right of its head.
+ - **vahkus (N):**
+ Mun málestan guktii _vahkus_. - 'I make food twice a _week_.'
+- **@ADVL>CS:**
+ adverbial modifying a conjunction
+ - **dallah (Adv):**
+ _Dallah_ goh Jeesuse tjaetseste tjuedtjele, dellie vuajna Elmie
+ rihpesåvva jih Voejkene altasasse goh ledtie suaja. - '_Right_
+ after Jesus stood up from the water, he sees that heaven opens
+ and the holy spirit flies to him like a bird.'
+ - **:**
+ _(sma)_
+- **@APP-ADVL<:**
+ Apposition to an adverbial to the left. If the apposition consists
+ of more than one word, the head will get this tag.
+ - **ovdal (Pr):**
+ Dolin, _ovdal_ soađi, olbmot lávejedje vuovdit joŋaid. - 'In
+ old times, _before_ the war, people used to sell cowberries.'
+- **@APP-N<:**
+ Apposition to a noun to the left of it. If the apposition is more
+ than one word, the head will get this tag.
+ - **eatnigiela (N):**
+ Viimmat mun ohppen čállit sámegiela, mu _eatnigiela_. -
+ 'Finally, I learned to write in Sámi, my _mother tongue_.'
+- **@APP-Num<:**
+ Apposition to a numeral to the left.
+ - **suinniid (N):**
+ Juohke heasta borrá sullii 6 kilu _suinniid_ beaivái. - 'Every
+ horse eats approximately 6 kilograms of _grass_ a day.'
+- **@APP>Pron:**
+ Apposition to a pronoun to the right. If the apposition is more than
+ one constituent, the head will get this tag.
+ - **Turner (N Prop):**
+ Muhto diet Will _Turner_, son nai lea fiinna olmmái. - 'But
+ this Will _Turner_, he is also a nice guy.'
+- **@APP-Pron<:**
+ Apposition to a pronoun to the left. If the apposition is more than
+ one constituent, the head will get this tag.
+ - **olbmái (N):**
+ Dan mun muitalan dušše dutnje, mu buoremus _olbmái_. - 'This I
+ tell only you, my best _friend_.'
+- **@CMPND:**
+ First part of a compound followed by a hyphen
+ - **skaehtie-:**
+ Reerenasse galka båetije stoerredigkieboelhkesne jåerhkedh dam
+ \*skaehtie-_ jïh åasadaltesem mij lea daelie, jïh daennie
+ daltesisnie hov lea nuepie buerebe joekedimmiem darjodh._
+- **@CNP:**
+ Local conjunction or subjunction.
+ - **ja (CC):**
+ Sihke Mázes _ja_ Guovdageainnus leat boarrásat viššalit finadan
+ doaibmaguovddážiin. - 'Both in Máze _and_ Guovdageaidnu, the
+ oldest people frequently got to the activitycentre.'
+ - **go (CS):**
+ Sámi geavaheaddjit hállet dávjá metaforaiguin ja sis leat ollu
+ eará gulahallanvuogit _go_ giella. - 'Saami users speak often in
+ metaphores and the have many other ways of comunicating _than_
+ by means of language.'
+- **@COMP-CS<:**
+ Complement of subjunction.
+ - **vejolaš (A):**
+ Gulahallan Sámedikkiin dán gažaldagas šaddá nu konkrehtalažžan
+ go _vejolaš_. - 'The contact with the Saami Parliament about
+ this issue gets as concrete as _possible_.'
+- **@CVP:**
+ Conjunction or subjunction that conjoins finite verb phrases.
+ - **ja (CC):**
+ Bealatjogas leat dološ rájes leamaš bálvvossajit _ja_ dát
+ golbma sieiddi ledje dovddus gitta olgoriikii. - 'Long since,
+ there have been sacrificial sites at Bealatjohka _and_ the three
+ 'sieidi' (cult images) were known even abroad.
+ - **go (CS):**
+ Leago guhkes áigi dassá _go_ Máreha oidnet? - 'Has it been a
+ long time _since_ you have seen Máret?'
+- **@-FADVL>:**
+ Complement of infinite verb outside the verbal.
+ - **várrogasat (Adv):**
+ Dihkkadeaddji rávve skohtervuddjiid _várrogasat_ mátkkoštit.
+ 'The roadman warns snowscooter drivers to drive _carefully_.'
+- **@-FAUXV:**
+ Infinite auxiliary verb.
+ - **sáhte (V):**
+ Eat mii _sáhte_ vuolgit. - 'We *can*not leave.'
+- **@-FMAINV:**
+ Infinite main verb.
+ - **geargan (V):**
+ Ja Biret-Elle lea easka skuvllas _geargan_. - 'And Biret-Elle
+ has just _finished_ school.'
+- **@-FOBJ>:**
+ Object of infinite verb outside the verbal.
+ - **váldovuoittuid (N):**
+ Valáštallanhálla lei njealjehas dievva olbmuiguin geat vurde
+ _váldovuoittuid_ fasket. - 'The gymn was to a quarter full of
+ people that wait to grab _the main prizes_.'
+- **@-FSUBJ>:**
+ Subject of infinite verb outside the verbal.
+ - **mu (Pron):**
+ Diet dáhpáhuvai _mu_ dieđikeahttá. - 'It happened without _me_
+ knowing about it.'
+- **@ADVL> <hab>:**
+ Habitive to the left of the finite verb.
+ - **Máhtes (N):**
+ _Máhtes_ lea beana. - '_Máhtte_ has a dog.'
+- **@<ADVL <hab>:**
+ Habitive to the right of the finite verb.
+ - **dus (Pron):**
+ Leago _dus_ ruhta? - 'Do _you_ have money?'
+- **@HNOUN:**
+ Stray noun in sentence fragments.
+ - **boddu (N):**
+ Vuosttaš _boddu_. - 'First _lesson_.'
+- **@INTERJ:**
+ Interjection.
+ - **maid (Interj):**
+ _Maid_, iigo leat boahtán? - '_What_, hasn't he/she come?'
+- **@>N:**
+ Prenominal modifier to the left
+ - **geavatlaš (A):**
+ Ráđđehussii lea _geavatlaš_ politihkka deaŧalaš. - 'For the
+ government, _practical_ politics is important.'
+ - **oahppo-:**
+ _Oahppo-_ ja dutkanministtar dat lea ráhkadan dieđáhusa alit
+ sámi oahpu ja dutkama birra. - 'The secretary for _education_
+ and research has given a notice about Saami higher education and
+ research.'
+ - **rektor (N):**
+ _Rektor_ Tove Bull álgaga mielde... - 'According to _principal_
+ Tove Bull ...'
+ - **Tove (N Prop):**
+ Rektor _Tove_ Bull álgaga mielde... - 'According to principal
+ _Tove_ Bull ...'
+- **@N<:**
+ Modifier of the noun to the left.
+ - **33 (Num):**
+ Mun lean ilus go beasan ovdanbuktit St.dieđ. nr. _33_. - 'I am
+ happy that I get the opportunity to present the parliament
+ notice number _33_.'\* (In this case _33_ modifies _St.dieđ._.)
+ - **vihtta (Num):**
+ Mun boađán diibmu _vihtta_. - 'I will come at _five_ o'clock.'
+- **@>Num:**
+ Attributes of numeral to the right.
+ - **nr (N):**
+ Mun lean ilus go beasan ovdanbuktit St.dieđ. _nr._ 33. - 'I am
+ happy that I get the opportunity to present the parliament
+ notice _number_ 33.'
+- **@Num<:**
+ Attributes of numeral to the left.
+ - **jagi (N):**
+ Son lea guoktelogi _jagi_ boaris. - 'She/he is twenty _years_
+ old.'
+- **@<OBJ:**
+ Direct object to the right of the finite verb.
+ - **áiggi (N):**
+ Dat gáibida ollu _áiggi_. - 'That demands a lot of _time_.'
+- **@OBJ>:**
+ Direct object to the left of the finite verb.
+ - **maid (Pron):**
+ Filbma lea oassi prošeavttas _maid_ Sámi instituhtta lea
+ ruthadan. - 'The film is a part of the project _that_ the Saami
+ institute has financed.'
+- **@OPRED>:**
+ Object predicative to the left of the finite verb.
+ - **luoikkasin (N):**
+ Gaup dojii stivrrana hárjehallamiin, muhto oaččui _luoikkasin_
+ eará stivrrana. - 'Gaup broke the handlebars during the
+ practises, but got to _borrow_ another steering.'
+- **@<OPRED:**
+ Object predicative to the right of the finite verb.
+ - **buriid (A):**
+ Gáhkkuid son ráhkada hui _buriid_. - 'Cakes, she/he makes
+ really _good ones_.'
+ - **sámegielhállin (N):**
+ Dagat iežat _sámegielhállin_. - 'You make yourself _a Saami
+ speaker_.'
+- **@>P:**
+ Complement of postposition to the left of it.
+ - **oahpu (N), dutkama (N):**
+ Oahppo- ja dutkanministtar dat lea ráhkadan dieđáhusa alit sámi
+ _oahpu_ ja _dutkama_ birra. - 'The secretary for education and
+ research has given a notice about Saami higher _education_ and
+ _research_.'
+- **@P<:**
+ Complement of preposition to the right of it.
+ - **oasálaččaid (N):**
+ Finnmárkkus ii goassige leat leamaš ságastallan gaskal muhtun
+ muddui seammadássásaš _oasálaččaid_. - 'There has never been a
+ discussion in Finnmark between somehow equal _parts_.'
+- **@PCLE:**
+ Particle.
+ - **amma (Pcle):**
+ _Amma_ mii eat leat máksán? - 'We haven't paid, _have we_?'
+- **@<PPRED:**
+ a predicative with a predicative as its head
+ - **reaŋgan (N):**
+ Máhtes lea Jovnna _reaŋgan_. - 'Máhtte has Jovnna _as a
+ searvant_.'
+- **@>Pron:**
+ Modifier of a pronoun to the left of it.
+ - **buot (Pron):**
+ Mun, Johanas, lean dat guhte lean gullan ja oaidnán _buot_
+ dán. - 'I, Johanas, am the one who has heard and seen _all_ of
+ it.'
+- **@Pron<:**
+ Modifier of pronoun to the right of it.
+ - **ipmašiid (N):**
+ Maid _ipmašiid_ doppe dagat? - 'What _the heck_ are you doing
+ there?'
+ - **golmmas (N):**
+ Mii _golmmas_ oktan du vieljain finaimet Niillas-čeazi
+ geahčen. - 'We _three_ together with your brother visited uncle
+ Niillas.'
+- **@SPRED:**
+ Subject predicative in elliptical sentences.
+ - **nommh (N):**
+ _Die maa onterligksh nommh, ih goh tuhtjh, men die ligan
+ onterligksh nierretjh aaj._
+ - **:**
+ _(sma)_
+- **@<SPRED:**
+ Subject predicative to the right of the finite verb.
+ - **galbmasat (A):**
+ Mus leat gieđat nu _galbmasat_. - 'My hands are so _cold_.'
+- **@SPRED>:**
+ Subject predicative to the left of the finite verb.
+ - **bargu (N):**
+ Sin _bargun_ lei váldit fáŋgan Gonagasa. - 'Their _job_ was to
+ capture the King.'
+- **@SUBJ:**
+ Elliptical subject.
+ - **ålma (N):**
+ _Dennie synnagovgesne jis akte ålma maam doenh-aajmoe
+ doerelamme._
+- **@SUBJ>:**
+ Subject to the left of the finite verb.
+ - **son (Pron):**
+ _Son_ lea mu oabbá. - '_She_ is my sister.'
+- **@<SUBJ:**
+ Subject to the right of the finite verb.
+ - **ollusat (Pron):**
+ ...ja dan vejolašvuođa orro gal _ollusat_ geavahan. - '...and
+ this opportunity, _many_ seem to make use of.'
+- **@<SUBJ <ext>:**
+ Subject to the right of the finite verb, in a habitive or extencial
+ construction.
+ - **beana (N):**
+ Mus lea _beana_. - 'I have _a dog_.'
+ - **luopmánat (N):**
+ Jeakkis leat _luopmánat_. - 'There are _cloudberries_ in the
+ swamp.'
+- **@VOC:**
+ Vocative.
+ - **hearrá:**
+ _Hearrá_, du ráhkis ustit lea buohcci. - '_Lord_, your beloved
+ friend is ill.'
+- **@X:**
+ A dummy tag assigned when no tag assignment rule has hit. This tag
+ is useful for finding the flaws in the tag mapping section.
diff --git a/lang/common/flag-diacritics.md b/lang/common/flag-diacritics.md
index c2e6031e..1f27b579 100644
--- a/lang/common/flag-diacritics.md
+++ b/lang/common/flag-diacritics.md
@@ -1,8 +1,6 @@
-Flag diacritics
+# Flag diacritics
+## Introduction
The use of flag diacritics is documented in chapter 8 of the Xerox book.
The present page documents the flag diacritics format, and the use of
@@ -13,29 +11,28 @@ remove illegal compounds, and in order to handle automatic downcasing of
proper names when they are converted to e.g. adjectives. See the
documentation for each language for an overview.
-Flag diacritics format
+## Flag diacritics format
There are four types of flag diacritics, all of them with the format
@operator.feature.value@ or @operator.feature@:
-- **U or Unification flags, @U.feature.value@:**
- U is the unification operator, and the form is accepted if, for the
- relevant feature, the two flags in the derivation string have the
- same value.
-- **P or Positive (Re)Setting, @P.feature.value@:**
- Sets or resets the feature to the given value.
-- **N or Negative (Re)Setting, @N.feature.value@:**
- Sets or resets the feature to the negation of the given value.
-- **R or Require Test, @R.feature.value@:**
- For this diacritic, a test is performed, and it succeeds iff feature
- is currently set to value, otherwise the path is blocked.
-- **D or Disallow Test, @D.feature.value@:**
- A test is performed that succeds if feature is neutral or set to a
- value that is incompatible with value.
-- **C or Clear Feature, @C.feature@:**
- For this flag, the value of feature is reset to neutral.
-- **U or Unification Test, @U.feature.value@:**
- If feature is currently neutra, this diacritic causes feature to be
- set to value. Else if feature is currently set, then the test
- succeeds iff value is compatible with the current value of feature.
+- **U or Unification flags, @U.feature.value@:**
+ U is the unification operator, and the form is accepted if, for the
+ relevant feature, the two flags in the derivation string have the
+ same value.
+- **P or Positive (Re)Setting, @P.feature.value@:**
+ Sets or resets the feature to the given value.
+- **N or Negative (Re)Setting, @N.feature.value@:**
+ Sets or resets the feature to the negation of the given value.
+- **R or Require Test, @R.feature.value@:**
+ For this diacritic, a test is performed, and it succeeds iff feature
+ is currently set to value, otherwise the path is blocked.
+- **D or Disallow Test, @D.feature.value@:**
+ A test is performed that succeds if feature is neutral or set to a
+ value that is incompatible with value.
+- **C or Clear Feature, @C.feature@:**
+ For this flag, the value of feature is reset to neutral.
+- **U or Unification Test, @U.feature.value@:**
+ If feature is currently neutra, this diacritic causes feature to be
+ set to value. Else if feature is currently set, then the test
+ succeeds iff value is compatible with the current value of feature.
diff --git a/lang/common/index.md b/lang/common/index.md
index 9ab9153f..b138764c 100644
--- a/lang/common/index.md
+++ b/lang/common/index.md
@@ -1,50 +1,45 @@
-Language models (transducers)
+# Language models (transducers)
Working with LEXC, TWOLC and Constraint Grammar
+## Transducers
- [Transducer infrastructure](../../infra/Infrastructure.md)
- [Tutorials for lexc, twolc and constraint grammar](Tutorials.html)
-- [Test scripts and routines for use when working on the tools](developingwork.html)
-- [Handling morphological variation in lexc](Variation_in_lexc.html)
-- [Principles for common (language-independent) lexicon entries](PrinciplesForCommonTagsAndLexiconEntries.html)
+- [Test scripts and routines for use when working on the tools](developingwork.html)
+- [Handling morphological variation in lexc](Variation_in_lexc.html)
+- [Principles for common (language-independent) lexicon entries](PrinciplesForCommonTagsAndLexiconEntries.html)
+## Shared resources
-Shared resources
Description of [how to set up](SharedResources.md) shared resources.
-Documentation of tags
+## Documentation of tags
These links document the different types of tags used in the grammar models.
-- [How the different tags are interacting with the FSTs](DifferentFSTs.html)
-- [Harmonising the most frekvent derivations in Saami languages](DerivationOverview.html)
-- [Compoundtags](CompoundTags.html)
-- [Morphological tags](MorphologicalTags.html)
-- [Derivational tags](DerivationOverview.html)
-- [Syntax](docu-sme-syntaxtags.html)
-- [Dependency](docu-deptags.html)
-- [Semantic tags](SemanticTags.html)
+- [How the different tags are interacting with the FSTs](DifferentFSTs.html)
+- [Harmonising the most frekvent derivations in Saami languages](DerivationOverview.html)
+- [Compoundtags](CompoundTags.html)
+- [Morphological tags](MorphologicalTags.html)
+- [Derivational tags](DerivationOverview.html)
+- [Syntax](docu-sme-syntaxtags.html)
+- [Dependency](docu-deptags.html)
+- [Semantic tags](SemanticTags.html)
-Language-specific documentation
+## Language-specific documentation
-- [Work on each languages is documented on their respective pages](https://giellalt.github.io/LanguageModels.html)
-- [Page for improving our linguistic analysis for the Saami languages](../smi/index.html)
+- [Work on each languages is documented on their respective pages](https://giellalt.github.io/LanguageModels.html)
+- [Page for improving our linguistic analysis for the Saami languages](../smi/index.html)
-Obsolete documentation
+## Obsolete documentation
-Here we keep some documentation that *now is obsolete*, but that we
+Here we keep some documentation that _now is obsolete_, but that we
don't want to throw away. Sometimes looking at how things were before
help us understand the present situation, or it may support our memory.
-- [The original sme flowchart over the old
- infra](../sme/docu-sme-flowchart.html)
-- [The makefile setup in our old infra](../sme/docu-sme-makefile.html)
-- [Our oldinfra system for flag
- diacritics"](../sme/docu-sme-flag-diacritics.html)
+- [The original sme flowchart over the old
+ infra](../sme/docu-sme-flowchart.html)
+- [The makefile setup in our old infra](../sme/docu-sme-makefile.html)
+- [Our oldinfra system for flag
+ diacritics"](../sme/docu-sme-flag-diacritics.html)
diff --git a/lang/common/korp-enkel.md b/lang/common/korp-enkel.md
index 546d1940..ac299afd 100644
--- a/lang/common/korp-enkel.md
+++ b/lang/common/korp-enkel.md
@@ -1,4 +1,4 @@
-# Søk med søkeboksen *Enkel* i Korp
+# Søk med søkeboksen _Enkel_ i Korp
Gå til et av Korp-grensesnitta, f.eks. [det samiske](http://gtweb.uit.no/korp/). Trykk på fliken **Enkel** rett under **KORP**-kogoen.
@@ -7,17 +7,16 @@ Gå til et av Korp-grensesnitta, f.eks. [det samiske](http://gtweb.uit.no/korp/)
Boksen har ett søkefelt. Skriv inn ordform, og trykk **Søk**. Merk nedfallsmenyen til høyre for ordet **Søk**: Det er mulik å lagre søket du har gjort, og deretter bruke det på nytt i sammenligning med andre søk.
Under søkefeltet er det 4 alternativ:
- i rekkefølge og også som
-- prefiks
-- suffiks
-- skiller ikke mellom store/små
+- prefiks
+- suffiks
+- skiller ikke mellom store/små
-Dette gir et enkelt regulært uttrykk, f.eks. «alle ord på *-guin*».
+Dette gir et enkelt regulært uttrykk, f.eks. «alle ord på _-guin_».
## Muligheter med enkelt søk
### Ordbild
Fjern avkrysninga på alle de fire alternativa, men kryss av for **Ordbild** til høyre i den blå linja under søkefeltet. Skriv inn f.eks. et verb i infinitiv, og søk. Søket vil ta flere minutt, men resultatet blir et oversyn over argumenta knytta til verbet (eller to: Et med argument før og et med argument etter verbet).
diff --git a/lang/common/korp-extended.md b/lang/common/korp-extended.md
index 673df721..4165a057 100644
--- a/lang/common/korp-extended.md
+++ b/lang/common/korp-extended.md
@@ -1,59 +1,63 @@
-# Search with the search box *Extended* in Korp
+# Search with the search box _Extended_ in Korp
Go to one of the Korp interfaces, e.g. [the Sami](http://gtweb.uit.no/korp/). Tap the **Extended** tab right below the **KORP** cogo.
-# The search box itself
+## The search box itself
(picture in English forthcoming)
![Alt text](korp-extended.png?raw=true "Søkeboksen *Utvidet*")
+### Simple use of the search box
+The box has 9 different search modes, _word, Part-of-speech, Grammatical analysis, Baseform, Dependency relation, Domain, Translated from, Title, date, time interval_. We go through them one by one than:
-## Simple use of the search box
+#### word
-The box has 9 different search modes, *word, Part-of-speech, Grammatical analysis, Baseform, Dependency relation, Domain, Translated from, Title, date, time interval*. We go through them one by one than:
+Here you enter _word form_. Mark alternative to the right, e.g. _is, is not, ..._ The option _is not_ only makes sense with the use of multiple search boxes.
-### word
-Here you enter *word form*. Mark alternative to the right, e.g. *is, is not, ...* The option *is not* only makes sense with the use of multiple search boxes.
+#### Part of speech
-### Part of speech
Here there are predefined options, one for each Part of speech.
-### Grammatical analysis
-Here you enter the grammatical tag. The dropdown menu immediately to the right says **contains**, because the tag is only a part of the string *word form + analysis*. If you want to search for several tags, e.g. locative singular, type **Sg.Loc** in the search field.
+#### Grammatical analysis
+Here you enter the grammatical tag. The dropdown menu immediately to the right says **contains**, because the tag is only a part of the string _word form + analysis_. If you want to search for several tags, e.g. locative singular, type **Sg.Loc** in the search field.
+#### Baseform
+Here you search for the **lexeme**. Selecting _sátni_ here gives hit on the inflected forms _sátni, sáni, sániid, ..._
-### Baseform
-Here you search for the **lexeme**. Selecting *sátni* here gives hit on the inflected forms *sátni, sáni, sániid, ...*
+#### Dependency relation
-### Dependency relation
-Here you can search for tags for syntactic function, e.g. **deprel_←OBJ** (in u\_corp it only says **deprel_←OBJ**). The drop-down menu provides a list of available function tags. Here is an [explanation of the tags for syntactic function](https://giellalt.uit.no/lang/sme/docu-sme-syntaxtags.html).
+Here you can search for tags for syntactic function, e.g. **deprel_←OBJ** (in u*corp it only says \*\*deprel*←OBJ\*\*). The drop-down menu provides a list of available function tags. Here is an [explanation of the tags for syntactic function](https://giellalt.uit.no/lang/sme/docu-sme-syntaxtags.html).
+#### Domain
-### Domain
This is the set of corpus domains: **administration, bible, facts, ficti, news, ...** This does not seem to be implemented to work in search. On the other hand, it is possible to sort hits by domain during a search on **Statistics**.
-### Title
+#### Title
This is the **title** of the document. This doesn't seem to be implemented to work in search. On the other hand, it is possible to sort hits by domain during a search on **Statistics**.
-### Translated from
+#### Translated from
Metadata is very poor here, and it also does not seem to be implemented in search.
-## Combine several wrap legs in the same search box
+### Combine several wrap legs in the same search box
-It is possible to copy searches with the operators **AND** and **OR**. Press **or** at the bottom of the box to search for the union of two or more requirements (eg search for *noun or pronoun*). Press **and** to get a new part of the same search box, to search for an intersection of two requirements (eg search for a word that is *plural and object*).
+It is possible to copy searches with the operators **AND** and **OR**. Press **or** at the bottom of the box to search for the union of two or more requirements (eg search for _noun or pronoun_). Press **and** to get a new part of the same search box, to search for an intersection of two requirements (eg search for a word that is _plural and object_).
-# Combining multiple conditions in the same search box
-(The picture shows Norwegian as metalanguage)
+## Combining multiple conditions in the same search box
+(The picture shows Norwegian as metalanguage)
![Alt text](korp-treboksar.png?raw=true "Combination of several boxes")
By pressing **⨁** to the right of the search box, you get another search box, so you can search for word combinations. Here it might also be a good idea to search for **Part of speech is not**.
-# Search for more words and show statistics
+## Search for more words and show statistics
-Search for two words (mark the empty box between verb ob object), and select **Statistics**. The result is a frequency-sorted statistic of *verb + object*.
+Search for two words (mark the empty box between verb ob object), and select **Statistics**. The result is a frequency-sorted statistic of _verb + object_.
![Alt text](korp-treboks-obj.png?raw=true "Unspecified word between the verb and the object")
diff --git a/lang/common/korp-simple.md b/lang/common/korp-simple.md
index 2f722bac..803e5a56 100644
--- a/lang/common/korp-simple.md
+++ b/lang/common/korp-simple.md
@@ -1,4 +1,4 @@
-# Search with the search box *Simple* in Korp
+# Search with the search box _Simple_ in Korp
Go to one of the corpus collections, e.g. [the Saami one](http://gtweb.uit.no/korp/). Press the tab **Simple** just underneath the **KORP** logo.
@@ -13,11 +13,10 @@ Under the search field there are 4 options:
- final part and
- case-insensitive
-This gives the possibility to use simple regular expressions, e.g. "all words in *-guin*".
+This gives the possibility to use simple regular expressions, e.g. "all words in _-guin_".
-## Possibilities with *simple search*
+## Possibilities with _simple search_
### Word image
Uncheck all four alternatives, then tick **Word picture** on the right in the blue line below the search field. Enter e.g. a verb in the infinitive, and search. The search will take several minutes, but the result will be an overview of arguments linked to the verb (or two: One with an argument before and one with an argument after the verb).
diff --git a/lang/common/korp-utvidet.md b/lang/common/korp-utvidet.md
index 2e98e59e..6344983c 100644
--- a/lang/common/korp-utvidet.md
+++ b/lang/common/korp-utvidet.md
@@ -1,55 +1,59 @@
-# Søk med søkeboksen *Utvidet* i Korp
+# Søk med søkeboksen _Utvidet_ i Korp
Gå til et av Korp-grensesnitta, f.eks. [det samiske](http://gtweb.uit.no/korp/). Trykk på fliken **Utvidet** rett under **KORP**-kogoen.
+## Selve søkeboksen
+![Alt text](korp-utvidet.png?raw=true "Søkeboksen *Utvidet*")
-# Selve søkeboksen
+### Enkel bruk av søkeboksen
-![Alt text](korp-utvidet.png?raw=true "Søkeboksen *Utvidet*")
+Boksen har 9 ulike sækemodi, _ord, ordklasse, grunnform, dependensrelasjon, domain, tittel, translated from, tidsintervall_. Vi går gjennom dem en etter enn:
+#### ord
-## Enkel bruk av søkeboksen
+Her skriver du inn _ordform_. Merk alternativa til høyre, f.eks. _er, er ikke, ..._ Alternativet _er ikke_ gir bare mening med bruk av flere søkebokser.
-Boksen har 9 ulike sækemodi, *ord, ordklasse, grunnform, dependensrelasjon, domain, tittel, translated from, tidsintervall*. Vi går gjennom dem en etter enn:
+#### msd (morphosyntactic description)
-### ord
-Her skriver du inn *ordform*. Merk alternativa til høyre, f.eks. *er, er ikke, ...* Alternativet *er ikke* gir bare mening med bruk av flere søkebokser.
+Her skriver du inn grammatisk tagg. Menyen til venstre står på **inneholder**, fordi taggen bare er en del av _ordform + analyse_. Viss målet t.d. er lokativ entall, skriv **Sg.Loc** i søkefeltet.
-### msd (morphosyntactic description)
-Her skriver du inn grammatisk tagg. Menyen til venstre står på **inneholder**, fordi taggen bare er en del av *ordform + analyse*. Viss målet t.d. er lokativ entall, skriv **Sg.Loc** i søkefeltet.
+#### ordklasse
-### ordklasse
Her er det ferdigdefinerte alternativ, et for hver ordklasse.
-### grunnform
-Her kan du søke på leksemet. Å velge *sátni* her gir *sátni, sáni, sániid, ...*
+#### grunnform
+Her kan du søke på leksemet. Å velge _sátni_ her gir _sátni, sáni, sániid, ..._
+#### dependensrelasjon
-### dependensrelasjon
-Her kan du søke på tagger for syntaktisk funksjon, f.eks. **deprel_←OBJ** (i u_korp står det bare **deprel_←OBJ**). Nedfallsmenyen gir ei liste over tilgjengelig funksjonstagger. Her er ei [forklaring av taggene for syntaktisk funksjon](https://giellalt.uit.no/lang/sme/docu-sme-syntaxtags.html).
+Her kan du søke på tagger for syntaktisk funksjon, f.eks. **deprel_←OBJ** (i u*korp står det bare \*\*deprel*←OBJ\*\*). Nedfallsmenyen gir ei liste over tilgjengelig funksjonstagger. Her er ei [forklaring av taggene for syntaktisk funksjon](https://giellalt.uit.no/lang/sme/docu-sme-syntaxtags.html).
+#### domain
-### domain
Dette er korpusdomena **administration, bible, facts, ficti, news, ...** Det ser ikke ut til at dette er implementert til å fungere i søk. Derimot er det mulig å sortere treff etter domene under søk på **Statistikk**.
-### tittel
+#### tittel
Dette er **tittelen** til dokumentet. Det ser ikke ut til at dette er implementert til å fungere i søk. Derimot er det mulig å sortere treff etter domene under søk på **Statistikk**.
-### translated from
+#### translated from
Her er metadata svært dårlig, og det ser heller ikke ut til at dette er implementert i søk.
-## Kombinere flere viklår i samme søkeboks
+### Kombinere flere viklår i samme søkeboks
-Det er mulig å kopiere søk med operatorene **OG** og **ELLER**. Trykk på **eller** nederst i boksen for å få søke etter unionen av to eller flere krav (f.eks. søk etter *substantiv eller pronomen*). Trykk på **og** for å få en ny del av samme søkeboks, for å søke etter et snitt av to krav (f.eks. søk etter et ord som er *plural og objekt*).
+Det er mulig å kopiere søk med operatorene **OG** og **ELLER**. Trykk på **eller** nederst i boksen for å få søke etter unionen av to eller flere krav (f.eks. søk etter _substantiv eller pronomen_). Trykk på **og** for å få en ny del av samme søkeboks, for å søke etter et snitt av to krav (f.eks. søk etter et ord som er _plural og objekt_).
-# Kombinere flere søkebokser
+## Kombinere flere søkebokser
![Alt text](korp-treboksar.png?raw=true "Kombinasjon av fleire boksar")
Med å trykke på **⨁** til høyre for søkeboksen får du en søkeboks til, slik at du kan du søke på ordkombinasjoner. Her kan det også væerre en god idé å søke på **ordklasse er ikke**.
-# Søk på flere ord og vis statistikk
+## Søk på flere ord og vis statistikk
-Søk ett er to ord (merk den tomme boksen mellom verb ob objekt), og velg **Statistikk**. Resultatet blir en frekvenssortert statistikk over *verb + objekt*.
+Søk ett er to ord (merk den tomme boksen mellom verb ob objekt), og velg **Statistikk**. Resultatet blir en frekvenssortert statistikk over _verb + objekt_.
![Alt text](korp-treboks-obj.png?raw=true "Uspesifisert ord mellom verbet og objektet")
diff --git a/lang/docu-makefile.md b/lang/docu-makefile.md
index c47a8ee7..daa4194b 100644
--- a/lang/docu-makefile.md
+++ b/lang/docu-makefile.md
@@ -1,5 +1,4 @@
-The common Makefile and scripts
+# The common Makefile and scripts
The Makefile is used to compile the xfst and aspell source files, i.e.
to make the programs. It is put to use by (being in `gt/`) writing the
@@ -11,8 +10,7 @@ project is found in Appendix C of the Beesley and Karttunen book. The
makefiles for the other languages follow the same layout, but they are
-Makefile structure
+## Makefile structure
The makefile contains variables defining tools and files to be used in
compiling the programs. In the beginning of makefile are commonly used
@@ -21,8 +19,7 @@ tools and files, and after those there are language specific variables.
The rest of the makefile is documented in [sme
-Common scripts
+## Common scripts
Common scripts to all languages are in `gt/common/src/`, and binaries of
these scripts are in `gt/common/bin/`.
diff --git a/lang/index.md b/lang/index.md
index 69138688..758190f0 100644
--- a/lang/index.md
+++ b/lang/index.md
@@ -3,153 +3,147 @@ here](../infra/infraremake/NewinfraCatalogues.html). Languages with no
work done so far are marked with an asterisk (\*). There are
[tutorials](common/Tutorials.html) explaining the grammar format.
-Saami languages
+# Saami languages
-- [North](sme/j-sme.html), [Lule](smj/j-smj.html),
- [South](sma/j-sma.html), [Inari](smn/j-smn.html),
- [Kildin](sjd/index.html), [Pite](sje/PiteSaamiDocumentation.html),
- [Skolt](sms/j-sms.html) // [Common for all Saami
- languages](smi/index.html)
+- [North](sme/j-sme.html), [Lule](smj/j-smj.html),
+ [South](sma/j-sma.html), [Inari](smn/j-smn.html),
+ [Kildin](sjd/index.html), [Pite](sje/PiteSaamiDocumentation.html),
+ [Skolt](sms/j-sms.html) // [Common for all Saami
+ languages](smi/index.html)
-Other Finnic languages
+## Other Finnic languages
-- [Estonian (version 1)](est/EstonianDocumentation.html), [Estonian
- (version 2)](experimentest/EstonianDocumentation.html),
- [Finnish](fin/j-fin.html), [Ingrian](izh/IngrianDocumentation.html),
- [Kven](fkv/KvenDocumentation.html),
- [Livonian](liv/LivonianDocumentation.html),
- [Meänkieli\*](fit/MeankieliDocumentation.html),
- [Olonetsian](olo/OlonetsianDocumentation.html),
- [Veps](vep/VepsDocumentation.html),
- [Võro](vro/VoroDocumentation.html),
+- [Estonian (version 1)](est/EstonianDocumentation.html), [Estonian
+ (version 2)](experimentest/EstonianDocumentation.html),
+ [Finnish](fin/j-fin.html), [Ingrian](izh/IngrianDocumentation.html),
+ [Kven](fkv/KvenDocumentation.html),
+ [Livonian](liv/LivonianDocumentation.html),
+ [Meänkieli\*](fit/MeankieliDocumentation.html),
+ [Olonetsian](olo/OlonetsianDocumentation.html),
+ [Veps](vep/VepsDocumentation.html),
+ [Võro](vro/VoroDocumentation.html),
-Other Uralic languages
+## Other Uralic languages
-- [Eastern Mari](mhr/EasternMariDocumentation.html),
- [Erzya](myv/ErzyaDocumentation.html),
- [Khanty](kca/KhantyDocumentation.html), [Komi](kom/index.html),
- [Komi Permyak](koi/KomiPermyakDocumentation.html),
- [Moksha](mdf/MokshaDocumentation.html),
- [Nganasan](nio/NganasanDocumentation.html), [Northern
- Mansi](mns/NorthernMansiDocumentation.html),
- [Selkup](sel/SelkupDocumentation.html), [Tundra
- Nenets](yrk/TundraNenetsDocumentation.html),
- [Udmurt](udm/UdmurtDocumentation.html), [Western
- Mari](mrj/WesternMariDocumentation.html)
+- [Eastern Mari](mhr/EasternMariDocumentation.html),
+ [Erzya](myv/ErzyaDocumentation.html),
+ [Khanty](kca/KhantyDocumentation.html), [Komi](kom/index.html),
+ [Komi Permyak](koi/KomiPermyakDocumentation.html),
+ [Moksha](mdf/MokshaDocumentation.html),
+ [Nganasan](nio/NganasanDocumentation.html), [Northern
+ Mansi](mns/NorthernMansiDocumentation.html),
+ [Selkup](sel/SelkupDocumentation.html), [Tundra
+ Nenets](yrk/TundraNenetsDocumentation.html),
+ [Udmurt](udm/UdmurtDocumentation.html), [Western
+ Mari](mrj/WesternMariDocumentation.html)
-American languages
+## American languages
-- [Apuriña\*](apu/ApurinaDocumentation.html), [Central Alaskan
- Yupik\*](esu/CentralAlaskanYupikDocumentation.html), [Central
- Siberian Yupik\*](ess/CentralSiberianYupikDocumentation.html),
- [Cherokee\*](chr/CherokeeDocumentation.html),
- [Dogrib\*](dgr/DogribDocumentation.html),
- [Greenlandic](kal/index.html), [Iñupiaq](ipk/index.html),
- [Kiowa\*](kio/KiowaDocumentation.html), [Northern
- Haida](hdn/NorthernHaidaDocumentation.html),
- [Ojibwa](oji/OjibwaDocumentation.html), [Ojibwe
- (Chippewa)](ciw/OjibweDocumentation.html), [Plains
- Cree](crk/PlainsCreeDocumentation.html), [Southern Puget Sound
- Salish
- (Lushootseed)](lut/SouthernPugetSoundSalishDocumentation.html),
- [Tsuut’ina (Sarcee)](srs/TsuutinaDocumentation.html), [Upper Necaxa
- Totonac](tku/UpperNecaxaTotonacDocumentation.html), [Upper
- Tanana](tau/UpperTananaDocumentation.html)
+- [Apuriña\*](apu/ApurinaDocumentation.html), [Central Alaskan
+ Yupik\*](esu/CentralAlaskanYupikDocumentation.html), [Central
+ Siberian Yupik\*](ess/CentralSiberianYupikDocumentation.html),
+ [Cherokee\*](chr/CherokeeDocumentation.html),
+ [Dogrib\*](dgr/DogribDocumentation.html),
+ [Greenlandic](kal/index.html), [Iñupiaq](ipk/index.html),
+ [Kiowa\*](kio/KiowaDocumentation.html), [Northern
+ Haida](hdn/NorthernHaidaDocumentation.html),
+ [Ojibwa](oji/OjibwaDocumentation.html), [Ojibwe
+ (Chippewa)](ciw/OjibweDocumentation.html), [Plains
+ Cree](crk/PlainsCreeDocumentation.html), [Southern Puget Sound
+ Salish
+ (Lushootseed)](lut/SouthernPugetSoundSalishDocumentation.html),
+ [Tsuut’ina (Sarcee)](srs/TsuutinaDocumentation.html), [Upper Necaxa
+ Totonac](tku/UpperNecaxaTotonacDocumentation.html), [Upper
+ Tanana](tau/UpperTananaDocumentation.html)
-Other languages
+## Other languages
-- [Bashkir](bak/BashkirDocumentation.html),
- [Buryaad](bxr/BuryadDocumentation.html),
- [Chukchi\*](ckt/ChukchiDocumentation.html),
- [Cornish](cor/CornishDocumentation.html),
- [Evenki\*](evn/EvenkiDocumentation.html), [Faroese](fao/index.html),
- [Finnish Romani\*](rmf/FinnishRomaniDocumentation.html),
- [Irish\*](gle/IrishDocumentation.html), [Kalderash
- Romani\*](rmy/KalderashRomaniDocumentation.html), [Khalkha
- Mongolian\*](khk/KhalkhaMongolianDocumentation.html),
- [Khakas\*](kjh/KhakasDocumentation.html),
- [Latvian](lav/LatvianDocumentation.html), [Norwegian
- Bokmål](nob/j-nob.html), [Romanian](ron/RomanianDocumentation.html),
- [Aromanian](rup/AromanianDocumentation.html),
- [Russian](rus/RussianDocumentation.html),
- [Somali](som/SomaliDocumentation.html),
- [Klingon\*](tlh/KlingonDocumentation.html),
- [Tuvan\*](tyv/TuvanDocumentation.html),
- [Kalmyk\*](xal/KalmykDocumentation.html), [Todo
- Oirat\*](xwo/TodoOiratDocumentation.html),
+- [Bashkir](bak/BashkirDocumentation.html),
+ [Buryaad](bxr/BuryadDocumentation.html),
+ [Chukchi\*](ckt/ChukchiDocumentation.html),
+ [Cornish](cor/CornishDocumentation.html),
+ [Evenki\*](evn/EvenkiDocumentation.html), [Faroese](fao/index.html),
+ [Finnish Romani\*](rmf/FinnishRomaniDocumentation.html),
+ [Irish\*](gle/IrishDocumentation.html), [Kalderash
+ Romani\*](rmy/KalderashRomaniDocumentation.html), [Khalkha
+ Mongolian\*](khk/KhalkhaMongolianDocumentation.html),
+ [Khakas\*](kjh/KhakasDocumentation.html),
+ [Latvian](lav/LatvianDocumentation.html), [Norwegian
+ Bokmål](nob/j-nob.html), [Romanian](ron/RomanianDocumentation.html),
+ [Aromanian](rup/AromanianDocumentation.html),
+ [Russian](rus/RussianDocumentation.html),
+ [Somali](som/SomaliDocumentation.html),
+ [Klingon\*](tlh/KlingonDocumentation.html),
+ [Tuvan\*](tyv/TuvanDocumentation.html),
+ [Kalmyk\*](xal/KalmykDocumentation.html), [Todo
+ Oirat\*](xwo/TodoOiratDocumentation.html),
-All languages listed alphabetically
+## All languages listed alphabetically
-- [Apuriña](apu/ApurinaDocumentation.html)
-- [Aromanian](rup/AromanianDocumentation.html)
-- [Bashkir](bak/BashkirDocumentation.html)
-- [Central Alaskan Yupik\*](esu/CentralAlaskanYupikDocumentation.html)
-- [Central Siberian
- Yupik\*](ess/CentralSiberianYupikDocumentation.html)
-- [Cherokee\*](chr/CherokeeDocumentation.html)
-- [Chukchi\*](cor/ChukchiDocumentation.html)
-- [Cornish](cor/CornishDocumentation.html)
-- [Dogrib\*](dgr/DogribDocumentation.html)
-- [Eastern Mari](mhr/EasternMariDocumentation.html)
-- [Erzya Mordvin](myv/ErzyaDocumentation.html)
-- [Estonian (version 1)](est/EstonianDocumentation.html)
-- [Estonian (version 2)](experimentest/EstonianDocumentation.html)
-- [Evenki](evn/EvenkiDocumentation.html)
-- [Faroese](fao/index.html)
-- [Finnish](fin/j-fin.html)
-- [Finnish Romani\*](rmf/FinishRomaniDocumentation.html)
-- [Greenlandic](kal/index.html)
-- [Inari Saami](smn/j-smn.html)
-- [Ingrian](izh/IngrianDocumentation.html)
-- [Iñupiaq](ipk/index.html)
-- [Irish\*](gle/IrishDocumentation.html)
-- [Kalderash Romani\*](rmy/KalderashRomaniDocumentation.html)
-- [Kalmyk](xal/KalmykDocumentation.html)
-- [Khalkha Mongolian](khk/KhalkhaMongolianDocumentation.html)
-- [Khakas](kjh/KhakasDocumentation.html)
-- [Khanty](kca/KhantyDocumentation.html)
-- [Kiowa](kio/KiowaDocumentation.html)
-- [Klingon](tlh/KlingonDocumentation.html)
-- [Komi](kom/index.html)
-- [Komi Permyak](koi/KomiPermyakDocumentation.html)
-- [Kven](fkv/KvenDocumentation.html)
-- [Latvian](lav/LatvianDocumentation.html)
-- [Livonian](liv/LivonianDocumentation.html)
-- [Lule Saami](smj/j-smj.html)
-- [Moksha Mordvin](mdf/MokshaDocumentation.html)
-- [Meänkieli\*](mdf/MeankieliDocumentation.html)
-- [Nganasan](nio/NganasanDocumentation.html)
-- [North Saami](sme/j-sme.html)
-- [Northern Haida](hdn/NorthernHaidaDocumentation.html)
-- [Northern Mansi](mns/NorthernMansiDocumentation.html)
-- [Norwegian Bokmål](nob/j-nob.html)
-- [Ojibwa](oji/OjibwaDocumentation.html)
-- [Ojibwe (Chippewa)](ciw/OjibweDocumentation.html)
-- [Olonetsian](olo/OlonetsianDocumentation.html)
-- [Kildin Saami](sjd/index.html)
-- [Pite Saami](sje/PiteSaamiDocumentation.html)
-- [Plains Cree](crk/PlainsCreeDocumentation.html)
-- [Romanian](ron/RomanianDocumentation.html)
-- [Russian](rus/RussianDocumentation.html)
-- [Selkup\*](sel/SelkupDocumentation.html)
-- [South Saami](sma/j-sma.html)
-- [Southern Puget Sound Salish
- (Lushootseed)](lut/SouthernPugetSoundSalishDocumentation.html)
-- [Skolt Saami](sms/j-sms.html)
-- [Todo Oirat](xwo/TodoOiratDocumentation.html)
-- [Tsuut’ina](srs/TsuutinaDocumentation.html)
-- [Tundra Nenets](yrk/TundraNenetsDocumentation.html)
-- [Tuvan](tyv/TuvanDocumentation.html)
-- [Udmurt](udm/UdmurtDocumentation.html)
-- [Upper Necaxa Totonac](tku/UpperNecaxaTotonacDocumentation.html)
-- [Upper Tanana](tau/UpperTananaDocumentation.html)
-- [Võro](vro/VoroDocumentation.html)
-- [Western Mari](mrj/WesternMariDocumentation.html)
+- [Apuriña](apu/ApurinaDocumentation.html)
+- [Aromanian](rup/AromanianDocumentation.html)
+- [Bashkir](bak/BashkirDocumentation.html)
+- [Central Alaskan Yupik\*](esu/CentralAlaskanYupikDocumentation.html)
+- [Central Siberian
+ Yupik\*](ess/CentralSiberianYupikDocumentation.html)
+- [Cherokee\*](chr/CherokeeDocumentation.html)
+- [Chukchi\*](cor/ChukchiDocumentation.html)
+- [Cornish](cor/CornishDocumentation.html)
+- [Dogrib\*](dgr/DogribDocumentation.html)
+- [Eastern Mari](mhr/EasternMariDocumentation.html)
+- [Erzya Mordvin](myv/ErzyaDocumentation.html)
+- [Estonian (version 1)](est/EstonianDocumentation.html)
+- [Estonian (version 2)](experimentest/EstonianDocumentation.html)
+- [Evenki](evn/EvenkiDocumentation.html)
+- [Faroese](fao/index.html)
+- [Finnish](fin/j-fin.html)
+- [Finnish Romani\*](rmf/FinishRomaniDocumentation.html)
+- [Greenlandic](kal/index.html)
+- [Inari Saami](smn/j-smn.html)
+- [Ingrian](izh/IngrianDocumentation.html)
+- [Iñupiaq](ipk/index.html)
+- [Irish\*](gle/IrishDocumentation.html)
+- [Kalderash Romani\*](rmy/KalderashRomaniDocumentation.html)
+- [Kalmyk](xal/KalmykDocumentation.html)
+- [Khalkha Mongolian](khk/KhalkhaMongolianDocumentation.html)
+- [Khakas](kjh/KhakasDocumentation.html)
+- [Khanty](kca/KhantyDocumentation.html)
+- [Kiowa](kio/KiowaDocumentation.html)
+- [Klingon](tlh/KlingonDocumentation.html)
+- [Komi](kom/index.html)
+- [Komi Permyak](koi/KomiPermyakDocumentation.html)
+- [Kven](fkv/KvenDocumentation.html)
+- [Latvian](lav/LatvianDocumentation.html)
+- [Livonian](liv/LivonianDocumentation.html)
+- [Lule Saami](smj/j-smj.html)
+- [Moksha Mordvin](mdf/MokshaDocumentation.html)
+- [Meänkieli\*](mdf/MeankieliDocumentation.html)
+- [Nganasan](nio/NganasanDocumentation.html)
+- [North Saami](sme/j-sme.html)
+- [Northern Haida](hdn/NorthernHaidaDocumentation.html)
+- [Northern Mansi](mns/NorthernMansiDocumentation.html)
+- [Norwegian Bokmål](nob/j-nob.html)
+- [Ojibwa](oji/OjibwaDocumentation.html)
+- [Ojibwe (Chippewa)](ciw/OjibweDocumentation.html)
+- [Olonetsian](olo/OlonetsianDocumentation.html)
+- [Kildin Saami](sjd/index.html)
+- [Pite Saami](sje/PiteSaamiDocumentation.html)
+- [Plains Cree](crk/PlainsCreeDocumentation.html)
+- [Romanian](ron/RomanianDocumentation.html)
+- [Russian](rus/RussianDocumentation.html)
+- [Selkup\*](sel/SelkupDocumentation.html)
+- [South Saami](sma/j-sma.html)
+- [Southern Puget Sound Salish
+ (Lushootseed)](lut/SouthernPugetSoundSalishDocumentation.html)
+- [Skolt Saami](sms/j-sms.html)
+- [Todo Oirat](xwo/TodoOiratDocumentation.html)
+- [Tsuut’ina](srs/TsuutinaDocumentation.html)
+- [Tundra Nenets](yrk/TundraNenetsDocumentation.html)
+- [Tuvan](tyv/TuvanDocumentation.html)
+- [Udmurt](udm/UdmurtDocumentation.html)
+- [Upper Necaxa Totonac](tku/UpperNecaxaTotonacDocumentation.html)
+- [Upper Tanana](tau/UpperTananaDocumentation.html)
+- [Võro](vro/VoroDocumentation.html)
+- [Western Mari](mrj/WesternMariDocumentation.html)
\*) = Languages with no real work done, found in `startup-langs/`.
diff --git a/lang/parallel_names.md b/lang/parallel_names.md
index eb694345..f24265e7 100644
--- a/lang/parallel_names.md
+++ b/lang/parallel_names.md
@@ -1,44 +1,32 @@
# Names and multilinguality
Meeting between **Sjur, Thomas, Trond** on Nov. 14, 2006.
1. Fyrste problem:
-* All names in all languages will likely be misunderstood if the material is published in
+- All names in all languages will likely be misunderstood if the material is published in
-* "foreign" names can be as much noise as they are valuable, and including them must be
+- "foreign" names can be as much noise as they are valuable, and including them must be
done carefully
We need a more principled approach to this.
Background: the name lexicon is getting attention from the SD name/terminology
sections, and they would like to use our name lexicon also for public searching.
+1. Multilinguality is always optional.
-1) Multilinguality is always optional.
+2. We can observe that "foreign" names in texts follows a domination pattern:
+ majority language forms can be found in minority language texts as real names
+ ("Kautokeino produkter"), whereas minority language names _almost always_
+ occur in majority language texts as citations. And citations should not be
+ considered a natural part of the text.
+3. When looking at our name classification, multilinguality varies according to:
-2) We can observe that "foreign" names in texts follows a domination pattern:
-majority language forms can be found in minority language texts as real names
-("Kautokeino produkter"), whereas minority language names *almost always*
-occur in majority language texts as citations. And citations should not be
-considered a natural part of the text.
-3) When looking at our name classification, multilinguality varies according to:
Ani - weak/none? (pet, myth anim. names)
Fem - weak (informative)
Mal - weak (informative)
@@ -49,48 +37,51 @@ Sur - none
Tit - strong (titles)
-We need to reconsider the *all names in all languages* policy. That policy is
+We need to reconsider the _all names in all languages_ policy. That policy is
valid only for `Fem, Mal,` and `Sur` (and Ani and Tit?). For
`Obj, Org, Plc` the rule should be that if they have multilingual names, each
name should only be used in it's own language. Then we need a modification
saying that majority language names can be included in minority language
lexicons **if attested** in our corpus.
Also, the majority language varies
according to country (obviously), which means that in a speller context, we
might consider tailoring spellers for each country, leaving out noise relating
to majority language names from another country.
+## TODO
-# finish first version of the editing (**Sjur, Tomi**)
-# add @type=secondary and @excl=speller,hyph to all names marked with !SUB (**Saara**)
-# test editing of the xml files. If ok, then: (**Sjur, Thomas, Trond**)
-# make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well)
- (den morfologiske delen skal vere intakt i t.d. propernoun-sme-morph.txt) (**Sjur**)
-# convert propernoun-($lang)-lex.txt to a derived file from common xml files
- (**Sjur, Tomi, Saara**)
-# Rens terms-sme.xml slik at alle namn har rett tagging for ulik bruk (@type=secondary)
- (**Thomas, Maaren, linguists**)
-# Slå i hop stadnamn som ikkje er i same termposten: Helsinki, Helsingfors, Helsset
- (**linguists**)
-# Gjer namnematerialet søkbart i risten.no (**Sjur**)
-# Legg til evt. manglande parallellnamn (stadnamn) (**linguists**)
-# Lag koplingar mellom Niillas og Nils (**linguists**)
+- finish first version of the editing (**Sjur, Tomi**)
+- add @type=secondary and @excl=speller,hyph to all names marked with !SUB (**Saara**)
+- test editing of the xml files. If ok, then: (**Sjur, Thomas, Trond**)
+- make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well)
+(den morfologiske delen skal vere intakt i t.d. propernoun-sme-morph.txt) (**Sjur**)
+- convert propernoun-($lang)-lex.txt to a derived file from common xml files
+(**Sjur, Tomi, Saara**)
+- Rens terms-sme.xml slik at alle namn har rett tagging for ulik bruk (@type=secondary)
+(**Thomas, Maaren, linguists**)
+- Slå i hop stadnamn som ikkje er i same termposten: Helsinki, Helsingfors, Helsset
+- Gjer namnematerialet søkbart i risten.no (**Sjur**)
+- Legg til evt. manglande parallellnamn (stadnamn) (**linguists**)
+- Lag koplingar mellom Niillas og Nils (**linguists**)
======= termcenter.xml =========
@@ -163,25 +154,25 @@ After merge:
diff --git a/lang/smi/AvgrenseAvleiing.md b/lang/smi/AvgrenseAvleiing.md
index e350dec6..4857891d 100644
--- a/lang/smi/AvgrenseAvleiing.md
+++ b/lang/smi/AvgrenseAvleiing.md
@@ -1,52 +1,43 @@
-Avgrense avleiing
+# Avgrense avleiing
Sjå [møtedokumentet](https://divvungiellatekno.github.io/giellalt.uit.no/admin/linguists/200423_AvgrenseAvleiing.html) (frå 23.4. 2020)
Her held vi fram frå møtet:
-# Konklusjon frå møtet
-* Vi skal endre +Comp og +Superl til +DerN+Der/Comp+A og +DerN+Der/Superl+A (nummeret på DerN må vi sjå på)
-** dette medfører endringer i filene for (signer når det er gjort)
-*** NDS
-*** Apertium
-*** Gramchecker?
-*** cgi paradigmegenerator
-*** andre?
-* Vi bør observere og deretter forbetre 12345-grammatikken, og evt andre begrensninger
-* måtar å avgrensa derivasjonane på:
-** Bare for normativ HFST: ved hjelp av Der12345-grammatikken
-** For all analyse/generering: ved hjelp av fortsettingsleksikoner
-** For all analyse/generering: ved hjelp av diakritiske flagg som også fungerer på desc (Px er løst slik)
-** Leksikalisere selve derivasjonen, feks. buoremusvuohta
-** Leksikalisere derivasjoner som er utgangspunkt for nye derivasjoner
-* I smj endrer vi slik at på -dibme og -ahtes ikke får Comp og Superl (ahtes skal få nummer likt eller etter Comp/Superl)
-# Steg framover
+## Konklusjon frå møtet
+- Vi skal endre +Comp og +Superl til +DerN+Der/Comp+A og +DerN+Der/Superl+A (nummeret på DerN må vi sjå på)
+ - dette medfører endringer i filene for (signer når det er gjort)
+ - NDS
+ - Apertium
+ - Gramchecker?
+ - cgi paradigmegenerator
+ - andre?
+- Vi bør observere og deretter forbetre 12345-grammatikken, og evt andre begrensninger
+- måtar å avgrensa derivasjonane på:
+ - Bare for normativ HFST: ved hjelp av Der12345-grammatikken
+ - For all analyse/generering: ved hjelp av fortsettingsleksikoner
+ - For all analyse/generering: ved hjelp av diakritiske flagg som også fungerer på desc (Px er løst slik)
+ - Leksikalisere selve derivasjonen, feks. buoremusvuohta
+ - Leksikalisere derivasjoner som er utgangspunkt for nye derivasjoner
+- I smj endrer vi slik at på -dibme og -ahtes ikke får Comp og Superl (ahtes skal få nummer likt eller etter Comp/Superl)
+## Steg framover
1. Legge Der/ til Comp/Superl
1. Deretter dei andre stega ovafor
+### Derivasjoner som vi bør se på
-## Derivasjoner som vi bør se på
-### Caritiv med komparativ, når brukes det?
-I korp finnes noen få reelle eksempler (her trenges bedre disambiguering), så vi kunne ihvertfall begrense til bare leksikaliserte -heapme:
-* [Korpsøk sme -heapme Comp](http://gtweb.uit.no/korp/#?stats_reduce=word&cqp=%5Bmsd%20_%3D%20%22Comp%22%20%26%20lemma%20%26%3D%20%22heapme%22%5D&search_tab=1&sort=keyword&hpp=1000&search=cqp)
-* [Korpsøk sme -heapme Superl|http://gtweb.uit.no/korp/#?stats_reduce=word&cqp=%5Bmsd%20_%3D%20%22Superl%22%20%26%20lemma%20%26%3D%20%22heapme%22%5D&search_tab=1&sort=keyword&hpp=1000&search=cqp
-* Her er det meste ikke komperlativ. Her trenges god disambiguering]
-* Eksempler på bruk:
-** Geahnohat bealli dán riiddus lea ge eahpitkeahttá palestiinnálaččat.
-** gutneheappo mielkkeheabbon návccaheappot
-## Logg over hva som blir gjort, med dato
+#### Caritiv med komparativ, når brukes det?
+I korp finnes noen få reelle eksempler (her trenges bedre disambiguering), så vi kunne ihvertfall begrense til bare leksikaliserte -heapme:
+- [Korpsøk sme -heapme Comp](http://gtweb.uit.no/korp/#?stats_reduce=word&cqp=%5Bmsd%20_%3D%20%22Comp%22%20%26%20lemma%20%26%3D%20%22heapme%22%5D&search_tab=1&sort=keyword&hpp=1000&search=cqp)
+- [Korpsøk sme -heapme Superl|http://gtweb.uit.no/korp/#?stats_reduce=word&cqp=%5Bmsd%20_%3D%20%22Superl%22%20%26%20lemma%20%26%3D%20%22heapme%22%5D&search_tab=1&sort=keyword&hpp=1000&search=cqp
+- Her er det meste ikke komperlativ. Her trenges god disambiguering]
+- Eksempler på bruk:
+ - Geahnohat bealli dán riiddus lea ge eahpitkeahttá palestiinnálaččat.
+ - gutneheappo mielkkeheabbon návccaheappot
+### Logg over hva som blir gjort, med dato
diff --git a/lang/smi/DisambiguerenBargovuohki.md b/lang/smi/DisambiguerenBargovuohki.md
index aade7c16..81348e65 100644
--- a/lang/smi/DisambiguerenBargovuohki.md
+++ b/lang/smi/DisambiguerenBargovuohki.md
@@ -1,22 +1,14 @@
-Disambigueren -- bargovuohki
+# Disambigueren -- bargovuohki
(dát lea álgu, dokumeanta ii leat válmmaš)
Go lea gávdnan cealkaga mas lea boasttoanalysa: Dás leat ideat das mo sáhttá bargat.
1. Makkár njuolggadus addá boasttoanalysa (= Rule1)? Kommentere dan gaskaboddosaččat eret.
1. Jus dalle oažžu rivttes analysa, de sáhttá buoridit Rule1.
-1. Jus ii leat vejolaš Rule1 divvut: de sáhtát álkidahttit cealkaga (= váldit eret modifikáhtora/advearbbaid jnv), ja analyseret ođđasit.
- 1. Jus dat lihkostuvvá, de sáhtát buoridit njuolggadusa mii vállje riekta (= Rule2) nu ahte dohkkeha modivikáhtoriid/advearbbaid, jus Rule2 boahtá árabut go Rule1.
- 1. Jus buorideapmi ii leat vejolaš, de sáhtát ráhkadit ođđa njuolggadusa (= Rule3) mii galgá boahtit árabut fiillas go Rule1.
- 1. Geahča CG-fiilla struktuvrra, ja geahččal bidjat ođđa njuolggadusaid eará seammásullásaš njuolggadusaid searvái.
+1. Jus ii leat vejolaš Rule1 divvut: de sáhtát álkidahttit cealkaga (= váldit eret modifikáhtora/advearbbaid jnv), ja analyseret ođđasit.
+ 1. Jus dat lihkostuvvá, de sáhtát buoridit njuolggadusa mii vállje riekta (= Rule2) nu ahte dohkkeha modivikáhtoriid/advearbbaid, jus Rule2 boahtá árabut go Rule1.
+ 1. Jus buorideapmi ii leat vejolaš, de sáhtát ráhkadit ođđa njuolggadusa (= Rule3) mii galgá boahtit árabut fiillas go Rule1.
+ 1. Geahča CG-fiilla struktuvrra, ja geahččal bidjat ođđa njuolggadusaid eará seammásullásaš njuolggadusaid searvái.
1. Teste ovdalgo šekket sisa: `sme$ sh script/testCGrules.sh`
1. Árvvoštala lasihit testkorpus.txt:i ja testkorpus.dis.corr.txt:i cealkaga mainna leat bargan.
diff --git a/lang/smi/Samansetjing.md b/lang/smi/Samansetjing.md
index c201900c..cb99c840 100644
--- a/lang/smi/Samansetjing.md
+++ b/lang/smi/Samansetjing.md
@@ -1,163 +1,143 @@
+# Samansetjing
Her samlar vi analyser og konklusjonar for samansetjing for samiske språk.
Se [møtereferat](https://divvungiellatekno.github.io/giellalt.uit.no/admin/linguists/210302_Cmp_avledninger.html)
+## Dynamiske sammensetningstyper og tagging av dem
+### +N+Cmp/SgNom+Cmp\#
+- sme: skuvlahistorjá - skuvla+N+Cmp/SgNom+Cmp#historjá+N+Sg+Nom
+- sme: jorgalanbargu - jorgalit+V+TV+Der/NomAct+N+Cmp/SgNom+Cmp#bargu+N+Sg+Nom
-# Dynamiske sammensetningstyper og tagging av dem
-## +N+Cmp/SgNom+Cmp#
-* sme: skuvlahistorjá - skuvla+N+Cmp/SgNom+Cmp#historjá+N+Sg+Nom
-* sme: jorgalanbargu - jorgalit+V+TV+Der/NomAct+N+Cmp/SgNom+Cmp#bargu+N+Sg+Nom
-## +N+Cmp/SgNom+Cmp/Hyph+Cmp#
-* sme: čoahkkinbovdehus-áššelistá čoahkkinbovdehus+N+Cmp/SgNom+Cmp/Hyph+Cmp#áššelistu+v3+N+Sg+Nom
-## +Prop+Cmp/SgNom+Cmp-#
-* sme: Nils-Henrik - Nils+N+Prop+Sem/Mal+Cmp/SgNom+Cmp-#Henrik+N+Prop+Sem/Mal+Sg+Nom
-## +Cmp/Sh+Cmp# (short form)
-* sme: čoarbbeallađas - čoarbbealli+N+Cmp/Sh+Cmp#lađas+N+Sg+Nom
-* sme: justiskomitea - justiisa+N+Cmp/Sh+Cmp#komitea+N+Sg+Nom
-## +N+Cmp/SgGen+Cmp#
-* sme: sámegiella - sápmi+N+Cmp/SgGen+Cmp#giella+N+Sg+Nom
-## +N+Cmp/PlGen+Cmp#
-* sme: mánáidskuvla - mánná+N+Cmp/PlGen+Cmp#skuvla+N+Sg+Nom
-## +A+Cmp/SgNom+Cmp#
-* sme: buoridahkki - buorre+A+Cmp/SgGen+Cmp#dahkki+N+NomAg+Sg+Nom
+### +N+Cmp/SgNom+Cmp/Hyph+Cmp\#
+- sme: čoahkkinbovdehus-áššelistá čoahkkinbovdehus+N+Cmp/SgNom+Cmp/Hyph+Cmp#áššelistu+v3+N+Sg+Nom
+### +Prop+Cmp/SgNom+Cmp-\#
+- sme: Nils-Henrik - Nils+N+Prop+Sem/Mal+Cmp/SgNom+Cmp-#Henrik+N+Prop+Sem/Mal+Sg+Nom
-## +A+Cmp/PlGen+Cmp#
-* sme: čalmmehemiidlihttu - čalmmeheapme+A+Cmp/PlGen+Cmp#lihttu+N+Sg+Nom
+### +Cmp/Sh+Cmp# (short form)
+- sme: čoarbbeallađas - čoarbbealli+N+Cmp/Sh+Cmp#lađas+N+Sg+Nom
+- sme: justiskomitea - justiisa+N+Cmp/Sh+Cmp#komitea+N+Sg+Nom
-## +A+Cmp/Attr+Cmp#
-* sme: oktasaščoahkkin - oktasaš+A+Cmp/Attr+Cmp#čoahkkin+N+Sg+Nom
+### +N+Cmp/SgGen+Cmp\#
+- sme: sámegiella - sápmi+N+Cmp/SgGen+Cmp#giella+N+Sg+Nom
-## +A+Cmp/Attr+Cmp/Hyph+Cmp#
-* sme: oppalaš-ávkkálaš - oppalaš+A+Cmp/Attr+Cmp/Hyph+Cmp#ávki+N+Der/lasj+A+Sg+Nom
+### +N+Cmp/PlGen+Cmp\#
+- sme: mánáidskuvla - mánná+N+Cmp/PlGen+Cmp#skuvla+N+Sg+Nom
-## +Num+Cmp-#
-* sme: 1700-lohku - 1700+Num+Cmp-#lohku+N+Sg+Nom
+### +A+Cmp/SgNom+Cmp\#
+- sme: buoridahkki - buorre+A+Cmp/SgGen+Cmp#dahkki+N+NomAg+Sg+Nom
-## +Num+Cmp/SgNom+Cmp#
-* sme: logijahki - logi+Num+Cmp/SgNom+Cmp#jahki+N+Sg+Nom
+### +A+Cmp/PlGen+Cmp\#
+- sme: čalmmehemiidlihttu - čalmmeheapme+A+Cmp/PlGen+Cmp#lihttu+N+Sg+Nom
-## +Num+Cmp/SgGen+Cmp#
-* sme: guovttejahkásaš - guokte+Num+Cmp/SgGen+Cmp#jahki+N+Der/sasj+A+Sg+Nom
+### +A+Cmp/Attr+Cmp\#
+- sme: oktasaščoahkkin - oktasaš+A+Cmp/Attr+Cmp#čoahkkin+N+Sg+Nom
-## +Adv+Cmp#
-* sme: dáppeolmmoš - dáppe+Adv+Err/Orth+Cmp#olmmoš+N+Sg+Nom : Hvorfor Err/Orth?
+### +A+Cmp/Attr+Cmp/Hyph+Cmp\#
+- sme: oppalaš-ávkkálaš - oppalaš+A+Cmp/Attr+Cmp/Hyph+Cmp#ávki+N+Der/lasj+A+Sg+Nom
-## +ACR+Cmp-#
-* sme: EU-válggain - EU+N+ACR+Cmp-#válga+N+Pl+Loc
+### +Num+Cmp-\#
+- sme: 1700-lohku - 1700+Num+Cmp-#lohku+N+Sg+Nom
+### +Num+Cmp/SgNom+Cmp\#
+- sme: logijahki - logi+Num+Cmp/SgNom+Cmp#jahki+N+Sg+Nom
-## Eksempler på sammensetninger som ikke gir sammensetningsanalyse
+### +Num+Cmp/SgGen+Cmp\#
+- sme: guovttejahkásaš - guokte+Num+Cmp/SgGen+Cmp#jahki+N+Der/sasj+A+Sg+Nom
-### Forleddet er substantiv, ikke nominativ eller genitiv
-* sme: buorringeavaheapmi (buorri+N+Ess)
-* sme: fápmuibidjan (fápmu+N+Sg+Ill)
-* sme: árvvusatnin (árvu+N+Sg+Loc)
+### +Adv+Cmp\#
+- sme: dáppeolmmoš - dáppe+Adv+Err/Orth+Cmp#olmmoš+N+Sg+Nom : Hvorfor Err/Orth?
-### Forleddet er substantiv, nominativ eller genitiv, men sammensetninga følger ikke regler i lexc for stammevokalen i forledd
-* sme: jahkeduhát (jahki+N+Sg+Nom)
-* sme: lottibeassi (loddi+N+Sg+Gen)
+### +ACR+Cmp-\#
+- sme: EU-válggain - EU+N+ACR+Cmp-#válga+N+Pl+Loc
-### Forleddet er substantiv, men har en forkortet form
-* sme: oahpaheaibargu (oahpaheaddji+N+Sg+Nom)
+### Eksempler på sammensetninger som ikke gir sammensetningsanalyse
+#### Forleddet er substantiv, ikke nominativ eller genitiv
+- sme: buorringeavaheapmi (buorri+N+Ess)
+- sme: fápmuibidjan (fápmu+N+Sg+Ill)
+- sme: árvvusatnin (árvu+N+Sg+Loc)
+#### Forleddet er substantiv, nominativ eller genitiv, men sammensetninga følger ikke regler i lexc for stammevokalen i forledd
-### Forleddet er pronomen
-* sme: iešdovdu (ieš+Pron+Refl+Sg+Nom)
-* sme: iežasmáksu (ieš+Pron+Refl+Gen+PxSg3)
-* sme: buohkaidopmodat (buohkat+Pron+Indef+Pl+Gen)
-* sme: dasagulli (dat+Pron+Dem+Sg+Ill)
+- sme: jahkeduhát (jahki+N+Sg+Nom)
+- sme: lottibeassi (loddi+N+Sg+Gen)
+#### Forleddet er substantiv, men har en forkortet form
-### Forleddet er tallord
-* sme: duhátjahki (duhát+Num+Sg+Nom)
-* sme: guoktelogi (guokte+Num+Sg+Nom)
-* sme: guvttiidlohku (guokte+Num+Pl+Gen)
+- sme: oahpaheaibargu (oahpaheaddji+N+Sg+Nom)
+#### Forleddet er pronomen
-### Forleddet er verb
-* sme: buolleviidna (buollit+V+IV+PrsPrc)
-* sme: báhcánvuoigatvuohta (báhcit+V+IV+PrfPrc)
+- sme: iešdovdu (ieš+Pron+Refl+Sg+Nom)
+- sme: iežasmáksu (ieš+Pron+Refl+Gen+PxSg3)
+- sme: buohkaidopmodat (buohkat+Pron+Indef+Pl+Gen)
+- sme: dasagulli (dat+Pron+Dem+Sg+Ill)
+#### Forleddet er tallord
+- sme: duhátjahki (duhát+Num+Sg+Nom)
+- sme: guoktelogi (guokte+Num+Sg+Nom)
+- sme: guvttiidlohku (guokte+Num+Pl+Gen)
+#### Forleddet er verb
-### Forleddet er adjektiv
-* sme: arvvesdálki (arvves+A+Attr)
-* sme: bajimusčoahkkin (bajit+A+Superl+Attr)
+- sme: buolleviidna (buollit+V+IV+PrsPrc)
+- sme: báhcánvuoigatvuohta (báhcit+V+IV+PrfPrc)
+#### Forleddet er adjektiv
-### Forleddet er adverb
-* sme: aisttonmearka (aistton+Adv)
-* sme: oktováhnen (okto+Adv)
-* sme: bajásgeassin (bajás+Adv)
-* sme: bieđgguidássanguovlu (bieđgguid+Adv)
+- sme: arvvesdálki (arvves+A+Attr)
+- sme: bajimusčoahkkin (bajit+A+Superl+Attr)
+#### Forleddet er adverb
-### Forleddet er postposisjon
-* sme: maŋisboahtti (maŋis+Po)
+- sme: aisttonmearka (aistton+Adv)
+- sme: oktováhnen (okto+Adv)
+- sme: bajásgeassin (bajás+Adv)
+- sme: bieđgguidássanguovlu (bieđgguid+Adv)
+#### Forleddet er postposisjon
-### Forleddet er propernoun
-* sme: Gáivuonasuopman (Gáivuotna+N+Prop+Sem/Plc+Sg+Gen)
-* sme: anársápmelaš (Anár+N+Prop+Sem/Plc+Sg+Nom)
+- sme: maŋisboahtti (maŋis+Po)
+#### Forleddet er propernoun
-### Ikke tydelige lånord, men forleddet får ikke egen analyse
-* sme: bužosdákti (bužos - ikke eget lemma)
+- sme: Gáivuonasuopman (Gáivuotna+N+Prop+Sem/Plc+Sg+Gen)
+- sme: anársápmelaš (Anár+N+Prop+Sem/Plc+Sg+Nom)
+#### Ikke tydelige lånord, men forleddet får ikke egen analyse
-### Etterleddet får ikke egen analyse
-* sme: bággopántideapmi (pántideapmi - ikke eget lemma)
-* sme: báikegoddelaš (goddelaš - ikke eget lemma)
+- sme: bužosdákti (bužos - ikke eget lemma)
+#### Etterleddet får ikke egen analyse
+- sme: bággopántideapmi (pántideapmi - ikke eget lemma)
+- sme: báikegoddelaš (goddelaš - ikke eget lemma)
+#### Lånord, forleddet får ikke egen analyse
-### Lånord, forleddet får ikke egen analyse
-* sme: adoptiivaváhnen (adoptiiva - ikke eget lemma)
-* sme: allegrohápmi (allegro - ikke eget lemma)
+- sme: adoptiivaváhnen (adoptiiva - ikke eget lemma)
+- sme: allegrohápmi (allegro - ikke eget lemma)
+#### Etterleddet er adjektiv
-### Etterleddet er adjektiv
-* sme: čálaoahppavaš
+- sme: čálaoahppavaš
diff --git a/lang/smi/index.md b/lang/smi/index.md
index a2d8637e..f901514b 100644
--- a/lang/smi/index.md
+++ b/lang/smi/index.md
@@ -1,14 +1,13 @@
+# Topics
-- [Compounding in the Saami languages](Samansetjing.html)
-- [Harmonising derivational tags for Saami languages.
- Overview.](../common/DerivationOverview.html)
-- [Discussions on restricting derivation for Saami languages. Log for
- what is done.](AvgrenseAvleiing.html)
-- [Prinsipper for homonymi i lemma, varianter og subfomer](lemma.html)
-- [How to handle variation in LEXC: Main documentation in
- English](../common/Variation_in_lexc.html)
-- [Discussions on restricting generating of possessive suffixes, esp.
- North Saami](https://giellalt.github.io/lang-sme/PXdiscussion.html)
-- [Bruken av Use/-Spell taggen](minusspelltag.html)
+- [Compounding in the Saami languages](Samansetjing.html)
+- [Harmonising derivational tags for Saami languages.
+ Overview.](../common/DerivationOverview.html)
+- [Discussions on restricting derivation for Saami languages. Log for
+ what is done.](AvgrenseAvleiing.html)
+- [Prinsipper for homonymi i lemma, varianter og subfomer](lemma.html)
+- [How to handle variation in LEXC: Main documentation in
+ English](../common/Variation_in_lexc.html)
+- [Discussions on restricting generating of possessive suffixes, esp.
+ North Saami](https://giellalt.github.io/lang-sme/PXdiscussion.html)
+- [Bruken av Use/-Spell taggen](minusspelltag.html)
diff --git a/lang/smi/lemma.md b/lang/smi/lemma.md
index 7e01c105..4ac4184a 100644
--- a/lang/smi/lemma.md
+++ b/lang/smi/lemma.md
@@ -1,66 +1,39 @@
# Prinsipp for lemmatisering av samiske språk
## Lemma som ikkje skal inn i stavekontrollen - Err/Lex
+Bakgrunnen for dette er ord i leksikon som ikkje er skal inn i stavekontrollen, men som likevel skal bli generert. Døme på slike ord er på sørsamisk _cubanske, juni_, og det kan være behov for det i nordsamisk også.
-Bakgrunnen for dette er ord i leksikon som ikkje er skal inn i stavekontrollen, men som likevel skal bli generert. Døme på slike ord er på sørsamisk *cubanske, juni*, og det kan være behov for det i nordsamisk også.
-Desse blir merka med *+Err/Lex* i leksikon. Dei kjem med i genereringsfilene, men ikkje i den normative fila.
+Desse blir merka med _+Err/Lex_ i leksikon. Dei kjem med i genereringsfilene, men ikkje i den normative fila.
## Leksikalsk homonymi: identifisere riktig lemma
+Lemmaene er homonyme, men det er samantisk forskjell og forskjellige bøyningsparadigmer. I nordsamisk skiller vi de fleste med G3- og NomAg-tagger, fordi det er systematikk for store grupper av lemmaer.
-Lemmaene er homonyme, men det er samantisk forskjell og forskjellige bøyningsparadigmer. I nordsamisk skiller vi de fleste med G3- og NomAg-tagger, fordi det er systematikk for store grupper av lemmaer.
-| Nom | Gen | norsk | norm-fst-analyse
-| --- | --- | --- | ---
-| lohkki | lohki | lokk | lohkki+N+Sg+Nom
-| lohkki | lohkki | lesar | lohkki+N+NomAg+Sg+Nom
-| beassi | beasi | reir | beassi+N+Sg+Nom
-| beassi | beassi | never | beassi+G3+N+Sg+Nom
+| Nom | Gen | norsk | norm-fst-analyse |
+| ------ | ------ | ----- | --------------------- |
+| lohkki | lohki | lokk | lohkki+N+Sg+Nom |
+| lohkki | lohkki | lesar | lohkki+N+NomAg+Sg+Nom |
+| beassi | beasi | reir | beassi+N+Sg+Nom |
+| beassi | beassi | never | beassi+G3+N+Sg+Nom |
Når det er snakk om enkelttilfeller, gir vi disse arbitrære taggar `+Hom1, +Hom2, …` (nummerert oppover ad lib).
-Taggane blir lagt inn i leksikon før POS, men burde flyttast til etter POS
+Taggane blir lagt inn i leksikon før POS, men burde flyttast til etter POS
i kompileringa.
-* Eksempler fra sørsamisk:
-** govledh+Hom1 - kl. IV å høre
-** govledh+Hom2 - kl. V å høres
+- Eksempler fra sørsamisk:
+ - govledh+Hom1 - kl. IV å høre
+ 0 govledh+Hom2 - kl. V å høres
## Varianter under samme lemma: sortere bøyningsformer til riktig grunnform - v1, v2 osv
Ortografiske varianter av samme lemma, dvs. grunnform og ihvertfall deler av bøyingsparadigmet, bør i fst sorteres under samme lemma. Men vi legger til en tag for å kunne sortere bøyningsparadigmene til riktig grunnform.
Vi brukar taggane `+v1, +v2, …` (nummerert oppover ad lib) for å skilje mellom
dei ulike paradigmene.
-* Eksempler:
-** sihkar+v1:sihkar
-** sihkar+v2:sihkkar
+- Eksempler:
+ - sihkar+v1:sihkar
+ - sihkar+v2:sihkkar
Hvis grunnformen er den samme, men det er to mulige bøyningsparadigmer, bruker vi ikke denne merkinga.
diff --git a/lang/smi/minusspelltag.md b/lang/smi/minusspelltag.md
index 7c8f5d7d..30974e7f 100644
--- a/lang/smi/minusspelltag.md
+++ b/lang/smi/minusspelltag.md
@@ -1,13 +1,11 @@
# Bruken av +Use/-Spell for samiske språk
Dokumentasjon over bruken i lexc for samiske språk.
Møtereferat er [her](https://divvungiellatekno.github.io/giellalt.uit.no/admin/linguists/220324_Tagger_Adverber.html)
-+Use/-Spell Orthographically correct, typically perifer words,
++Use/-Spell Orthographically correct, typically perifer words,
excluded in speller because they cause trouble for frequent words (fra sme root)
lang-sme lan000$ cut -d '!' -f1 src/fst/stems/* |grep 'Use/-Spell' |wc -l 33
@@ -21,65 +19,66 @@ lang-sma lan000$ cut -d '!' -f1 src/fst/affixes/* |grep 'Use/-Spell' |wc -l
lang-sms lan000$ cut -d '!' -f1 src/fst/stems/* |grep 'Use/-Spell' |wc -l 0
lang-sms lan000$ cut -d '!' -f1 src/fst/affixes/* |grep 'Use/-Spell' |wc -l 14
lang-smn: 0
- ```
Linjene med denne taggen blir ikke med i normativ HFST. Vi diskuterte bruken.
## Bruken i sme
-### Oftest for å begrense generering, for å unngå å generere marginale former.
+### Oftest for å begrense generering, for å unngå å generere marginale former
-LEXICON acrooblique
+LEXICON acrooblique
+Der2+Der/ár+N+CmpN/SgN+CmpN/SgG+CmpN/PlG+Use/-Spell:»ár GAHPIRLONG ; !
-ČSV:ár ČSV:ár+v1+N+Sg+Nom
-ČSV:ár ČSV+v1+N+Prop+Sem/Org+ACR+Der/ár+N+Sg+Nom
-SG:ár SG+N+Prop+Sem/Org+ACR+Der/ár+N+Sg+Nom
+ČSV:ár ČSV:ár+v1+N+Sg+Nom
+ČSV:ár ČSV+v1+N+Prop+Sem/Org+ACR+Der/ár+N+Sg+Nom
+SG:ár SG+N+Prop+Sem/Org+ACR+Der/ár+N+Sg+Nom
-ČSV:ár ČSV:ár+N+Sg+Nom
-SG:ár SG:ár+? inf
+ČSV:ár ČSV:ár+N+Sg+Nom
+SG:ár SG:ár+? inf
+Jeg har sammenlikna med korpus. [i SIKOR finnes, med bøyningformer](https://gtweb.uit.no/korp/#?stats_reduce=word&cqp=%5Bword%20*%3D%20%22.*%5BA-Z%C4%8C%C5%A0%5D%7B2%7D:*%5Ba%C3%A1%5Dr.%7B0,5%7D%22%5D&prefix&isCaseInsensitive&search_tab=1&search=cqp) : AUF:ár, TIFF:ár, NSR:ár, SG:ár, ČSV:ár. Bare ČSV:ár er leksikalisert.
- ```
-Jeg har sammenlikna med korpus. [i SIKOR finnes, med bøyningformer](https://gtweb.uit.no/korp/#?stats_reduce=word&cqp=%5Bword%20*%3D%20%22.*%5BA-Z%C4%8C%C5%A0%5D%7B2%7D:*%5Ba%C3%A1%5Dr.%7B0,5%7D%22%5D&prefix&isCaseInsensitive&search_tab=1&search=cqp) : AUF:ár, TIFF:ár, NSR:ár, SG:ár, ČSV:ár. Bare ČSV:ár er leksikalisert.
#### Hva er argumentet mot å la alle akronymer også få :ár med alle kasus? Eller burde man lage egen sti for Sem/Org som er typiske for slik bruk?
f.eks. for alle politiske parti?
-numerals.lexc: +Use/-Spell+Use/Circ: NUM-PREFIXES ; ! for §34 etc.
+numerals.lexc: +Use/-Spell+Use/Circ: NUM-PREFIXES ; ! for §34 etc.
Av 28 stier for adjektiv + vuohta har 8 +Use/-Spell, kanskje fordi de er mindre produktive? Men -vuohta skulle kanskje ikke dekke over for skrivefeil?
Disse bør sjekkes og sammenliknes med korpus. (med første blikk ser de ut til å være veldig marginale, f.eks. med adjektiv i flertall før derivasjon med vuohta, cealkemeahttumatvuohta
-### Noen ganger for å unngå genererte former som er svært marginale og som kan dekke over skrivefeil i frekvente ord,
+### Noen ganger for å unngå genererte former som er svært marginale og som kan dekke over skrivefeil i frekvente ord
-LEXICON ENGEL Restricted denominals for speller -eŋgel
-eŋgelaš eŋgel+N+Der/Dimin+N+Sg+Nom som også er en Err/Orth av eŋgelas
+LEXICON ENGEL Restricted denominals for speller -eŋgel
+eŋgelaš eŋgel+N+Der/Dimin+N+Sg+Nom som også er en Err/Orth av eŋgelas
-### Full dokumentasjon for sme, med kommentarer:
+### Full dokumentasjon for sme, med kommentarer
-#### nouns-fila: for å begrense generering, unngå for mange irrelevante former:
+#### nouns-fila: for å begrense generering, unngå for mange irrelevante former
-sis+N+CmpN/SgN+Use/-Spell+Sem/Dummytag+Cmp/SgNom:sis%> Rreal ;
+sis+N+CmpN/SgN+Use/-Spell+Sem/Dummytag+Cmp/SgNom:sis%> Rreal ;
sisa+N+CmpN/SgN+Use/-Spell+Sem/Dummytag+Cmp/SgNom:sisa%> Rreal ;
(disse gir bare støy, svært få relevante ord mangler leksikalisering, jeg kommenterer stiene ut)
+#### 108 substantiver med dynamisk førsteledd fra adj+attr
-#### 108 substantiver med dynamisk førsteledd fra adj+attr:
-Disse har jeg sammenlikna med korpus. Dette er lite produktive stier. Jeg har kommentert dem ut, sjekka i korpus og DG-ordbok og lagt til lemmaer i adj-fila.
+Disse har jeg sammenlikna med korpus. Dette er lite produktive stier. Jeg har kommentert dem ut, sjekka i korpus og DG-ordbok og lagt til lemmaer i adj-fila.
Mange av disse får også analyse som N+Pl+Nom, og overgenerering skaper dermed støy i analysen, siden adjektivanalysen blir +Attr
Ved at det ikke er dynamisk analyse, kan vi fange dem opp i missinglist.
-LEXICON NAMATCont second-part compounds (fra adj+attr og fra arabics)
+LEXICON NAMATCont second-part compounds (fra adj+attr og fra arabics)
nuolus+N+Use/-Spell:nuollus AHKASAS "unravelled? A" ;
stávval+N+Use/-Spell:stávval AGAdjINFL "syllabled A" ; Ikke i bruk
náittot+N+CmpN/SgN+CmpN/PlG+Use/-Spell+Sem/Hum:náittog AGAdjINFL "-gamic A" ;
@@ -88,10 +87,11 @@ suttat+N+Use/-Spell+Sem/Plc:sutt AGAdj ;
dáfot+N+Use/-Spell:dáfog AGAdjINFL "faceted A" ;
-#### substantiver med dynamisk førsteledd fra arabics:
+#### substantiver med dynamisk førsteledd fra arabics
Disse har jeg sammenlikna med korpus. Jeg har lagt til noen som manglet. Jeg forstår ikke at disse skulle lage problemer, så jeg har fjerna Use/-Spell for dem som er produktive
-LEXICON SASCont FROM NUMERALS, gives -kilosaš etc.
+LEXICON SASCont FROM NUMERALS, gives -kilosaš etc.
buddi+N+Use/-Spell:buddás DER-SAS ;
báiki+N+Use/-Spell+Sem/Ani_Hum:báikás DER-SAS ;
dássi+N+Use/-Spell:dássás DER-SAS ;
@@ -105,27 +105,29 @@ vahkku+N+Use/-Spell+Sem/Ani_Hum:vahkkos DER-SAS ; !50-vahkkosaš
čiehka+N+Der2+Der/has+N+Use/-Spell:čiegahass JOHTOLAT ;
giella+N+Der2+Der/lasj+A+Use/-Spell:gielal AHKASAS ; !2-gielalaš
#### hit går f.eks. fra NAMATCont ahki+N+Sem/Ani_Hum:ag DER-AGAdj ;
Her fjerner jeg Use/-Spell, den hindrer fornuftige dynamiske ord. Begrensninga bør skje tidligere i stien.
- +Der2+Der/t+A+CmpN/SgN+CmpN/PlG+Use/-Spell: AGAdj ; (2-agat)
++Der2+Der/t+A+CmpN/SgN+CmpN/PlG+Use/-Spell: AGAdj ; (2-agat)
#### hit går f.eks. fra NAMATCont lahttu+N+Sem/Hum:laht DER-OGAdj "membered A" ;
Her fjerner jeg Use/-Spell, den hindrer fornuftige dynamiske ord. Begrensninga bør skje tidligere i stien.
- +Der2+Der/t+A+CmpN/SgN+CmpN/PlG+Use/-Spell:og AGAdjINFL ; (2-lahtot)
++Der2+Der/t+A+CmpN/SgN+CmpN/PlG+Use/-Spell:og AGAdjINFL ; (2-lahtot)
+#### hit går f.eks. fra NAMATCont málli+N+Sem/Ani_Hum:máll DER-EGAdj "modelled A" ;
-#### hit går f.eks. fra NAMATCont málli+N+Sem/Ani_Hum:máll DER-EGAdj "modelled A" ;
Her fjerner jeg Use/-Spell, den hindrer fornuftige dynamiske ord. Begrensninga bør skje tidligere i stien.
- +Der2+Der/t+A+CmpN/SgN+CmpN/PlG+Use/-Spell:eg AGAdjINFL ; (2-mállet)
++Der2+Der/t+A+CmpN/SgN+CmpN/PlG+Use/-Spell:eg AGAdjINFL ; (2-mállet)
+#### Hvorfor disse?
-#### Hvorfor disse?
dávvirvuorkásuorgi+N+Use/-Spell+Sem/Plc-abstr:dávvir#vuorká#suorºgi GOAHTI-I ;
gákcilotlohku+v1+N+Use/-Spell+Sem/Dummytag:gákci#lot#lohºku LOTLOHKU ;
gákcilotlohku+v2+N+Use/-Spell+Sem/Dummytag:gákci#loh9#lohºku LOTLOHKU ;
@@ -141,62 +143,61 @@ sábbát+v1+N+CmpN/SgN+CmpN/SgG+CmpN/PlG+Sem/Time:sabºbah GAHPIRLONG ; ! NT
sme-acronyms.lexc:iežaskap+Use/-Spell+Sem/Dummytag:iežaskap9 UNIT ; (forkortelse for iežaskapitála, ikke i bruk i SIKOR)
-LEXICON ENGEL Restricted denominals for speller -eŋgel
+LEXICON ENGEL Restricted denominals for speller -eŋgel
+LEXICON BUORRE For this adj only
++Use/-Spell: VUOHTA ; ! ... to A > N -vuohta derivation
++Use/-Spell:»X7 NAMAT ; ! comp-only adj. Here since buorre has no Attr, not compound.
++Use/-Spell:»X7# NAMATLAGANLAGASCont ;
-LEXICON BUORRE For this adj only
-+Use/-Spell: VUOHTA ; ! ... to A > N -vuohta derivation
-+Use/-Spell:»X7 NAMAT ; ! comp-only adj. Here since buorre has no Attr, not compound.
-+Use/-Spell:»X7# NAMATLAGANLAGASCont ;
+LEXICON RIEKTA Bisyll adj w/o obl sg forms, WeG Attr
++Use/-Spell:a VUOHTA ; ! ... to A > N -vuohta derivation
-LEXICON RIEKTA Bisyll adj w/o obl sg forms, WeG Attr
-+Use/-Spell:a VUOHTA ; ! ... to A > N -vuohta derivation
+LEXICON MEAHTTUS meahttun-adj. with comp. and superl. forms -seabbo, -seamos etc.
++Pl+Nom+Use/-Spell:m%>at VUOHTA ;
-LEXICON MEAHTTUS meahttun-adj. with comp. and superl. forms -seabbo, -seamos etc.
- +Pl+Nom+Use/-Spell:m%>at VUOHTA ;
-LEXICON BEAKKAN Trisyll. Non-gradating C-Adj. without Separate Attr.
- +Pl+Nom+Use/-Spell:%>at VUOHTA
- +Pl+Nom+Use/-Spell:%>at VUOHTA ;
+LEXICON BEAKKAN Trisyll. Non-gradating C-Adj. without Separate Attr.
++Pl+Nom+Use/-Spell:%>at VUOHTA
-LEXICON GEARDAN Trisyll. Non-gradating C-Adj. without Separate Attr.
+Pl+Nom+Use/-Spell:%>at VUOHTA ;
-+Use/-Spell: VUOHTA ; ! VUOHTA, without j
+LEXICON GEARDAN Trisyll. Non-gradating C-Adj. without Separate Attr.
++Pl+Nom+Use/-Spell:%>at VUOHTA ;
++Use/-Spell: VUOHTA ; ! VUOHTA, without j
+Pl+Acc+Use/-Spell:%>Y5jd K ; !riiduid, ruvsuid
-LEXICON LAS from verbs: čirrolas, bealkálas etc
+LEXICON LAS from verbs: čirrolas, bealkálas etc
+Use/-Spell: VUOHTA ;
-LEXICON DenominalAdjsV1 caritives and their derivatives (huvva, huhtti), from bisyll nouns
- +Der1+Der2+Der/laakan+A+Use/-Spell:» LAGAN ; ! ! biilalágan, noaidelágán noaiddilágán beatnagalágán beanalágán all these goes Nielsen: beatnatlágán, beatnatlágáš, beanalágáš, giđalágáš, áhččelágáš,
- +Der1+Der2+Der/laagasj+A+Use/-Spell:» LAGAS ; ! ! etc.
-LEXICON DenominalAdjsV1Long caritives and their derivatives (huvva, huhtti), from bisyll nouns without vowel shortening
- +Der1+Der2+Der/laakan+A+Use/-Spell:» LAGAN ; ! ! biilalágan, noaidelágán noaiddilágán beatnagalágán beanalágán all these goes Nielsen: beatnatlágán, beatnatlágáš, beanalágáš, giđalágáš, áhččelágáš,
- +Der1+Der2+Der/laagasj+A+Use/-Spell:» LAGAS ; ! ! etc.
-LEXICON DenominalAdjsV1Short caritives and their derivatives (huvva, huhtti), from bisyll nouns with vowel shortening
- +Der1+Der2+Der/laakan+A+Use/-Spell:» LAGAN ; ! ! biilalágan, noaidelágán noaiddilágán beatnagalágán beanalágán all these goes Nielsen: beatnatlágán, beatnatlágáš, beanalágáš, giđalágáš, áhččelágáš,
- +Der1+Der2+Der/laagasj+A+Use/-Spell:» LAGAS ; ! ! etc.
+LEXICON DenominalAdjsV1 caritives and their derivatives (huvva, huhtti), from bisyll nouns
++Der1+Der2+Der/laakan+A+Use/-Spell:» LAGAN ; ! ! biilalágan, noaidelágán noaiddilágán beatnagalágán beanalágán all these goes Nielsen: beatnatlágán, beatnatlágáš, beanalágáš, giđalágáš, áhččelágáš,
++Der1+Der2+Der/laagasj+A+Use/-Spell:» LAGAS ; ! ! etc.
+LEXICON DenominalAdjsV1Long caritives and their derivatives (huvva, huhtti), from bisyll nouns without vowel shortening
++Der1+Der2+Der/laakan+A+Use/-Spell:» LAGAN ; ! ! biilalágan, noaidelágán noaiddilágán beatnagalágán beanalágán all these goes Nielsen: beatnatlágán, beatnatlágáš, beanalágáš, giđalágáš, áhččelágáš,
++Der1+Der2+Der/laagasj+A+Use/-Spell:» LAGAS ; ! ! etc.
+LEXICON DenominalAdjsV1Short caritives and their derivatives (huvva, huhtti), from bisyll nouns with vowel shortening
++Der1+Der2+Der/laakan+A+Use/-Spell:» LAGAN ; ! ! biilalágan, noaidelágán noaiddilágán beatnagalágán beanalágán all these goes Nielsen: beatnatlágán, beatnatlágáš, beanalágáš, giđalágáš, áhččelágáš,
++Der1+Der2+Der/laagasj+A+Use/-Spell:» LAGAS ; ! ! etc.
-LEXICON DenominalAdjsV2_lasj from bisyllables, muoralaš, gieđalaš etc
- +Sg+Nom+PxDu2+Use/-Spell:»X6lažža%>X2t RPXADD_FLAG ; ! ! tentative.
+LEXICON DenominalAdjsV2_lasj from bisyllables, muoralaš, gieđalaš etc
++Sg+Nom+PxDu2+Use/-Spell:»X6lažža%>X2t RPXADD_FLAG ; ! ! tentative.
@R.Px.add@ K ;
-LEXICON acrooblique
+LEXICON acrooblique
- +Der2+Der/ár+N+CmpN/SgN+CmpN/SgG+CmpN/PlG+Use/-Spell:»ár GAHPIRLONG ; !
++Der2+Der/ár+N+CmpN/SgN+CmpN/SgG+CmpN/PlG+Use/-Spell:»ár GAHPIRLONG ; !
#### Propernouns
Alle leksikon har denne: (men fra RProp kreves hyphen, hvis ikke Err/Orth, så Use/-Spell her er unødvendig, jeg kommenterer den ut)
+Cmp/SgNom+Use/-Spell:%> RProp ;
@@ -204,4 +205,3 @@ LEXICON SULLOT-plc
+N+Prop+Sem/Plc+Sg+Gen+Use/-Spell:%>Y5 VUONAT ;
LEXICON ADJAGAT-plc ! Place names
+N+Prop+Sem/Plc+Sg+Nom+Use/-Spell:X4 VUONAT ;
diff --git a/ling/CorpusConvertingManipulation.md b/ling/CorpusConvertingManipulation.md
index b9993396..272806b8 100644
--- a/ling/CorpusConvertingManipulation.md
+++ b/ling/CorpusConvertingManipulation.md
@@ -1,18 +1,9 @@
# How to get a better converting
Converting of
-* [doc-documents](DocConvertingManipulation.html)
-* [html-documents](HtmlConvertingManipulation.html)
-* [pdf-documents](PdfConvertingManipulation.html)
-* [rtf-documents](RtfConvertingManipulation.html)
-* [txt-documents](TxtConvertingManipulation.html)
+- [doc-documents](DocConvertingManipulation.html)
+- [html-documents](HtmlConvertingManipulation.html)
+- [pdf-documents](PdfConvertingManipulation.html)
+- [rtf-documents](RtfConvertingManipulation.html)
+- [txt-documents](TxtConvertingManipulation.html)
diff --git a/ling/CorpusTools.md b/ling/CorpusTools.md
index 6d090bf2..2bfb2244 100644
--- a/ling/CorpusTools.md
+++ b/ling/CorpusTools.md
@@ -1,110 +1,94 @@
-Corpus Tools
+# Corpus Tools
Corpus Tools contains tools to manipulate a corpus in different ways.
These scripts will be installed
+## Howto install and update the tools
-# Howto install and update the tools
-## First time install
-* [Install requirements](#Requirements).
-* [Install CorpusTools](#To-own-home-directory-(recommended))
+### First time install
-## Update
-* [Howto update CorpusTools](#To-own-home-directory-(recommended))
+- [Install requirements](#Requirements).
+- [Install CorpusTools](<#To-own-home-directory-(recommended)>)
+### Update
+- [Howto update CorpusTools](<#To-own-home-directory-(recommended)>)
+## Use the content of the corpus
-# Use the content of the corpus
-* [convert2xml: Convert original files to giellatekno xml](CorpusTools.html#convert2xml)
-* [ccat: Print the contents of a converted corpus file as plain text](CorpusTools.html#ccat)
-* [analyse_corpus: Do syntactic analysis of converted files](CorpusTools.html#analyse_corpus)
-* [parallelize: Sentence align file pairs](CorpusTools.html#parallelize)
-* [reparallelize: Reconvert and realign a given .tmx.html file](CorpusTools.html#reparallelize)
-* [tmx2html: Convert tmx files to html files](CorpusTools.html#tmx2html)
+- [convert2xml: Convert original files to giellatekno xml](CorpusTools.html#convert2xml)
+- [ccat: Print the contents of a converted corpus file as plain text](CorpusTools.html#ccat)
+- [analyse_corpus: Do syntactic analysis of converted files](CorpusTools.html#analyse_corpus)
+- [parallelize: Sentence align file pairs](CorpusTools.html#parallelize)
+- [reparallelize: Reconvert and realign a given .tmx.html file](CorpusTools.html#reparallelize)
+- [tmx2html: Convert tmx files to html files](CorpusTools.html#tmx2html)
+## Add files to the corpus
-# Add files to the corpus
-* [add_files_to_corpus: Add file(s) to a corpus directory](CorpusTools.html#add_files_to_corpus)
-* [saami_crawler: Crawl saami sites, add files to corpus](CorpusTools.html#saami_crawler)
+- [add_files_to_corpus: Add file(s) to a corpus directory](CorpusTools.html#add_files_to_corpus)
+- [saami_crawler: Crawl saami sites, add files to corpus](CorpusTools.html#saami_crawler)
+## Manage the corpus repositories
-# Manage the corpus repositories
-* [move_corpus_file: Move or rename a file inside the corpus](CorpusTools.html#move_corpus_file)
-* [remove_corpus_file: Remove a file from the corpus](CorpusTools.html#remove_corpus_file)
-* [normalise_corpus_names: Program to normalise file names](CorpusTools.html#normalise_corpus_names)
-* [paracheck: Check if the parallel files found in the metadata files exist](CorpusTools.html#paracheck)
-* [duperemover: Remove duplicate files from the given directory](CorpusTools.html#duperemover)
-* [dupefinder: Find files with more than 90% similarity in the given directory](CorpusTools.html#dupefinder)
-* [clean_prestable:Remove files in prestable that have no original files](CorpusTools.html#clean_prestable)
-* [pick_parallel_docs: Pick out parallel files from converted to prestable/converted](CorpusTools.html#pick_parallel_docs)
-* [update_metadata: Update metadata files in given directories](CorpusTools.html#update_metadata)
+- [move_corpus_file: Move or rename a file inside the corpus](CorpusTools.html#move_corpus_file)
+- [remove_corpus_file: Remove a file from the corpus](CorpusTools.html#remove_corpus_file)
+- [normalise_corpus_names: Program to normalise file names](CorpusTools.html#normalise_corpus_names)
+- [paracheck: Check if the parallel files found in the metadata files exist](CorpusTools.html#paracheck)
+- [duperemover: Remove duplicate files from the given directory](CorpusTools.html#duperemover)
+- [dupefinder: Find files with more than 90% similarity in the given directory](CorpusTools.html#dupefinder)
+- [clean_prestable:Remove files in prestable that have no original files](CorpusTools.html#clean_prestable)
+- [pick_parallel_docs: Pick out parallel files from converted to prestable/converted](CorpusTools.html#pick_parallel_docs)
+- [update_metadata: Update metadata files in given directories](CorpusTools.html#update_metadata)
+## Miscellaneous
-# Miscellaneous
-* [pytextcat: textcat implemented in Python](CorpusTools.html#pytextcat)
-* [generate_anchor_list: Generate paired anchor list for languages lang1 and lang2](CorpusTools.html#generate_anchor_list)
-* [html_cleaner: Program to print out a nicely indented html document](CorpusTools.html#html_cleaner)
-* [epubchooser: Program to set metadata of an epub file](CorpusTools.html#epubchooser)
-* [make_training_corpus: Program to make training corpus from giella xml analysed files](CorpusTools.html#make_training_corpus)
+- [pytextcat: textcat implemented in Python](CorpusTools.html#pytextcat)
+- [generate_anchor_list: Generate paired anchor list for languages lang1 and lang2](CorpusTools.html#generate_anchor_list)
+- [html_cleaner: Program to print out a nicely indented html document](CorpusTools.html#html_cleaner)
+- [epubchooser: Program to set metadata of an epub file](CorpusTools.html#epubchooser)
+- [make_training_corpus: Program to make training corpus from giella xml analysed files](CorpusTools.html#make_training_corpus)
+## Requirements
-# Requirements
-* python3
-* pip for python3
-* pysvn (only needed for add_files_to_corpus)
-* wvHtml (only needed for convert2xml)
-* pdftohtml (only needed for convert2xml)
-* latex2html (only needed for convert2xml)
-* Java (only needed for parallelize)
-* Perl (only needed for parallelize)
+- python3
+- pip for python3
+- pysvn (only needed for add_files_to_corpus)
+- wvHtml (only needed for convert2xml)
+- pdftohtml (only needed for convert2xml)
+- latex2html (only needed for convert2xml)
+- Java (only needed for parallelize)
+- Perl (only needed for parallelize)
On Mac, do:
sudo port install py-pysvn py-pip wv latex2html poppler
On Debian/Ubuntu, do:
sudo apt-get install python3-svn python3-pip wv latex2html poppler-utils
On Arch Linux, do:
sudo pacman -S python3-pip wv
yaourt -S python3-pysvn
You also need to have the $GLANGS variable set to where you checked
-out *https://github.com/giellalt/CorpusTools/* (see the *Getting Started* documentation).
-## Custom version of pdftohtml (poppler)
+out *https://github.com/giellalt/CorpusTools/* (see the _Getting Started_ documentation).
+### Custom version of pdftohtml (poppler)
The standard version of pdftohtml sometimes produces invalid xml-documents.
A version that fixes this bug is found at https://github.com/albbas/poppler
and the poppler developers have been notified about the bug.
To install it do the following
git clone https://github.com/albbas/poppler
cd poppler
@@ -114,120 +98,96 @@ make
sudo make install
+## Installation
-# Installation
-## To own home directory (recommended)
+### To own home directory (recommended)
Install the tools for the current user by writing
cd $GTLANGS/CorpusTools
python3 setup.py install --user --install-scripts=$HOME/bin --record installed_files.txt
+### System wide (recommended for servers only)
-## System wide (recommended for servers only)
Install the tools for all users on a machine by
cd $GTLANGS/CorpusTools
sudo python3 setup.py install --install-scripts=/usr/local/bin --record installed_files.txt
+## Uninstalling
+### Remove from own home directory
-# Uninstalling
-## Remove from own home directory
cd $GTLANGS/CorpusTools
cat installed_files.txt | xargs rm -rf
-## System wide
+### System wide
cd $GTLANGS/CorpusTools
cat installed_files.txt | xargs sudo rm -rf
-# ccat
+## ccat
Convert corpus format xml to clean text.
ccat has three usage modes, print to stdout the content of:
-* converted files (produced by [convert2xml](CorpusTools.html#convert2xml))
-* converted files containing errormarkup (produced by [convert2xml](CorpusTools.html#convert2xml))
-* analysed files (produced by [analyse_corpus](CorpusTools.html#analyse_corpus))
+- converted files (produced by [convert2xml](CorpusTools.html#convert2xml))
+- converted files containing errormarkup (produced by [convert2xml](CorpusTools.html#convert2xml))
+- analysed files (produced by [analyse_corpus](CorpusTools.html#analyse_corpus))
-## Printing content of converted files to stdout
+### Printing content of converted files to stdout
To print out all sme content of all the converted files found in
$GTFREE/converted/sme/admin and its subdirectories, issue the command:
ccat -a -l sme $GTFREE/converted/sme/admin
It is also possible to print a file at a time:
ccat -a -l sme $GTFREE/converted/sme/admin/sd/other_files/vl_05_1.doc.xml
To print out the content of e.g. all converted pdf files found in a directory
and its subdirectories, issue this command:
find converted/sme/science/ -name "*.pdf.xml" | xargs ccat -a -l sme
-## Printing content of analysed files to stdout
+### Printing content of analysed files to stdout
The analysed files produced by
[analyse_corpus](CorpusTools.html#analyse_corpus) contain among other one
dependency element and one disambiguation element, that contain the
dependency and disambiguation analysis of the original files content.
ccat -dis sda/sda_2006_1_aikio1.pdf.xml
Prints the content of the disambiguation element.
ccat -dep sda/sda_2006_1_aikio1.pdf.xml
Prints the content of the dependency element.
The usage pattern for printing these elements is otherwise the same as
printing the content of converted files.
Printing dependency elements
@@ -236,7 +196,6 @@ ccat -dep $GTFREE/analysed/sme/admin/sd/other_files/vl_05_1.doc.xml
find analysed/sme/science/ -name "*.pdf.xml" | xargs ccat -dep
Printing disambiguation elements
@@ -245,16 +204,12 @@ ccat -dis $GTFREE/analysed/sme/admin/sd/other_files/vl_05_1.doc.xml
find analysed/sme/science/ -name "*.pdf.xml" | xargs ccat -dis
-## Printing errormarkup content
+### Printing errormarkup content
This usage mode is used in the speller tests. Examples of this usage pattern
is found in the make files in $GTBIG/prooftools.
-## The complete help text from the program:
+### The complete help text from the program:
usage: ccat [-h] [--version] [-l LANG] [-T] [-L] [-t] [-a] [-c] [-C] [-ort]
@@ -310,91 +265,66 @@ optional arguments:
Replace hyph tags with the given argument
**Source code**
-* [ccat.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/ccat.py)
-# convert2xml
+- [ccat.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/ccat.py)
+## convert2xml
Convert original files in a corpus to giellatekno/divvun xml format.
convert2xml depends on these external programs:
-* pdftotext
-* wvHtml
+- pdftotext
+- wvHtml
as well as various files from the Divvun/Giellatekno SVN, at least the
following files/directories need to exist under $GTHOME:
-* gt/dtd
+- gt/dtd
Convert all files in the directory $GTFREE/orig/sme and its
convert2xml $GTFREE/orig/sme
The converted files are placed in $GTFREE/converted/sme
with the same directory structure as that in $GTFREE/orig/sme.
Convert only one file:
convert2xml $GTFREE/orig/sme/admin/sd/file1.html
The converted file is found in $GTFREE/orig/sme/admin/sd/file1.htm.xml
Convert all sme files in directories ending with corpus
convert2xml *corpus/orig/sme
If convert2xml is not able to convert a file these kinds of message will appear:
A log file will be found in
explaining what went wrong.
The complete help text from the program:
usage: convert2xml [-h] [--version] [--serial] [--lazy-conversion]
[--write-intermediate] [--goldstandard]
@@ -422,79 +352,62 @@ optional arguments:
--goldstandard Convert goldstandard and .correct files
**Source code**
-* [converter.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/converter.py)
-* [ccat.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/ccat.py)
-* [corpuspath.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/corpuspath.py)
-* [decode.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/decode.py)
-* [errormarkup.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/errormarkup.py)
-* [text_cat.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/text_cat.py)
-* [xslsetter.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/xslsetter.py)
-* [util.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/util.py)
-# analyse_corpus
+- [converter.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/converter.py)
+- [ccat.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/ccat.py)
+- [corpuspath.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/corpuspath.py)
+- [decode.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/decode.py)
+- [errormarkup.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/errormarkup.py)
+- [text_cat.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/text_cat.py)
+- [xslsetter.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/xslsetter.py)
+- [util.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/util.py)
+## analyse_corpus
Analyse converted corpus files.
analyse_corpus depends on these external programs:
-* preprocess (found in the Divvun/Giellatekno svn)
-* lookup2cg (found in the Divvun/Giellatekno svn)
-* lookup (from xfst)
-* vislcg3
-* hfst
+- preprocess (found in the Divvun/Giellatekno svn)
+- lookup2cg (found in the Divvun/Giellatekno svn)
+- lookup (from xfst)
+- vislcg3
+- hfst
To be able to use this program you must either use the
[nightly giella packages](https://giellalt.uit.no/infra/compiling_HFST3.html#The+simple+installation+%28you+download+ready-made+programs%29)
or build the needed resources for the supported
languages (exchange "sma" with "sme, smj" ad lib):
`cd $GTHOME/langs/sma`
Configure the language, use at least these to options `--prefix=$HOME/.local --with-hfst --enable-tokenisers`
-./configure --prefix=$HOME/.local --with-hfst --enable-tokenisers # add your own flags to taste
+./configure --prefix=$HOME/.local --with-hfst --enable-tokenisers ## add your own flags to taste
make install
Then you must convert the corpus files as explained in the [convert2xml](CorpusTools.html#convert2xml) section.
When this is done you can analyse all files in the directory $GTFREE/converted/sme (and sma, smj) and its subdirectories by issuing this command:
analyse_corpus -k hfst sme$GTFREE/converted/sme
-The analysed file will be found in
+The analysed file will be found in
To analyse only one file, issue this command:
analyse_corpus -k hfst --serial sme $GTFREE/converted/sme/file.html.xml
The complete help text from the program:
@@ -522,21 +435,17 @@ optional arguments:
**Source code**
-* [analyser.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/analyser.py)
-* [ccat.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/ccat.py)
-* [parallelize.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/parallelize.py)
-* [util.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/util.py)
-# add_files_to_corpus
+- [analyser.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/analyser.py)
+- [ccat.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/ccat.py)
+- [parallelize.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/parallelize.py)
+- [util.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/util.py)
+## add_files_to_corpus
The complete help text from the program is as follows:
usage: add_files_to_corpus [-h] [-v] [-p PARALLEL_FILE] [-l LANG]
@@ -571,77 +480,54 @@ no_parallel:
The directory where the origs should be placed
Download and add parallel files from the net to the corpus:
**Adding the first file**
The command
-```add_files_to_corpus -d orig/sme/admin/sd/other_files http://www.samediggi.no/content/download/5407/50892/version/2/file/Sametingets+%C3%A5rsmelding+2013+-+nordsamisk.pdf```
+`add_files_to_corpus -d orig/sme/admin/sd/other_files http://www.samediggi.no/content/download/5407/50892/version/2/file/Sametingets+%C3%A5rsmelding+2013+-+nordsamisk.pdf`
Gives the message:
-```Added orig/sme/admin/sd/other_files/sametingets_ay-rsmelding_2013_-_nordsamisk.pdf```
+`Added orig/sme/admin/sd/other_files/sametingets_ay-rsmelding_2013_-_nordsamisk.pdf`
**Adding the parallel file**
-```add_files_to_corpus -p orig/sme/admin/sd/other_files/sametingets_ay-rsmelding_2013_-_nordsamisk.pdf -l nob http://www.samediggi.no/content/download/5406/50888/version/2/file/Sametingets+%C3%A5rsmelding+2013+-+norsk.pdf```
+`add_files_to_corpus -p orig/sme/admin/sd/other_files/sametingets_ay-rsmelding_2013_-_nordsamisk.pdf -l nob http://www.samediggi.no/content/download/5406/50888/version/2/file/Sametingets+%C3%A5rsmelding+2013+-+norsk.pdf`
Gives the message:
-```Added orig/nob/admin/sd/other_files/sametingets_ay-rsmelding_2013_-_norsk.pdf```
+`Added orig/nob/admin/sd/other_files/sametingets_ay-rsmelding_2013_-_norsk.pdf`
After this is done, you will have to commit the files to
the working copy, like this:
svn ci orig
**Source code**
-* [adder.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/adder.py)
-* [namechanger.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/namechanger.py)
-* [util.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/util.py)
-* [xslsetter.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/xslsetter.py)
-# parallelize
+- [adder.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/adder.py)
+- [namechanger.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/namechanger.py)
+- [util.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/util.py)
+- [xslsetter.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/xslsetter.py)
+## parallelize
Parallelize parallel corpus files, write the results to
.tmx and .txm.html files.
NB! When debugging alignment, use [reparallelize](CorpusTools.html#reparallelize), it
reconverts all files and realigns the file anew.
parallelize depends on various files from the Divvun/Giellatekno SVN,
at least the following directories need to exist in $GTHOME:
-* langs (specifically, the abbr.txt files)
-* gt/common
-* gt/script
+- langs (specifically, the abbr.txt files)
+- gt/common
+- gt/script
It also requires Java if you wish to use the default (included)
alignment program TCA2. For convenience, a pre-compiled version of
@@ -649,13 +535,11 @@ TCA2's alignment.jar-file is included in SVN and installed by
CorpusTools, but if you have ant installed, you can recompile it by
simply typing "ant" in corpustools/tca2.
Alternatively, you can align with Hunalign, if you have that installed
(or don't have Java). Hunalign is faster, and the quality is less
dependent on predefined dictionaries (though it can use those as
well). Neither system gives perfect alignments.
By default, it uses the $GTHOME/gt/common/src/anchor.txt file as an
anchor dictionary for alignment. If your language pair is not in this
dictionary, you can provide your own with the --dict argument. If you
@@ -663,13 +547,10 @@ do not have a dictionary, you can use "--dict=<(echo)" to provide an
"empty" dictionary – in this case, you should also use
**Compile dependencies**
XXX is one of the languages in $GTHOME/langs.
cd $GTHOME/langs/XXX
./configure --prefix="$HOME"/.local \
@@ -682,12 +563,10 @@ XXX is one of the languages in $GTHOME/langs.
make install
To prepare for parallelising e.g. nob and sme files, do the following:
-for LANG in sme nob # Replace sme and nob by languages for your own needs
+for LANG in sme nob ## Replace sme and nob by languages for your own needs
cd $GTHOME/langs/$LANG
./configure --prefix="$HOME"/.local \
@@ -701,8 +580,8 @@ do
The complete help text from the program is as follows:
usage: parallelize [-h] [--version] [-s] [-f] [-q] [-a {hunalign,tca2}]
[-d DICT] -l2 LANG2
@@ -739,39 +618,33 @@ optional arguments:
parallelised with
You run the program on the files created by convert2xml by running a command with the following syntax:
for instance, with nob as SOURCE_LANGUAGE and sma as TARGET_LANGUAGE
parallelize -l2 sma converted/nob/admin/ntfk/tsaekeme.html.xml
This will create a file named
If you want to parallelize all your sma files with nob in one go, you
can do e.g.
convert2xml orig/{sma,nob}
parallelize -l2 sma converted/nob
The files will end up in corresponding directories under
-* ****CAVEAT 1****: *If you get a message such as*
+- \***\*CAVEAT 1\*\***: _If you get a message such as_
parallelize -l2 sma converted/sma/admin/ntfk/tsaekeme.html.xml
@@ -779,29 +652,24 @@ Error reading file '/Users/xxx/freecorpus/converted/sma/admin/ntfk/.xml':
failed to load external entity "/Users/xxx/freecorpus/converted/sma/admin/ntfk/.xml"
then you gave nob als l1 but the path to a sma-file as argument.
+- \***\*CAVEAT 2\*\***: _If you get a similar error message as_
-* ****CAVEAT 2****: *If you get a similar error message as*
parallelize -l2 sma converted/nob/admin/ntfk/rup_2013_trykt_versjon.pdf.xml
ERROR: /Users/xxx/gtsvn/langs/nob/tools/preprocess/tokeniser-gramcheck-gt-desc.pmhfst does not exist
you have to recompile the language tool of the respective language (in the example above it is nob)
with a different configuration, as in the following example with nob as language to recompile, have
a look at the info above on how to [compile dependencies](CorpusTools.html#Compile+dependencies)
After that you can go back to the directory where you are working with the parallelizing files and
try to parallelize the files anew. You might recompile the language tools for ALL the languages
you are working with.
-* ****CAVEAT 3****: *If you get a message like*
+- \***\*CAVEAT 3\*\***: _If you get a message like_
Exception in thread "main" java.lang.UnsupportedClassVersionError: aksis/alignment/Alignment : Unsupported major.minor version 51.0
@@ -819,52 +687,37 @@ Exception in thread "main" java.lang.UnsupportedClassVersionError: aksis/alignme
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
then you need to recompile the Java parts and reinstall CorpusTools.
Make sure you have Apache ant installed, then do:
cd $GTHOME/tools/CorpusTools/corpustools/tca2
Then follow the instructions on [how to install CorpusTools ](CorpusTools.html#Installation)
**Source code**
-* [parallelize.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/parallelize.py)
-* [generate_anchor_list.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/generate_anchor_list.py)
-* [typosfile.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/typosfile.py)
-* [util.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/util.py)
-# saami_crawler
+- [parallelize.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/parallelize.py)
+- [generate_anchor_list.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/generate_anchor_list.py)
+- [typosfile.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/typosfile.py)
+- [util.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/util.py)
+## saami_crawler
Add files to freecorpus from a given site.
Only able to crawl www.samediggi.fi now, will collect html files only for now.
Run it like this:
saami_crawler www.samediggi.fi
The complete help text from the program is as follows:
usage: saami_crawler [-h] [-v] sites [sites ...]
@@ -881,38 +734,29 @@ optional arguments:
-v, --version show program's version number and exit
**Source code**
-* [saami_crawler.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/saami_crawler.py)
-* [namechanger.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/namechanger.py)
-* [text_cat.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/text_cat.py)
-* [util.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/util.py)
-* [xslsetter.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/xslsetter.py)
-# pytextcat
+- [saami_crawler.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/saami_crawler.py)
+- [namechanger.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/namechanger.py)
+- [text_cat.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/text_cat.py)
+- [util.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/util.py)
+- [xslsetter.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/xslsetter.py)
+## pytextcat
Pytextcat is an implementation of the "N-Gram-Based Text Categorization" algorithm.
Original article:
-Cavnar, W. B. and J. M. Trenkle,
-*N-Gram-Based Text Categorization*
+Cavnar, W. B. and J. M. Trenkle,
+_N-Gram-Based Text Categorization_
In Proceedings of Third Annual Symposium on
Document Analysis and Information Retrieval, Las Vegas, NV, UNLV
Publications/Reprographics, pp. 161-175, 11-13 April 1994.
Original Perl implementation and article available from
usage: pytextcat [-h] [--version] [-V] {proc,complm,compwm,compdir} ...
@@ -935,22 +779,11 @@ optional arguments:
-V, --verbose Print some info to stderr
**Source code**
-* [text_cat.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/text_cat.py)
-# generate_anchor_list
+- [text_cat.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/text_cat.py)
+## generate_anchor_list
usage: generate_anchor_list.py [-h] [-v] [--lang1 LANG1] [--lang2 LANG2]
@@ -975,19 +808,15 @@ optional arguments:
--outdir OUTDIR The output directory
**Source code**
-* [generate_anchor_list.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/generate_anchor_list.py)
-* [util.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/util.py)
-# normalise_corpus_names
+- [generate_anchor_list.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/generate_anchor_list.py)
+- [util.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/util.py)
+## normalise_corpus_names
Normalise the filenames of the files found in the given directories.
usage: normalise_corpus_names [-h] [--version] target_dirs [target_dirs ...]
@@ -1006,15 +835,12 @@ optional arguments:
--version show program's version number and exit
**Source code**
-* [normalise_filenames](https://github.com/giellalt/CorpusTools/blob/main/corpustools/normalise_filenames.py)
-* [namechanger.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/namechanger.py)
-# move_corpus_file
+- [normalise_filenames](https://github.com/giellalt/CorpusTools/blob/main/corpustools/normalise_filenames.py)
+- [namechanger.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/namechanger.py)
+## move_corpus_file
usage: move_corpus_file [-h] [-v] oldpath newpath
@@ -1034,15 +860,12 @@ optional arguments:
-v, --version show program's version number and exit
**Source code**
-* [move_files.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/move_files.py)
-* [namechanger.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/namechanger.py)
-# paracheck
+- [move_files.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/move_files.py)
+- [namechanger.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/namechanger.py)
+## paracheck
usage: paracheck [-h] [-v] orig_dir
@@ -1061,17 +884,14 @@ optional arguments:
-v, --version show program's version number and exit
**Source code**
-* [check_para_consistency.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/check_para_consistency.py)
-* [namechanger.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/namechanger.py)
-* [util.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/util.py)
-* [xslsetter.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/xslsetter.py)
-# html_cleaner
+- [check_para_consistency.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/check_para_consistency.py)
+- [namechanger.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/namechanger.py)
+- [util.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/util.py)
+- [xslsetter.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/xslsetter.py)
+## html_cleaner
usage: html_cleaner [-h] [-v] inhtml outhtml
@@ -1092,16 +912,13 @@ optional arguments:
-v, --version show program's version number and exit
**Source code**
-* [html_cleaner.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/html_cleaner.py)
-* [converter.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/converter.py)
-* [util.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/util.py)
-# duperemover
+- [html_cleaner.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/html_cleaner.py)
+- [converter.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/converter.py)
+- [util.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/util.py)
+## duperemover
usage: duperemover [-h] [-v] dir
@@ -1119,19 +936,14 @@ optional arguments:
-v, --version show program's version number and exit
**Source code**
-* [dupe_finder.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/dupe_finder.py)
-* [ccat.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/ccat.py)
-* [move_files.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/move_files.py)
-* [util.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/util.py)
-# dupefinder
+- [dupe_finder.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/dupe_finder.py)
+- [ccat.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/ccat.py)
+- [move_files.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/move_files.py)
+- [util.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/util.py)
+## dupefinder
usage: dupefinder [-h] [-v] dir
@@ -1149,17 +961,14 @@ optional arguments:
-v, --version show program's version number and exit
**Source code**
-* [dupe_finder.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/dupe_finder.py)
-* [ccat.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/ccat.py)
-* [move_files.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/move_files.py)
-* [util.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/util.py)
-# move_corpus_file
+- [dupe_finder.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/dupe_finder.py)
+- [ccat.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/ccat.py)
+- [move_files.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/move_files.py)
+- [util.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/util.py)
+## move_corpus_file
usage: move_corpus_file [-h] [-v] oldpath newpath
@@ -1179,15 +988,12 @@ optional arguments:
-v, --version show program's version number and exit
**Source code**
-* [move_files.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/move_files.py)
-* [namechanger.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/namechanger.py)
-# remove_corpus_file
+- [move_files.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/move_files.py)
+- [namechanger.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/namechanger.py)
+## remove_corpus_file
usage: remove_corpus_file [-h] [-v] oldpath
@@ -1205,15 +1011,12 @@ optional arguments:
-v, --version show program's version number and exit
**Source code**
-* [move_files.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/move_files.py)
-* [namechanger.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/namechanger.py)
-# pick_parallel_docs
+- [move_files.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/move_files.py)
+- [namechanger.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/namechanger.py)
+## pick_parallel_docs
usage: pick_parallel_docs [-h] [-v] -p PARALLEL_LANGUAGE --minratio MINRATIO
@@ -1238,16 +1041,13 @@ optional arguments:
--maxratio MAXRATIO The maximum ratio
**Source code**
-* [pick_parallel_docs.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/pick_parallel_docs.py)
-* [parallelize.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/parallelize.py)
-* [util.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/util.py)
-# clean_prestable
+- [pick_parallel_docs.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/pick_parallel_docs.py)
+- [parallelize.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/parallelize.py)
+- [util.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/util.py)
+## clean_prestable
usage: clean_prestable [-h] [--version] corpusdirs [corpusdirs ...]
@@ -1265,17 +1065,14 @@ optional arguments:
--version show program's version number and exit
**Source code**
-* [clean_prestable.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/clean_prestable.py)
-* [corpuspath.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/corpuspath.py)
-* [versioncontrol.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/versioncontrol.py)
-* [util.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/util.py)
-# reparallelize
+- [clean_prestable.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/clean_prestable.py)
+- [corpuspath.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/corpuspath.py)
+- [versioncontrol.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/versioncontrol.py)
+- [util.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/util.py)
+## reparallelize
usage: reparallelize [-h] [--version] [--files] [--convert] tmxhtml
@@ -1300,18 +1097,15 @@ optional arguments:
converted files.
**Source code**
-* [realign.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/realign.py)
-* [corpuspath.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/corpuspath.py)
-* [converter.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/converter.py)
-* [parallelize.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/parallelize.py)
-* [tmx.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/tmx.py)
-# tmx2html
+- [realign.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/realign.py)
+- [corpuspath.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/corpuspath.py)
+- [converter.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/converter.py)
+- [parallelize.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/parallelize.py)
+- [tmx.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/tmx.py)
+## tmx2html
usage: tmx2html [-h] [--version] sources [sources ...]
@@ -1329,13 +1123,11 @@ optional arguments:
--version show program's version number and exit
**Source code**
-* [tmx.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/tmx.py)
+- [tmx.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/tmx.py)
-# update_metadata
+## update_metadata
usage: update_metadata [-h] [--version] directories [directories ...]
@@ -1356,15 +1148,12 @@ optional arguments:
--version show program's version number and exit
**Source code**
-* [update_metadata.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/update_metadata.py)
-* [xslsetter.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/xslsetter.py)
-# epubchooser
+- [update_metadata.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/update_metadata.py)
+- [xslsetter.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/xslsetter.py)
+## epubchooser
usage: epubchooser [-h] [--version] epubfile
@@ -1382,16 +1171,13 @@ optional arguments:
--version show program's version number and exit
**Source code**
-* [epubchooser.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/epubchooser.py)
-* [epubconverter.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/epubconverter.py)
-* [xslsetter.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/xslsetter.py)
-# make_training_corpus
+- [epubchooser.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/epubchooser.py)
+- [epubconverter.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/epubconverter.py)
+- [xslsetter.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/xslsetter.py)
+## make_training_corpus
Make training corpus from analysed giella xml files. Sentences with words
@@ -1407,13 +1193,6 @@ optional arguments:
--version show program's version number and exit
**Source code**
-* [trainingcorpusmaker.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/trainingcorpusmaker.py)
+- [trainingcorpusmaker.py](https://github.com/giellalt/CorpusTools/blob/main/corpustools/trainingcorpusmaker.py)
diff --git a/ling/DerivationalTagSystem.md b/ling/DerivationalTagSystem.md
index 7a1057c2..bf2d6077 100644
--- a/ling/DerivationalTagSystem.md
+++ b/ling/DerivationalTagSystem.md
@@ -1,36 +1,25 @@
+# Documenting the derivational tags & system
-Documenting the derivational tags & system
-# The most common verb nominalisations
+## The most common verb nominalisations
+Der/NomAct - Nomen Actionis, Actionnoun, Handlingsnomen
-+Der/NomAg - Nomen Agentis, Agentnoun, Handlernomen
++Der/NomAg - Nomen Agentis, Agentnoun, Handlernomen
+Actio - this is a verbal tag, ie when the word form is part of the verbal construction.
-# Position tags
+## Position tags
Tallene er bare navn på plassering av derivasjonene. De beskriver ikke posisjonen men rekkefølgen, i normativ forstand:
-* +Der1
-* +Der2
-* +Der3
-* +Der4
+- +Der1
+- +Der2
+- +Der3
+- +Der4
Hver enkelt derivasjonstag har en posisjonstag foran seg i LexC-koden:
@@ -44,23 +33,14 @@ LEXICON BÅETEDH
+Der1+Der/ldahke:%»%^1UMLaldahk LDAHKE ;
+Der1+Der/ldh+N+SgNomCmp:%»%^1UMLeldh R ;
+Der1+Der/ldh+N+Sg+Nom:%»%^1UMLeldh FINAL1 ;
- +Der1+Der/ldh+N+Sg+Nom+Err/Sub:%»%^1UMLeld FINAL1 ;
+ +Der1+Der/ldh+N+Sg+Nom+Err/Sub:%»%^1UMLeld FINAL1 ;
-LexC-koden tillater en ikkje helt ufri kombinasjon av ulike derivasjoner, som vil gi kombinasjonar som er ugrammatiske. Posisjonsklassifiseringa er eit forsøk på å avgrensa overgenereringa, ved å seia at det er ugrammatisk å bruka posisjon 1-derivasjonar *etter* posisjon 2-derivasjonar.
+LexC-koden tillater en ikkje helt ufri kombinasjon av ulike derivasjoner, som vil gi kombinasjonar som er ugrammatiske. Posisjonsklassifiseringa er eit forsøk på å avgrensa overgenereringa, ved å seia at det er ugrammatisk å bruka posisjon 1-derivasjonar _etter_ posisjon 2-derivasjonar.
Oppsummert: definere de grammatiske kombinasjonene for å unngå ugrammatiske kombinasjoner! - Men, hvor produktiv skal et avledningsmønster være, for at denne skal defineres som en grammatisk avlednngskombinasjon?
diff --git a/ling/DocConvertingManipulation.md b/ling/DocConvertingManipulation.md
index b844101c..731d9a76 100644
--- a/ling/DocConvertingManipulation.md
+++ b/ling/DocConvertingManipulation.md
@@ -1,23 +1,9 @@
-Converting .doc files
-# Skip pages
-# Skip part of pages
-# Skip lines
-# Skip words
+# Converting .doc files
+## Skip pages
+## Skip part of pages
+## Skip lines
+## Skip words
diff --git a/ling/HtmlConvertingManipulation.md b/ling/HtmlConvertingManipulation.md
index 6bd18789..c6d8804f 100644
--- a/ling/HtmlConvertingManipulation.md
+++ b/ling/HtmlConvertingManipulation.md
@@ -1,26 +1,20 @@
-Converting html pages
+# Converting html pages
## Skip pages
## Skip part of pages or lines
Select contains comma separated xpath path pairs.
A path pair is separated by a semicolon.
In this:
-``` ```
Each path should start with .//body \\
Examples of valid pairs (fra og med - til):
.//body/div[5];.//body/div[8]/div[3]/h1[1], .//body/div[11]/div[2];.//body/div[11]/div[5]
@@ -28,10 +22,8 @@ Examples of valid pairs (fra og med - til):
### Comments:
.//body er xslt-introen som er måten å gje
@@ -44,12 +36,8 @@ vi vil slette området frå første p under første div under body til andre p
## Skip words in
Change or remove problematic characters from the text.
Specify the elements to match (here all p's within
//body, that do contain text, but do NOT contain em and
@@ -61,7 +49,6 @@ of elements - then only one of them will apply. Also try
to restrict the template to nodes that do not contain
other markup, as such markup otherwise will be removed.
@@ -90,18 +77,16 @@ other markup, as such markup otherwise will be removed.
### Skip words in e.g. span
- <=
+ <=
- <=
+ <=
diff --git a/ling/ImprovementNotes2016.md b/ling/ImprovementNotes2016.md
index 5a84181d..31929a02 100644
--- a/ling/ImprovementNotes2016.md
+++ b/ling/ImprovementNotes2016.md
@@ -1,5 +1,4 @@
-Improvement notes 2016
+# Improvement notes 2016
This is a document without any real content. It is supposed to be filled as we
move forward.
diff --git a/ling/LanguageIndependentTagsInTheGiellaInfra.md b/ling/LanguageIndependentTagsInTheGiellaInfra.md
index 4872b938..117e06d4 100644
--- a/ling/LanguageIndependentTagsInTheGiellaInfra.md
+++ b/ling/LanguageIndependentTagsInTheGiellaInfra.md
@@ -1,145 +1,120 @@
-Language Independent Tags In The Giella Infra
+# Language Independent Tags In The Giella Infra
There are a number of classes of tags where the classes are language
independent, but the actual tags are language specific. Some examples of
such classes of tags are:
-* **Error tags**: tags describing parts of the language outside the established norm
-* **Dialect tags**: tags describing variation (in the written language) based on
- dialect
-* **Derivation tags**: tags describing derivational morphology
+- **Error tags**: tags describing parts of the language outside the established norm
+- **Dialect tags**: tags describing variation (in the written language) based on
+ dialect
+- **Derivation tags**: tags describing derivational morphology
All such classes of tags are described below. New classes will probably be added
in the future, but we'll try to keep the document updated. See also the
[documentation for each language](/lang/index.html).
Each class is recognised by having a **tag prefix**, a short string starting
with "**+**" (for suffix tags; prefix tags for prefixing languages end with
**+** as their last character) and ending with "**/**". Examples of such tag
prefixes are: `+Err/`, `+Dial/` etc.
It is assumed — and required — that all tags described here (and all other tags,
for that matter) are declared as multichar symbols in the `root.lexc` file of
each language.
-# Error tags
+## Error tags
The error tag class is defined as follows:
+- **Tag prefix**: `+Err/`
+- **Definition**: tags describing parts of the language outside the established norm
+- **FST implication**: all strings containing one or more such tags are removed from
+ all normative transducres
-* **Tag prefix**: `+Err/`
-* **Definition**: tags describing parts of the language outside the established norm
-* **FST implication**: all strings containing one or more such tags are removed from
- all normative transducres
-# Dialect tags
+## Dialect tags
The dialect tag class is defined as follows:
-* **Tag prefix**: `+Dial/`
-* **Definition**: tags describing (written) variation based on dialect
-* **FST implication**: when the `DIALECTS` variable is set in `configure.ac`, one
- filter for each dialect defined there is built automatically. Each
- filter will remove all strings tagged with a dialect different from
- the one specific to the filter. Untagged strings will be left as is.
- The dialect tags are presently only made use of in Oahpa generators.
+- **Tag prefix**: `+Dial/`
+- **Definition**: tags describing (written) variation based on dialect
+- **FST implication**: when the `DIALECTS` variable is set in `configure.ac`, one
+ filter for each dialect defined there is built automatically. Each
+ filter will remove all strings tagged with a dialect different from
+ the one specific to the filter. Untagged strings will be left as is.
+ The dialect tags are presently only made use of in Oahpa generators.
Other notes:
-* The first character after the **/** *must* be one of `+` or `–`,
+- The first character after the **/** _must_ be one of `+` or `–`,
denoting either inclusion (the entry/form is valid for the specified dialect)
or exclusion (the entry/form is NOT valid for the specified dialect - but for
all others)
-* The string following **/** and **+/–** *must* be one of the strings
+- The string following **/** and **+/–** _must_ be one of the strings
specified in `configure.ac` for the variable `DIALECTS`.
-# Area tags
+## Area tags
The area/country tag class is defined as follows:
-* **Tag prefix**: `+Area/`
-* **Definition**: tags describing (written) variation based on country or another
- geographical unit, as per the
- [ISO 3166](https://en.wikipedia.org/wiki/ISO_3166) standard.
-* **FST implication**: not yet actively used, but will be used to build proofing
- tools and possibly other normative fst's where strings for other
- areas than the specified one will be removed. This will e.g. make
- the Lule Sámi speller for Sweden a better tool, as all strings
- with Norwegian *æ* (except for in names) will be removed: smaller
- and faster, and with less irrelevant suggestions.
+- **Tag prefix**: `+Area/`
+- **Definition**: tags describing (written) variation based on country or another
+ geographical unit, as per the
+ [ISO 3166](https://en.wikipedia.org/wiki/ISO_3166) standard.
+- **FST implication**: not yet actively used, but will be used to build proofing
+ tools and possibly other normative fst's where strings for other
+ areas than the specified one will be removed. This will e.g. make
+ the Lule Sámi speller for Sweden a better tool, as all strings
+ with Norwegian _æ_ (except for in names) will be removed: smaller
+ and faster, and with less irrelevant suggestions.
Other notes:
-* The tag prefix must be followed by an
- [ISO 3166](https://en.wikipedia.org/wiki/ISO_3166) string.
-# Semantic tags
+- The tag prefix must be followed by an
+ [ISO 3166](https://en.wikipedia.org/wiki/ISO_3166) string.
+## Semantic tags
The semantic tag class is defined as follows:
-* **Tag prefix**: `+Sem/`
-* **Definition**: tags describing semantic properties of the lexeme
-* **FST implication**: all semantic tags are automatically identified, and a couple
- of filters for manipulating them are built and applied, see further
- notes below.
+- **Tag prefix**: `+Sem/`
+- **Definition**: tags describing semantic properties of the lexeme
+- **FST implication**: all semantic tags are automatically identified, and a couple
+ of filters for manipulating them are built and applied, see further
+ notes below.
Other notes:
-* **the raw fst:** the semantic tags are moved relative to the POS tag, to
- ensure consistent tag ordering
-* **all fst's except disambiguators and grammar checkers:**
- the semantic tags are removed.
-* **disambiguators and grammar checkers:** the tags are kept (i.e. they are
- untouched)
+- **the raw fst:** the semantic tags are moved relative to the POS tag, to
+ ensure consistent tag ordering
+- **all fst's except disambiguators and grammar checkers:**
+ the semantic tags are removed.
+- **disambiguators and grammar checkers:** the tags are kept (i.e. they are
+ untouched)
-# Derivation tags
+## Derivation tags
The derivation tag class is defined as follows:
+- **Tag prefix**: `+Der/`
+- **Definition**: tags describing derivational morphology
+- **FST implication**: there is no language-independent processing of these tags ATM
-* **Tag prefix**: `+Der/`
-* **Definition**: tags describing derivational morphology
-* **FST implication**: there is no language-independent processing of these tags ATM
-# Originating language tags
+## Originating language tags
The originating language tag class is defined as follows:
-* **Tag prefix**: `+OLang/`
-* **Definition**: tags describing originating language for loan words in cases where
- such information is required to get proper pronunciation in speech
- synthesis
-* **FST implication**: there is no language-independent processing of these tags ATM,
- and they are removed from all fst's; for North Sámi there is some
- language-specific processing to split the lexical fst into separate
- fst's for each defined `+OLang/` language, after which it is
- possible to apply OLang-specific phonetic rules
+- **Tag prefix**: `+OLang/`
+- **Definition**: tags describing originating language for loan words in cases where
+ such information is required to get proper pronunciation in speech
+ synthesis
+- **FST implication**: there is no language-independent processing of these tags ATM,
+ and they are removed from all fst's; for North Sámi there is some
+ language-specific processing to split the lexical fst into separate
+ fst's for each defined `+OLang/` language, after which it is
+ possible to apply OLang-specific phonetic rules
Other notes:
So far the only speech synthesis system we have built is for North Sámi. It was
furthermore built without using our text processing technology, and the features
being made possible with these tags (ie pronouncing «u» as /ʉː/ instead
diff --git a/ling/LexCIntro.md b/ling/LexCIntro.md
index cb9418fe..bc290598 100644
--- a/ling/LexCIntro.md
+++ b/ling/LexCIntro.md
@@ -5,7 +5,7 @@
1. LexC-formalismen - del to: fleirteiknssymbolar (Multichar_Symbols)
1. del tre: fortsetjingsleksikon, start og slutt
-# stutt om fst-ar - kva er dei, korleis funkar dei?
+## stutt om fst-ar - kva er dei, korleis funkar dei?
- fst = Finite state transducer
- to nivå: ordform + analyse
@@ -18,7 +18,7 @@ g å e t i e +N +Pl +Gen
g ö ö t i - - - -
-# LexC-formalismen - del ein: leksikonstruktur
+## LexC-formalismen - del ein: leksikonstruktur
lemma+Tag:stamme fortsetjingsleksikon "infostreng" ;
@@ -26,11 +26,11 @@ g ö ö t i - - - -
Dvs lemma + analyse på venstre side av kolon, (abstrakt) ordform på høgre side.
-# LexC-formalismen - del to: fleirteiknssymbolar (Multichar_Symbols)
+## LexC-formalismen - del to: fleirteiknssymbolar (Multichar_Symbols)
Kva med taggane? Alle taggar må definerast som eit fleirteiknssymbol.
-# del tre: fortsetjingsleksikon, start og slutt
+## del tre: fortsetjingsleksikon, start og slutt
- start: `LEXICON Root` - **MÅ** stå fyrst
- slutt: `#` - alle stiar **MÅ** enda opp ved `#`
diff --git a/ling/LinguisticAnalysis.md b/ling/LinguisticAnalysis.md
index 2981db87..a779edf7 100644
--- a/ling/LinguisticAnalysis.md
+++ b/ling/LinguisticAnalysis.md
@@ -1,18 +1,15 @@
-Linguistic analysis with GiellaLT models
+# Linguistic analysis with GiellaLT models
Instead of compiling the grammatical tools yourself (as described elsewhere on these pages), you may also **download ready-compiled analysers for text analysis**. This page explains how. If you have compiled the tools on your machine **already**, we recommend [this page](../tools/docu-sme-manual.md) instead. If not, read on.
+## 1. Download the programs
-# 1. Download the programs
-## 1.1. Download the required *support programs*
-These commands will download the compilers *hfst* and *vislcg3*. They require a unix system. For use on Windows, see below.
+### 1.1. Download the required _support programs_
+These commands will download the compilers _hfst_ and _vislcg3_. They require a unix system. For use on Windows, see below.
**Download on Mac:**
curl http://apertium.projectjj.com/osx/install-nightly.sh > install-nightly.sh
@@ -21,7 +18,6 @@ chmod a+x install-nightly.sh
sudo ./install-nightly.sh
**Download on Linux ubuntu:**
@@ -38,35 +34,32 @@ curl https://apertium.projectjj.com/rpm/install-nightly.sh |sudo bash
sudo apt-get -f install apertium-all-devel
-## 1.2. Download the *analyser and disambiguator for your language:*
+### 1.2. Download the _analyser and disambiguator for your language:_
You will need both morphology and syntax. We use North Sámi (ISO code: **sme**) as an example:
+**Morphological analyser:**
-**Morphological analyser:**
curl https://gtsvn.uit.no/biggies/trunk/bin/sme/tokeniser-disamb-gt-desc.pmhfst > sme.pmhfst
+**Syntactic disambiguator:**
-**Syntactic disambiguator:**
curl https://gtsvn.uit.no/biggies/trunk/bin/sme/disambiguator.cg3 > sme.cg3
**NOTE!** For North Sámi (but not for the other languages) you also should run this command:
curl https://gtsvn.uit.no/biggies/trunk/bin/sme/semsets.cg3 > semsets.cg3
-The file *semset.cg3* should be in the same catalogue as the file *sme.cg3*.
+The file _semset.cg3_ should be in the same catalogue as the file _sme.cg3_.
Replace the language code **sme** with the language you want (note! the language code is mentioned **twice** in the commands above, replace both!):
- **fao**: Faroese
- **fin**: Finnish
- **smn**: Inari Saami
@@ -77,40 +70,35 @@ Replace the language code **sme** with the language you want (note! the language
- **rus**: Russian (Note! For Russian only morphology is available)
- **sma**: South Saami
More languages may be added upon request, from [this list](https://giellalt.github.io/LanguageModels.html). Feel free to contact us if your language is missing.
+## 2. Use the programs
-# 2. Use the programs
-## 2.1. Automatic grammatical analysis
+### 2.1. Automatic grammatical analysis
**Summary:** When you have downloaded the files (cf. the **Download...** links above), you will be able to run the following command in a terminal window (again with **sme** as an example):
-cat yourtextfile.txt | hfst-tokenise -cg sme.pmhfst | vislcg3 -g sme.cg3
+cat yourtextfile.txt | hfst-tokenise -cg sme.pmhfst | vislcg3 -g sme.cg3
+The textfile is sent through a two-step analysis: First through the morphological analyser `sme.pmhfst`,
+by using the support program `hfst-tokenise`. The flag `-cg` ensures morphological analysis in the required format.
+Thereafter the output is disambiguated with the disambiguator sme.cg3, by using the support program `vislcg3`.
+The flag `-g` identifies the file `sme.cg3` as the grammar file. In order to see more options, you may write
+`hfst-tokenise -h` and `vislcg3 -h`.
-The textfile is sent through a two-step analysis: First through the morphological analyser ``sme.pmhfst``,
-by using the support program ``hfst-tokenise``. The flag ``-cg`` ensures morphological analysis in the required format.
-Thereafter the output is disambiguated with the disambiguator sme.cg3, by using the support program ``vislcg3``.
-The flag ``-g`` identifies the file ``sme.cg3`` as the grammar file. In order to see more options, you may write
-``hfst-tokenise -h`` and ``vislcg3 -h``.
-You may also conduct automatic dictionary lookup, see below.
+You may also conduct automatic dictionary lookup, see below.
+## 3. Download other programs
-# 3. Download other programs
+### 3.1. Dictionaries
-## 3.1. Dictionaries
-You may also use the *Neahttadigisánit* dictionaries on the command line. **Warning!!** The program to be downloaded here gives translation equivalent only, not explanations or example sentences. For dictionary lookup the online dictionaries are thus far better, the programs presented here are good for automatic lookup.
+You may also use the _Neahttadigisánit_ dictionaries on the command line. **Warning!!** The program to be downloaded here gives translation equivalent only, not explanations or example sentences. For dictionary lookup the online dictionaries are thus far better, the programs presented here are good for automatic lookup.
-### 3.1.1. Fetching the dictionaries
+#### 3.1.1. Fetching the dictionaries
-The dictionaries are found in the catalogue of **the first language**, the language to translate **from**. Each dictionary has the file name *Lang1Lang2-all.hfst*.
+The dictionaries are found in the catalogue of **the first language**, the language to translate **from**. Each dictionary has the file name _Lang1Lang2-all.hfst_.
Here are two command examples for fetching the dictionaries.
@@ -124,36 +112,36 @@ curl https://gtsvn.uit.no/biggies/trunk/bin/fin/finsme-all.hfst > finsme.hfst
-For other dictionaries, replace *sme/smenob-all.hfst* above with *smn/smnfin-all.hfst*, *fin/finsmn-all.hfst*, *sma/smanob-all.hfst*, *nob/nobsma-all.hfst*, and correspondingly for *sme/smenob.hfst* etc.
+For other dictionaries, replace _sme/smenob-all.hfst_ above with _smn/smnfin-all.hfst_, _fin/finsmn-all.hfst_, _sma/smanob-all.hfst_, _nob/nobsma-all.hfst_, and correspondingly for _sme/smenob.hfst_ etc.
-### 3.1.2. Using the dictionaries
+#### 3.1.2. Using the dictionaries
The dictionaries may be used in two ways:
-- send a list of baseforms through it: ``cat smn-words.txt | hfst-lookup smnfin-all.hfst``
-- use the dictionary interactively: ``hfst-lookup smnfin-all.hfst``and thereafter write Inari Saami words and press ENTER. Leave the program with ``ctrl C``.
+- send a list of baseforms through it: `cat smn-words.txt | hfst-lookup smnfin-all.hfst`
+- use the dictionary interactively: `hfst-lookup smnfin-all.hfst`and thereafter write Inari Saami words and press ENTER. Leave the program with `ctrl C`.
-## 3.2. Word analysers
+### 3.2. Word analysers
curl https://gtsvn.uit.no/biggies/trunk/bin/smn/smn.hfstol > smn.hfstol
Use the word analysers in two ways:
a, send lists with one word per line through them: `cat wordlist | hfst-lookup smn.hfstol`
b. use the analyser interactively (put it on stand-by) with ` hfst-lookup smn.hfstol` and feed it with one word at a time (press ENTER). Leave the program with `ctrl C`.
+### 3.3. Spellers
-## 3.3. Spellers
-**Note** The spellers will need the *hfst-ospell* program (**TODO**: Document how to get hfst-ospell from nightly).
+**Note** The spellers will need the _hfst-ospell_ program (**TODO**: Document how to get hfst-ospell from nightly).
curl https://gtsvn.uit.no/biggies/trunk/bin/smn/smn.zhfst > smn.zhfst
-Thereafter use them as follows (presuming you have the *hfst-ospell* program:
+Thereafter use them as follows (presuming you have the _hfst-ospell_ program:
hfst-ospell -S -n 5 smn.zhfst
@@ -161,11 +149,8 @@ hfst-ospell -S -n 5 smn.zhfst
The flag `-S` means "present a correction suggestion", and the flag `-n 5` specifles the number of suggestions (here: 5).
+## 4. Running the analysers on Windows:
-# 4. Running the analysers on Windows:
All the above works on Linux and Mac. In order to make it work on Windows, do the following:
[Install a Linux shell](https://www.howtogeek.com/249966/how-to-install-and-use-the-linux-bash-shell-on-windows-10/). It is not too complicated, but requires admin rights on your machine. Thereafter, execute the commands for Linux ubuntu above.
diff --git a/ling/Ordbild.md b/ling/Ordbild.md
index 7e41a4f0..196ae558 100644
--- a/ling/Ordbild.md
+++ b/ling/Ordbild.md
@@ -1,49 +1,31 @@
+# Ordbild
The goal of this project is to make a corpus-driven valency dictionary,
along the lines of Korp's Ordbild, Gramtrans' DeepDict and Kilgarriff's
Sketch Engine. We use the Korp interface.
It is intended for researchers, students and translators.
+## Documents
+- [Plan for innholdet for samisk Ordbild](ordbild/OversiktOverOrdbild.html)
+## Meeting memos
+- [DeepDict-diskusjon 3.11.2011](/ped/nudoc/meetings/111103.html)
+## Links, (as reference)
-# Documents
-* [Plan for innholdet for samisk Ordbild](ordbild/OversiktOverOrdbild.html)
-# Meeting memos
-* [DeepDict-diskusjon 3.11.2011](/ped/nudoc/meetings/111103.html)
-# Links, (as reference)
-* [GramTrans DeepDict](http://gramtrans.com/deepdict/), in cooperation with
-* [Kommentarer (til Tino) til en tidligere versjon av prosjektet](GammalKravspesifikasjon.html)
-* 6 eksempel frå den eksperimentelle Saami DeepDict testversjonen som vi sette opp tidlegare
-** V:
-*** [leat](http://gramtrans.com/deepdict/lookup.php?word=leat&class=V&lang=smi)
-*** [mannat](http://gramtrans.com/deepdict/lookup.php?word=mannat&class=V&lang=smi)
-*** [boahtit](http://gramtrans.com/deepdict/lookup.php?word=boahtit&class=V&lang=smi)
-** N:
-*** [boazu](http://gramtrans.com/deepdict/lookup.php?word=boazu&class=N&lang=smi)
-*** [eallu](http://gramtrans.com/deepdict/lookup.php?word=eallu&class=N&lang=smi)
-** ADJ:
-*** [olu](http://gramtrans.com/deepdict/lookup.php?word=olu&class=ADJ&lang=smi)
-*** [stuoris](http://gramtrans.com/deepdict/lookup.php?word=stuoris&class=ADJ&lang=smi)
-** ADV:
-*** [de](http://gramtrans.com/deepdict/lookup.php?word=de&class=ADV&lang=smi)
-*** [nu](http://gramtrans.com/deepdict/lookup.php?word=nu&class=ADV&lang=smi)
+- [GramTrans DeepDict](http://gramtrans.com/deepdict/), in cooperation with
+ [GramTrans](http://gramtrans.com).
+- [Kommentarer (til Tino) til en tidligere versjon av prosjektet](GammalKravspesifikasjon.html)
+- 6 eksempel frå den eksperimentelle Saami DeepDict testversjonen som vi sette opp tidlegare
+ ** V: \*** [leat](http://gramtrans.com/deepdict/lookup.php?word=leat&class=V&lang=smi)
+ **_ [mannat](http://gramtrans.com/deepdict/lookup.php?word=mannat&class=V&lang=smi)
+ _** [boahtit](http://gramtrans.com/deepdict/lookup.php?word=boahtit&class=V&lang=smi)
+ ** N: \*** [boazu](http://gramtrans.com/deepdict/lookup.php?word=boazu&class=N&lang=smi)
+ **\* [eallu](http://gramtrans.com/deepdict/lookup.php?word=eallu&class=N&lang=smi)
+ ** ADJ:
+ **_ [olu](http://gramtrans.com/deepdict/lookup.php?word=olu&class=ADJ&lang=smi)
+ _** [stuoris](http://gramtrans.com/deepdict/lookup.php?word=stuoris&class=ADJ&lang=smi)
+ ** ADV: \*** [de](http://gramtrans.com/deepdict/lookup.php?word=de&class=ADV&lang=smi)
+ \*\*\* [nu](http://gramtrans.com/deepdict/lookup.php?word=nu&class=ADV&lang=smi)
diff --git a/ling/ParallelCorpusCheckFix.md b/ling/ParallelCorpusCheckFix.md
index 9bd1ae96..8b7d66ba 100644
--- a/ling/ParallelCorpusCheckFix.md
+++ b/ling/ParallelCorpusCheckFix.md
@@ -1,23 +1,19 @@
-Check and fix parallel corpus
+# Check and fix parallel corpus
Do this if you find files in freecorpus that aren't parallel anyway (or how to
improve freecorpus in sixteen simple steps)
To find files with wrong sentence alignment:
1. Open a terminal. Run tmx2html.sh. This converts all .tmx files to html files.
-This makes it easier to read the parallelized files.
+ This makes it easier to read the parallelized files.
2. Go to:
+ $GTFREE/prestable/tmx/nob2sme
3. If you want to check in which files the word is, grep word in sentence that
-has wrong alignment:
+ has wrong alignment:
nob2sme $ grep -rl bransjep . | grep -v '.svn' | less
@@ -26,22 +22,17 @@ nob2sme $ grep -rl bransjep . | grep -v '.svn' | less
Check all the listed files
4. Check if correctly aligned:
-nob2sme $ open ./admin/depts/regjeringen.no/aktuelt.html_id=166.tmx.html
-Then you will see the sentence alignment and can check if it is correctly
-aligned. Use cmd+f bransjep to search for words, etc.
+ nob2sme $ open ./admin/depts/regjeringen.no/aktuelt.html_id=166.tmx.html
+ Then you will see the sentence alignment and can check if it is correctly
+ aligned. Use cmd+f bransjep to search for words, etc.
1. Find all the files with the same id number in orig:
~ $ cd freecorpus
$ find orig -name '*id=210*' | grep -v ".svn"
@@ -55,31 +46,26 @@ orig/sme/admin/depts/regjeringen.no/bargu-ja-algu.html_id=210
2. Check if nob and sme are parallel files:
-$ see orig/nob/admin/depts/regjeringen.no/arbeid_og_velferd.html_id=210
+ $ see orig/nob/admin/depts/regjeringen.no/arbeid_og_velferd.html_id=210
+ orig/sme/admin/depts/regjeringen.no/bargu-ja-algu.html_id=210
In SubEthaEdit press cmd+r to open the files in a webbrowser.
3. If the files are not parallel files, change the sme xsl-file and delete
-information about parallel files (line 83-96). Check also line 60-79
-$ see orig/sme/admin/depts/regjeringen.no/bargu-ja-algu.html_id=210.xsl
+ information about parallel files (line 83-96). Check also line 60-79
+ (multilanguages):
+ $ see orig/sme/admin/depts/regjeringen.no/bargu-ja-algu.html_id=210.xsl
4. Convert the file xml to check if there are any errors in xsl-file:
-$ convert2xml orig/sme/admin/depts/regjeringen.no/bargu-ja-algu.html_id=210
+ $ convert2xml orig/sme/admin/depts/regjeringen.no/bargu-ja-algu.html_id=210
5. If converted succesfully, check in xsl-file:
-$ svn ci -m "This file doesn't have a parallel nob version"
+ $ svn ci -m "This file doesn't have a parallel nob version"
+ orig/sme/admin/depts/regjeringen.no/bargu-ja-algu.html_id=210.xsl
6. Find the rest of the files with the same id number:
$ find orig -name '*id=210*' | grep -v ".svn"
@@ -90,41 +76,33 @@ orig/nob/admin/depts/regjeringen.no/arbeid_og_velferd.html_id=210
7. svn rm the other languages except sme (6 files; 2 eng, 2 nno, 2 nob):
-$ svn rm orig/eng/admin/depts/regjeringen.no/engelsk-tema.html_id=210
+ $ svn rm orig/eng/admin/depts/regjeringen.no/engelsk-tema.html_id=210
+ ./eng/admin/depts/regjeringen.no/engelsk-tema.html_id=210.xsl
+ ./nno/admin/depts/regjeringen.no/arbeid-og-velferd-nynorsk.html_id=210
+ ./nno/admin/depts/regjeringen.no/arbeid-og-velferd-nynorsk.html_id=210.xsl
+ ./nob/admin/depts/regjeringen.no/arbeid_og_velferd.html_id=210
+ ./nob/admin/depts/regjeringen.no/arbeid_og_velferd.html_id=210.xsl
8. check in the changes:
-svn ci -m "deleted files with no parallel Saami translation"
+ svn ci -m "deleted files with no parallel Saami translation"
+9. Find the rest of the files with same id number in prestable and delete
+ them:
-10. Find the rest of the files with same id number in prestable and delete
freecorpus $ find prestable -name '*id=210*' | grep -v '.svn'
11. Delete the files (5 files: 3 in converted, 1 in tmx and 1 in toktmx):
-freecorpus $ svn rm
+ freecorpus $ svn rm
+ prestable/converted/nob/admin/depts/regjeringen.no/arbeid_og_velferd.html_id=210.xml
+ prestable/converted/sme/admin/depts/regjeringen.no/bargu-ja-
+ algu.html_id=210.typos
+ prestable/converted/sme/admin/depts/regjeringen.no/bargu-ja-
+ algu.html_id=210.xml
+ prestable/tmx/nob2sme/admin/depts/regjeringen.no/arbeid_og_velferd.html_id=210.tmx
+ prestable/toktmx/nob2sme/admin/depts/regjeringen.no/arbeid_og_velferd.html_id=210.toktmx
12. Check in the changes:
-svn ci -m "deleting useless files, they weren't parallel anyway"
+ svn ci -m "deleting useless files, they weren't parallel anyway"
diff --git a/ling/ParallelCorpusConversion.md b/ling/ParallelCorpusConversion.md
index 1ea2303d..b994fb1d 100644
--- a/ling/ParallelCorpusConversion.md
+++ b/ling/ParallelCorpusConversion.md
@@ -1,17 +1,15 @@
-Parallel corpus conversion
+# Parallel corpus conversion
This is an example on how to:
-- fetch parallel documents
-- add metadata that make the parallel documents refer to each other
-- add the parallel documents and their metadata to the corpus
- repository
-- convert them to giellateknos xml format
-- move the converted documents to prestable/converted
+- fetch parallel documents
+- add metadata that make the parallel documents refer to each other
+- add the parallel documents and their metadata to the corpus
+ repository
+- convert them to giellateknos xml format
+- move the converted documents to prestable/converted
-Fetch parallel documents
+## Fetch parallel documents
1. Open Safari, go to this address:
@@ -43,7 +41,7 @@ Fetch parallel documents
4. Open a new Terminal window and go to:
- freecorpus/orig/sme/admin/sd/other\_files/
+ freecorpus/orig/sme/admin/sd/other_files/
Fetch the saami document with this command: wget (cmd+v, to paste
the link we just copied)
@@ -63,7 +61,7 @@ Fetch parallel documents
7. Go the already opened Terminal. Press cmd+t to open a new tab and go
- freecorpus/orig/nob/admin/sd/other\_files/
+ freecorpus/orig/nob/admin/sd/other_files/
Fetch the norwegian document using this command: wget (press cmd+v,
then paste the link using cmd+c)
@@ -76,8 +74,7 @@ Fetch parallel documents
(Now you have the saami pages in the left tab of Terminal and the
norwegian pages in the right tab)
-Add metadata
+## Add metadata
1. Open the saami xsl file:
@@ -95,28 +92,21 @@ Add metadata
This entry has to be entered in the saami xsl file (don't fill in
"translated from" in the norwegian xsl file):
-->NB!! This is the full link, note that you have to replace some
characters in the link. (Paste the link into a clean SubEthaEdit
document, use the search and replace function and replace & with
(&). Don't include the paranthesis). Copy the link and paste it
into the Terminal. IMPORTANT: replace & with &
- --->NB! Only use translated\_from if it is a translated document!
+ --->NB! Only use translated_from if it is a translated document!
@@ -126,12 +116,9 @@ Add metadata
-Add the parallel documents and their metadata to the corpus repository
+## Add the parallel documents and their metadata to the corpus repository
1. Rerun convert2xml in the SME Terminál window (note that you can
press up in the Terminal untill the right command appears.):
@@ -171,8 +158,7 @@ Add the parallel documents and their metadata to the corpus repository
svn ci -m "your svn message"
-Move the converted documents to prestable/converted
+## Move the converted documents to prestable/converted
1. In the SME Terminal, write:
@@ -191,14 +177,14 @@ Move the converted documents to prestable/converted
4. Write svn stat and the result is:
- ? prestable/converted/nob/admin/sd/other\_files/sp2012-2.pdf.xml ?
- prestable/converted/sme/admin/sd/other\_files/dc2012-2.pdf.xml
+ ? prestable/converted/nob/admin/sd/other_files/sp2012-2.pdf.xml ?
+ prestable/converted/sme/admin/sd/other_files/dc2012-2.pdf.xml
5. Write:
svn add (copy and paste both paths, remember to add a space between
- them) prestable/converted/nob/admin/sd/other\_files/sp2012-2.pdf.xml
- prestable/converted/sme/admin/sd/other\_files/dc2012-2.pdf.xml
+ them) prestable/converted/nob/admin/sd/other_files/sp2012-2.pdf.xml
+ prestable/converted/sme/admin/sd/other_files/dc2012-2.pdf.xml
6. Write:
diff --git a/ling/PdfConvertingManipulation.md b/ling/PdfConvertingManipulation.md
index 40847848..c2fb8d4f 100644
--- a/ling/PdfConvertingManipulation.md
+++ b/ling/PdfConvertingManipulation.md
@@ -1,93 +1,67 @@
-Converting PDF files
-# PDF
+# Converting PDF files
+## PDF
This is by far the most problematic format to convert to xml, often needing extensive manipulation of the variables in the metadata documents to get the wanted output in the converted documents.
Portable Document Format (PDF) is a digital document format developed by Adobe Systems and was introduced in in 1993. Each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, graphics, and other information needed to display it.
A loose definition of the format could be "digital paper".
Extracting text from a pdf document can be approximated to that of extracting text using OCR: to retain the "story" of the document, we often need to skip pages, headers, footers, page numbers, foot notes, etc.
-# Converted document contains less (or no) text compared to the original document
+## Converted document contains less (or no) text compared to the original document
Decrease margins to 0, then compare document to the converted output.
Then adjust variables to taste.
-# Extracting individual articles from a document
+## Extracting individual articles from a document
Some documents contain many articles written by different authors. To correctly attribute the authors their text, we need to extract their article from the document.
-First download the document into a corpus, preferrably using __add_files_to_corpus__. Remove the metadata document of the downloaded document, we will not need it.
+First download the document into a corpus, preferrably using **add_files_to_corpus**. Remove the metadata document of the downloaded document, we will not need it.
Make a soft link to the document, e.g.
ln -s original.pdf original-author1-author2.pdf
ln -s original.pdf original-author3-author4.pdf
Run convert2xml on both the soft-linked documents to make basic metadata files belonging to these soft linked files.
convert2xml original-author1-author2.pdf
convert2xml original-author3-author4.pdf
+Then use **skip_pages** in the files `original-author1-author2.pdf.xsl` and `original-author3-author4.pdf.xsl` so that only the wanted pages are left in the converted documents.
-Then use __skip_pages__ in the files `original-author1-author2.pdf.xsl` and `original-author3-author4.pdf.xsl` so that only the wanted pages are left in the converted documents.
-# Order in the converted document is not retained
+## Order in the converted document is not retained
Run the command:
pdftohtml -hidden -enc UTF-8 -stdout -nodrm -i -xml documentname.pdf | less
to see if order of the text is contained. This is the command that is used by the pdf converter to do the first conversion from pdf to xml. It produces a xml format specific to the [poppler](https://poppler.freedesktop.org/) tools, which pdftohtml is a part of.
If the order of the text from the above content is different from the content of the converted document, then there is a bug in the pdf converter. File a bug on bugzilla. Use the **product**
"Corpus", **component**
"xml conversion".
+## Most of the text lines in the pdf documents are interpreted as paragraphs
-# Most of the text lines in the pdf documents are interpreted as paragraphs
-Have a look at the documentation on linespacing below.
+Have a look at the documentation on linespacing below.
+## Variables specific to pdf documents
-# Variables specific to pdf documents
-# Skipping pages
+## Skipping pages
Typical uses are to skip front page, pages containing tables of content, indexes, etc. In short, removing pages not relevant for the "story" of the document.
@@ -100,15 +74,10 @@ Examples:
1, 6-10, 15, 20, 25-30
-# Margins
+## Margins
This option is used to remove text outside a given rectangle. Typical uses are to remove page numbers, page headers, page footers, foot notes at the bottom of the page, info boxes on the left or right of the "real" document.
@@ -153,13 +122,10 @@ all=9, 8=12
1;3;8=20, 4;5;7=10
-# Removing content from a page
+## Removing content from a page
Typical uses of this is to remove info boxes inside the margins of a page.
@@ -172,16 +138,12 @@ as *_margin above. For a given page, all four margins
must be defined.
-# Line spacing
+## Line spacing
The pdf converter in CorpusTools uses guesswork to glue text lines into paragraphs. Usually documents have a line spacing of 1.5 and less. This means that from the bottom of a line to the bottom of the next line there is maximally 1.5 times larger than the font size.
Some documents, typically student texts, have a larger linespacing. When using the default linespacing, lines in the documents are interpreted as paragraphs leading to output like this:
This sentence
is divided
@@ -192,10 +154,8 @@ Some documents, typically student texts, have a larger linespacing. When using t
Increasing the value for this variable improves this situation.
@@ -227,7 +187,3 @@ odd=5, even=8, 8=15, 11=3
all=9, 8=12
1;3;8=20, 4;5;7=10
diff --git a/ling/RtfConvertingManipulation.md b/ling/RtfConvertingManipulation.md
index adeace23..2b6c83a5 100644
--- a/ling/RtfConvertingManipulation.md
+++ b/ling/RtfConvertingManipulation.md
@@ -1,28 +1,7 @@
-## Skip pages
+# Skip pages
## Skip part of pages
## Skip lines
## Skip words
diff --git a/ling/SaamiTextOnline.md b/ling/SaamiTextOnline.md
index 9e692b82..16922be1 100644
--- a/ling/SaamiTextOnline.md
+++ b/ling/SaamiTextOnline.md
@@ -1,68 +1,58 @@
-Online-teavsttat sámegillii
+# Online-teavsttat sámegillii
Mii háliidit nu olu teakstamateriála go vejolaš.
Prinsihppa lea, ahte mii čohkket html-siidduid
automahtalaččat, muhto pdf-siiddut manuálalaččat.
Bija ođđa siidduid freecorpus/urls.yaml:i.
-# Automáhtalaš viežžan
-* samediggi.fi
-* samediggi.no
-* yle.fi/uutiset/osasto/sapmi/
-* nrk.no/sapmi
-* avvir.no
-* samas.no
-* beaivvas.no
-* giella.org
-* paliskunnat.fi
-* regjeringen.no
-* arkivverket.no/om-oss/sami-arkiiva
-* samisklegeforening.no/
-* finnmarksarkivene.no
-# Neahttásiiddut main leat pdf
-* [Dutkanráđđi](https://www.forskningsradet.no/prognett-samisk/Sentrale_dokumenter/1229378700479)
-* [FeFo](http://www.fefo.no/sa/sider/start.aspx)
-* [Gielddat](https://nn.wikipedia.org/wiki/Samiske_kommunar) (fuom! oassi dain mis lea juo)
-* Munin uit.no
-** [PhD|https://munin.uit.no/handle/10037/281/browse?type=title], [master](https://munin.uit.no/handle/10037/159/browse?type=title) oza "ahte" site:munin.uit.no/handle
-* NDLA (Nasjonal digital læringsarena
-** [NDLA lullisámegiella vuosttaš giellan](https://ndla.no/nn/node/162357?fag=126960)
-## Eará gáldut
-* Leanat
-* Stáhtahálddahus Norggas, Ruoŧas, Suomas
-** Direktoráhtat
-** Suopmi: Metsähallitus
-* Sámedikkit
-** Preassadieđahusat
-* Sámi politihkalaš bellodagat
-* Bloggat
-* Girkkut
-## Reetta epoasttas
+## Automáhtalaš viežžan
+- samediggi.fi
+- samediggi.no
+- yle.fi/uutiset/osasto/sapmi/
+- nrk.no/sapmi
+- avvir.no
+- samas.no
+- beaivvas.no
+- giella.org
+- paliskunnat.fi
+- regjeringen.no
+- arkivverket.no/om-oss/sami-arkiiva
+- samisklegeforening.no/
+- finnmarksarkivene.no
+## Neahttásiiddut main leat pdf
+- [Dutkanráđđi](https://www.forskningsradet.no/prognett-samisk/Sentrale_dokumenter/1229378700479)
+- [FeFo](http://www.fefo.no/sa/sider/start.aspx)
+- [Gielddat](https://nn.wikipedia.org/wiki/Samiske_kommunar) (fuom! oassi dain mis lea juo)
+- Munin uit.no
+ \*\* [PhD|https://munin.uit.no/handle/10037/281/browse?type=title], [master](https://munin.uit.no/handle/10037/159/browse?type=title) oza "ahte" site:munin.uit.no/handle
+- NDLA (Nasjonal digital læringsarena
+ \*\* [NDLA lullisámegiella vuosttaš giellan](https://ndla.no/nn/node/162357?fag=126960)
+### Eará gáldut
+- Leanat
+- Stáhtahálddahus Norggas, Ruoŧas, Suomas
+ ** Direktoráhtat
+ ** Suopmi: Metsähallitus
+- Sámedikkit
+ \*\* Preassadieđahusat
+- Sámi politihkalaš bellodagat
+- Bloggat
+- Girkkut
+### Reetta epoasttas
Her er ei liste av nettesider som jeg fant, men ikke rukket å hente nye filer fra. Risten kan kanskje gjøre det når hun begynner på jobb igjen?
-* giella.org: lulesamisk og sørsamisk
-* paliskunnat.fi
-* saminuorra.org
-* saamivillage.fi
-* rdm.no
-* kuati.fi?
-* dutkansearvi.fi?
-* Báikkálaš sámi searvvit?
+- giella.org: lulesamisk og sørsamisk
+- paliskunnat.fi
+- saminuorra.org
+- saamivillage.fi
+- rdm.no
+- kuati.fi?
+- dutkansearvi.fi?
+- Báikkálaš sámi searvvit?
diff --git a/ling/TxtConvertingManipulation.md b/ling/TxtConvertingManipulation.md
index ed8376bb..f2eb4bd0 100644
--- a/ling/TxtConvertingManipulation.md
+++ b/ling/TxtConvertingManipulation.md
@@ -1,15 +1,9 @@
-Converting .txt files
+# Converting .txt files
-# txt
+## txt
+### Skip lines
-## Skip lines
-## Skip words
+### Skip words
diff --git a/ling/UnicodeNormalisation.md b/ling/UnicodeNormalisation.md
index 2a235ab8..574bac01 100644
--- a/ling/UnicodeNormalisation.md
+++ b/ling/UnicodeNormalisation.md
@@ -1,44 +1,29 @@
-(*or: how to fix decomposed Sami letters*)
+(_or: how to fix decomposed Sami letters_)
In Unicode, many glyphs (letter symbols) may either be represented
by one character, or by a sequence of many. The letter á may thus be
either one character á or two characters a and combining ´ . Normalisation
forms are used to standardise the representation.
1. NFKD = Normalization Form Compatibility Decomposition
1. NFKC = Normalization Form Compatibility Composition
-The first, NFKD, **decomposes** the characters (á as two characters),
+The first, NFKD, **decomposes** the characters (á as two characters),
whereas the second, NFKC, **composes it** (á as one character).
Our North Sami analysers use the **composed** representation.
If you get text with decomposed letters (**UnicodeChecker** will tell you that č is two characters), you must **compose** them with the following command
cat infile.txt \
| uconv -f utf8 -t utf8 -x Any-NFKC > outfile.txt
See also `man uconv`
The uconv program should be installed on your machine as part of
the ICU installation.
-* [Unicode on normalization](http://unicode.org/reports/tr15/)
-* [Exmple script where the command is used](https://github.com/redpony/cdec/blob/master/corpus/utf8-normalize.sh)
+- [Unicode on normalization](http://unicode.org/reports/tr15/)
+- [Exmple script where the command is used](https://github.com/redpony/cdec/blob/master/corpus/utf8-normalize.sh)
diff --git a/ling/WikipediaAsCorpus.md b/ling/WikipediaAsCorpus.md
index 508bcb95..9e727f8e 100644
--- a/ling/WikipediaAsCorpus.md
+++ b/ling/WikipediaAsCorpus.md
@@ -1,38 +1,33 @@
# Wikipedia as a Corpus
This page explains how to fetch whole Wikipedia editions as raw text
-# Do the following:
-1. Find the language code for the language you want: It is the two-letter ISO code (**se**, etc.). If the language has no two-letter code, use the 3-letter code.
-1. Go to the download page. The URL is [http://dumps.wikimedia.org/sewiki/](http://dumps.wikimedia.org/sewiki/) will give you North Sámi, exchange the **se** in *sewiki* with the language code you want.
-1. In the list that follows, choose the last one **before** *latest/*. The
- latest one is the same as the one with the last dates (it is just a stable url), the download headers are more nicely formatted in the last dated link.
-1. Download the .bz2 file found under the header
- **Articles, templates, image descriptions, and primary meta-pages.**
- This will give you the articles. \\
- If you want revision history (e.g. for spellchecker testing), you need
- *All pages with complete edit history* (this use is not documented).
+## Do the following
+1. Find the language code for the language you want: It is the two-letter ISO code (**se**, etc.). If the language has no two-letter code, use the 3-letter code.
+1. Go to the download page. The URL is [http://dumps.wikimedia.org/sewiki/](http://dumps.wikimedia.org/sewiki/) will give you North Sámi, exchange the **se** in _sewiki_ with the language code you want.
+1. In the list that follows, choose the last one **before** _latest/_. The
+ latest one is the same as the one with the last dates (it is just a stable url), the download headers are more nicely formatted in the last dated link.
+1. Download the .bz2 file found under the header
+ **Articles, templates, image descriptions, and primary meta-pages.**
+ This will give you the articles. \\
+ If you want revision history (e.g. for spellchecker testing), you need
+ _All pages with complete edit history_ (this use is not documented).
1. When downloaded, open the .bz2 file. (On Mac and Linux, just doubleclick on the file.)
-You now want to convert the xml files to text. Use e.g. the script [https://pypi.org/project/wikiextractor/](WikiExtractor.py). If you have downloaded the svn giellalt file tree from Tromsø, you already have this script, in ``$GTHOME/gt/script/corpus/``. If not, look at the documentation on the script's homepage. The script has a --help option explaining
- usage. Let us say you call the folder for output `outf`.
+You now want to convert the xml files to text. Use e.g. the script [https://pypi.org/project/wikiextractor/](WikiExtractor.py). If you have downloaded the svn giellalt file tree from Tromsø, you already have this script, in `$GTHOME/gt/script/corpus/`. If not, look at the documentation on the script's homepage. The script has a --help option explaining
+usage. Let us say you call the folder for output `outf`.
1. The output is xml. If you want clean text, you may strip the tags.
Here are two ways of stripping xml tags. First, just with sed:
- ```
- cat outf/* | sed 's/<[^>]*>//g;' | ...
- ```
+ cat outf/* | sed 's/<[^>]*>//g;' | ...
For Tromsø users we have made a script to somewhat refine this command, also that in $GTHOME/gt/script/corpus/. It is called `rydd_i_wikipedia.sh`
- ```
- cat outf/* | sh $GTHOME/gt/script/corpus/rydd_i_wikipedia.sh | ...
- ```
+ cat outf/* | sh $GTHOME/gt/script/corpus/rydd_i_wikipedia.sh | ...
diff --git a/ling/bokhylla/BrukJupyter.md b/ling/bokhylla/BrukJupyter.md
index 280d6906..766bb94a 100644
--- a/ling/bokhylla/BrukJupyter.md
+++ b/ling/bokhylla/BrukJupyter.md
@@ -1,25 +1,24 @@
Dokumentasjon for bruk av Bokhylla som korpus, med hjelp av Jupyter.
-* [Nasjonalbiblioteket sin dokumentasjon](https://nbviewer.jupyter.org/github/DH-LAB-NB/DHLAB/blob/master/DHLAB_ved_Nasjonalbiblioteket.ipynb)
-* [Installering av Jupyter](https://realpython.com/jupyter-notebook-introduction/)
+- [Nasjonalbiblioteket sin dokumentasjon](https://nbviewer.jupyter.org/github/DH-LAB-NB/DHLAB/blob/master/DHLAB_ved_Nasjonalbiblioteket.ipynb)
+- [Installering av Jupyter](https://realpython.com/jupyter-notebook-introduction/)
# Installering
Dette er våre notatar, sjå også Jupyter sine eigne notatar (over).
-* Det du trenger for å kjøre jupyter:
-**python3, som du kanskje har frå før (viss ikkje installer den, t.d. frå Anaconda:
-** [Anaconda](https://www.datacamp.com/community/tutorials/installing-anaconda-mac-os-x) (en pythondistribusjon)
-** Deretter jupyter, installert med desse to kommandoane: \\
-python3 -m pip setuptools \\
-python3 -m pip install jupyter
-* Start deretter jupyter, i terminalen: \\
-jupyter notebook
+- Det du trenger for å kjøre jupyter:
+ **python3, som du kanskje har frå før (viss ikkje installer den, t.d. frå Anaconda:
+ ** [Anaconda](https://www.datacamp.com/community/tutorials/installing-anaconda-mac-os-x) (en pythondistribusjon)
+ \*\* Deretter jupyter, installert med desse to kommandoane: \\
+ python3 -m pip setuptools \\
+ python3 -m pip install jupyter
+- Start deretter jupyter, i terminalen: \\
+ jupyter notebook
Merk at det øverst til høgre i sida du får i nettlesaren skal stå **Python3**.
-# Lars Johnsen sitt seminar
+## Lars Johnsen sitt seminar
@@ -53,7 +52,7 @@ Rediger denne fila (t.d. ved å fjerne irrelevante bøker), eller berre spar hen
Viss neste sesjon startar på nytt, må vi importere bokhylla på nytt (sjå ovafor).
-Deretter last korus inn att frå bokhylla med hjelp av (den redigerte) xls-fila:
+Deretter last korus inn att frå bokhylla med hjelp av (den redigerte) xls-fila:
@@ -85,11 +84,11 @@ Translation
Og i eksemplet med sami_count (telling for hele korpuset) må den gjøres om til dataramme først, i og med det er et Counter-object (en dict med ekstra funksjonalitet - fin å telle tekst med generelt):
-sami_count = nb.frame(sami_count) # nb.frame() er en wrapper for pandas-funksjonalitet med litt ekstra...
+sami_count = nb.frame(sami_count) ## nb.frame() er en wrapper for pandas-funksjonalitet med litt ekstra...
så kan det summeres:
@@ -163,9 +162,9 @@ Nye kommandoer og hjelpekommandoer kan du lage som du vil egentlig. Mesteparten
import dhlab.module_update as mu
-mu.update('nbtext') # laster ned nbtext.py
-mu.update('nbtokenizer') # tokenisator for norsk
-mu.update('token_map') # for å gjøre navnebehandling
+mu.update('nbtext') ## laster ned nbtext.py
+mu.update('nbtokenizer') ## tokenisator for norsk
+mu.update('token_map') ## for å gjøre navnebehandling
Håper det her hjelper.
diff --git a/ling/cgii-writing.md b/ling/cgii-writing.md
index d7cf5299..3976bab3 100644
--- a/ling/cgii-writing.md
+++ b/ling/cgii-writing.md
@@ -1,8 +1,6 @@
-How to write disambiguation files
+# How to write disambiguation files
-Constraint grammar
+## Constraint grammar
The main introduction to CG-2 is Tapanainen 1996. Karlsson & al 1992
gives a good introduction to CG-1, and also the most thorough
@@ -11,18 +9,17 @@ presentation of the philosophy behind the constraint grammar framework.
The projects uses the CG-2 formalism, and this formalism is presentation
below. The concrete implementation is vislcg.
-The structure of the disambiguation file
+## The structure of the disambiguation file
The disambiguation file has the suffix .rle, in our case it is called
sme-dis.rle, smj-dis.rle, etc. The file consists of the following
sections (an additional section CORRECTIONS may also be used, it then
follows the CONSTRAINTS sections):
-- CONSTRAINTS (there are several CONSTRAINT sections
+- CONSTRAINTS (there are several CONSTRAINT sections
### The delimiters
@@ -114,7 +111,6 @@ the rule to work since Acc is not a member of the set PRE-NP-HEAD.
The constraints of the North Saami file are documented
-The format of the constraint rules
+## The format of the constraint rules
diff --git a/ling/common-regex.md b/ling/common-regex.md
index 7488bef2..25beabd3 100644
--- a/ling/common-regex.md
+++ b/ling/common-regex.md
@@ -1,47 +1,34 @@
# Utility regexes found in gt/common/src
The following is the beginning of documentation of the different utility regexes found in `$GTHOME/gt/common/src/`. To be extended as more is found.
-# tag-pos.fst
+## tag-pos.fst
In order to make pos.fst we need a binary tag-pos.fst
This goal depends on tag-pos.regex. The way it
is done is that all tags except the POS one are deleted.
+## tag-not-save.fst
-# tag-not-save.fst
We want to delete the +TV +IV tags for the generator (and other
tags later on. For that we need our tag-deleter.
-# hyphen-convert.fst
+## hyphen-convert.fst
No documentation yet
-# hyphen-remove.fst
+## hyphen-remove.fst
No documentation yet
-# tag-inclusion-filter.fst
+## tag-inclusion-filter.fst
No documentation yet
-# tag-no.fst
+## tag-no.fst
No documentation yet
-# webadr.fst
+## webadr.fst
This goal is to make a regex for filenames, urls and mail addresses
diff --git a/ling/common.md b/ling/common.md
index 376c56dc..d72658f9 100644
--- a/ling/common.md
+++ b/ling/common.md
@@ -1,43 +1,39 @@
+# Tutorials
-- [Tutorials for working with Lexc/Twolc, CG, and UNIX
- commands](../lang/common/Tutorials.html)
-- [How to use the morphological parsers](/tools/docu-sme-manual.html)
-- [A flowchart of the analysis pipeline](global-flowchart.html)
+- [Tutorials for working with Lexc/Twolc, CG, and UNIX
+ commands](../lang/common/Tutorials.html)
+- [How to use the morphological parsers](/tools/docu-sme-manual.html)
+- [A flowchart of the analysis pipeline](global-flowchart.html)
-Linguistic issues
+## Linguistic issues
-- Preprocessing of text
- - [How to split text into tokens](preprocessor-usage.html)
- - [Documentation of the preprocessor files](preprocessor.html)
-- Morphological tagging of text, with LEXC and TWOLC
- - [Principles for common (language-independent) lexicon
- entries](../lang/common/PrinciplesForCommonTagsAndLexiconEntries.html)
- - [Handling variation in
- LEXC](../lang/common/Variation_in_lexc.html)
-- Documentation of tags
- - [Compoundtags](../lang/common/CompoundTags.html)
- - [Morphological tags](../lang/common/MorphologicalTags.html)
- - [Derivational tags in Sámi](../lang/common/DerivationOverview.html)
- - [How the different tags are interacting with the
- FSTs](../lang/common/DifferentFSTs.html)
- - [Syntax tags](../lang/common/docu-sme-syntaxtags.html)
- - [Dependency tags](../lang/common/docu-deptags.html)
- - [Semantic tags](../lang/common/SemanticTags.html)
-- Disambiguation of morphological analysis
- - [Morphological disambiguation](docu-disambiguation.html)
+- Preprocessing of text
+ - [How to split text into tokens](preprocessor-usage.html)
+ - [Documentation of the preprocessor files](preprocessor.html)
+- Morphological tagging of text, with LEXC and TWOLC
+ - [Principles for common (language-independent) lexicon
+ entries](../lang/common/PrinciplesForCommonTagsAndLexiconEntries.html)
+ - [Handling variation in
+ LEXC](../lang/common/Variation_in_lexc.html)
+- Documentation of tags
+ - [Compoundtags](../lang/common/CompoundTags.html)
+ - [Morphological tags](../lang/common/MorphologicalTags.html)
+ - [Derivational tags in Sámi](../lang/common/DerivationOverview.html)
+ - [How the different tags are interacting with the
+ FSTs](../lang/common/DifferentFSTs.html)
+ - [Syntax tags](../lang/common/docu-sme-syntaxtags.html)
+ - [Dependency tags](../lang/common/docu-deptags.html)
+ - [Semantic tags](../lang/common/SemanticTags.html)
+- Disambiguation of morphological analysis
+ - [Morphological disambiguation](docu-disambiguation.html)
+## Testing
-- [LEXC/TWOLC work – Testscripts](../lang/common/developingwork.html)
-- Check analysis regressions, TO BE WRITTEN
-- [Testing the disambiguation](docu-distesting.html)
+- [LEXC/TWOLC work – Testscripts](../lang/common/developingwork.html)
+- Check analysis regressions, TO BE WRITTEN
+- [Testing the disambiguation](docu-distesting.html)
-Outdated documentation
+## Outdated documentation
[Analyzed corpus,](corpus_analyze.html) [Correct
corpus,](correct-dir.html) [Corpus plan,](corpus_plan.html) [The tools
diff --git a/ling/corpus_analyze.md b/ling/corpus_analyze.md
index 398d4586..2ad13684 100644
--- a/ling/corpus_analyze.md
+++ b/ling/corpus_analyze.md
@@ -1,8 +1,6 @@
-Corpus analysis
-# Overview and introduction
+# Corpus analysis
+## Overview and introduction
One of the goals of the giellatekno-project is to provide easy access to
the text materials for non-commercial purposes such as research. The
@@ -11,15 +9,14 @@ with which a user can fetch different types of data from the Sámi
corpora. The raw corpus material is collected in co-operation with the
owners of the documents. The documents are preprocessed so that the
texts can be used in research. The process of text collection is
-described in documents [corpus\_conversion.html](corpus_conversion.html)
-and [corpus\_conversion\_tech.html.](corpus_conversion_tech.html) This
+described in documents [corpus_conversion.html](corpus_conversion.html)
+and [corpus_conversion_tech.html.](corpus_conversion_tech.html) This
documents describes the process where the document is transferred to the
graphical corpus interface. The graphical corpus interface is developed
and maintaned by[Textlaboratoriet](http://www.hf.uio.no/tekstlab/)in the
university of Oslo.
-# How to parallelize documents
+## How to parallelize documents
Files that are ready to be parellellised exist in
`$GTFREE/prestable/converted`. The steps to parallelize between sme and
@@ -33,40 +30,38 @@ nob are:
5. `make`
2. Make an anchor file using the command (note that different text
domains may have different additional anchor files):
- - `generate-anchor-list.pl --lang1=sme --lang2=nob --outdir=$GTFREE $GTHOME/gt/common/src/anchor.txt $GTHOME/gt/common/src/anchor_admin.txt `
+ - `generate-anchor-list.pl --lang1=sme --lang2=nob --outdir=$GTFREE $GTHOME/gt/common/src/anchor.txt $GTHOME/gt/common/src/anchor_admin.txt `
3. Compile TCA2, the sentence aligner, using these commands:
- - `cd $GTHOME/tools/alignment-tools/tca2`
- - `ant`
+ - `cd $GTHOME/tools/alignment-tools/tca2`
+ - `ant`
The files may be parallellised in commandline mode.
1. Parallelize the files in `$GTFREE/prestable/converted/sme` and
`$GTFREE/prestable/converted/nob` using this command:
- - `` for file in `find $GTFREE/prestable/converted/sme -name \*.xml | grep -v .svn`; do corpus-parallel.pl --lang1=sme --lang2=nob $file ; done ``
+ - `` for file in `find $GTFREE/prestable/converted/sme -name \*.xml | grep -v .svn`; do corpus-parallel.pl --lang1=sme --lang2=nob $file ; done ``
2. The output is found in
The files may also be parallellised in graphical mode.
1. Issue the command
- - ` java -jar $GTHOME/tools/alignment-tools/tca2/dist/lib/alignment.jar `
+ - ` java -jar $GTHOME/tools/alignment-tools/tca2/dist/lib/alignment.jar `
2. To load files when starting tca2 in gui mode, issue this command:
- - ` java -jar $GTHOME/tools/alignment-tools/tca2/dist/lib/alignment.jar -anchor= -in1= -in2= `
+ - ` java -jar $GTHOME/tools/alignment-tools/tca2/dist/lib/alignment.jar -anchor= -in1= -in2= `
To parallelize the other way, exchange the values for lang1 and lang2 in
step 2 and 4, and change the find command in step 4 to
`find $GTFREE/prestable/converted/nob`
-# Analyzing the corpus text.
-## The files and formats
+## Analyzing the corpus text.
+### The files and formats
The project-internal corpus format contains the basic elements, such as
paragraphs, lists and tables that can be extracted from the original
document format. The xml-format of the Saami corpus resources is
-documented in [corpus\_conversion.html](corpus_conversion.html)
+documented in [corpus_conversion.html](corpus_conversion.html)
The original name of the document is preserved in the process with the
suffix indicating the document type, e.g. `file.doc.` When the text is
@@ -81,8 +76,7 @@ those files are indicated with suffix `.sent.xml`, e.g.
The xml-files reside in either `$GTFREE/converted` or
-## XML-format of the analyzed corpus.
+### XML-format of the analyzed corpus.
The XML format of the analyzed text is basically the following:
@@ -91,7 +85,7 @@ The XML format of the analyzed text is basically the following: