what's the spec of a conll representation? #72

vcvpaiva · 2017-03-03T23:55:45Z

A sentence that works in SICK-SANE is

+# text = People are walking
+1 People people NOUN NNS _ 3 nsubj _ NNS|07942152-n|GroupOfPeople=
+2 are be VERB VBP _ 3 aux _ VBP|02604760-v|Entity+
+3 walking walk VERB VBG _ 0 ROOT _ VBG|01904930-v|Walking=

Using this as a model we could say that a conll-representation is a pair of a sentence (in the example People are walking) and a matrix M. The matrix M has always as many rows as the tokens/words in the sentence (in the example only 3 words/tokens). The matrix M always has ten columns.

column 1 is just a numbering of the tokens.
column 2 is the word as originally given
column 3 is the word of the sentence lowercased and lemmatized (walking ==> walk)
column 4 is the POS of the word in universal POS?
column 5 is the POS in Freeling's POS? (what's the set of labels? VBG=verb gerund, VBP aux?)
column 6 is ALWAYS empty?
column 7 is an encoding of the dependency tree?
column 8 is the collection of labels of the dependency tree
column 9 is always EMPTY?
column 10 is the processing produced by Freeling, including the mapping into PWN and SUMO.

is this correct?

fcbr · 2017-03-06T11:47:27Z

The CoNLL-U format is documented here.

Notes:

column 5 is what is called a "language specific POS". However, in practice what we are seeing is that the corpora usually put the legacy POS tag that each corpus was originally annotated on there. Since we are using Parsey McParseface, we need to figure out the legacy tagset used in the corpus that was used to train. My guess would be the Penn Treebank POS tags.
We are not using enhanced dependencies (column 9).
Looks like the corpus that Parsey was trained on doesn't have morphological features explicitly defined, so column 6 will always be empty. If we used the UD corpus, this column would not be empty.
Column 10 is the MISC column, which is the catch all for this format and yes we are putting everything extra there. This is a big mess, but is a limitation of the format. I wish we had a better solution to this.

vcvpaiva · 2017-03-06T14:22:23Z

excellent, thanks a lot!

vcvpaiva · 2017-03-06T14:36:47Z

@fcbr many thanks! excellent.

I am trying to understand why in a simple example like:

# text = Someone is writing
 +1	Someone	someone	NOUN	NN	_	3	nsubj	_	PRP|?|?
 +2	is	be	VERB	VBZ	_	3	aux	_	VBZ|02604760-v|Entity+
 +3	writing	write	VERB	VBG	_	0	ROOT	_	VBG|00993014-v|WrittenCommunication=

"someone" which should be a pronoun (PRP) gets tagged as noun (NN and NOUN) and then doesn't get a mapping. if it was a noun it should get the mapping, as the noun is in PWN.

finally, can you also shed some light on the need to 'disable' Freeling's compound module and what you know about the module itself?

arademaker · 2017-03-06T15:22:46Z

@vcvpaiva concordo que esperarímos o mapping para http://wnpt.brlcloud.com/wn/synset?id=00007846-n . Mas o Freeling marcou como PRP (pronome, vide columna MISC) e logo não iria procurar com um noun na WN.Pr. A coluna 4 é a tag do SyntaxNet.

Sobre entender os erros, já falamos sobre isso. O parser, o tagger de Freeling são treinados. Em particular, as frases curtas do SICK não estão ajudando o parser que foi treinado com um corpus com frases mais longas, natureza bem distinta.

arademaker · 2017-03-06T15:25:21Z

@vcvpaiva documentação do módulo compound em

https://talp-upc.gitbooks.io/freeling-user-manual/content/modules/dictionary.html

vcvpaiva · 2017-03-07T15:01:02Z

@fcbr

We are not using enhanced dependencies (column 9).

why? because we cannot? or because we prefer not?

@arademaker just to make it abundantly clear,

Sobre entender os erros, já falamos sobre isso.

I do know how statistical systems work, no need to repeat it. the question here was, is it the disconnect between postags Parsey and FreeLing, or just the no-reason for tag in Freeling that causes the issue.
it seems clear now that mappings to PWN come straight from Freeling (which makes important the discussion with @lluisp on having the full collection of PWN lemmas there, including stuff such as bored, animated, waterboard, jetski). but still not clear where/how to add new mappings that one needs, like all the pronouns, the soon to be created prepositional phrases, etc.

fcbr · 2017-03-07T15:39:01Z

Nothing stops us from using enhanced dependencies, but Parsey McParseface does not emit them, most likely because the corpus it was trained on doesn't contain them (neither does the UD corpus).

arademaker · 2017-03-07T17:17:56Z

@vcvpaiva Pelo que entendi agora em do paper:

S. Schuster and C. D. Manning, “Enhanced English Universal Dependencies: An Improved Representation for Natural Language Understanding Tasks,” pp. 1–8, Mar. 2016.

De fato poderiamos tentar reproduzir este paper (se ele contiver todas as regras ou apontar para onde elas estejam) para produzir enhanced dependencies a partir das depencencias básicas. Parece que eles usaram regras em Semgrex: http://nlp.stanford.edu/software/tregex.shtml

ainda não entendi a questão seguinte.

vcvpaiva · 2017-03-07T17:49:09Z

se ele contiver todas as regras ou apontar para onde elas estejam

Sim, seria muito bom poder usar as "enhanced dependencies" deles,
infelizmente ele so' diz que as regras estao no CoreNLP, que e' um monstro de code...
mas o paper 'e muito bom!

Qual 'e a questao seguinte? a abaixo?

but still not clear where/how to add new mappings that one needs, like all the pronouns, the soon to be created prepositional phrases, etc.

a gente sabe que vai ter que remover pedacos dos mappings e tambem que vai ter que adicionar coisas a representacao conll, certo? por exemplo pra frase:

+# text = People are walking
+1 People people NOUN NNS _ 3 nsubj _ NNS|07942152-n|GroupOfPeople=
+2 are be VERB VBP _ 3 aux _ VBP|02604760-v|Entity+
+3 walking walk VERB VBG _ 0 ROOT _ VBG|01904930-v|Walking=

a gente quer "read-out" dessa dependencia a seguinte representacao:

subconcept-of (people2, GroupOfPeople)
subconcept-of (walking1, Walking)
subj(walking1,people2)

e pra segunda sentenca
+# text = Someone is writing
+1 Someone someone NOUN NN _ 3 nsubj _ PRP|?|?
+2 is be VERB VBZ _ 3 aux _ VBZ|02604760-v|Entity+
+3 writing write VERB VBG _ 0 ROOT _ VBG|00993014-v|WrittenCommunication=

a gente quer read-out"

subconcept-of (writing1 , WrittenCommunication)
subconcept-of (someone2 ,Person)
subj(writing1, someone2)

dai que precisamos ter um mecanismo pra inventar mappings pra pronomes, ja' que o mecanismo Freeling so' funciona pra nouns,verbs,adjectives, adverbs.
vamos precisar desse mecanismo pra todas as frase preposicionais que decidirmos usar.
e eu propus que usassemos pelo menos as das Enhanced dependencies repetidas em #65

vcvpaiva · 2017-03-07T19:46:57Z

the validator spec is implicit from the rules for the representations (in (http://universaldependencies.org/format.html) and the error signals the validator emits (UniversalDependencies/tools#20 (comment)).

So the CoNLL-U rules format says:

Sentences consist of one or more word lines(no empty sentences), and word lines contain the following fields:

ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes.
FORM: Word form or punctuation symbol.
LEMMA: Lemma or stem of word form.
UPOSTAG: Universal part-of-speech tag.
XPOSTAG: Language-specific part-of-speech tag; underscore if not available.
FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
HEAD: Head of the current word, which is either a value of ID or zero (0).
DEPREL: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs.
MISC: Any other annotation.

First the validator checks whether there are spurious empty lines (non-spurious empty lines are between reps of sentences) or spurious comment lines (#text) and counts the number of columns for word-lines, which needs to be ten.

The first 6 error codes in @martinp are about this:

"Spurious empty line.",u"Format"
"Spurious comment line.",u"Format"
"The line has %d columns, but %d (10) are expected."%(len(cols),COLCOUNT),u"Format"
"Spurious line: '%s'. All non-empty lines should start with a digit or the # character."%(line),u"Format"
"Missing empty line after the last tree.",u"Format" (tree?)
"Spurious sent_id line: '%s' Should look like '# sent_id = xxxxxx' where xxxx is not whitespace. Forward slash reserved for special purposes." %c,u"Metadata"

vcvpaiva · 2017-03-07T20:04:54Z

the next 9 error codes are about metadata and I am not worrying about it.

fcbr · 2017-03-07T21:07:42Z

please notice that validator.py checks not only the CoNLL-U format, but also the UD tagset.

vcvpaiva · 2017-03-07T22:32:55Z

well, it takes as a parameter the UD tagset, the set of UD dependency relations DEPREL and the set of morphological features (whatever this may be, depending on the language). at least.
if I am reading correctly the other error codes, e.g.

Morphological features must be sorted: '%s'"%feats,u"Morpho"
"Spurious morphological feature: '%s'. Should be of the form attribute=value and must start with [A-Z0-9] and only contain [A-Za-z0-9]."%f,u"Morpho"
"Repeated feature values are disallowed: %s"%feats,u"Morpho"
"If an attribute has multiple values, these must be sorted as well: '%s'"%f,u"Morpho"
"Incorrect value '%s' in '%s'. Must start with [A-Z0-9] and only contain [A-Za-z0-9]."%(v,f),u"Morpho"
"Unknown attribute-value pair %s=%s"%(attr,v),u"Morpho"
"Repeated features are disallowed: %s"%feats, u"Morpho"
"Unknown UPOS tag: %s"%cols[UPOSTAG],u"Morpho"
"Unknown XPOS tag: %s"%cols[XPOSTAG],u"Morpho"
"Unknown UD DEPREL: %s"%cols[DEPREL],u"Syntax"
"Malformed head:deprel pair '%s'"%head_deprel,u"Syntax"
"Unknown dependency relation '%s' in '%s'"%(deprel,head_deprel),u"Syntax"

I had stopped analyzing the other error codes (at number 15), but I meant to continue checking them later tonight.

fcbr · 2017-03-07T22:35:19Z

Absolutely, good reminder. I was thinking of the default behavior.

vcvpaiva · 2017-03-07T22:44:52Z

while what I wanted to know was if it would do more semantic things about the meaning of the dependencies.
like saying that a ROOT can only be a main verb, an adj or a noun, in case the sentence is copular.
that if a dependency is is dobj it should relate a verb to a noun-phrase (perhaps?, I don't know)

but I believe that there are more things that it could be saying...

vcvpaiva · 2017-03-07T22:50:10Z

@fcbr I am very interested in understanding how they check

Words do not form a sequence. Got: %s."%(u",".join(unicode(x) for x in words)),u"Format"

because most of our issues at the moment in the Bosque have to do with multiword tokens.
and in SICK-SANE have to do with creating appropriate multiword tokens.

vcvpaiva · 2017-03-07T22:59:57Z

@arademaker as you can see, the representation

subconcept-of (people2, GroupOfPeople)
subconcept-of (walking1, Walking)
subj(walking1,people2)

is very similar to TIL (over SUMO), instead of TIL over Unified Lexicon, as we have with XLE.

The only thing missing is the context "true", which is not necessary in this simple case.

But the representation is also very close to OWL, as we have always triples and we pushed the difficult bits about quantifiers somewhere else. just like enhanced Dependencies do.

vcvpaiva · 2017-03-07T23:11:24Z

@fcbr have you tried to download and run UDPipe https://github.com/ufal/udpipe?

did you see the numbers in
http://ufal.mff.cuni.cz/udpipe/users-manual#universal_dependencies_12_models
they say they can do better em PT than in EN, which is crazy.

vcvpaiva · 2017-03-17T22:34:53Z

closing this issue as the representation is described in details in
https://github.com/own-pt/cl-conllu

vcvpaiva closed this as completed Mar 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what's the spec of a conll representation? #72

what's the spec of a conll representation? #72

vcvpaiva commented Mar 3, 2017 •

edited

Loading

fcbr commented Mar 6, 2017

vcvpaiva commented Mar 6, 2017

vcvpaiva commented Mar 6, 2017 •

edited by arademaker

Loading

arademaker commented Mar 6, 2017

arademaker commented Mar 6, 2017

vcvpaiva commented Mar 7, 2017

fcbr commented Mar 7, 2017

arademaker commented Mar 7, 2017

vcvpaiva commented Mar 7, 2017 •

edited

Loading

vcvpaiva commented Mar 7, 2017 •

edited

Loading

vcvpaiva commented Mar 7, 2017

fcbr commented Mar 7, 2017

vcvpaiva commented Mar 7, 2017 •

edited

Loading

fcbr commented Mar 7, 2017

vcvpaiva commented Mar 7, 2017

vcvpaiva commented Mar 7, 2017

vcvpaiva commented Mar 7, 2017

vcvpaiva commented Mar 7, 2017

vcvpaiva commented Mar 17, 2017

what's the spec of a conll representation? #72

what's the spec of a conll representation? #72

Comments

vcvpaiva commented Mar 3, 2017 • edited Loading

fcbr commented Mar 6, 2017

vcvpaiva commented Mar 6, 2017

vcvpaiva commented Mar 6, 2017 • edited by arademaker Loading

arademaker commented Mar 6, 2017

arademaker commented Mar 6, 2017

vcvpaiva commented Mar 7, 2017

fcbr commented Mar 7, 2017

arademaker commented Mar 7, 2017

vcvpaiva commented Mar 7, 2017 • edited Loading

vcvpaiva commented Mar 7, 2017 • edited Loading

vcvpaiva commented Mar 7, 2017

fcbr commented Mar 7, 2017

vcvpaiva commented Mar 7, 2017 • edited Loading

fcbr commented Mar 7, 2017

vcvpaiva commented Mar 7, 2017

vcvpaiva commented Mar 7, 2017

vcvpaiva commented Mar 7, 2017

vcvpaiva commented Mar 7, 2017

vcvpaiva commented Mar 17, 2017

vcvpaiva commented Mar 3, 2017 •

edited

Loading

vcvpaiva commented Mar 6, 2017 •

edited by arademaker

Loading

vcvpaiva commented Mar 7, 2017 •

edited

Loading

vcvpaiva commented Mar 7, 2017 •

edited

Loading

vcvpaiva commented Mar 7, 2017 •

edited

Loading