Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what's the spec of a conll representation? #72

Closed
vcvpaiva opened this issue Mar 3, 2017 · 19 comments
Closed

what's the spec of a conll representation? #72

vcvpaiva opened this issue Mar 3, 2017 · 19 comments

Comments

@vcvpaiva
Copy link
Member

vcvpaiva commented Mar 3, 2017

A sentence that works in SICK-SANE is

+# text = People are walking
+1 People people NOUN NNS _ 3 nsubj _ NNS|07942152-n|GroupOfPeople=
+2 are be VERB VBP _ 3 aux _ VBP|02604760-v|Entity+
+3 walking walk VERB VBG _ 0 ROOT _ VBG|01904930-v|Walking=

Using this as a model we could say that a conll-representation is a pair of a sentence (in the example People are walking) and a matrix M. The matrix M has always as many rows as the tokens/words in the sentence (in the example only 3 words/tokens). The matrix M always has ten columns.

  1. column 1 is just a numbering of the tokens.
  2. column 2 is the word as originally given
  3. column 3 is the word of the sentence lowercased and lemmatized (walking ==> walk)
  4. column 4 is the POS of the word in universal POS?
  5. column 5 is the POS in Freeling's POS? (what's the set of labels? VBG=verb gerund, VBP aux?)
  6. column 6 is ALWAYS empty?
  7. column 7 is an encoding of the dependency tree?
  8. column 8 is the collection of labels of the dependency tree
  9. column 9 is always EMPTY?
  10. column 10 is the processing produced by Freeling, including the mapping into PWN and SUMO.

is this correct?

@fcbr
Copy link
Member

fcbr commented Mar 6, 2017

The CoNLL-U format is documented here.

Notes:

  1. column 5 is what is called a "language specific POS". However, in practice what we are seeing is that the corpora usually put the legacy POS tag that each corpus was originally annotated on there. Since we are using Parsey McParseface, we need to figure out the legacy tagset used in the corpus that was used to train. My guess would be the Penn Treebank POS tags.

  2. We are not using enhanced dependencies (column 9).

  3. Looks like the corpus that Parsey was trained on doesn't have morphological features explicitly defined, so column 6 will always be empty. If we used the UD corpus, this column would not be empty.

  4. Column 10 is the MISC column, which is the catch all for this format and yes we are putting everything extra there. This is a big mess, but is a limitation of the format. I wish we had a better solution to this.

@vcvpaiva
Copy link
Member Author

vcvpaiva commented Mar 6, 2017

excellent, thanks a lot!

@vcvpaiva
Copy link
Member Author

vcvpaiva commented Mar 6, 2017

@fcbr many thanks! excellent.

I am trying to understand why in a simple example like:

# text = Someone is writing
 +1	Someone	someone	NOUN	NN	_	3	nsubj	_	PRP|?|?
 +2	is	be	VERB	VBZ	_	3	aux	_	VBZ|02604760-v|Entity+
 +3	writing	write	VERB	VBG	_	0	ROOT	_	VBG|00993014-v|WrittenCommunication=

"someone" which should be a pronoun (PRP) gets tagged as noun (NN and NOUN) and then doesn't get a mapping. if it was a noun it should get the mapping, as the noun is in PWN.

finally, can you also shed some light on the need to 'disable' Freeling's compound module and what you know about the module itself?

@arademaker
Copy link
Member

@vcvpaiva concordo que esperarímos o mapping para http://wnpt.brlcloud.com/wn/synset?id=00007846-n . Mas o Freeling marcou como PRP (pronome, vide columna MISC) e logo não iria procurar com um noun na WN.Pr. A coluna 4 é a tag do SyntaxNet.

Sobre entender os erros, já falamos sobre isso. O parser, o tagger de Freeling são treinados. Em particular, as frases curtas do SICK não estão ajudando o parser que foi treinado com um corpus com frases mais longas, natureza bem distinta.

@arademaker
Copy link
Member

@vcvpaiva documentação do módulo compound em

https://talp-upc.gitbooks.io/freeling-user-manual/content/modules/dictionary.html

@vcvpaiva
Copy link
Member Author

vcvpaiva commented Mar 7, 2017

@fcbr

We are not using enhanced dependencies (column 9).

why? because we cannot? or because we prefer not?

@arademaker just to make it abundantly clear,

Sobre entender os erros, já falamos sobre isso.

I do know how statistical systems work, no need to repeat it. the question here was, is it the disconnect between postags Parsey and FreeLing, or just the no-reason for tag in Freeling that causes the issue.
it seems clear now that mappings to PWN come straight from Freeling (which makes important the discussion with @lluisp on having the full collection of PWN lemmas there, including stuff such as bored, animated, waterboard, jetski). but still not clear where/how to add new mappings that one needs, like all the pronouns, the soon to be created prepositional phrases, etc.

@fcbr
Copy link
Member

fcbr commented Mar 7, 2017

Nothing stops us from using enhanced dependencies, but Parsey McParseface does not emit them, most likely because the corpus it was trained on doesn't contain them (neither does the UD corpus).

@arademaker
Copy link
Member

@vcvpaiva Pelo que entendi agora em do paper:

S. Schuster and C. D. Manning, “Enhanced English Universal Dependencies: An Improved Representation for Natural Language Understanding Tasks,” pp. 1–8, Mar. 2016.

De fato poderiamos tentar reproduzir este paper (se ele contiver todas as regras ou apontar para onde elas estejam) para produzir enhanced dependencies a partir das depencencias básicas. Parece que eles usaram regras em Semgrex: http://nlp.stanford.edu/software/tregex.shtml

ainda não entendi a questão seguinte.

@vcvpaiva
Copy link
Member Author

vcvpaiva commented Mar 7, 2017

se ele contiver todas as regras ou apontar para onde elas estejam

Sim, seria muito bom poder usar as "enhanced dependencies" deles,
infelizmente ele so' diz que as regras estao no CoreNLP, que e' um monstro de code...
mas o paper 'e muito bom!

Qual 'e a questao seguinte? a abaixo?

but still not clear where/how to add new mappings that one needs, like all the pronouns, the soon to be created prepositional phrases, etc.

a gente sabe que vai ter que remover pedacos dos mappings e tambem que vai ter que adicionar coisas a representacao conll, certo? por exemplo pra frase:

+# text = People are walking
+1 People people NOUN NNS _ 3 nsubj _ NNS|07942152-n|GroupOfPeople=
+2 are be VERB VBP _ 3 aux _ VBP|02604760-v|Entity+
+3 walking walk VERB VBG _ 0 ROOT _ VBG|01904930-v|Walking=

a gente quer "read-out" dessa dependencia a seguinte representacao:

subconcept-of (people2, GroupOfPeople)
subconcept-of (walking1, Walking)
subj(walking1,people2)

e pra segunda sentenca
+# text = Someone is writing
+1 Someone someone NOUN NN _ 3 nsubj _ PRP|?|?
+2 is be VERB VBZ _ 3 aux _ VBZ|02604760-v|Entity+
+3 writing write VERB VBG _ 0 ROOT _ VBG|00993014-v|WrittenCommunication=

a gente quer read-out"

subconcept-of (writing1 , WrittenCommunication)
subconcept-of (someone2 ,Person)
subj(writing1, someone2)

dai que precisamos ter um mecanismo pra inventar mappings pra pronomes, ja' que o mecanismo Freeling so' funciona pra nouns,verbs,adjectives, adverbs.
vamos precisar desse mecanismo pra todas as frase preposicionais que decidirmos usar.
e eu propus que usassemos pelo menos as das Enhanced dependencies repetidas em #65

@vcvpaiva
Copy link
Member Author

vcvpaiva commented Mar 7, 2017

the validator spec is implicit from the rules for the representations (in (http://universaldependencies.org/format.html) and the error signals the validator emits (UniversalDependencies/tools#20 (comment)).

So the CoNLL-U rules format says:

Sentences consist of one or more word lines(no empty sentences), and word lines contain the following fields:

  1. ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes.
  2. FORM: Word form or punctuation symbol.
  3. LEMMA: Lemma or stem of word form.
  4. UPOSTAG: Universal part-of-speech tag.
  5. XPOSTAG: Language-specific part-of-speech tag; underscore if not available.
  6. FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
  7. HEAD: Head of the current word, which is either a value of ID or zero (0).
  8. DEPREL: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
  9. DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs.
  10. MISC: Any other annotation.

First the validator checks whether there are spurious empty lines (non-spurious empty lines are between reps of sentences) or spurious comment lines (#text) and counts the number of columns for word-lines, which needs to be ten.

The first 6 error codes in @martinp are about this:

  1. "Spurious empty line.",u"Format"

  2. "Spurious comment line.",u"Format"

  3. "The line has %d columns, but %d (10) are expected."%(len(cols),COLCOUNT),u"Format"

  4. "Spurious line: '%s'. All non-empty lines should start with a digit or the # character."%(line),u"Format"

  5. "Missing empty line after the last tree.",u"Format" (tree?)

  6. "Spurious sent_id line: '%s' Should look like '# sent_id = xxxxxx' where xxxx is not whitespace. Forward slash reserved for special purposes." %c,u"Metadata"

@vcvpaiva
Copy link
Member Author

vcvpaiva commented Mar 7, 2017

the next 9 error codes are about metadata and I am not worrying about it.

@fcbr
Copy link
Member

fcbr commented Mar 7, 2017

please notice that validator.py checks not only the CoNLL-U format, but also the UD tagset.

@vcvpaiva
Copy link
Member Author

vcvpaiva commented Mar 7, 2017

well, it takes as a parameter the UD tagset, the set of UD dependency relations DEPREL and the set of morphological features (whatever this may be, depending on the language). at least.
if I am reading correctly the other error codes, e.g.

  1. Morphological features must be sorted: '%s'"%feats,u"Morpho"
  2. "Spurious morphological feature: '%s'. Should be of the form attribute=value and must start with [A-Z0-9] and only contain [A-Za-z0-9]."%f,u"Morpho"
  3. "Repeated feature values are disallowed: %s"%feats,u"Morpho"
  4. "If an attribute has multiple values, these must be sorted as well: '%s'"%f,u"Morpho"
  5. "Incorrect value '%s' in '%s'. Must start with [A-Z0-9] and only contain [A-Za-z0-9]."%(v,f),u"Morpho"
  6. "Unknown attribute-value pair %s=%s"%(attr,v),u"Morpho"
  7. "Repeated features are disallowed: %s"%feats, u"Morpho"
  8. "Unknown UPOS tag: %s"%cols[UPOSTAG],u"Morpho"
  9. "Unknown XPOS tag: %s"%cols[XPOSTAG],u"Morpho"
  10. "Unknown UD DEPREL: %s"%cols[DEPREL],u"Syntax"
  11. "Malformed head:deprel pair '%s'"%head_deprel,u"Syntax"
  12. "Unknown dependency relation '%s' in '%s'"%(deprel,head_deprel),u"Syntax"

I had stopped analyzing the other error codes (at number 15), but I meant to continue checking them later tonight.

@fcbr
Copy link
Member

fcbr commented Mar 7, 2017

Absolutely, good reminder. I was thinking of the default behavior.

@vcvpaiva
Copy link
Member Author

vcvpaiva commented Mar 7, 2017

while what I wanted to know was if it would do more semantic things about the meaning of the dependencies.
like saying that a ROOT can only be a main verb, an adj or a noun, in case the sentence is copular.
that if a dependency is is dobj it should relate a verb to a noun-phrase (perhaps?, I don't know)

but I believe that there are more things that it could be saying...

@vcvpaiva
Copy link
Member Author

vcvpaiva commented Mar 7, 2017

@fcbr I am very interested in understanding how they check

Words do not form a sequence. Got: %s."%(u",".join(unicode(x) for x in words)),u"Format"

because most of our issues at the moment in the Bosque have to do with multiword tokens.
and in SICK-SANE have to do with creating appropriate multiword tokens.

@vcvpaiva
Copy link
Member Author

vcvpaiva commented Mar 7, 2017

@arademaker as you can see, the representation

subconcept-of (people2, GroupOfPeople)
subconcept-of (walking1, Walking)
subj(walking1,people2)

is very similar to TIL (over SUMO), instead of TIL over Unified Lexicon, as we have with XLE.

The only thing missing is the context "true", which is not necessary in this simple case.

But the representation is also very close to OWL, as we have always triples and we pushed the difficult bits about quantifiers somewhere else. just like enhanced Dependencies do.

@vcvpaiva
Copy link
Member Author

vcvpaiva commented Mar 7, 2017

@fcbr have you tried to download and run UDPipe https://github.com/ufal/udpipe?

did you see the numbers in
http://ufal.mff.cuni.cz/udpipe/users-manual#universal_dependencies_12_models
they say they can do better em PT than in EN, which is crazy.

@vcvpaiva
Copy link
Member Author

closing this issue as the representation is described in details in
https://github.com/own-pt/cl-conllu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants