-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
what's the spec of a conll representation? #72
Comments
The CoNLL-U format is documented here. Notes:
|
excellent, thanks a lot! |
@fcbr many thanks! excellent. I am trying to understand why in a simple example like:
"someone" which should be a pronoun (PRP) gets tagged as noun (NN and NOUN) and then doesn't get a mapping. if it was a noun it should get the mapping, as the noun is in PWN. finally, can you also shed some light on the need to 'disable' Freeling's compound module and what you know about the module itself? |
@vcvpaiva concordo que esperarímos o mapping para http://wnpt.brlcloud.com/wn/synset?id=00007846-n . Mas o Freeling marcou como PRP (pronome, vide columna MISC) e logo não iria procurar com um noun na WN.Pr. A coluna 4 é a tag do SyntaxNet. Sobre entender os erros, já falamos sobre isso. O parser, o tagger de Freeling são treinados. Em particular, as frases curtas do SICK não estão ajudando o parser que foi treinado com um corpus com frases mais longas, natureza bem distinta. |
@vcvpaiva documentação do módulo compound em https://talp-upc.gitbooks.io/freeling-user-manual/content/modules/dictionary.html |
why? because we cannot? or because we prefer not? @arademaker just to make it abundantly clear,
I do know how statistical systems work, no need to repeat it. the question here was, is it the disconnect between postags Parsey and FreeLing, or just the no-reason for tag in Freeling that causes the issue. |
Nothing stops us from using enhanced dependencies, but Parsey McParseface does not emit them, most likely because the corpus it was trained on doesn't contain them (neither does the UD corpus). |
@vcvpaiva Pelo que entendi agora em do paper: S. Schuster and C. D. Manning, “Enhanced English Universal Dependencies: An Improved Representation for Natural Language Understanding Tasks,” pp. 1–8, Mar. 2016. De fato poderiamos tentar reproduzir este paper (se ele contiver todas as regras ou apontar para onde elas estejam) para produzir enhanced dependencies a partir das depencencias básicas. Parece que eles usaram regras em Semgrex: http://nlp.stanford.edu/software/tregex.shtml ainda não entendi a questão seguinte. |
Sim, seria muito bom poder usar as "enhanced dependencies" deles, Qual 'e a questao seguinte? a abaixo?
a gente sabe que vai ter que remover pedacos dos mappings e tambem que vai ter que adicionar coisas a representacao conll, certo? por exemplo pra frase: +# text = People are walking a gente quer "read-out" dessa dependencia a seguinte representacao: subconcept-of (people2, GroupOfPeople) e pra segunda sentenca a gente quer read-out" subconcept-of (writing1 , WrittenCommunication) dai que precisamos ter um mecanismo pra inventar mappings pra pronomes, ja' que o mecanismo Freeling so' funciona pra nouns,verbs,adjectives, adverbs. |
the validator spec is implicit from the rules for the representations (in (http://universaldependencies.org/format.html) and the error signals the validator emits (UniversalDependencies/tools#20 (comment)). So the CoNLL-U rules format says: Sentences consist of one or more word lines(no empty sentences), and word lines contain the following fields:
First the validator checks whether there are spurious empty lines (non-spurious empty lines are between reps of sentences) or spurious comment lines (#text) and counts the number of columns for word-lines, which needs to be ten. The first 6 error codes in @martinp are about this:
|
the next 9 error codes are about metadata and I am not worrying about it. |
please notice that |
well, it takes as a parameter the UD tagset, the set of UD dependency relations DEPREL and the set of morphological features (whatever this may be, depending on the language). at least.
I had stopped analyzing the other error codes (at number 15), but I meant to continue checking them later tonight. |
Absolutely, good reminder. I was thinking of the default behavior. |
while what I wanted to know was if it would do more semantic things about the meaning of the dependencies. but I believe that there are more things that it could be saying... |
@fcbr I am very interested in understanding how they check
because most of our issues at the moment in the Bosque have to do with multiword tokens. |
@arademaker as you can see, the representation subconcept-of (people2, GroupOfPeople) is very similar to TIL (over SUMO), instead of TIL over Unified Lexicon, as we have with XLE. The only thing missing is the context "true", which is not necessary in this simple case. But the representation is also very close to OWL, as we have always triples and we pushed the difficult bits about quantifiers somewhere else. just like enhanced Dependencies do. |
@fcbr have you tried to download and run UDPipe https://github.com/ufal/udpipe? did you see the numbers in |
closing this issue as the representation is described in details in |
A sentence that works in SICK-SANE is
+# text = People are walking
+1 People people NOUN NNS _ 3 nsubj _ NNS|07942152-n|GroupOfPeople=
+2 are be VERB VBP _ 3 aux _ VBP|02604760-v|Entity+
+3 walking walk VERB VBG _ 0 ROOT _ VBG|01904930-v|Walking=
Using this as a model we could say that a conll-representation is a pair of a sentence (in the example People are walking) and a matrix M. The matrix M has always as many rows as the tokens/words in the sentence (in the example only 3 words/tokens). The matrix M always has ten columns.
is this correct?
The text was updated successfully, but these errors were encountered: