Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further problems with input format #6

Open
jpiitula opened this issue Aug 21, 2018 · 8 comments
Open

Further problems with input format #6

jpiitula opened this issue Aug 21, 2018 · 8 comments

Comments

@jpiitula
Copy link

I encountered new problems with the input format option since the previous issue (many thanks for the prompt fixing of that). Briefly:

  1. tags don't seem to match dictionary entries when using input format (dictionary matching works when using slash-separated format and -t option)
  2. reading from stdin with an input format seems to produce a final ghost entry of empty "word" and empty "tag" (this is my guess; what is observed is three trailing lines in output: empty line, tab on line, empty line)

The same problems occur both with -I '$w/$t\n' on slash-separated tokens and with -I '$w\t$t\n' on tab-separated tokens. (I will need to use tab. The actual model has slashes in some tags.) (I saw indications that the tags or the third "word" are considered "unknown" when I played with output formats in earlier experiments but this is not in the attached logs.)

While investigating this I also saw that the STREAM==0 version of cstlemma sends its diagnostics to stdout (when writing to a file) or nowhere (when writing to stdout). Surely it would be better to use stderr. But this is just by the way.

I attach a summary of my experiments, a shell log, the compilation script (fresh clones yesterday, compiled with and without STREAM), the test script, and an archive containing the two input files (slashed, tabbed) and the different output files (correct lemmas accompanying tags when using -t, incorrect lemmas, incorrect lemmas with the unexpected trail of empty-looking lines when using appropriate -I from file or from stdin). Hope some of these are useful. (The dictionary is from Språkbanken's sparv distribution, I'm not attaching that yet, but see test log for examples of the format.)

sum.txt
log.txt
test.sh.txt
compile.sh.txt
input-output.zip

@BartJongejan
Copy link
Contributor

Version 7.35 solves (I think) two problems: warnings and errors are sent to stderr, not stdout, and spurious empty lines near the end of the output are suppressed.

@jpiitula
Copy link
Author

jpiitula commented Sep 1, 2018

Running the same compilation and test scripts with 7.35, I now get segmentation fault (core dumped) from the streaming version when using the -I option. This happens with both input formats (word SLASH tag NL, word TAB tag NL) and their appropriate inputs, regardless of whether the input comes from a file or from stdin.

The default version, compiled with STREAM set to 0, does not dump core with either -I format.

So I have eight core files (5.6M or 9.6M each) but I don't know what to do with them, or what other information would be relevant. Gcc version is 4.8.2; readelf -d lists libstdc++.so.6, libm.so.6, libgcc_s.so.1, libc.so.6 as the needed shared libraries. Is there something I can check on my end?

@jpiitula
Copy link
Author

jpiitula commented Sep 2, 2018

I made the attached table that compares the file sizes out of the previous version (7.34) and the new version (7.35). That seems to reveal that the newly core-dumping cases are exactly those that had the spurious trail of space before. Unless I'm getting confused with the combinatorics.

I will test if the segfault happens early or late.

regress.txt

@jpiitula
Copy link
Author

jpiitula commented Sep 2, 2018

Ok, it (STREAM as 1 with -I format) appears to segfault early. The test is by sending it so much input that it would have produced some output before crashing if the crash only happened at end:

$ seq 4000000 | xargs -I{} cat slashed | cstlemma1/cstlemma -I '$w/$t\n' -d models/dict0 -f empty 2> /dev/null
xargs: cat: terminated by signal 13
Segmentation fault (core dumped)

With input format specified as -t it produces the expected output:

$ seq 2 | xargs -I{} cat slashed | cstlemma1/cstlemma -t -d models/dict0 -f empty 2> /dev/null
kommer	komma	VB.PRS.AKT
kommer	kommer	UO
kommer	komma	VB.PRS.AKT
kommer	kommer	UO

But I cannot use -t because some of the tags contain slashes. I wish to use tabs.

@BartJongejan
Copy link
Contributor

There is a new version on GitHub

@jpiitula
Copy link
Author

jpiitula commented Sep 2, 2018

Thanks. All my test combinations produce proper result now. All combinations that use -t produce the correct result. Those that use an -I format produce the incorrect result (fail to recognize that the word form with that tag is in the dictionary). Trailing space is gone as you promised.

@BartJongejan
Copy link
Contributor

I am glad to know cstlemma works better now.
If the lemmatizer must take PoS-tags into account, then use -t, even if there also is a -I input format that specifies that there are PoS-tags in the input. Without -t, the lemmatizer defaults to -t-, and ignores PoS-information in the input. (If the PoS-tagger isn't very good, ignoring the PoS-tags may in fact give better lemmatization results.)

@jpiitula
Copy link
Author

jpiitula commented Sep 2, 2018

Aha, I had misunderstood -t. Sorry about that. With the recent fixes, and with -t added to those test cases that use -I, all my current tests work now, including the cases that I most need to work.

So I think this issue is solved. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants