-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Further problems with input format #6
Comments
Version 7.35 solves (I think) two problems: warnings and errors are sent to stderr, not stdout, and spurious empty lines near the end of the output are suppressed. |
Running the same compilation and test scripts with 7.35, I now get segmentation fault (core dumped) from the streaming version when using the -I option. This happens with both input formats (word SLASH tag NL, word TAB tag NL) and their appropriate inputs, regardless of whether the input comes from a file or from stdin. The default version, compiled with STREAM set to 0, does not dump core with either -I format. So I have eight core files (5.6M or 9.6M each) but I don't know what to do with them, or what other information would be relevant. Gcc version is 4.8.2; readelf -d lists libstdc++.so.6, libm.so.6, libgcc_s.so.1, libc.so.6 as the needed shared libraries. Is there something I can check on my end? |
I made the attached table that compares the file sizes out of the previous version (7.34) and the new version (7.35). That seems to reveal that the newly core-dumping cases are exactly those that had the spurious trail of space before. Unless I'm getting confused with the combinatorics. I will test if the segfault happens early or late. |
Ok, it (STREAM as 1 with -I format) appears to segfault early. The test is by sending it so much input that it would have produced some output before crashing if the crash only happened at end:
With input format specified as -t it produces the expected output:
But I cannot use -t because some of the tags contain slashes. I wish to use tabs. |
There is a new version on GitHub |
Thanks. All my test combinations produce proper result now. All combinations that use -t produce the correct result. Those that use an -I format produce the incorrect result (fail to recognize that the word form with that tag is in the dictionary). Trailing space is gone as you promised. |
I am glad to know cstlemma works better now. |
Aha, I had misunderstood -t. Sorry about that. With the recent fixes, and with -t added to those test cases that use -I, all my current tests work now, including the cases that I most need to work. So I think this issue is solved. Thank you. |
I encountered new problems with the input format option since the previous issue (many thanks for the prompt fixing of that). Briefly:
The same problems occur both with -I '$w/$t\n' on slash-separated tokens and with -I '$w\t$t\n' on tab-separated tokens. (I will need to use tab. The actual model has slashes in some tags.) (I saw indications that the tags or the third "word" are considered "unknown" when I played with output formats in earlier experiments but this is not in the attached logs.)
While investigating this I also saw that the STREAM==0 version of cstlemma sends its diagnostics to stdout (when writing to a file) or nowhere (when writing to stdout). Surely it would be better to use stderr. But this is just by the way.
I attach a summary of my experiments, a shell log, the compilation script (fresh clones yesterday, compiled with and without STREAM), the test script, and an archive containing the two input files (slashed, tabbed) and the different output files (correct lemmas accompanying tags when using -t, incorrect lemmas, incorrect lemmas with the unexpected trail of empty-looking lines when using appropriate -I from file or from stdin). Hope some of these are useful. (The dictionary is from Språkbanken's sparv distribution, I'm not attaching that yet, but see test log for examples of the format.)
sum.txt
log.txt
test.sh.txt
compile.sh.txt
input-output.zip
The text was updated successfully, but these errors were encountered: