Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collocation Grammar #4

Open
LenaHenke opened this issue Sep 10, 2021 · 4 comments
Open

Collocation Grammar #4

LenaHenke opened this issue Sep 10, 2021 · 4 comments

Comments

@LenaHenke
Copy link

Hi Ke! I am very excited about your toolbox, however I unfortunately do not get it to work for the collocation grammar. Everything works fine for the unigram grammar. I have tried using the input (train.dat/test.dat) that you provided and get an assertion error in hybrid.py:

line 1333, in model_state_assertion adapted_production_dependent[adapted_production]))
AssertionError: : Word -> a (Word -> Chars, Chars -> Char, Char -> 'a')
: 0 set()
: 5 {Collocation -> i ' l l p u t i t a w a y (Collocation -> Words, Words -> Word Words, Word -> i ' l l p u (Word -> Chars, Chars -> Char Chars, Char -> ...

Unfortunately, I could not figure out how to solve it. Do you maybe have a solution?
Thank you very much for your help in advance!
Best regards,
Lena

@kzhai
Copy link
Owner

kzhai commented Sep 13, 2021

Hi, Lena,

Thanks for your interest in the package.
I have not been keeping up with the package for quite a long time. The package was originally implemented years ago with python 2.7 (hence an old nltk version), and later ported to python 3. During that porting, there was a big change in nltk's FreqDist API. The lines reporting the error is likely due to the "block" assertion statement, where I kept around to validate the intermediate data structure and cache format.
One possible quick (and a bit hacky) fix is to comment out the entire assertion block between line 1319-1345. It runs fine on my end.
I will revisit the internal logics when I have some spare time.

Best,
Ke

@LenaHenke
Copy link
Author

Dear Ke,

Thank you so much for your reply! I have only just seen your suggestion and it works perfectly also on my own data!

I might, however, have another question on how I can apply the model to new input. I basically just want to use the Collocation Model to make inferences about new sentences. I was hoping that there would be a simple function (along the lines of Model.inference(newsentence)), however from the previously closed issues, I understood that launch_test returns parses of new data. The function itself is working for me, however, I am unsure whether I understand and apply it correctly and I would be really very grateful for your insights:

(1) I am very new to NLP and I apologize if this is very basic, but why does the function take truth and training data, if I have already used training data to train the model? The output for truth data is also the very same as my truth input, so I am not sure I understand why I need both of those inputs. Maybe this is also a misconception from my side on what should be the train.dat and truth.dat. Could you possibly clarify this to me?

(2) In my output file for the train.dat, each sentence is parsed 10 times (sometimes in different ways). Which of these parses should I consider as final output (i.e. the final/most likely parse given the trained model)?

Thank you very much for your help again!
Lena

@kzhai
Copy link
Owner

kzhai commented Oct 8, 2021

Hi, Lena,

(1) I am very new to NLP and I apologize if this is very basic, but why does the function take truth and training data, if I have already used training data to train the model? The output for truth data is also the very same as my truth input, so I am not sure I understand why I need both of those inputs. Maybe this is also a misconception from my side on what should be the train.dat and truth.dat. Could you possibly clarify this to me?

If I understand your question correctly, the adaptor grammar is an unsupervised model, so that it does need external labels/annotations. If you check the truth vs train, train is simply a tokenized version of truth data.

(2) In my output file for the train.dat, each sentence is parsed 10 times (sometimes in different ways). Which of these parses should I consider as final output (i.e. the final/most likely parse given the trained model)?

Ideally, there should be one dominant parse tree, however, in some case, there could be two or more, in that case, you may do a sampling over the parse trees, or simply take the most frequent one.

@LenaHenke
Copy link
Author

Thank you so much, Ke! Your answers helped a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants