-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Are all turns "real"? #26
Comments
Hi, Thanks! Answers inline:
Yes, with only a few caveats... we renamed a few functions after exporting to match aesthetic choices in the paper's writing. Our annotation tool executes programs and displays the results to annotators, so very very few unexecutable programs slip through the process.
For correctness, we say the following in the paper:
The test set is likely > 75% agreement, but AFAIK we haven't done a triple review process to measure that. There are also various kinds of spurious ambiguity, and some subjectivity (like meeting titles), so not all disagreements necessarily imply "wrong". "Completeness" is hard to define, but our goal is to capture what the bot should do in response to the utterance, so it's by no means a complete representation of the semantics of the utterance. I would say the programs are all pretty efficient in terms of execution speed and number of API calls required, but could be made terser. Many annotator disagreements are due to changing annotation guidelines (i.e. they were correct at some time in the past and do the right thing, but are no longer the preferred way to write the program).
The data was collected while having an end-to-end system in place. We may release a new version in the future to correct or remove errors that inevitably surface despite that, but likely not in the very near future. Cheers, |
Interesting, thanks! So the programs are executed in the context of the full dialog, right? Somewhat related question (example for program which may not be "right"): One example picked up in random - The answer may be OK, but is this the way it was supposed to be done? |
@jrmmrn yes, this is how this specific case was handled (that resorting to the NLP to decide that "mom" is unknown) in this release version of the data, and as you noted, it's not optimal. Execution + exception is a better way to handling this case, though there are more nuances. |
Hi,
I really like this work!
A lot of work went into creating this dataset, and obviously you had to rely on many people to do the labelling, so the quality/style of the labelling (sexp/lispress's) varies from labeler to labeler.
Also, you may have first constructed the fastest, even before you had a functioning end-to-end system.
My question is - are the labeled lispress expression "real"?
Meaning:
Now that you have an end-to-end system, could you maybe go over the dataset and verify/correct/unify it?
Or separate it to subsets (e.g. by correctness / quality...)?
Especially since you can not release more details about your system, trying to understand the concept and use the data, in the presence of incorrect data makes it even more difficult...
Thanks!
The text was updated successfully, but these errors were encountered: