Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are all turns "real"? #26

Open
jrmmrn opened this issue Feb 19, 2021 · 3 comments
Open

Are all turns "real"? #26

jrmmrn opened this issue Feb 19, 2021 · 3 comments

Comments

@jrmmrn
Copy link

jrmmrn commented Feb 19, 2021

Hi,

I really like this work!

A lot of work went into creating this dataset, and obviously you had to rely on many people to do the labelling, so the quality/style of the labelling (sexp/lispress's) varies from labeler to labeler.
Also, you may have first constructed the fastest, even before you had a functioning end-to-end system.

My question is - are the labeled lispress expression "real"?
Meaning:

  1. Is there a system which gets these expressions (exactly as they appear in the dataset) and actually executes them "correctly"/"as intended"
  2. Are the labeled lispress's a "good" translation of the user input given the dialog context? good - in the sense of correct/complete/efficient/...

Now that you have an end-to-end system, could you maybe go over the dataset and verify/correct/unify it?
Or separate it to subsets (e.g. by correctness / quality...)?

Especially since you can not release more details about your system, trying to understand the concept and use the data, in the presence of incorrect data makes it even more difficult...

Thanks!

@sammthomson
Copy link
Member

Hi,

Thanks! Answers inline:

Is there a system which gets these expressions (exactly as they appear in the dataset) and actually executes them "correctly"/"as intended"

Yes, with only a few caveats... we renamed a few functions after exporting to match aesthetic choices in the paper's writing. Our annotation tool executes programs and displays the results to annotators, so very very few unexecutable programs slip through the process.

Are the labeled lispress's a "good" translation of the user input given the dialog context? good - in the sense of correct/complete/efficient/...

For correctness, we say the following in the paper:

We review every dialogue in the test set with two additional annotators. 75% of turns pass through this double review process with no changes, which serves as an approximate measure of inter-annotator consensus on full programs.

The test set is likely > 75% agreement, but AFAIK we haven't done a triple review process to measure that. There are also various kinds of spurious ambiguity, and some subjectivity (like meeting titles), so not all disagreements necessarily imply "wrong".

"Completeness" is hard to define, but our goal is to capture what the bot should do in response to the utterance, so it's by no means a complete representation of the semantics of the utterance.

I would say the programs are all pretty efficient in terms of execution speed and number of API calls required, but could be made terser. Many annotator disagreements are due to changing annotation guidelines (i.e. they were correct at some time in the past and do the right thing, but are no longer the preferred way to write the program).

Now that you have an end-to-end system, could you maybe go over the dataset and verify/correct/unify it?
Or separate it to subsets (e.g. by correctness / quality...)?

The data was collected while having an end-to-end system in place. We may release a new version in the future to correct or remove errors that inevitably surface despite that, but likely not in the very near future.

Cheers,
-Sam

@jrmmrn
Copy link
Author

jrmmrn commented Feb 20, 2021

Interesting, thanks!

So the programs are executed in the context of the full dialog, right?

Somewhat related question (example for program which may not be "right"):

One example picked up in random -
at one point, the user asks:
"Please add coffee with mom a little later afterwards."
This is labeled simply as :
FenceAttendee
And generates the answer:
"I can only look up names in your address book. Please use a full name and try again."

The answer may be OK, but is this the way it was supposed to be done?
Here, rather than going through a graph computation and throwing an exception for unknown attendee, the NLP had to decide that "mom" is unknown...

@ysu1989
Copy link

ysu1989 commented Mar 11, 2021

@jrmmrn yes, this is how this specific case was handled (that resorting to the NLP to decide that "mom" is unknown) in this release version of the data, and as you noted, it's not optimal. Execution + exception is a better way to handling this case, though there are more nuances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants