Are all turns "real"? #26

jrmmrn · 2021-02-19T10:20:26Z

Hi,

I really like this work!

A lot of work went into creating this dataset, and obviously you had to rely on many people to do the labelling, so the quality/style of the labelling (sexp/lispress's) varies from labeler to labeler.
Also, you may have first constructed the fastest, even before you had a functioning end-to-end system.

My question is - are the labeled lispress expression "real"?
Meaning:

Is there a system which gets these expressions (exactly as they appear in the dataset) and actually executes them "correctly"/"as intended"
Are the labeled lispress's a "good" translation of the user input given the dialog context? good - in the sense of correct/complete/efficient/...

Now that you have an end-to-end system, could you maybe go over the dataset and verify/correct/unify it?
Or separate it to subsets (e.g. by correctness / quality...)?

Especially since you can not release more details about your system, trying to understand the concept and use the data, in the presence of incorrect data makes it even more difficult...

Thanks!

sammthomson · 2021-02-19T18:52:45Z

Hi,

Thanks! Answers inline:

Is there a system which gets these expressions (exactly as they appear in the dataset) and actually executes them "correctly"/"as intended"

Yes, with only a few caveats... we renamed a few functions after exporting to match aesthetic choices in the paper's writing. Our annotation tool executes programs and displays the results to annotators, so very very few unexecutable programs slip through the process.

Are the labeled lispress's a "good" translation of the user input given the dialog context? good - in the sense of correct/complete/efficient/...

For correctness, we say the following in the paper:

We review every dialogue in the test set with two additional annotators. 75% of turns pass through this double review process with no changes, which serves as an approximate measure of inter-annotator consensus on full programs.

The test set is likely > 75% agreement, but AFAIK we haven't done a triple review process to measure that. There are also various kinds of spurious ambiguity, and some subjectivity (like meeting titles), so not all disagreements necessarily imply "wrong".

"Completeness" is hard to define, but our goal is to capture what the bot should do in response to the utterance, so it's by no means a complete representation of the semantics of the utterance.

I would say the programs are all pretty efficient in terms of execution speed and number of API calls required, but could be made terser. Many annotator disagreements are due to changing annotation guidelines (i.e. they were correct at some time in the past and do the right thing, but are no longer the preferred way to write the program).

Now that you have an end-to-end system, could you maybe go over the dataset and verify/correct/unify it?
Or separate it to subsets (e.g. by correctness / quality...)?

The data was collected while having an end-to-end system in place. We may release a new version in the future to correct or remove errors that inevitably surface despite that, but likely not in the very near future.

Cheers,
-Sam

jrmmrn · 2021-02-20T09:43:56Z

Interesting, thanks!

So the programs are executed in the context of the full dialog, right?

Somewhat related question (example for program which may not be "right"):

One example picked up in random -
at one point, the user asks:
"Please add coffee with mom a little later afterwards."
This is labeled simply as :
FenceAttendee
And generates the answer:
"I can only look up names in your address book. Please use a full name and try again."

The answer may be OK, but is this the way it was supposed to be done?
Here, rather than going through a graph computation and throwing an exception for unknown attendee, the NLP had to decide that "mom" is unknown...

ysu1989 · 2021-03-11T17:10:19Z

@jrmmrn yes, this is how this specific case was handled (that resorting to the NLP to decide that "mom" is unknown) in this release version of the data, and as you noted, it's not optimal. Execution + exception is a better way to handling this case, though there are more nuances.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are all turns "real"? #26

Are all turns "real"? #26

jrmmrn commented Feb 19, 2021

sammthomson commented Feb 19, 2021

jrmmrn commented Feb 20, 2021

ysu1989 commented Mar 11, 2021

Are all turns "real"? #26

Are all turns "real"? #26

Comments

jrmmrn commented Feb 19, 2021

sammthomson commented Feb 19, 2021

jrmmrn commented Feb 20, 2021

ysu1989 commented Mar 11, 2021