Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overlap in dataset splits #28

Open
jjcmoon opened this issue Jul 20, 2020 · 0 comments
Open

Overlap in dataset splits #28

jjcmoon opened this issue Jul 20, 2020 · 0 comments

Comments

@jjcmoon
Copy link

jjcmoon commented Jul 20, 2020

When looking at the results of make data in a clean repo clone, it seems there is a small overlap in NL descriptions of the train and test datasets (same for the train and dev). After investigating this issue, it seems that a NL description can have multiple corresponding bash commands, which can get placed in different splits. The code in data/scripts/split_data.py seems to address this in the wrong way. The script checks if identical bash commands are placed in different splits. This would be appropriate when performing Bash2NL but not the other way round.

As the amount of descriptions with multiple commands is not that large, the overlap is not very large, so the performance reported will be only slightly decreased (i guesstimate around 1%, have not tried). But I figured you still might want to be aware of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant