Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preprocessing of texts and source code #24

Open
liuhuigmail opened this issue Mar 5, 2020 · 6 comments
Open

preprocessing of texts and source code #24

liuhuigmail opened this issue Mar 5, 2020 · 6 comments
Assignees

Comments

@liuhuigmail
Copy link

Great work on source code generation.
The details of the preprocessing of texts (naturel languges) and source code are missing from the paper. Would you kindly let me known what kind of preprossing has been conducted, e.g., unifying identifiers?

Thanks.

Hui
[email protected]

@pcyin pcyin self-assigned this Mar 13, 2020
@jason-hanling
Copy link

Great work on source code generation.
The details of the preprocessing of texts (naturel languges) and source code are missing from the paper. Would you kindly let me known what kind of preprossing has been conducted, e.g., unifying identifiers?

Thanks.

Hui
[email protected]

have you solved the question ?
i am also curious about it

@liuhuigmail
Copy link
Author

Great work on source code generation.
The details of the preprocessing of texts (naturel languges) and source code are missing from the paper. Would you kindly let me known what kind of preprossing has been conducted, e.g., unifying identifiers?
Thanks.
Hui
[email protected]

have you solved the question ?
i am also curious about it

No. But I built a new dataset (https://github.com/ds4an/CoDas4CG) and conduct a sequence of preprossing as I will :)

@jason-hanling
Copy link

Great work on source code generation.
The details of the preprocessing of texts (naturel languges) and source code are missing from the paper. Would you kindly let me known what kind of preprossing has been conducted, e.g., unifying identifiers?
Thanks.
Hui
[email protected]

have you solved the question ?
i am also curious about it

No. But I built a new dataset (https://github.com/ds4an/CoDas4CG) and conduct a sequence of preprossing as I will :)

datasets/conala/dataset.py may do the preprocessing, i thought

@ShangwenWang
Copy link

@jason-hanling @liuhuigmail

Hi, Professor Liu,

I am also interested about how to pre-process the data.
I note that the pre-processing is done by the dataset.py script (you are right). I'd like to know what the files are like before pre-processing (conala-train.json). However, I found that the official webpage of CoNala (https://conala-corpus.github.io/) does not support downloading any more.
I wonder do you have something to share. Thanks!

@neubig
Copy link
Collaborator

neubig commented May 20, 2022

I'm not sure why, but some people have been having trouble downloading the dataset on the chrome browser. Here's a direct link that should work. The dataset is still available:
http://www.phontron.com/download/conala-corpus-v1.1.zip

@ShangwenWang
Copy link

Oh great @neubig
Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants