Transformers for variable misuse, function naming and code completion tasks

The official PyTorch implementation of:

Empirical Study of Transformers for Source Code [arxiv] (accepted to ESEC/FSE'21)
A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code [arxiv] (accepted to NAACL'21)

The repository also contains code for resplitting Python150k and JavaScript150k datasets (with splitting by repository, removing duplicates and the redistributable version of Py150k).

Repository structure

data_utils: scripts for downloading Python150k and JavaScript150k datasets and obtaining new train / val / test splits (with splitting by repository, removing duplicates and the redistributable version of Py150k)
vm_fn: code for Variable Misuse (VM) and Function Naming (FN) tasks (additional preprocessing, models, training etc)
cc: code for Code Completion (CC) task (additional preprocessing, models, training etc)

See README in each directory for details.

Run

The code was tested on a system with Linux 3.10.0. Experiments were run using a Tesla V100 GPU. Required libraries are listed in requirments.txt in VM_FN and CC directories. The implementation is based on PyTorch>=1.5.

Running experiments:

Download and resplit data, see data_utils for details;
Preprocess data for a task you are interested in (VM, FN or CC), see vm_fn or cc for details;
Run the experiment you are interested in, see vm_fn or cc for details.

Attribution

Parts of this code are based on the following repositories:

Citation

If you found this code useful, please cite our papers

@inproceedings{10.1145/3468264.3468611,
    author = {Chirkova, Nadezhda and Troshin, Sergey},
    title = {Empirical Study of Transformers for Source Code},
    year = {2021},
    isbn = {9781450385626},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3468264.3468611},
    doi = {10.1145/3468264.3468611},
    booktitle = {Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering},
    pages = {703–715},
    numpages = {13},
    keywords = {code completion, neural networks, transformer, function naming, variable misuse detection},
    location = {Athens, Greece},
    series = {ESEC/FSE 2021}
}

@inproceedings{chirkova-troshin-2021-simple,
    title = "A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code",
    author = "Chirkova, Nadezhda and Troshin, Sergey",
    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.naacl-main.26",
    doi = "10.18653/v1/2021.naacl-main.26",
    pages = "278--288",
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
cc		cc
data_utils		data_utils
vm_fn		vm_fn
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Transformers for variable misuse, function naming and code completion tasks

Repository structure

Run

Attribution

Citation

About

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

License

bayesgroup/code_transformers

Folders and files

Latest commit

History

Repository files navigation

Transformers for variable misuse, function naming and code completion tasks

Repository structure

Run

Attribution

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages