Skip to content

Multilingual Code Co-Evolution Using Large Language Models

License

Notifications You must be signed in to change notification settings

EngineeringSoftware/codeditor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multilingual Code Co-Evolution Using Large Language Models

This repo hosts the code and data for the following FSE 2023 paper:

Title: Multilingual Code Co-Evolution Using Large Language Models

Authors: Jiyang Zhang, Pengyu Nie, Junyi Jessy Li, Milos Gligoric

@inproceedings{ZhangETAL23Codeditor,
  author = {Zhang, Jiyang and Nie, Pengyu and Li, Junyi Jessy and Gligoric, Milos},
  title = {Multilingual Code Co-Evolution Using Large Language Models},
  booktitle = {Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering},
  year = {2023},
}

News

May 2024 The fine-tuned EditsTranslation model is released on 🤗 ! 🔥cs2java and java2cs

How to Use

from transformers import T5ForConditionalGeneration, AutoTokenizer

checkpoint = "EngineeringSoftware/EditsTranlation-java2cs"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = T5ForConditionalGeneration.from_pretrained(checkpoint)

code_input = """class HelloWorld { public static void main(String[] args) { System.out.println("Hello, World!")"""

input_ids = tokenizer(code_input, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, max_length=200)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
# output: <INSERT>; } } ;<INSERT_END> class HelloWorld { public static void main(String[] args) { System.out.println("Hello, World!") ; } } ;

Introduction

This repo contains the code and artifacts for reproducing the experiments in Multilingual Code Co-Evolution Using Large Language Models. In this work, we introduce Codeditor for co-evolving software implemented in multiple programming languages.

The code includes:

  • scripts for processing dataset
  • scripts for training and evaluating codeditor models

The artifacts include:

  • Java to C# raw paired changes
  • Java to C# translation dataset processed for codeditor models

Data Downloads

All our data is hosted on UTBox via a shared folder.

Code for Processing Fine-tuning Data

We provide the sample script to process the datasets for edit-translation. Requires the raw data files at raw_data/.

cd python/
python -m deltr.collector.DataProcessor edit_translation_data_process --exp cs2java --src_lang cs --tgt_lang java

Code for Training and Evaluating Models

Train ML models

cd python/
python -m deltr.coditT5.CodeT5 fit --exp_dir {MODELS_DIR}/${model_name}/${dataset} --data.dataset {dataset} --data.model ${model_name} --config  configs/coditT5.yaml

# Example: python -m deltr.coditT5.CodeT5 fit --exp_dir models/edit-translation/java2cs --data.dataset java2cs --data.model edit-translation --config  configs/coditT5.yaml

Results are generated to models/${model}/${dataset}/, where:

  • model/: stores the trained model.

  • logs/: stores logs during training.

Run ML models to do inference

Requires the dataset at data/${model}/${dataset}/, the trained model at models/${model}/${dataset}/model/.

cd python/
python -m deltr.coditT5.CodeT5 predict --exp_dir {MODELS_DIR}/${model_name}/${dataset} --data.dataset {dataset} --data.model ${model_name} --config  configs/coditT5.yaml

Results are generated to models/${model}/${dataset}/, where:

  • output.hyp: the predictions.

Releases

No releases published

Packages

No packages published