Skip to content

Latest commit



160 lines (139 loc) · 21.2 KB

File metadata and controls

160 lines (139 loc) · 21.2 KB


Year-Id Title Venue Name
2023-3 Textbooks Are All You Need Arxiv
2023-2 Data Quality Matters: A Case Study of Obsolete Comment Detection ICSE
2023-1 Data Quality for Software Vulnerability Datasets ICSE
2022-3 On the Importance of Building High-quality Training Datasets for Neural Code Search. ICSE
2022-2 Data smells in public datasets CAIN
2022-1 Data smells: categories, causes and consequences, and detection of suspicious data in AI-based systems CAIN
2021-1 Data Quality Matters: A Case Study on Data Label Correctness for Security Bug Report Prediction TSE

SE Dataset Papers

Click to expand!
Year-Id Title Venue Name
2023-24 Evaluating software user feedback classifier performance on unseen apps, datasets, and metadata. ESE
2023-23 JEMMA: An extensible Java dataset for ML4Code applications. ESE
2023-22 The software heritage license dataset (2022 edition). ESE
2023-21 Data Quality for Software Vulnerability Datasets. ICSE
2023-20 On the Reproducibility of Software Defect Datasets. ICSE
2023-19 An Automated and Flexible Multilingual Bug-Fix Dataset Construction System. ASE
2023-18 BugMiner: Automating Precise Bug Dataset Construction by Code Evolution History Mining. ASE
2023-17 Compsuite: A Dataset of Java Library Upgrade Incompatibility Issues. ASE
2023-16 DeepScenario: An Open Driving Scenario Dataset for Autonomous Driving System Testing. MSR
2023-15 PTMTorrent: A Dataset for Mining Open-source Pre-trained Model Packages. MSR
2023-14 NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python. MSR
2023-13 microSecEnD: A Dataset of Security-Enriched Dataflow Diagrams for Microservice Applications. MSR
2023-12 SecretBench: A Dataset of Software Secrets. MSR
2023-11 Defectors: A Large, Diverse Python Dataset for Defect Prediction. MSR
2023-10 DocMine: A Software Documentation-Related Dataset of 950 GitHub Repositories. MSR
2023-9 DACOS - A Manually Annotated Dataset of Code Smells. MSR
2023-8 A Dataset of Bot and Human Activities in GitHub. MSR
2023-7 Snapshot Testing Dataset. MSR
2023-6 LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations. MSR
2023-5 GitHub OSS Governance File Dataset. MSR
2023-4 CodeMark: Imperceptible Watermarking for Code Datasets against Neural Code Completion Models. FSE
2023-3 npm-follower: A Complete Dataset Tracking the NPM Ecosystem. FSE
2023-2 Improving Fine-tuning Pre-trained Models on Small Source Code Datasets via Variational Information Bottleneck. SANER
2023-1 CILIATE: Towards Fairer Class-Based Incremental Learning by Dataset and Training Refinement. ISSTA
2022-33 A large-scale empirical study of commit message generation: models, datasets and evaluation. ESE
2022-32 Making the Most of Small Software Engineering Datasets With Modern Machine Learning. TSE
2022-31 An Empirical Study of the Effectiveness of an Ensemble of Stand-alone Sentiment Detection Tools for Software Engineering Datasets. TOSEM
2022-30 Assessing and Improving an Evaluation Dataset for Detecting Semantic Code Clones via Deep Learning. TOSEM
2022-29 On the Importance of Building High-quality Training Datasets for Neural Code Search. ICSE
2022-28 Robust Learning of Deep Predictive Models from Noisy and Imbalanced Software Engineering Datasets. ASE
2022-27 Towards Robust Models of Code via Energy-Based Learning on Auxiliary Datasets. ASE
2022-26 Which bugs are missed in code reviews: An empirical study on SmartSHARK dataset. MSR
2022-25 An Alternative Issue Tracking Dataset of Public Jira Repositories. MSR
2022-24 ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction. MSR
2022-23 ReCover: a Curated Dataset for Regression Testing Research. MSR
2022-22 DISCO: A Dataset of Discord Chat Conversations for Software Engineering Research. MSR
2022-21 SOSum: A Dataset of Stack Overflow Post Summaries. MSR
2022-20 ManyTypes4TypeScript: A Comprehensive TypeScript Dataset for Sequence-Based Type Inference. MSR
2022-19 METHODS2TEST: A dataset of focal methods mapped to test cases. MSR
2022-18 The Unsolvable Problem or the Unheard Answer? A Dataset of 24, 669 Open-Source Software Conference Talks. MSR
2022-17 DaSEA - A Dataset for Software Ecosystem Analysis. MSR
2022-16 GitDelver Enterprise Dataset (GDED): An Industrial Closed-source Dataset for Socio-Technical Research. MSR
2022-15 AndroOBFS: Time-tagged Obfuscated Android Malware Dataset with Family Information. MSR
2022-14 TriggerZoo: A Dataset of Android Applications Automatically Infected with Logic Bombs. MSR
2022-13 Vul4J: A Dataset of Reproducible Java Vulnerabilities Geared Towards the Study of Program Repair Techniques. MSR
2022-12 TwinDroid: A Dataset of Android app System call traces and Trace Generation Pipeline. MSR
2022-11 Constructing Dataset of Functionally Equivalent Java Methods Using Automated Test Generation Techniques. MSR
2022-10 A Time Series-Based Dataset of Open-Source Software Evolution. MSR
2022-9 A Versatile Dataset of Agile Open Source Software Projects. MSR
2022-8 FixJS: A Dataset of Bug-fixing JavaScript Commits. MSR
2022-7 A Large-scale Dataset of (Open Source) License Text Variants. MSR
2022-6 Lighting up supervised learning in user review-based code localization: dataset and benchmark. FSE
2022-5 Python-by-contract dataset. FSE
2022-4 RegMiner: mining replicable regression dataset from code repositories. FSE
2022-3 PANDORA: Continuous Mining Software Repository and Dataset Generation. SANER
2022-2 CoolTeD: A Web-based Collaborative Labeling Tool for the Textual Dataset. SANER
2022-1 RegMiner: towards constructing a large regression dataset from code evolution history. ISSTA
2021-16 Are datasets for information retrieval-based bug localization techniques trustworthy? ESE
2021-15 GreenHub: a large-scale collaborative dataset to battery consumption analysis of android devices. ESE
2021-14 AndroidCompass: A Dataset of Android Compatibility Checks in Code Repositories. MSR
2021-13 Duets: A Dataset of Reproducible Pairs of Java Library-Clients. MSR
2021-12 KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle. MSR
2021-11 A Traceability Dataset for Open Source Systems. MSR
2021-10 The Wonderless Dataset for Serverless Computing. MSR
2021-9 Andromeda: A Dataset of Ansible Galaxy Roles and Their Evolution. MSR
2021-8 ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-based Type Inference. MSR
2021-7 QScored: A Large Dataset of Code Smells and Quality Metrics. MSR
2021-6 Apache Software Foundation Incubator Project Sustainability Dataset. MSR
2021-5 Andror2: A Dataset of Manually-Reproduced Bug Reports for Android apps. MSR
2021-4 GE526: A Dataset of Open-Source Game Engines. MSR
2021-3 EqBench: A Dataset of Equivalent and Non-equivalent Program Pairs. MSR
2021-2 CrossVul: a cross-language vulnerability dataset with commit data. FSE
2021-1 Is the Ground Truth Really Accurate? Dataset Purification for Automated Program Repair. SANER
2020-18 A Framework and DataSet for Bugs in Ethereum Smart Contracts. ICSME
2020-17 Defining a Software Maintainability Dataset: Collecting, Aggregating and Analysing Expert Evaluations of Software Maintainability. ICSME
2020-16 Towards Robust Production Machine Learning Systems: Managing Dataset Shift. ASE
2020-15 The Software Heritage Graph Dataset: Large-scale Analysis of Public Software Development History. MSR
2020-14 An Exploratory Study to Find Motives Behind Cross-platform Forks from Software Heritage Dataset. MSR
2020-13 RTPTorrent: An Open-source Dataset for Evaluating Regression Test Prioritization. MSR
2020-12 A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries. MSR
2020-11 A Dataset and an Approach for Identity Resolution of 38 Million Author IDs extracted from 2B Git Commits. MSR
2020-10 A Dataset for GitHub Repository Deduplication. MSR
2020-9 A Dataset of Dockerfiles. MSR
2020-8 A Dataset of Enterprise-Driven Open Source Software. MSR
2020-7 A Mixed Graph-Relational Dataset of Socio-technical Interactions in Open Source Systems. MSR
2020-6 On the Shoulders of Giants: A New Dataset for Pull-based Development Research. MSR
2020-5 Dataset of Video Game Development Problems. MSR
2020-4 GitterCom: A Dataset of Open Source Developer Communications in Gitter. MSR
2020-3 How Often Do Single-Statement Bugs Occur?: The ManySStuBs4J Dataset. MSR
2020-2 TestRoutes: A Manually Curated Method Level Dataset for Test-to-Code Traceability. MSR
2020-1 Cross-Dataset Design Discussion Mining. SANER


Click to expand!


Paper Id Title Venue Year Target Task Task Description Used Data Used LLMs Replication Package
1 Backdooring Neural Code Search ACL 2023 Code Search "Given a natural language description (query), the code search task is to return related code snippets from a large code corpus." CodeSearchNet "CodeBERT, CodeT5"
2 Multi-target Backdoor Attacks for Code Pre-trained Models ACL 2023 Defect detection Predict whether the input code is vulnerable or not CodeXGLUE "PLBART, CodeT5"
Clone detection Predict whether two programs are semantic-equivalent.
Code2Code translation Translate a piece of Java (C#) code to the version of C# (Java).
Text2Code Generate the source code of class member functions in Java given the natural language description as well as the class context.
Code refinement Fix a piece of buggy Java code and generate its refined version.
3 CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning WWW 2022 Code generation Generate source code based on a natural language description. CodeSearchNet "DeepCS, GPT-2, NCS-T"
Code search Retrieve the related code snippets from a codebase given a natural language query
Code summarization Summarize the code snippet into a summary sentence that describes its functionality
4 You See What I Want You to See: Poisoning Vulnerabilities in Neural Code Search FSE 2022 Code search "Input: a natural language description (query), Output: related code snippets from a large code corpus." CodeSearchNet "BiRNN, CodeBERT"
5 You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion USENIX 2021 Code Completion "Input: Previous k tokens, Output: Next token" 2800 repositories from Github GPT-2


Click to expand!


Year-Id Title Venue Name
2023-1 Inconsistent Defect Labels: Essence, Causes, and Influence TSE
2021-1 Deep Just-In-Time Inconsistency Detection Between Comments and Source Code AAAI
2019-1 A Large-Scale Empirical Study on Code-Comment Inconsistencies ICPC


Click to expand!