It's impossible to have trustworthy AI without good data for AI models to learn.
The vision of SEEDGuard is to provide a platform for researchers and practitioners to share and discuss data-centric methods for improving the quality of software engineering datasets.
The quality of software engineering datasets is crucial for the success of data-driven software engineering research. However, the quality of software engineering datasets is often overlooked.
In this project, we aim to develop a data-centric library for researchers and practitioners (especially LLM developers) for improving the quality of software engineering datasets.
Btw, SEEDGuard is short for Software EnginEEring Data Guard.
The workflow of SEEDGuard.AI is shown in the following figure:
Similar to the data in other domains, the data in software engineering also suffers from various quality issues. For example, limited high-quality labeled data, data privacy issues, data imbalance, etc. Moreover, SE data also has its own unique quality issues especially related to code. For example, code used for training LLM can be poisoned to mislead developers to use insecure code.
We are actively expanding the list of SE data quality issues. If you have any suggestions, please feel free to open an issue or pull request. Currently, we mainly focus on the following SE data quality issues:
- SE Data Security
- SE Data Quality Assessment
- SE Data Augmentation
We're excited that you're interested in contributing to SEEDGuard.AI! This document outlines the process for contributing to our project. Your contributions can make a real difference, and we appreciate every effort you make to help improve this project.
We will be always happy to help for any problem or difficulties you may face during the contribution process.
Follow the guidelines provided in CONTRIBUTING file for further information.
- Identify your target
Based on your own interests, you may start in the following 2 different ways:
If you are interested in a specific dataset (you can find many datasets here), your can:
- find the corresponding documentation about the dataset to know more about how the dataset was built
- based on your understanding, decide which data-centric method you want to apply to the dataset which will help to improve the dataset quality
If you are interested in a specific data-centric method, you can:
- identify a related research paper about a data-centric method which links to a specific data aspect (such as data security, data augmentation) Data-centric LLM4SE Paper Repo
- find the specific dataset (with its documentation) mentioned in the paper
In short, at the end of this step, you should have a clear idea about:
- which dataset
- which data-centric method
- which data aspect
- how to evaluate the method
- Integrate the specifc data-centric method
Once you manage to find a data-centric method fits your interests, you can either choose to reuse the replication package released by the original authors or implement the method by yourself. In both cases, you should be able to integrate the method into our project.
One important thing to note is that you should pack the method into a docker image. We provide a standard docker image template in the docker
folder. You can find more details about how to build the docker image in the docker
folder.
- Evaluate the method
We provide a standard evaluation framework for evaluating the data-centric methods. You can find the evaluation framework in the evaluation
folder. Please be aware that you need to standardize your input and output format by following the requirement of our evaluation framework.
If you have any questions, please feel free to contact us via email [email protected] or open an issue.
MIT © SEEDGuard