This repository contains the source code for our paper "Broken Promises: Measuring Confounding Effects in Learning-based Vulnerability Discovery" that was accepted at AISec '23.
Experiments regarding the Causal Graph Model reside in CGIN
, while experiments using the StackLSTM are in StackLSTM
. Experiments using CodeT5+ and LineVul are in LLM
. The directory Perturbations
contains scripts to apply obfuscation and styling to obtain the perturbed training data. The experiments using the graph-based model ReVeal
were performed using this repository. We used and modified the original code from both LineVul and CodeT5 for finetuning all our LLM models.
For LLM
:
- Python
- PyTorch
- transformers
- datasets
- scikit-learn
- numpy
- matplotlib
- tqdm
- tree_sitter
- sacrebleu==1.2.11
For CGIN
:
- Python
- PyTorch
- PyTorch Geometric
- torch_scatter
- numpy
- networkx
- scikit-learn
- tqdm
- gensim
For StackLSTM
:
- Python
- PyTorch
- tqdm
- sctokenizer
- scikit-learn
- pickle
- torchray
- stacknn
For Perturbations
/ for generating the perturbed dataset:
- Download the file from here and move it to the folder
Perturbations
- Download the file from here and also move it to the folder
Perturbations
Create an issue in this repository if you find a bug or have questions about the content.
For additional support, ask a question in SAP Community.
If you wish to contribute code, offer fixes or improvements, please send a pull request. Due to legal reasons, contributors will be asked to accept a DCO when they create the first pull request to this project. This happens in an automated fashion during the submission process. SAP uses the standard DCO text of the Linux Foundation.
Copyright (c) 2023 SAP SE or an SAP affiliate company. All rights reserved. This project is licensed under the Apache Software License, version 2.0 except as noted otherwise in the LICENSE file.