Malware detection with added explanability through saliency map on Android APK using PyTorch and Androguard.
To run the code you need to create a conda environment. You can do so by running the following commands after cloning the repository:
conda env create --file ./conda-package-list.yml
conda activate malware_detection_research
🗒️ Note : we used Androguard 3.3.5 and not the version 4.0.2 because of a bug in the submodules not being recognized by PyLance.
⚠️ Warning : this repository is developped as a python package. Thus, to run the scripts you need to be at the root of the repository and use the following syntax :python -m folder.scriptFor example, to run the
apk_to_image.py
script with the required parameters, you need to run the following command :python -m pre_processing.apk_to_image -t RGB -p random -e jpg
This repository contains multiples files that can be used to train a model, test it, and generate saliency maps.
🗒️ Note : you can use the
start_training.sh
script and modify it to your needs. This script will run all the scripts in the correct order.
The usual workflow is:
-
Insert the APKs in the
_dataset
folder into subdirectories corresponding to their nature. For example, in our experiments we had two datasets.- Random based split dataset:
30k_dataset
with four subdirectories:Goodware_Obf
,Goodware_NoObf
,Malware_Obf
,Malware_NoObf
- Time split based dataset :
71k_dataset
with still four subdirectoriesGoodware_Obf
,Goodware_NoObf
,Malware_Obf
,Malware_NoObf
but inside those subfolders everything was sorted by period like so2022_01
, ...,2022_12
,2023_01
, ...,2023_12
etc. The subdirectories are used to create the dataset and the labels. The values for the name are kind of hardcoded in the code, so you might need to change them if you want to use different names. (seemodel_training/train_test_model.py
).⚠️ Warning : if you have a time based dataset, you require an .csv file with ahash
column to identify the APKs, anum_antivirus_malicious
column to give how many detection one APK has on Virus Total (it's an int),first_submission_date
with the date of the first submission of the APK on Virus Total in the formatDD-MM-YYYY hh:mm
and the columnobfuscated
(0 or 1) which indicates if the APK is obfuscated or not. You can find all the scripts to manipulate the dataset inpre_processing/dataset_manipulation/
folder. For indication, to organize the dataset you would first runpre_processing/dataset_manipulation/sort_by_period.py
and thenpre_processing/dataset_manipulation/select_data_by_period.py
. The parameters are in the script usage section.
- Random based split dataset:
-
Run
pre_processing/apk_to_image.py
to transform the APKs into images. You have to specify the type of images you want (RGB or BW) and the padding type (random, black, white), and the extension type (jpg or png). Here's an example on how to run the script:python -m pre_processing.apk_to_image -t RGB -p random -e jpg
This will create a folder in the
_images
corresponding to the type of conversion you chose with the nature of the APKs as subdirectories. For example, if you chose RGB and random padding, you will have the following structure:_images/Goodware_Obf_RGB_random/{apk_name}.jpg
If you have a time based dataset, you can use the
--time_based
flag to specify it. This will create the images using the already createdtrain
andtest
directories created by thesort_by_period.py
script.python -m pre_processing.apk_to_image -t RGB -p random -e jpg --time_based
-
Run
pre_processing/create_train_test.py
to automatically create thetrain
andtest
directories with the given ratio (default 80:20 train/test).python -m pre_processing.create_train_test -r 0.8
If you have a time based dataset, you can use the
--time_based
flag to specify it. This will sort the images using the already createdtrain
andtest
directories without a random split. -
Run
model_training/train_test_models.py
to train and test ResNet18 and ResNet50 model on multiple epochs. You have to specify the type of images (RGB or BW), the padding type (random, black, white) and the extension of the image (jpg or png) so the script knows where to find the images. Here's an example on how to run the script:python -m model_training.train_test_models -t RGB -p random -ex jpg
This will save the model in the
_models
folder. The name of the model will be{model_name}_{type}_{padding_type}_padding_{extension}.pth
. For example, if you chose RGB and random padding, you will have the following model:model_training/_models/resnet18_RGB_random_10_epochs_jpg.pth ... model_training/_models/resnet18_RGB_random_50_epochs_jpg.pth ... model_training/_models/resnet50_RGB_random_10_epochs_jpg.pth ... model_training/_models/resnet50_RGB_random_50_epochs_jpg.pth
-
Run
visualization/saliency.py
to generate the saliency maps. You have to specify the type of images (RGB or BW) and the padding type (random, black, white) with the model name (resnet18, resnet50) and finally the epochs number (10, 20, ...) Here's an example on how to run the script:python -m visualization.saliency -t RGB -p random -mn resnet18 -e 5 -ex jpg
This will save the saliency maps in the
_saliency_maps
folder.
If you want to see the results of our experiments with TensorBoard (the built-in VSCode way doesn't work with WSL), you can run the following command:
tensorboard --logdir=model_training/runs
⚠️ Warning : you must be at the root of the repository to run this command.
🗒️ Note : you can use
tensorboard
in the CLI if you're using a precompiled TensorFlow package (e.g you installed via pip.) See here for more details.
This project is based on the following paper:
[1] Obfuscation detection for Android applications : We used create_image.py
and map-saturation.png
to transform the APKs into images while developing our own method.
[2] Fast adversarial training using FGSM Based on paper Fast is better than free: Revisiting adversarial training by Athalye et al. We used fast_adversarial.py
to train our model.