An Airflow-based pipeline designed for the continuous analysis of GitHub repositories using the Arcan tool.
The primary goal of this pipeline is to regularly conduct software analysis using the Arcan tool on a diverse dataset of GitHub projects. This effort aims to expand the dataset that Arcan uses for project classification. The pipeline is responsible for the following tasks:
- Detecting changes in the source code of GitHub projects to determine which projects should undergo analysis through the GitHub REST API.
- Running Arcan to perform reverse engineering on the projects and storing the output in a database.
- Executing Arcan to analyze the technical debt of the projects and storing the output in a database.
The general workflow can be divided into two separate and independent pipelines, each consisting of a set of well-defined steps.
The first pipeline, Ingestion, is responsible for retrieving new versions for each project from the database, acquiring them from GitHub, and selecting those that need to be analyzed. It then stores these selected versions back into the database. This process ensures that the list of versions available for subsequent analysis remains up-to-date.
The second pipeline, Execution, is responsible for executing the parsing process, when necessary, and analyzing the versions that were selected by the first pipeline.
Both pipelines are dynamically generated based on a subset of projects and versions selected from the database.
The foundation for this implementation is the Docker image of Apache Airflow. The Directed Acyclic Graphs (DAGs) that represent the Ingestion and Execution pipelines have been implemented in Python. The logic for the flow of the pipeline, scheduling settings, and error handling configurations are found in inception.py
and execution.py
in the Dag
directory. These files define the tasks and their dependencies.
The logic for individual tasks is implemented in functions found in the tasksFunction.py
file within the Dag/utilities
directory. Modules have been created for interacting with services such as MySQL, the GitHub REST API, Docker Engine, and file system access. These modules are located in the Dag/utilities
directory. Specifically, mySqlGateway.py
manages database access operations, gitHubRepository.py
handles communication with the GitHub REST API, dockerRunner.py
takes care of Docker container execution, and fileManager.py
handles file system access.
Before proceeding with the installation, ensure you have the following prerequisites:
- Docker Community Edition (CE) installed on your workstation.
- Docker Compose version 1.29.1 or newer installed on your workstation.
- An initialized MySQL server.
- The Arcan tool image on your workstation.
- Clone this GitHub repository to a directory on your computer.
- Create a
.env
file based on the provided.env.example
file, entering the necessary data. - Run
docker compose build
in your terminal. - Execute
docker compose up airflow-init
in your terminal. - Start the platform by running
docker compose up
.
To configure Airflow:
- Access the Airflow web interface.
- Create a new connection with the ID 'mysql' in the connection management by entering the relevant MySQL server details.
- Create variables 'git_token' and 'git_username' in the variable management and populate them with the appropriate data.
- Create the 'docker_run_pool' in the pool management by specifying the maximum number of Docker containers that can run in parallel.
To restart the platform:
- Execute
docker compose down
in the terminal. - Restart the platform by running
docker compose up
.
The Ingestion pipeline runs daily on a subset of repositories, while the Execution pipeline runs continuously on a subset of versions. For details on using the Airflow web interface, please refer to the official documentation.
Please be aware of the following limitations:
- The Execution pipeline is designed for execution on a single machine and relies on an external shared volume for the source code of the project and the results of Arcan execution. This design limits scalability. To enhance scalability, it's necessary to clone the project from the GitHub repository within both the parsing and analysis tasks.
- There's a limitation of 1024 concurrent tasks in Airflow, with a maximum of 5 Docker containers running simultaneously.
- In the Ingestion pipeline, the rate limit of the GitHub REST API can affect the pipeline's responsiveness. Consider using GitHub WebHooks for real-time notifications.
- In the Execution pipeline, the analysis and parsing tasks are limited to 4 hours, and some tasks may fail due to resource constraints.