Skip to content

Commit

Permalink
docs: update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
sjungling committed Dec 19, 2024
1 parent e4dbfd2 commit ce3ead3
Show file tree
Hide file tree
Showing 2 changed files with 106 additions and 240 deletions.
180 changes: 44 additions & 136 deletions LOCAL_INSTALL.md
Original file line number Diff line number Diff line change
@@ -1,152 +1,60 @@
## Prerequisites
# Local Installation Guide

Please ensure you have the following tools installed on your system:
This project requires Python 3.12.x and manages dependencies using [`pyproject.toml`](pyproject.toml). Below are several methods to set up your environment.
## Using System Python with `venv`

* Python 3.12 (newer versions will not work)
* [pyenv](https://github.com/pyenv/pyenv) is recommended
* [Homebrew installation](https://formulae.brew.sh/formula/[email protected])
* [Official Python installer](https://www.python.org/downloads/release/python-31014/)
* [Git](https://git-scm.com/downloads)
1. Ensure Python 3.12.x is installed:
```bash
python --version
```

## Instructions
2. Create a virtual environment:
```bash
python -m venv .venv
```

### Step 1: Clone this project
3. Activate the virtual environment:
```bash
# On Unix/macOS
source .venv/bin/activate
# On Windows
.venv\Scripts\activate
```

```bash
git clone [email protected]:moderneinc/moderne-cluster-build-logs.git
cd moderne-cluster-build-logs
```

### Step 2: Set up the Python virtual environment
4. Install dependencies:
```bash
pip install -r pyproject.toml
```

You will be creating a server and running clustering inside of a Python virtual environment. To create said environment, please run:
## Using [uv](https://docs.astral.sh/uv/) (Fast Python Package Installer)

```bash
## Pick the one that applies to your system
python -m venv venv
1. [Install uv](https://docs.astral.sh/uv/getting-started/installation/#installation-methods):
```bash
pip install uv
```

## For Mac or Linux users
source venv/bin/activate
2. Create a virtual environment and install dependencies:
```bash
uv venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
uv pip install -r pyproject.toml
```

## For Windows users
source venv\Scripts\activate
```
## Using DevContainer

After running the `source` command, you should see that you're in a Python virtual environment.
This project includes DevContainer configuration for VS Code:

### Step 3: Install dependencies
1. Install the [Dev Containers extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers) in VS Code
2. Open the project in VS Code
3. Click "Reopen in Container" when prompted or use the command palette (F1) and select "Dev Containers: Reopen in Container"

Double-check that `pip` is pointing to the correct Python version by running the following command. The output should include `python 3.12.X`. If it doesn't, try using `pip3` instead.
The container will automatically set up Python 3.12 and install all dependencies.

```bash
pip --version
```
## Using Docker

Once you've confirmed which `pip` works for you, install dependencies by running the following command:
Build and run the project using Docker:

```bash
pip install -r requirements.txt
```

### Step 4: Download the model

Download the model which will assist with tokenizing and clustering of the build log data.

```bash
python scripts/download_model.py
```

### Step 5: Gather build logs

In order to perform an analysis on your build logs, all of them need to be copied over to this directory. Please ensure that they are copied over inside a folder named `repos`.

You will also need a `build.xlsx` file that provides details about the builds such as where the build logs are located, what the outcome was, and what the path to the project is. This file should exist inside of `repos` directory.

Here is an example of what your directory should be look like if everything was set up correctly:

```
moderne-cluster-build-logs
├───scripts
│ (4 files)
└───repos
│ builds.xlsx
├───Org1
│ ├───Repo1
│ │ └───main
│ │ build.log
│ │
│ └───Repo2
│ └───master
│ build.log
├───Org2
│ ├───Repo1
│ │ └───main
│ │ build.log
│ │
│ └───Repo2
│ └───master
│ build.log
└───Org3
└───Repo1
└───main
build.log
```


#### Using Moderne mass ingest logs

If you want to use Moderne's mass ingest logs to run this scripts, you may use the following script to download a sample.

```bash
python scripts/00.download_ingest_samples.py
```

You will be prompted which of the slices you want to download. Enter the corresponding number and press `Enter`.


### Step 6: Run the scripts

_Please note these scripts won't function correctly if you haven't copied over the logs and `build.xlsx` file into the `repos` directory and put that inside of the `Clustering` directory you're working out of._

**Run the following scripts in order**:

1. Load the logs and extract relevant error messages and stacktraces from the logs:

```bash
python scripts/01.load_logs_and_extract.py
```

_Please note that the loaded logs only include those generated from failures to build Maven or Gradle projects. You can open `build.xlsx` if there are less logs loaded than expected_

2. Embed logs and cluster:

```bash
python scripts/02.embed_summaries_and_cluster.py
```

### Step 7: Analyze the results

Once you've run the two scripts, you should find that a `clusters_scatter.html` and `clusters_logs.html` file was produced. Open those in the browser of your choice to get detailed information about your build failures.

Success! You can now freely exit out of the Python virtual environment by typing `exit` into the command line.

## Example results

Below you can see some examples of the HTML files produced by following the above steps.

### clusters_scatter.html

This file is a visual representation of the build failure clusters. Clusters that contain the most number of dots should generally be prioritized over ones that contain fewer dots. You can hover over the dots to see part of the build logs.

![expected_clusters](images/expected_clusters.gif)

#### cluster_logs.html

To see the full extracted logs, you may use this file. This file shows all the logs that belong to a cluster.

![logs](images/expected_logs.png)
# Build the image
docker build -t moderne-cluster-build-logs:latest .
```
166 changes: 62 additions & 104 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,133 +1,91 @@
# Clustering build logs to analyze common build issues

When your company attempts to build [Lossless Semantic Trees (LSTs)](https://docs.moderne.io/administrator-documentation/moderne-platform/references/concepts/lossless-semantic-trees) for all of your repositories, you may find that some of them do not build successfully. While you _could_ go through each of those by hand and attempt to figure out common patterns, there is a better way: [cluster analysis](https://en.wikipedia.org/wiki/Cluster_analysis).
When your company attempts to build [Lossless Semantic Trees (LSTs)](https://docs.moderne.io/administrator-documentation/moderne-platform/references/lossless-semantic-trees/) for all of your repositories, you may find that some of them do not build successfully. While you _could_ go through each of those by hand and attempt to figure out common patterns, there is a better way: [cluster analysis](https://en.wikipedia.org/wiki/Cluster_analysis).

You can think of cluster analysis as a way of grouping data into easily identifiable chunks. In other words, it can take in all of your build failures and then find what issues are the most common so you can prioritize what to fix first.
You can think of cluster analysis as a way of grouping data into easily identifiable chunks. In other words, it can take in all of your build failures and then find what issues are the most common - so you can prioritize what to fix first.

This repository will walk you through everything you need to do to perform a cluster analysis on your build failures. By the end, you will have produced two HTML files:
1. [one that visually displays the clusters](#analysis_build_failureshtml)
2. [one that contains samples for each cluster](#cluster_id_reasonhtml).

NOTE: Clustering is currently limited to Maven, Gradle, .Net, and Bazel builds because our heuristic-based extraction of build errors is specific to these build types. Although build failures for other types won’t cause error when clustering, the heuristic extraction may overlook valuable parts of the stack trace.

## Prerequisites

> [!NOTE]
> This repository contains a devcontainer specification, it is the recommended path to get setup as it ensures a consistent developer experience. If you so choose, you can
install the necessary components locally. Running without docker might be faster, if your local machine has GPU or metal support. See [LOCAL_INSTALL.md](/LOCAL_INSTALL.md) for
how to get started.
> Clustering is currently limited to Maven, Gradle, .Net, and Bazel builds because our heuristic-based extraction of build errors is specific to these build types. Although build failures for other types won't cause error when clustering, the heuristic extraction may overlook valuable parts of the stack trace.
Please ensure you have the following tools installed on your system:
## Setup

* A Devcontainer compatible client (GitHub Codespaces, GitPod, DevPod, Docker Desktop, Visual Studio Code, etc)
* Optional:
* [Git](https://git-scm.com/downloads)
Before you begin, you will need to complete one the setup methods in [LOCAL_INSTALL.md](LOCAL_INSTALL.md). This will ensure that you have all the necessary dependencies installed.
1. [Using System Python with `venv`](LOCAL_INSTALL.md#using-system-python-with-venv)
2. [Using uv](LOCAL_INSTALL.md#using-uv-fast-python-package-installer) (Fast Python Package Installer)
3. [Using DevContainer](LOCAL_INSTALL.md#using-devcontainer)
4. [Using Docker](LOCAL_INSTALL.md#using-docker)


## Instructions

### Clone this project

Most Devcontainer clients will perform the clone on your behalf as well as initializing the workspace for you. If you specific client requires you to clone the workspace locally first, then you will need to perform that task using the following.

```bash
git clone [email protected]:moderneinc/moderne-cluster-build-logs.git
cd moderne-cluster-build-logs
```

### Gather build logs

In order to perform an analysis on your build logs, all of them need to be copied over to this directory (`Clustering`). Please ensure that they are copied over inside a folder named `repos`.

You will also need a `build.xlsx` file that provides details about the builds such as where the build logs are located, what the outcome was, and what the path to the project is. This file should exist inside of `repos` directory.

Here is an example of what your directory should be look like if everything was set up correctly:

```
moderne-cluster-build-logs
├───scripts
│ (4 files)
└───repos
│ builds.xlsx
├───Org1
│ ├───Repo1
│ │ └───main
│ │ build.log
│ │
│ └───Repo2
│ └───master
│ build.log
├───Org2
│ ├───Repo1
│ │ └───main
│ │ build.log
│ │
│ └───Repo2
│ └───master
│ build.log
└───Org3
└───Repo1
└───main
build.log
```


#### Using Moderne mass ingest logs

If you want to use Moderne's mass ingest logs to run this scripts, you may use the following script to download a sample.
After [set-up / installation](LOCAL_INSTALL.md), you can run the analysis script in one of two ways:
1. [Using an existing build log file](#using-an-existing-build-log-file)
2. [Downloading build logs from a Artifactory repository](#downloading-build-logs-from-an-artifactory-repository)

```bash
python scripts/00.download_ingest_samples.py
```

You will be prompted which of the slices you want to download. Enter the corresponding number and press `Enter`.


### Run the scripts

> [!WARNING]
> Please note these scripts won't function correctly if you haven't copied over the logs and `build.xlsx` file into the `repos` directory you're working out of.
**Run the following scripts in order**:
#### Step 1
The first time you run this script, you must first run `01.extract_failures.py` to extract only the logpaths for the failed build stacktraces.

```bash
python scripts/01.extract_failures.py
```

#### Step 2
Load the logs and extract relevant error messages and stacktraces from the logs:
### Using an existing build log file

```bash
python scripts/02.load_logs_and_extract.py
# System Python with venv / DevContainer
python script/analyze_logs.py \
--logs <path_to_build_logs> \
--output <output_directory>

# or using uv
uv run script/analyze_logs.py \
--logs <path_to_build_logs> \
--output <output_directory>

# or using Docker
docker run -rm -it \
-v <path_to_build_logs>:/app/logs \
-v <path_to_output_directory>:/app/output \
moderne-cluster-build-logs:latest \
python analyze_logs.py \
--logs /app/logs \
--output /app/output
```

#### Step 3
Embed logs and cluster:

```bash
python scripts/03.embed_summaries_and_cluster.py
```

### Analyze the results
### Downloading build logs from an Artifactory repository

Once you've run the two scripts, you should find that a `clusters_scatter.html` and `clusters_logs.html` file were produced. Open those in the browser of your choice to get detailed information about your build failures.
The script can also download build logs from an Artifactory repository. You will need to provide the URL, repository path, username, and password to access the logs. You can also provide a specific log file to analyze. If you do not provide a specific log file, the script will provide an interactive selection for available logs.

```bash
python -m http.server 8080
# System Python with venv / DevContainer
python script/analyze_logs.py \
--url <artifactory_url> \
--repository-path <artifactory_repository_path_to_logs> \
--username <artifactory_username> \
--password <artifactory_passwd> \
--log-file <specific_log_file> \ # Optional
--output-dir ./output

# or using uv
uv run script/analyze_logs.py \
--url <artifactory_url> \
--repository-path <artifactory_repository_path_to_logs> \
--username <artifactory_username> \
--password <artifactory_passwd> \
--log-file <specific_log_file> \ # Optional
--output-dir ./output

# or using Docker
docker run -rm -it \
-v <path_to_output_directory>:/app/output \
moderne-cluster-build-logs:latest \
python analyze_logs.py \
--url <artifactory_url> \
--repository-path <artifactory_repository_path_to_logs> \
--username <artifactory_username> \
--password <artifactory_passwd> \
--log-file <specific_log_file> \ # Optional
--output-dir /app/output
```

Success! These can now be viewed in your browser at http://localhost:8080/clusters_scatter.html and http://localhost:8080/clusters_logs.html.

### Optional: Marking a certain repository as "solved"

As you work through the build failures, you might want to exclude logs that have been marked as solved from the clustering process. To do this, open the `failures.csv` file and set the `Solved` column to `True` for the logs you want to ignore. Alternatively, you can delete or rename the `build.log` file for that repository. After making these changes, you can re-run the clustering process by re-starting at [step 2](#step-2). You may repeat steps [2](#step-2) and [3](#step-3) repeatedly to update the graphics as many times as needed.

## Example results

Expand Down

0 comments on commit ce3ead3

Please sign in to comment.