docs: update documentation

moderneinc · Dec 19, 2024 · ce3ead3 · ce3ead3
1 parent e4dbfd2
commit ce3ead3
Show file tree

Hide file tree

Showing 2 changed files with 106 additions and 240 deletions.
diff --git a/LOCAL_INSTALL.md b/LOCAL_INSTALL.md
@@ -1,152 +1,60 @@
-## Prerequisites
+# Local Installation Guide
 
-Please ensure you have the following tools installed on your system:
+This project requires Python 3.12.x and manages dependencies using [`pyproject.toml`](pyproject.toml). Below are several methods to set up your environment.
+## Using System Python with `venv`
 
-* Python 3.12 (newer versions will not work)
-  * [pyenv](https://github.com/pyenv/pyenv) is recommended
-  * [Homebrew installation](https://formulae.brew.sh/formula/[email protected]) 
-  * [Official Python installer](https://www.python.org/downloads/release/python-31014/)
-* [Git](https://git-scm.com/downloads)
+1. Ensure Python 3.12.x is installed:
+  ```bash
+  python --version
+  ```
 
-## Instructions
+2. Create a virtual environment:
+  ```bash
+  python -m venv .venv
+  ```
 
-### Step 1: Clone this project
+3. Activate the virtual environment:
+  ```bash
+  # On Unix/macOS
+  source .venv/bin/activate
+  # On Windows
+  .venv\Scripts\activate
+  ```
 
-```bash
-git clone [email protected]:moderneinc/moderne-cluster-build-logs.git
-cd moderne-cluster-build-logs
-```
-
-### Step 2: Set up the Python virtual environment
+4. Install dependencies:
+  ```bash
+  pip install -r pyproject.toml
+  ```
 
-You will be creating a server and running clustering inside of a Python virtual environment. To create said environment, please run:
+## Using [uv](https://docs.astral.sh/uv/) (Fast Python Package Installer)
 
-```bash
-## Pick the one that applies to your system
-python -m venv venv
+1. [Install uv](https://docs.astral.sh/uv/getting-started/installation/#installation-methods):
+  ```bash
+  pip install uv
+  ```
 
-## For Mac or Linux users
-source venv/bin/activate
+2. Create a virtual environment and install dependencies:
+  ```bash
+  uv venv
+  source .venv/bin/activate  # or .venv\Scripts\activate on Windows
+  uv pip install -r pyproject.toml
+  ```
 
-## For Windows users
-source venv\Scripts\activate
-```
+## Using DevContainer
 
-After running the `source` command, you should see that you're in a Python virtual environment.
+This project includes DevContainer configuration for VS Code:
 
-### Step 3: Install dependencies
+1. Install the [Dev Containers extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers) in VS Code
+2. Open the project in VS Code
+3. Click "Reopen in Container" when prompted or use the command palette (F1) and select "Dev Containers: Reopen in Container"
 
-Double-check that `pip` is pointing to the correct Python version by running the following command. The output should include `python 3.12.X`. If it doesn't, try using `pip3` instead.
+The container will automatically set up Python 3.12 and install all dependencies.
 
-```bash
-pip --version
-```
+## Using Docker
 
-Once you've confirmed which `pip` works for you, install dependencies by running the following command:
+Build and run the project using Docker:
 
 ```bash
-pip install -r requirements.txt
-```
-
-### Step 4: Download the model
-
-Download the model which will assist with tokenizing and clustering of the build log data.
-
-```bash
-python scripts/download_model.py
-```
-
-### Step 5: Gather build logs
-
-In order to perform an analysis on your build logs, all of them need to be copied over to this directory. Please ensure that they are copied over inside a folder named `repos`. 
-
-You will also need a `build.xlsx` file that provides details about the builds such as where the build logs are located, what the outcome was, and what the path to the project is. This file should exist inside of `repos` directory.
-
-Here is an example of what your directory should be look like if everything was set up correctly:
-
-```
-moderne-cluster-build-logs
-│
-├───scripts
-│       (4 files)
-│
-└───repos
-    │   builds.xlsx
-    │
-    ├───Org1
-    │   ├───Repo1
-    │   │   └───main
-    │   │           build.log
-    │   │
-    │   └───Repo2
-    │       └───master
-    │               build.log
-    │
-    ├───Org2
-    │   ├───Repo1
-    │   │   └───main
-    │   │           build.log
-    │   │
-    │   └───Repo2
-    │       └───master
-    │               build.log
-    │
-    └───Org3
-        └───Repo1
-            └───main
-                    build.log
-```
-
-
-#### Using Moderne mass ingest logs
-
-If you want to use Moderne's mass ingest logs to run this scripts, you may use the following script to download a sample.
-
-```bash
-python scripts/00.download_ingest_samples.py
-```
-
-You will be prompted which of the slices you want to download. Enter the corresponding number and press `Enter`.
-
-
-### Step 6: Run the scripts
-
-_Please note these scripts won't function correctly if you haven't copied over the logs and `build.xlsx` file into the `repos` directory and put that inside of the `Clustering` directory you're working out of._
-
-**Run the following scripts in order**:
-
-1. Load the logs and extract relevant error messages and stacktraces from the logs:
-
-```bash
-python scripts/01.load_logs_and_extract.py
-```
-
-_Please note that the loaded logs only include those generated from failures to build Maven or Gradle projects. You can open `build.xlsx` if there are less logs loaded than expected_
-
-2. Embed logs and cluster:
-
-```bash
-python scripts/02.embed_summaries_and_cluster.py
-```
-
-### Step 7: Analyze the results
-
-Once you've run the two scripts, you should find that a `clusters_scatter.html` and `clusters_logs.html` file was produced. Open those in the browser of your choice to get detailed information about your build failures.
-
-Success! You can now freely exit out of the Python virtual environment by typing `exit` into the command line.
-
-## Example results
-
-Below you can see some examples of the HTML files produced by following the above steps.
-
-### clusters_scatter.html
-
-This file is a visual representation of the build failure clusters. Clusters that contain the most number of dots should generally be prioritized over ones that contain fewer dots. You can hover over the dots to see part of the build logs.
-
-![expected_clusters](images/expected_clusters.gif)
-
-#### cluster_logs.html
-
-To see the full extracted logs, you may use this file. This file shows all the logs that belong to a cluster.
-
-![logs](images/expected_logs.png)
+# Build the image
+docker build -t moderne-cluster-build-logs:latest .
+```
diff --git a/README.md b/README.md
@@ -1,133 +1,91 @@
 # Clustering build logs to analyze common build issues
 
-When your company attempts to build [Lossless Semantic Trees (LSTs)](https://docs.moderne.io/administrator-documentation/moderne-platform/references/concepts/lossless-semantic-trees) for all of your repositories, you may find that some of them do not build successfully. While you _could_ go through each of those by hand and attempt to figure out common patterns, there is a better way: [cluster analysis](https://en.wikipedia.org/wiki/Cluster_analysis).
+When your company attempts to build [Lossless Semantic Trees (LSTs)](https://docs.moderne.io/administrator-documentation/moderne-platform/references/lossless-semantic-trees/) for all of your repositories, you may find that some of them do not build successfully. While you _could_ go through each of those by hand and attempt to figure out common patterns, there is a better way: [cluster analysis](https://en.wikipedia.org/wiki/Cluster_analysis).
 
-You can think of cluster analysis as a way of grouping data into easily identifiable chunks. In other words, it can take in all of your build failures and then find what issues are the most common – so you can prioritize what to fix first.
+You can think of cluster analysis as a way of grouping data into easily identifiable chunks. In other words, it can take in all of your build failures and then find what issues are the most common - so you can prioritize what to fix first.
 
 This repository will walk you through everything you need to do to perform a cluster analysis on your build failures. By the end, you will have produced two HTML files:
 1. [one that visually displays the clusters](#analysis_build_failureshtml)
 2. [one that contains samples for each cluster](#cluster_id_reasonhtml). 
 
-NOTE: Clustering is currently limited to Maven, Gradle, .Net, and Bazel builds because our heuristic-based extraction of build errors is specific to these build types. Although build failures for other types won’t cause error when clustering, the heuristic extraction may overlook valuable parts of the stack trace.
-
-## Prerequisites
-
 > [!NOTE]
-> This repository contains a devcontainer specification, it is the recommended path to get setup as it ensures a consistent developer experience. If you so choose, you can
-install the necessary components locally. Running without docker might be faster, if your local machine has GPU or metal support. See [LOCAL_INSTALL.md](/LOCAL_INSTALL.md) for 
-how to get started.
+> Clustering is currently limited to Maven, Gradle, .Net, and Bazel builds because our heuristic-based extraction of build errors is specific to these build types. Although build failures for other types won't cause error when clustering, the heuristic extraction may overlook valuable parts of the stack trace.
 
-Please ensure you have the following tools installed on your system:
+## Setup
 
-* A Devcontainer compatible client (GitHub Codespaces, GitPod, DevPod, Docker Desktop, Visual Studio Code, etc)
-* Optional:
-  * [Git](https://git-scm.com/downloads)
+Before you begin, you will need to complete one the setup methods in [LOCAL_INSTALL.md](LOCAL_INSTALL.md). This will ensure that you have all the necessary dependencies installed.
+1. [Using System Python with `venv`](LOCAL_INSTALL.md#using-system-python-with-venv)
+2. [Using uv](LOCAL_INSTALL.md#using-uv-fast-python-package-installer) (Fast Python Package Installer)
+3. [Using DevContainer](LOCAL_INSTALL.md#using-devcontainer)
+4. [Using Docker](LOCAL_INSTALL.md#using-docker)
 
 
 ## Instructions
 
-### Clone this project
-
-Most Devcontainer clients will perform the clone on your behalf as well as initializing the workspace for you. If you specific client requires you to clone the workspace locally first, then you will need to perform that task using the following.
-
-```bash
-git clone [email protected]:moderneinc/moderne-cluster-build-logs.git
-cd moderne-cluster-build-logs
-```
-
-### Gather build logs
-
-In order to perform an analysis on your build logs, all of them need to be copied over to this directory (`Clustering`). Please ensure that they are copied over inside a folder named `repos`. 
-
-You will also need a `build.xlsx` file that provides details about the builds such as where the build logs are located, what the outcome was, and what the path to the project is. This file should exist inside of `repos` directory.
-
-Here is an example of what your directory should be look like if everything was set up correctly:
-
-```
-moderne-cluster-build-logs
-│
-├───scripts
-│       (4 files)
-│
-└───repos
-    │   builds.xlsx
-    │
-    ├───Org1
-    │   ├───Repo1
-    │   │   └───main
-    │   │           build.log
-    │   │
-    │   └───Repo2
-    │       └───master
-    │               build.log
-    │
-    ├───Org2
-    │   ├───Repo1
-    │   │   └───main
-    │   │           build.log
-    │   │
-    │   └───Repo2
-    │       └───master
-    │               build.log
-    │
-    └───Org3
-        └───Repo1
-            └───main
-                    build.log
-```
-
-
-#### Using Moderne mass ingest logs
-
-If you want to use Moderne's mass ingest logs to run this scripts, you may use the following script to download a sample.
+After [set-up / installation](LOCAL_INSTALL.md), you can run the analysis script in one of two ways: 
+1. [Using an existing build log file](#using-an-existing-build-log-file)
+2. [Downloading build logs from a Artifactory repository](#downloading-build-logs-from-an-artifactory-repository)
 
-```bash
-python scripts/00.download_ingest_samples.py
-```
-
-You will be prompted which of the slices you want to download. Enter the corresponding number and press `Enter`.
-
-
-### Run the scripts
-
-> [!WARNING]
-> Please note these scripts won't function correctly if you haven't copied over the logs and `build.xlsx` file into the `repos` directory you're working out of.
-
-**Run the following scripts in order**:
-#### Step 1
-The first time you run this script, you must first run `01.extract_failures.py` to extract only the logpaths for the failed build stacktraces.
-
-```bash
-python scripts/01.extract_failures.py
-```
-
-#### Step 2
-Load the logs and extract relevant error messages and stacktraces from the logs: 
+### Using an existing build log file
 
 ```bash
-python scripts/02.load_logs_and_extract.py
+# System Python with venv / DevContainer
+python script/analyze_logs.py \
+  --logs <path_to_build_logs> \
+  --output <output_directory>
+
+# or using uv
+uv run script/analyze_logs.py \
+  --logs <path_to_build_logs> \
+  --output <output_directory>
+
+# or using Docker
+docker run -rm -it \
+  -v <path_to_build_logs>:/app/logs \
+  -v <path_to_output_directory>:/app/output \
+  moderne-cluster-build-logs:latest \
+  python analyze_logs.py \
+  --logs /app/logs \
+  --output /app/output
 ```
 
-#### Step 3
-Embed logs and cluster:
-
-```bash
-python scripts/03.embed_summaries_and_cluster.py
-```
 
-### Analyze the results
+### Downloading build logs from an Artifactory repository
 
-Once you've run the two scripts, you should find that a `clusters_scatter.html` and `clusters_logs.html` file were produced. Open those in the browser of your choice to get detailed information about your build failures.
+The script can also download build logs from an Artifactory repository. You will need to provide the URL, repository path, username, and password to access the logs. You can also provide a specific log file to analyze. If you do not provide a specific log file, the script will provide an interactive selection for available logs.
 
 ```bash
-python -m http.server 8080
+# System Python with venv / DevContainer
+python script/analyze_logs.py \
+  --url <artifactory_url> \
+  --repository-path <artifactory_repository_path_to_logs> \
+  --username <artifactory_username> \
+  --password <artifactory_passwd> \
+  --log-file <specific_log_file> \ # Optional
+  --output-dir ./output 
+
+# or using uv
+uv run script/analyze_logs.py \
+  --url <artifactory_url> \
+  --repository-path <artifactory_repository_path_to_logs> \
+  --username <artifactory_username> \
+  --password <artifactory_passwd> \
+  --log-file <specific_log_file> \ # Optional
+  --output-dir ./output 
+
+# or using Docker
+docker run -rm -it \
+  -v <path_to_output_directory>:/app/output \
+  moderne-cluster-build-logs:latest \
+  python analyze_logs.py \
+  --url <artifactory_url> \
+  --repository-path <artifactory_repository_path_to_logs> \
+  --username <artifactory_username> \
+  --password <artifactory_passwd> \
+  --log-file <specific_log_file> \ # Optional
+  --output-dir /app/output
 ```
 
-Success! These can now be viewed in your browser at http://localhost:8080/clusters_scatter.html and http://localhost:8080/clusters_logs.html.
-
-### Optional: Marking a certain repository as "solved"
-
-As you work through the build failures, you might want to exclude logs that have been marked as solved from the clustering process. To do this, open the `failures.csv` file and set the `Solved` column to `True` for the logs you want to ignore. Alternatively, you can delete or rename the `build.log` file for that repository. After making these changes, you can re-run the clustering process by re-starting at [step 2](#step-2). You may repeat steps [2](#step-2) and [3](#step-3) repeatedly to update the graphics as many times as needed.
 
 ## Example results