Add more dev tools and update readme (#27)

This PR adds 1. A gRPC code generation script (`make gen`) 2. A cleaner script (`make clean`) 3. A sample data downloader (`make get-dataset`) 4. A major update of README.md after OSPP and GSOC submission. 5. Include Log-analysis architecture/design diagrams. Signed-off-by: Superskyyy <[email protected]>
SkyAPM · Sep 25, 2022 · 8402746 · 8402746
1 parent eae743d
commit 8402746
Show file tree

Hide file tree

Showing 14 changed files with 466 additions and 60 deletions.
diff --git a/Makefile b/Makefile
@@ -24,14 +24,14 @@ endif
 $(VENV):
 	python3 -m venv $(VENV_DIR)
 	poetry run python -m pip install --upgrade pip
+	poetry install --sync
 
 all: gen get-dataset prune-dataset lint license clean
 
 .PHONY: all
 
 gen:
-	poetry run python -m pip install grpcio-tools
-	poetry run python -m tools.grpc_code_gen
+	poetry run python -m tools.grpc_gen
 
 #argument indicates a dataset name defined in sample_data_manager.py
 get-dataset:
@@ -53,11 +53,5 @@ lint-fix: lint-setup
 	$(VENV)/unify -r --in-place .
 	$(VENV)/flynt -tc -v .
 
-# todo make this work on windows
 clean:
-	find . -name "*.egg-info" -exec rm -r {} +
-	find . -name "dist" -exec rm -r {} +
-	find . -name "build" -exec rm -r {} +
-	find . -name "__pycache__" -exec rm -r {} +
-	find . -name ".pytest_cache" -exec rm -r {} +
-	find . -name "*.pyc" -exec rm -r {} +
+	poetry run python -m tools.cleaner
diff --git a/README.md b/README.md
@@ -1,80 +1,88 @@
 # SkyWalking AIOps Engine
-**An AIOps Engine for Observability.**
 
-A usable open-source AIOps framework for the domain of cloud computing observability. 
+*A practical open-source AIOps engine for the
+era of cloud computing.*
 
-### Why this project matters?
-We could answer this from the following progressive questions:
-1. Are there existing algorithms for telemetry data? 
+## Why do we build this project?
+
+**We strongly believe that this project will bring value
+to AIOps practitioners and researchers.**
+<details>
+  <summary>Towards better Observability</summary>
+We could reason this from the following progressive questions:
+
+1. Are there existing algorithms for telemetry data?
    - **Abundant.**
 
-2. Are the existing algorithms empirically verified? 
-
-   - **Most proposed algorithms are not empirically verified**
 
-3. Are there AIOps tools that embed machine learning algorithms? 
+2. Are the existing algorithms empirically verified?
+
+   - **Most algorithms are not verified in production**
+
+
+3. Are there practical AIOps frameworks?
    - **Limited, often out of maintenance or commercialized.**
-
-4. Are there open-source AIOps solutions that integrates with popular backends?
+
+
+4. Are there open-source AIOps solutions that offers Out-of-Box integrations?
    - **Hardly any.**
 
+
 5. Why would I need that?
    1. For developers & organizations curious for AIOps:
-      - a. Just install and start using it, saves budget, saves head-scratching.
+      - a. Just install and start using it, saves budget, prevents head-scratching.
       - b. Treat this project as a good (or bad) reference for your own AIOps pipeline.
    2. For researchers in the AIOps domain:
       - a. For software engineering researchers - sample for AIOps evolution and empirical study.
       - b. For algorithm researchers - playground for new algorithms, solid case studies.
-
 
-The above is where we place the value of this project, though our current aim is to become the official AIOps engine 
-of [Apache SkyWalking](https://github.com/apache/skywalking), each component could be easily swapped given its 
-plugable design.
+</details>
+
+
+Click the above section to find out where we place the value of this project,
+though our current aim is to become the official AIOps engine
+of [Apache SkyWalking](https://github.com/apache/skywalking),
+each component could be easily swapped, extended and scaled to fit your own needs.
 
 ### Current Goal
 
-At the current stage, it serves as an **anomaly detection** engine, in the future, we will also explore root cause analysis and 
-automatic problem recovery.
+At the current stage, it targets at Logs and Metrics analysis,
+in the future, we will also explore root cause analysis and
+automatic problem recovery based on Traces.
 
-This is also the tentative repository for OSPP 2022 and GSOC 2022 student project outcomes.
+This is also the repository for
+OSPP 2022 and GSOC 2022 student research outcomes.
 
-Project `Exploration of Advanced Metrics Anomaly Detection & Alerts with Machine Learning in Apache SkyWalking`
+1. `Exploration of Advanced Metrics Anomaly Detection & Alerts with Machine Learning in Apache SkyWalking`
 
-Project `Log Outlier Detection in Apache SkyWalking`
+2. `Log Outlier Detection in Apache SkyWalking`
 
 ### Architecture
 
-**TBA**
+**Log Clustering and Log Trend Analysis**
 
-**Data pulling:**
+![img.png](docs/static/log-clustering-arch.png)
 
-The current data pulling and retention rely on a common set of ingestion methods, with a 
-first focus on SkyWalking OAP GraphQL and static file loader. We maintain a local storage for processed data.
+![img_1.png](docs/static/log-trend-analysis-arch.png)
 
-**Alert component:**
+**Metric Anomaly Detection and Visualizations**
 
-An anomaly does not directly trigger an alert, it 
-goes through a tolerance mechanism.
+TBD - Soon to be added
 
 ### Roadmap
 
-Phase 0 (current)
-1. [ ] Implement essential development infrastructure.
-2. [ ] Implement naive algorithms as baseline & pipline POC (on existing datasets).
-3. [ ] Implement a SkyWalking `GraphQLDataLoaderProvider` to test data pulling.
-
-Phase 1 (summer -> fall 2022, OSPP & GSOC period)
-1. [ ] Implement the remaining core default providers.
-2. [ ] **Research and implement algorithms with OSPP & GSOC students.**
-3. [ ] Integrate with Apache Airflow for orchestration.
-5. [ ] Evaluation based on benchmark microservices systems (anomaly injection).
-6. [ ] MVP ready without UI-side changes. 
-
-Phase 2 (fall -> end of 2022)
-1. [ ] Join as an Apache SkyWalking subproject.
-2. [ ] Integrate with SkyWalking Backend & rule-based alert module.
-3. [ ] Propose and request SkyWalking UI-side changes.
-4. [ ] First release for end-user testing.
-
-Phase Next 
+For the details of our progress, please refer to our project dashboard
+[Here](https://github.com/SkyAPM/aiops-engine-for-skywalking/projects?query=is%3Aopen).
+
+Phase Current (fall -> end of 2022)
+
+0. [ ] Finish POC stage and start implementing dashboards for first stage users. (demo purposes)
+1. [ ] Real-world data testing and chaos engineering benchmark experiments.
+2. [ ] Join Apache Software Foundation as an Apache SkyWalking subproject.
+3. [ ] Integrate with SkyWalking Backend (Export analytics results to SkyWalking)
+4. [ ] Propose and request SkyWalking UI-side changes.
+5. [ ] First release for SkyWalking end-user testing.
+
+Phase Next
+
 1.[ ] Towards production-ready.
diff --git a/sample_data_gaia/README.md → assets/README.md b/sample_data_gaia/README.md → assets/README.md
@@ -1,12 +1,15 @@
 # GAIA Dataset
+
 **Sample dataset can be downloaded from**
 
 https://github.com/CloudWise-OpenSource/GAIA-DataSet
 
 **We don't host dataset in this repo because of its size and GPL2.0 license.**
 
-To evaluate the models, please download the dataset and unzip each subset of
-`Companion_Data` into this directory. 
+**To get the data from source, simply run `make get-data` in the root directory.**
+
+To evaluate the models, the above command will download the dataset and populate each subset of
+`Companion_Data` into this directory.
 
 After that, this folder should be the following exact structure:
 

diff --git a/docs/en/contribute/Datasets.md b/docs/en/contribute/Datasets.md
@@ -0,0 +1,33 @@
+# Get a range of common datasets for testing and development
+
+At the root of project, run `make get-dataset name=<name>` to get them,
+the datasets will be extracted to the `assets/datasets` folder.
+
+Use the following `names` to download the batch of the datasets you need:
+
+1. `gaia`: the [GAIA](https://github.com/CloudWise-OpenSource/GAIA-DataSet) dataset.
+   - 4+ GB with log, trace and metric data.
+2. `log_s`: small [LogHub](https://github.com/logpai/loghub) datasets.
+   1. SSH.tar.gz: (Server)
+   2. Hadoop.tar.gz: (Distributed system)
+   3. Apache.tar.gz: (Server)
+   4. HealthApp.tar.gz: (Mobile application)
+   5. Zookeeper.tar.gz: (Distributed system)
+   6. HPC.tar.gz: (Supercomputer)
+3. `log_m`: medium [LogHub](https://github.com/logpai/loghub) datasets.
+   1. Android.tar.gz = 1,555,005 logs (183MB Mobile system)
+   2. BGL.tar.gz = 4,747,963 logs (700MB Supercomputer)
+   3. Spark.tar.gz = 33,236,604 logs (2.7GB Distributed system)
+4. `log_l`: large [LogHub](https://github.com/logpai/loghub) datasets.
+   1. HDFS_2.tar.gz = 71,118,073 logs (16GB Distributed system)
+   2. Thunderbird.tar.gz = 211,212,192 logs (30GB Supercomputer)
+
+**Note large dataset require substantial disk space and memory to extract**
+
+## To remove the datasets/zip/tar files
+
+If you want to keep all zip/tar files after extracting, pass additional `save=True`
+to `make get-dataset name=log_m save=True` .
+
+If you want to remove all datasets, run `make prune-dataset`
+
diff --git a/docs/static/log-clustering-arch.png b/docs/static/log-clustering-arch.png
diff --git a/docs/static/log-trend-analysis-arch.png b/docs/static/log-trend-analysis-arch.png
diff --git a/poetry.lock b/poetry.lock
diff --git a/pyproject.toml b/pyproject.toml
@@ -51,6 +51,7 @@ redis = { extras = ["hiredis"], version = "^4.3.4" }
 pyzstd = "^0.15.3"
 dynaconf = "^3.1.9"
 cachetools = "^5.2.0"
+grpcio-tools = ">=1.42.0"
 
 [tool.poetry.dev-dependencies]
 PySnooper = "^1.1.1"

diff --git a/tools/README.md b/tools/README.md
@@ -0,0 +1,11 @@
+# Convenient developer tools
+
+Tools in this folder should only be run via the `make <target>` command.
+
+1. grpc_gen.py => generate grpc code from proto files at any depth.
+2. get_data.py => download and extract some sample datasets from the web.
+3. cleaner.py => cleans up things like pycache and local installation manifests.
+
+## Future
+
+The above tools will be replaced will Poetry-based scripts in the future (A dev CLI)
diff --git a/tools/__init__.py b/tools/__init__.py
@@ -0,0 +1,13 @@
+#  Copyright 2022 SkyAPM org
+#
+#  Licensed under the Apache License, Version 2.0 (the "License");
+#  you may not use this file except in compliance with the License.
+#  You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+#  limitations under the License.
diff --git a/tools/cleaner.py b/tools/cleaner.py
@@ -0,0 +1,36 @@
+#  Copyright 2022 SkyAPM org
+#
+#  Licensed under the Apache License, Version 2.0 (the "License");
+#  you may not use this file except in compliance with the License.
+#  You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+#  limitations under the License.
+import os
+import shutil
+
+
+def find_and_clean(folders_to_remove: list, root='.') -> None:
+    """
+    Find and clean all files in the given folder list
+    :param folders_to_remove: list of directories to remove
+    :param root: from which folder to start searching, default current
+    :return:
+    """
+    exclude: set = {'.venv'}
+    for path, dirs, _ in os.walk(root):
+        dirs[:] = [d for d in dirs if d not in exclude]
+        for folder in folders_to_remove:
+            if any(folder in d for d in dirs):
+                shutil.rmtree(removed := os.path.join(path, folder))
+                print(f'Removed {removed}')
+
+
+if __name__ == '__main__':
+    find_and_clean(folders_to_remove=['__pycache__', 'generated', 'build', 'dist', 'egg-info', 'pytest_cache', '.pyc'],
+                   root='.')