Merge pull request #237 from amosproj/documentation/sprint-13-ahmed

Last sprint's deliverables
amosproj · Feb 7, 2024 · 31bdcae · 31bdcae
2 parents 48b1a1f + 02cc828
commit 31bdcae
Show file tree

Hide file tree

Showing 28 changed files with 707 additions and 159 deletions.
diff --git a/Deliverables/sprint-13/build-documentation.pdf b/Deliverables/sprint-13/build-documentation.pdf
diff --git a/Deliverables/sprint-13/build-documentation.pdf.license b/Deliverables/sprint-13/build-documentation.pdf.license
@@ -0,0 +1,2 @@
+SPDX-License-Identifier: MIT
+SPDX-FileCopyrightText: 2024 Ahmed Sheta <[email protected]>
diff --git a/Deliverables/sprint-13/design-documentation.pdf b/Deliverables/sprint-13/design-documentation.pdf
diff --git a/Deliverables/sprint-13/design-documentation.pdf.license b/Deliverables/sprint-13/design-documentation.pdf.license
@@ -0,0 +1,2 @@
+SPDX-License-Identifier: MIT
+SPDX-FileCopyrightText: 2024 Ahmed Sheta <[email protected]>
diff --git a/Deliverables/sprint-13/feature-board.png b/Deliverables/sprint-13/feature-board.png
diff --git a/Deliverables/sprint-13/feature-board.png.license b/Deliverables/sprint-13/feature-board.png.license
@@ -0,0 +1,2 @@
+SPDX-License-Identifier: MIT
+SPDX-FileCopyrightText: 2024 Simon Zimmermann <[email protected]>
diff --git a/Deliverables/sprint-13/imp-squared-backlog.jpg b/Deliverables/sprint-13/imp-squared-backlog.jpg
diff --git a/Deliverables/sprint-13/imp-squared-backlog.jpg.license b/Deliverables/sprint-13/imp-squared-backlog.jpg.license
@@ -0,0 +1,2 @@
+SPDX-License-Identifier: MIT
+SPDX-FileCopyrightText: 2023 Nico Hambauer <[email protected]>
diff --git a/Deliverables/sprint-13/planning-document.pdf b/Deliverables/sprint-13/planning-document.pdf
diff --git a/Deliverables/sprint-13/planning-document.pdf.license b/Deliverables/sprint-13/planning-document.pdf.license
@@ -0,0 +1,2 @@
+SPDX-License-Identifier: MIT
+SPDX-FileCopyrightText: 202$ Simon Zimmermann <[email protected]>
diff --git a/Deliverables/sprint-13/user-documentation.pdf b/Deliverables/sprint-13/user-documentation.pdf
diff --git a/Deliverables/sprint-13/user-documentation.pdf.license b/Deliverables/sprint-13/user-documentation.pdf.license
@@ -0,0 +1,2 @@
+SPDX-License-Identifier: MIT
+SPDX-FileCopyrightText: 2024 Ahmed Sheta <[email protected]>
diff --git a/Documentation/Architecture.md b/Documentation/Architecture.md
diff --git a/Documentation/Build-Documentation.md b/Documentation/Build-Documentation.md
@@ -0,0 +1,60 @@
+<!--
+SPDX-License-Identifier: MIT
+SPDX-FileCopyrightText: 2023 Felix Zailskas <[email protected]>
+-->
+
+# Creating the Environment
+
+The repository contains the file `.env.template`. This file is a template for
+the environment variables that need to be set for the application to run. Copy
+this file into a file called `.env` at the root level of this repository and
+fill in all values with the corresponding secrets.
+
+To create the virtual environment in this project you must have `pipenv`
+installed on your machine. Then run the following commands:
+
+```[bash]
+# for development environment
+pipenv install --dev
+# for production environment
+pipenv install
+```
+
+To work within the environment you can now run:
+
+```[bash]
+# to activate the virtual environment
+pipenv shell
+# to run a single command
+pipenv run <COMMAND>
+```
+
+# Build Process
+
+This application is built and tested on every push and pull request creation
+through Github actions. For this, the `pipenv` environment is installed and then
+the code style is checked using `flake8`. Finally, the `tests/` directory is
+executed using `pytest` and a test coverage report is created using `coverage`.
+The test coverage report can be found in the Github actions output.
+
+In another task, all used packages are tested for their license to ensure that
+the software does not use any copy-left licenses and remains open source and
+free to use.
+
+If any of these steps fail for a pull request the pull request is blocked from
+being merged until the corresponding step is fixed.
+
+Furthermore, it is required to install the pre-commit hooks as described
+[here](https://github.com/amosproj/amos2023ws06-sales-lead-qualifier/wiki/Knowledge#pre-commit).
+This ensures uniform coding style throughout the project as well as that the
+software is compliant with the REUSE licensing specifications.
+
+# Running the app
+
+To run the application the `pipenv` environment must be installed and all needed
+environment variables must be set in the `.env` file. Then the application can
+be started via
+
+```[bash]
+pipenv run python src/main.py
+```
diff --git a/Documentation/Classifier-Comparison.md b/Documentation/Classifier-Comparison.md
@@ -6,8 +6,14 @@ SPDX-FileCopyrightText: 2024 Ahmed Sheta <[email protected]>
 
 # Classifier Comparison
 
-This document compares the results of the following classifiers on the enriched and
-preprocessed data set from the 22.01.2024.
+## Abstract
+
+This report presents a comprehensive evaluation of various classifiers trained on the historical dataset, which has been enriched and preprocessed through our pipeline. Each model type was tested on two splits of the data set. The used data set has five
+classes for prediction corresponding to different merchant sizes, namely XS, S, M, L, and XL. The first split of the data set used exactly these classes for the prediction corresponding to the exact classes given by SumUp. The other data set split grouped the classes S, M, and L into one new class resulting in three classes of the form {XS}, {S, M, L}, and {XL}. While this does not exactly correspond to the given classes from SumUp, this simplification ofthe prediction task generally resulted in a better F1-score across models.
+
+## Experimental Attempts
+
+In accordance with the free lunch theorem, indicating no universal model superiority, multiple attempts were made to find the optimal solution. Unfortunately, certain models did not perform satisfactorily. Here are the experimented models and methodolgies
 
 - Quadratic Discriminant Analysis (QDA)
 - Ridge Classifier
@@ -18,24 +24,13 @@ preprocessed data set from the 22.01.2024.
 - XGBoost Classifier Model
 - K Nearest Neighbor Classifier (KNN)
 - Bernoulli Naive Bayes Classifier
-
-Each model type was tested on two splits of the data set. The used data set has five
-classes for prediction corresponding to different merchant sizes, namely XS, S, M, L, and XL.
-The first split of the data set used exactly these classes for the prediction corresponding
-to the exact classes given by SumUp. The other data set split grouped the classes S, M, and L
-into one new class resulting in three classes of the form {XS}, {S, M, L}, and {XL}. While
-this does not exactly correspond to the given classes from SumUp, this simplification of
-the prediction task generally resulted in a better F1-score across models.
-
-## Experimental Attempts
-
-According to free lunch theorem, there is no universal model or methodology that is top performing on every problem or data, therefore multiple attempts are crucal. In this section, we will document the experiments we tried and their corresponding performance and outputs.
+- LightGBM
 
 ## Models not performing well
 
 ### Support Vector Machine Classifier Model
 
-Training Support Vector Machine (SVM) took a while such that the training never ended. It is believed that it is the case because SVMs are very sensitive to the misclassifications and it finds a hard time minimizing them, given the data.
+Training Support Vector Machine (SVM) took a while such that the training never ended. We believe that it is the case because SVMs are very sensitive to the misclassifications and it finds a hard time minimizing them, given the data.
 
 ### Fully Connected Neural Networks Classifier Model
 
@@ -82,7 +77,6 @@ The following subsets are available:
 - The XGBoost was trained for 10000 rounds.
 - The LightGBM was trained with 2000 number of leaves
 
-
 In the following table we can see the model's overall weighted F1-score on the 3-class and
 5-class data set split. The best performing classifiers per row is marked **bold**.
 
@@ -141,3 +135,7 @@ In the following table we can see the F1-score of each model for each class in t
 
 For the 3-class split we observe similar performance for the XS and {S, M, L} classes for each model, while the LightGBM model slightly outperforms the other models. The LightGBM classifier is performing the best on the XL class while the Naive Bayes classifier performs worst. Interestingly, we can observe that the performance of the models on the XS class was barely affected by the merging of the S, M, and L classes while the performance on the XL class got worse for all of them. This needs to be considered, when evaluating the overall performance of the models on this data set split.
 The AdaBoost Classifier, trained on subset 1, performs best for the XL class. The KNN classifier got a slight boost in performance for the {S, M, L} and XL classes when using subset 1. All other models perform worse on subset 1.
+
+# Conclusion
+
+In summary, XGBoost consistently demonstrated superior performance, showcasing robust results across various splits and subsets. However, it is crucial to note that its elevated score is attributed to potential overfitting on the XS class. Given SumUp's emphasis on accurate predictions for higher classes, we recommend considering LightGBM. This model outperformed XGBoost in predicting the XL class and the other classes, offering better results in both the five-class and three-class splits.
diff --git a/Documentation/Controller.md b/Documentation/Controller.md
@@ -0,0 +1,32 @@
+<!--
+SPDX-License-Identifier: MIT
+SPDX-FileCopyrightText: 2023 Simon Zimmermann
+SPDX-FileCopyrightText: 2023 Berkay Bozkurt <[email protected]>
+-->
+
+# Automation
+
+The _Controller_ is a planned component, that has not been implemented beyond a
+conceptual prototype. In the planned scenario, the controller would coordinate
+BDC, MSP and the external components as a centralized instance of control. In
+contrast to our current design, this scenario would enable the automation of our
+current workflow, where there are currently several steps of human interaction
+required to achieve a prediction result for initially unprocessed lead data.
+
+## Diagrams
+
+The following diagrams were created during the prototyping phase for the
+Controller component. As they are from an early stage of our project, the
+Merchant Size Predictor is labelled as the (Estimated) Value Predictor here.
+
+### Component Diagram
+
+![Component Diagram](Media/component-diagram-with-controller.svg)
+
+### Sequence Diagram
+
+![Sequence Diagram](Media/sequence-diagram.svg)
+
+### Controller Workflow Diagram
+
+![Controller Workflow Diagram](Media/controller-workflow-diagram.jpg)
diff --git a/Documentation/Data-Field-Definition.md b/Documentation/Data-Field-Definition.md
diff --git a/Documentation/Data-Fields.md b/Documentation/Data-Fields.md
@@ -0,0 +1,23 @@
+<!--
+SPDX-License-Identifier: MIT
+SPDX-FileCopyrightText: 2023 Sophie Heasman <[email protected]>
+SPDX-FileCopyrightText: 2024 Simon Zimmermann <[email protected]>
+-->
+
+# Data Field Definitions
+
+This document outlines the data fields obtained for each lead. The data can be
+sourced from the online _Lead Form_ or be retrieved from the internet using
+APIs.
+
+## Data Field Table
+
+The most recent Data Fields table can now be found in a
+[separate CSV File](./data-fields.csv).
+
+## Links to Data Sources:
+
+Lead form: https://www.sumup.com/de-de/kontaktieren-vertriebsteam/ \
+Google Places API: https://developers.google.com/maps/documentation/places/web-service/overview \
+OpenAI API: https://platform.openai.com/docs/overview \
+Meta API: https://developers.facebook.com/docs/graph-api/overview