Move some existing documentation into Git Repo

Signed-off-by: Tims777 <[email protected]>
amosproj · Feb 6, 2024 · 3ab293d · 3ab293d
1 parent 08573f9
commit 3ab293d
Show file tree

Hide file tree

Showing 3 changed files with 329 additions and 0 deletions.
diff --git a/Documentation/Build-Documentation.md b/Documentation/Build-Documentation.md
@@ -0,0 +1,60 @@
+<!--
+SPDX-License-Identifier: MIT
+SPDX-FileCopyrightText: 2023 Felix Zailskas <[email protected]>
+-->
+
+# Creating the Environment
+
+The repository contains the file `.env.template`. This file is a template for
+the environment variables that need to be set for the application to run. Copy
+this file into a file called `.env` at the root level of this repository and
+fill in all values with the corresponding secrets.
+
+To create the virtual environment in this project you must have `pipenv`
+installed on your machine. Then run the following commands:
+
+```[bash]
+# for development environment
+pipenv install --dev
+# for production environment
+pipenv install
+```
+
+To work within the environment you can now run:
+
+```[bash]
+# to activate the virtual environment
+pipenv shell
+# to run a single command
+pipenv run <COMMAND>
+```
+
+# Build Process
+
+This application is built and tested on every push and pull request creation
+through Github actions. For this, the `pipenv` environment is installed and then
+the code style is checked using `flake8`. Finally, the `tests/` directory is
+executed using `pytest` and a test coverage report is created using `coverage`.
+The test coverage report can be found in the Github actions output.
+
+In another task, all used packages are tested for their license to ensure that
+the software does not use any copy-left licenses and remains open source and
+free to use.
+
+If any of these steps fail for a pull request the pull request is blocked from
+being merged until the corresponding step is fixed.
+
+Furthermore, it is required to install the pre-commit hooks as described
+[here](https://github.com/amosproj/amos2023ws06-sales-lead-qualifier/wiki/Knowledge#pre-commit).
+This ensures uniform coding style throughout the project as well as that the
+software is compliant with the REUSE licensing specifications.
+
+# Running the app
+
+To run the application the `pipenv` environment must be installed and all needed
+environment variables must be set in the `.env` file. Then the application can
+be started via
+
+```[bash]
+pipenv run python src/main.py
+```
diff --git a/Documentation/Design-Documentation.md b/Documentation/Design-Documentation.md
@@ -0,0 +1,82 @@
+<!--
+SPDX-License-Identifier: MIT
+SPDX-FileCopyrightText: 2023 Felix Zailskas <[email protected]>
+SPDX-FileCopyrightText: 2024 Ahmed Sheta <[email protected]>
+-->
+
+# Introduction
+
+This application serves as a pivotal tool employed by our esteemed industry
+partner, SumUp, for the enrichment of information pertaining to potential leads
+garnered through their sign-up website. The refined data obtained undergoes
+utilization in the prediction of potential value that a lead could contribute to
+SumUp, facilitated by a sophisticated machine learning model. The application is
+branched into two integral components: the Base Data Collector (BDC) and the
+Estimated Value Predictor (EVP).
+
+## Base Data Collector
+
+### General description
+
+The Base Data Collector (BDC) plays a crucial role in enriching the dataset
+related to potential client leads. The initial dataset solely comprises
+fundamental lead information, encompassing the lead's first and last name, phone
+number, email address, and company name. Recognizing the insufficiency of this
+baseline data for value prediction, the BDC is designed to query diverse data
+sources, incorporating various Application Programming Interfaces (APIs), to
+enrich the provided lead data.
+
+### Design
+
+The different data sources are organised as steps in the program. Each step
+extends from a common parent class and implements methods to validate that it
+can run, perform the data collection from the source and perform clean up and
+statistics reports for itself. These steps are then collected in a pipeline
+object sequentially performing the steps to enhance the given data with all
+chosen data sources. The data sources include:
+
+- inspecting the possible custom domain of the email address.
+- retrieving multiple data from the Google Places API.
+- analysing the sentiment of Google reviews using GPT.
+- inspecting the surrounding areas of the business using the Regional Atlas API.
+- searching for company-related data using the OffeneRegisterAPI.
+- performing sentiment analysis on reviews using GPT-4 model.
+
+### Data storage
+
+All data for this project is stored in CSV files in the client's AWS S3 storage.
+The files here are split into three buckets. The input data and enhanced data
+are stored in the events bucket, pre-processed data ready for use of ML models
+is stored in the features bucket and the used model and inference is stored in
+the model bucket. Data preprocessing Following data enrichment, a pivotal phase
+in the machine learning pipeline is data preprocessing, an essential process
+encompassing scaling operations, numerical outlier elimination, and categorical
+one-hot encoding. This preprocessing stage serves transforms the output
+originating from the BDC into feature vectors, thereby rendering them amenable
+for predictive analysis by the machine learning model.
+
+## Estimated Value Predictor
+
+The primary objective of the EVP was initially oriented towards forecasting the
+estimated life-value of leads. However, this objective evolved during the
+project's progression, primarily influenced by labelling considerations. The
+machine learning model, integral to the EVP, undergoes training on proprietary
+historical data sourced from SumUp. The training process aims to discern
+discriminative features that effectively stratify each class within the Merchant
+Size taxonomy. It is imperative to note that the confidentiality of the
+underlying data prohibits its public disclosure.
+
+## Merchant Size Predictor
+
+In the context of Merchant Size Prediction, our aim is to leverage pre-trained
+ML models on new lead data. By applying these models, we intend to predict the
+potential Merchant Size, thereby assisting SumUp in prioritizing leads and
+making informed decisions on which leads to contact first. This predictive
+approach enhances the efficiency of lead management and optimizes resource
+allocation for maximum impact.
+
+## Data field definitions
+
+This section highlights the data field definitions obtained for each lead. The
+acquisition of such data may derive from the online Lead Form or may be
+extracted from online sources utilizing APIs.
diff --git a/Documentation/User-Documentation.md b/Documentation/User-Documentation.md
@@ -0,0 +1,187 @@
+<!--
+SPDX-License-Identifier: MIT
+SPDX-FileCopyrightText: 2023 Felix Zailskas <[email protected]>
+SPDX-FileCopyrightText: 2024 Ahmed Sheta <[email protected]>
+-->
+
+# Project vision
+
+This product will give our industry partner a tool at hand, that can effectively
+increase conversion of their leads to customers, primarily by providing the
+sales team with valuable information. The modular architecture makes our product
+future-proof, by making it easy to add further data sources, employ improved
+prediction models or to adjust the output format if desired.
+
+# Project mission
+
+The mission of this project is to enrich historical data about customers and
+recent data about leads (with information from external sources) and to leverage
+the enriched data in machine learning, so that the estimated Merchant Size of
+leads can be predicted.
+
+# Usage
+
+To execute the final program, ensure the environment is installed (refer to
+build-documents.md) and run `python .\src\main.py` either locally or via the
+build process. The user will be presented with the following options:
+
+```[bash]
+Choose demo:
+(0) : Base Data Collector
+(1) : Data preprocessing
+(2) : Estimated Value Predictor
+(3) : Merchant Size Prediction
+(4) : Exit
+```
+
+## (0) : Base Data Collector
+
+This is the data enrichment pipeline, utilizing multiple data enrichment steps.
+Configuration options are presented:
+
+`Do you want to list all available pipeline configs? (y/N)` If `y`:
+
+```[bash]
+Please enter the index of requested pipeline config:
+(0) : config_sprint09_release.json
+(1) : just_run_search_offeneregister.json
+(2) : run_all_steps.json
+(3) : Exit
+```
+
+- (0) Coniguration used in sprint 9.
+- (1) Coniguration for OffeneRegister.
+- (2) Running all the steps of the pipeline without steps selection.
+- (3) Exit to the pipeline step selection.
+
+If `n`: proceed to pipeline step selection for data enrichment. Subsequent
+questions arise:
+
+```[bash]
+Run Scrape Address (will take a long time)(y/N)?
+Run Search OffeneRegister (will take a long time)(y/N)?
+Run Phone Number Validation (y/N)?
+Run Google API (will use token and generate cost!)(y/N)?
+Run Google API Detailed (will use token and generate cost!)(y/N)?
+Run openAI GPT Sentiment Analyzer (will use token and generate cost!)(y/N)?
+Run openAI GPT Summarizer (will use token and generate cost!)(y/N)?
+Run Smart Review Insights (will take looong time!)(y/N)?
+Run Regionalatlas (y/N)?
+```
+
+- `Run Scrape Address (will take a long time)(y/N)?`: This enrichment step
+  scrapes the leads website for an address using regex.
+- `Run Search OffeneRegister (will take a long time)(y/N)?`: This enrichment
+  step searches for company-related data using the OffeneRegisterAPI.
+- `Run Phone Number Validation (y/N)?`: This enrichment step checks if the
+  provided phone numbers are valid and extract geographical information using
+  geocoder.
+- `Run Google API (will use token and generate cost!)(y/N)?`: This enrichment
+  step tries to the correct business entry in the Google Maps database. It will
+  save basic information along with the place id, that can be used to retrieve
+  further detailed information and a confidence score that should indicate the
+  confidence in having found the correct result.
+- `Run Google API Detailed (will use token and generate cost!)(y/N)?`: This
+  enrichment step tries to gather detailed information for a given google
+  business entry, identified by the place ID.
+- `Run openAI GPT Sentiment Analyzer (will use token and generate cost!)(y/N)?`:
+  This enrichment step performs sentiment analysis on reviews using GPT-4 model.
+- `Run openAI GPT Summarizer (will use token and generate cost!)(y/N)?`: This
+  enrichment step attempts to download a businesses website in raw html format
+  and pass this information to OpenAIs GPT, which will then attempt to summarize
+  the raw contents and extract valuable information for a salesperson.
+- `Run Smart Review Insights (will take looong time!)(y/N)?`: This enrichment
+  step enhances review insights for smart review analysis
+- `Run Regionalatlas (y/N)?`: This enrichment step will query the RegionalAtlas
+  database for location based geographic and demographic information, based on
+  the address that was found for a business through Google API.
+
+It is emphasized that some steps are dependent on others, and excluding one
+might result in dependency issues for subsequent steps.
+
+After selecting the desired enrichtment steps, a prompt asks the user to
+`Set limit for data points to be processed (0=No limit)` such that the user
+chooses whether it apply the data enrichment steps for all the leads (no limit)
+or for a certain number of leads.
+
+**Note**: In case `DATABASE_TYPE="S3"` in your `.env` file, the limit will be
+removed, in order to enrich all the data into `s3://amos--data--events` S3
+bucket.
+
+## (1) : Data preprocessing
+
+Post data enrichment, preprocessing is crucial for machine learning models,
+involving scaling, numerical outlier removal, and categorical one-hot encoding.
+The user is prompted with questions:
+
+`Filter out the API-irrelevant data? (y/n)`: This will filter out all the leads
+that couldn't be enriched during the data enrichtment steps, removing them would
+be useful for the Machine Learning algorithms, to avoid any bias introduced,
+even if we pad the features with zeros.
+`Run on historical data ? (y/n) Note: DATABASE_TYPE should be S3!`: The user has
+to have `DATABASE_TYPE="S3"` in `.env` file in order to run on historical data,
+otherwise, it will run locally. After preprocessing, the log will show where the
+preprocessed_data is stored.
+
+## (2) : Estimated Value Predictor
+
+Six machine learning models are available:
+
+```[bash]
+(0) : Random Forest
+(1) : XGBoost
+(2) : Naive Bayes
+(3) : KNN Classifier
+(4) : AdaBoost
+(5) : LightGBM
+```
+
+After selection of the desired machine learning model, the user would be
+prompted with a series of questions:
+
+- `Load model from file? (y/N)` : In case of `y`, the program will ask for a
+  file location of a previously saved model to use for predictions and testing.
+- `Use 3 classes ({XS}, {S, M, L}, {XL}) instead of 5 classes ({XS}, {S}, {M}, {L}, {XL})? (y/N)`:
+  In case of `y`, the S, M, L labels of the data would be grouped alltogether as
+  one class such that the training would be on 3 classes ({XS}, {S, M, L}, {XL})
+  instead of the 5 classes. It is worth noting that grouping the S, M and L
+  classes alltogether as one class resulted in boosting the classification
+  performance.
+- ```[bash]
+  Do you want to train on a subset of features?
+  (0) : ['Include all features']
+  (1) : ['google_places_rating', 'google_places_user_ratings_total', 'google_places_confidence', 'regional_atlas_regional_score']
+  ```
+
+`0` would include all the numerical and categorical one-hot encoded features,
+while `1` would choose a small subset of data as features for the machine
+learning models
+
+Then, the user would be given multiple options:
+
+```[bash]
+(1) Train
+(2) Test
+(3) Predict on single lead
+(4) Save model
+(5) Exit
+```
+
+- (1): Train the current model on the current trainig dataset.
+- (2): Test the current model on the test dataset, displaying the mean squared
+  error.
+- (3): Choose a single lead from the test dataset and display the prediction and
+  true label.
+- (4): Save the current model to the `amos--models/models` on S3 in case of
+  `DATABASE_TYPE=S3`, otherwise it will save it locally.
+- (5): Exit the EVP submenu
+
+## (3) : Merchant Size Prediction
+
+After training, testing, and saving the model, the true essence of models lies
+not just in crucial task of generating forecasted predictions for previously
+unseen leads.
+
+## (4) : Exit
+
+Gracefully exit the program.