Skip to content

Commit

Permalink
Move some existing documentation into Git Repo
Browse files Browse the repository at this point in the history
Signed-off-by: Tims777 <[email protected]>
  • Loading branch information
Tims777 committed Feb 6, 2024
1 parent 08573f9 commit 3ab293d
Show file tree
Hide file tree
Showing 3 changed files with 329 additions and 0 deletions.
60 changes: 60 additions & 0 deletions Documentation/Build-Documentation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
<!--
SPDX-License-Identifier: MIT
SPDX-FileCopyrightText: 2023 Felix Zailskas <[email protected]>
-->

# Creating the Environment

The repository contains the file `.env.template`. This file is a template for
the environment variables that need to be set for the application to run. Copy
this file into a file called `.env` at the root level of this repository and
fill in all values with the corresponding secrets.

To create the virtual environment in this project you must have `pipenv`
installed on your machine. Then run the following commands:

```[bash]
# for development environment
pipenv install --dev
# for production environment
pipenv install
```

To work within the environment you can now run:

```[bash]
# to activate the virtual environment
pipenv shell
# to run a single command
pipenv run <COMMAND>
```

# Build Process

This application is built and tested on every push and pull request creation
through Github actions. For this, the `pipenv` environment is installed and then
the code style is checked using `flake8`. Finally, the `tests/` directory is
executed using `pytest` and a test coverage report is created using `coverage`.
The test coverage report can be found in the Github actions output.

In another task, all used packages are tested for their license to ensure that
the software does not use any copy-left licenses and remains open source and
free to use.

If any of these steps fail for a pull request the pull request is blocked from
being merged until the corresponding step is fixed.

Furthermore, it is required to install the pre-commit hooks as described
[here](https://github.com/amosproj/amos2023ws06-sales-lead-qualifier/wiki/Knowledge#pre-commit).
This ensures uniform coding style throughout the project as well as that the
software is compliant with the REUSE licensing specifications.

# Running the app

To run the application the `pipenv` environment must be installed and all needed
environment variables must be set in the `.env` file. Then the application can
be started via

```[bash]
pipenv run python src/main.py
```
82 changes: 82 additions & 0 deletions Documentation/Design-Documentation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
<!--
SPDX-License-Identifier: MIT
SPDX-FileCopyrightText: 2023 Felix Zailskas <[email protected]>
SPDX-FileCopyrightText: 2024 Ahmed Sheta <[email protected]>
-->

# Introduction

This application serves as a pivotal tool employed by our esteemed industry
partner, SumUp, for the enrichment of information pertaining to potential leads
garnered through their sign-up website. The refined data obtained undergoes
utilization in the prediction of potential value that a lead could contribute to
SumUp, facilitated by a sophisticated machine learning model. The application is
branched into two integral components: the Base Data Collector (BDC) and the
Estimated Value Predictor (EVP).

## Base Data Collector

### General description

The Base Data Collector (BDC) plays a crucial role in enriching the dataset
related to potential client leads. The initial dataset solely comprises
fundamental lead information, encompassing the lead's first and last name, phone
number, email address, and company name. Recognizing the insufficiency of this
baseline data for value prediction, the BDC is designed to query diverse data
sources, incorporating various Application Programming Interfaces (APIs), to
enrich the provided lead data.

### Design

The different data sources are organised as steps in the program. Each step
extends from a common parent class and implements methods to validate that it
can run, perform the data collection from the source and perform clean up and
statistics reports for itself. These steps are then collected in a pipeline
object sequentially performing the steps to enhance the given data with all
chosen data sources. The data sources include:

- inspecting the possible custom domain of the email address.
- retrieving multiple data from the Google Places API.
- analysing the sentiment of Google reviews using GPT.
- inspecting the surrounding areas of the business using the Regional Atlas API.
- searching for company-related data using the OffeneRegisterAPI.
- performing sentiment analysis on reviews using GPT-4 model.

### Data storage

All data for this project is stored in CSV files in the client's AWS S3 storage.
The files here are split into three buckets. The input data and enhanced data
are stored in the events bucket, pre-processed data ready for use of ML models
is stored in the features bucket and the used model and inference is stored in
the model bucket. Data preprocessing Following data enrichment, a pivotal phase
in the machine learning pipeline is data preprocessing, an essential process
encompassing scaling operations, numerical outlier elimination, and categorical
one-hot encoding. This preprocessing stage serves transforms the output
originating from the BDC into feature vectors, thereby rendering them amenable
for predictive analysis by the machine learning model.

## Estimated Value Predictor

The primary objective of the EVP was initially oriented towards forecasting the
estimated life-value of leads. However, this objective evolved during the
project's progression, primarily influenced by labelling considerations. The
machine learning model, integral to the EVP, undergoes training on proprietary
historical data sourced from SumUp. The training process aims to discern
discriminative features that effectively stratify each class within the Merchant
Size taxonomy. It is imperative to note that the confidentiality of the
underlying data prohibits its public disclosure.

## Merchant Size Predictor

In the context of Merchant Size Prediction, our aim is to leverage pre-trained
ML models on new lead data. By applying these models, we intend to predict the
potential Merchant Size, thereby assisting SumUp in prioritizing leads and
making informed decisions on which leads to contact first. This predictive
approach enhances the efficiency of lead management and optimizes resource
allocation for maximum impact.

## Data field definitions

This section highlights the data field definitions obtained for each lead. The
acquisition of such data may derive from the online Lead Form or may be
extracted from online sources utilizing APIs.
187 changes: 187 additions & 0 deletions Documentation/User-Documentation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
<!--
SPDX-License-Identifier: MIT
SPDX-FileCopyrightText: 2023 Felix Zailskas <[email protected]>
SPDX-FileCopyrightText: 2024 Ahmed Sheta <[email protected]>
-->

# Project vision

This product will give our industry partner a tool at hand, that can effectively
increase conversion of their leads to customers, primarily by providing the
sales team with valuable information. The modular architecture makes our product
future-proof, by making it easy to add further data sources, employ improved
prediction models or to adjust the output format if desired.

# Project mission

The mission of this project is to enrich historical data about customers and
recent data about leads (with information from external sources) and to leverage
the enriched data in machine learning, so that the estimated Merchant Size of
leads can be predicted.

# Usage

To execute the final program, ensure the environment is installed (refer to
build-documents.md) and run `python .\src\main.py` either locally or via the
build process. The user will be presented with the following options:

```[bash]
Choose demo:
(0) : Base Data Collector
(1) : Data preprocessing
(2) : Estimated Value Predictor
(3) : Merchant Size Prediction
(4) : Exit
```

## (0) : Base Data Collector

This is the data enrichment pipeline, utilizing multiple data enrichment steps.
Configuration options are presented:

`Do you want to list all available pipeline configs? (y/N)` If `y`:

```[bash]
Please enter the index of requested pipeline config:
(0) : config_sprint09_release.json
(1) : just_run_search_offeneregister.json
(2) : run_all_steps.json
(3) : Exit
```

- (0) Coniguration used in sprint 9.
- (1) Coniguration for OffeneRegister.
- (2) Running all the steps of the pipeline without steps selection.
- (3) Exit to the pipeline step selection.

If `n`: proceed to pipeline step selection for data enrichment. Subsequent
questions arise:

```[bash]
Run Scrape Address (will take a long time)(y/N)?
Run Search OffeneRegister (will take a long time)(y/N)?
Run Phone Number Validation (y/N)?
Run Google API (will use token and generate cost!)(y/N)?
Run Google API Detailed (will use token and generate cost!)(y/N)?
Run openAI GPT Sentiment Analyzer (will use token and generate cost!)(y/N)?
Run openAI GPT Summarizer (will use token and generate cost!)(y/N)?
Run Smart Review Insights (will take looong time!)(y/N)?
Run Regionalatlas (y/N)?
```

- `Run Scrape Address (will take a long time)(y/N)?`: This enrichment step
scrapes the leads website for an address using regex.
- `Run Search OffeneRegister (will take a long time)(y/N)?`: This enrichment
step searches for company-related data using the OffeneRegisterAPI.
- `Run Phone Number Validation (y/N)?`: This enrichment step checks if the
provided phone numbers are valid and extract geographical information using
geocoder.
- `Run Google API (will use token and generate cost!)(y/N)?`: This enrichment
step tries to the correct business entry in the Google Maps database. It will
save basic information along with the place id, that can be used to retrieve
further detailed information and a confidence score that should indicate the
confidence in having found the correct result.
- `Run Google API Detailed (will use token and generate cost!)(y/N)?`: This
enrichment step tries to gather detailed information for a given google
business entry, identified by the place ID.
- `Run openAI GPT Sentiment Analyzer (will use token and generate cost!)(y/N)?`:
This enrichment step performs sentiment analysis on reviews using GPT-4 model.
- `Run openAI GPT Summarizer (will use token and generate cost!)(y/N)?`: This
enrichment step attempts to download a businesses website in raw html format
and pass this information to OpenAIs GPT, which will then attempt to summarize
the raw contents and extract valuable information for a salesperson.
- `Run Smart Review Insights (will take looong time!)(y/N)?`: This enrichment
step enhances review insights for smart review analysis
- `Run Regionalatlas (y/N)?`: This enrichment step will query the RegionalAtlas
database for location based geographic and demographic information, based on
the address that was found for a business through Google API.

It is emphasized that some steps are dependent on others, and excluding one
might result in dependency issues for subsequent steps.

After selecting the desired enrichtment steps, a prompt asks the user to
`Set limit for data points to be processed (0=No limit)` such that the user
chooses whether it apply the data enrichment steps for all the leads (no limit)
or for a certain number of leads.

**Note**: In case `DATABASE_TYPE="S3"` in your `.env` file, the limit will be
removed, in order to enrich all the data into `s3://amos--data--events` S3
bucket.

## (1) : Data preprocessing

Post data enrichment, preprocessing is crucial for machine learning models,
involving scaling, numerical outlier removal, and categorical one-hot encoding.
The user is prompted with questions:

`Filter out the API-irrelevant data? (y/n)`: This will filter out all the leads
that couldn't be enriched during the data enrichtment steps, removing them would
be useful for the Machine Learning algorithms, to avoid any bias introduced,
even if we pad the features with zeros.
`Run on historical data ? (y/n) Note: DATABASE_TYPE should be S3!`: The user has
to have `DATABASE_TYPE="S3"` in `.env` file in order to run on historical data,
otherwise, it will run locally. After preprocessing, the log will show where the
preprocessed_data is stored.

## (2) : Estimated Value Predictor

Six machine learning models are available:

```[bash]
(0) : Random Forest
(1) : XGBoost
(2) : Naive Bayes
(3) : KNN Classifier
(4) : AdaBoost
(5) : LightGBM
```

After selection of the desired machine learning model, the user would be
prompted with a series of questions:

- `Load model from file? (y/N)` : In case of `y`, the program will ask for a
file location of a previously saved model to use for predictions and testing.
- `Use 3 classes ({XS}, {S, M, L}, {XL}) instead of 5 classes ({XS}, {S}, {M}, {L}, {XL})? (y/N)`:
In case of `y`, the S, M, L labels of the data would be grouped alltogether as
one class such that the training would be on 3 classes ({XS}, {S, M, L}, {XL})
instead of the 5 classes. It is worth noting that grouping the S, M and L
classes alltogether as one class resulted in boosting the classification
performance.
- ```[bash]
Do you want to train on a subset of features?
(0) : ['Include all features']
(1) : ['google_places_rating', 'google_places_user_ratings_total', 'google_places_confidence', 'regional_atlas_regional_score']
```

`0` would include all the numerical and categorical one-hot encoded features,
while `1` would choose a small subset of data as features for the machine
learning models

Then, the user would be given multiple options:

```[bash]
(1) Train
(2) Test
(3) Predict on single lead
(4) Save model
(5) Exit
```

- (1): Train the current model on the current trainig dataset.
- (2): Test the current model on the test dataset, displaying the mean squared
error.
- (3): Choose a single lead from the test dataset and display the prediction and
true label.
- (4): Save the current model to the `amos--models/models` on S3 in case of
`DATABASE_TYPE=S3`, otherwise it will save it locally.
- (5): Exit the EVP submenu

## (3) : Merchant Size Prediction

After training, testing, and saving the model, the true essence of models lies
not just in crucial task of generating forecasted predictions for previously
unseen leads.

## (4) : Exit

Gracefully exit the program.

0 comments on commit 3ab293d

Please sign in to comment.