generated from amosproj/amos202Xss0Y-projname
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Move some existing documentation into Git Repo
Signed-off-by: Tims777 <[email protected]>
- Loading branch information
Showing
3 changed files
with
329 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
<!-- | ||
SPDX-License-Identifier: MIT | ||
SPDX-FileCopyrightText: 2023 Felix Zailskas <[email protected]> | ||
--> | ||
|
||
# Creating the Environment | ||
|
||
The repository contains the file `.env.template`. This file is a template for | ||
the environment variables that need to be set for the application to run. Copy | ||
this file into a file called `.env` at the root level of this repository and | ||
fill in all values with the corresponding secrets. | ||
|
||
To create the virtual environment in this project you must have `pipenv` | ||
installed on your machine. Then run the following commands: | ||
|
||
```[bash] | ||
# for development environment | ||
pipenv install --dev | ||
# for production environment | ||
pipenv install | ||
``` | ||
|
||
To work within the environment you can now run: | ||
|
||
```[bash] | ||
# to activate the virtual environment | ||
pipenv shell | ||
# to run a single command | ||
pipenv run <COMMAND> | ||
``` | ||
|
||
# Build Process | ||
|
||
This application is built and tested on every push and pull request creation | ||
through Github actions. For this, the `pipenv` environment is installed and then | ||
the code style is checked using `flake8`. Finally, the `tests/` directory is | ||
executed using `pytest` and a test coverage report is created using `coverage`. | ||
The test coverage report can be found in the Github actions output. | ||
|
||
In another task, all used packages are tested for their license to ensure that | ||
the software does not use any copy-left licenses and remains open source and | ||
free to use. | ||
|
||
If any of these steps fail for a pull request the pull request is blocked from | ||
being merged until the corresponding step is fixed. | ||
|
||
Furthermore, it is required to install the pre-commit hooks as described | ||
[here](https://github.com/amosproj/amos2023ws06-sales-lead-qualifier/wiki/Knowledge#pre-commit). | ||
This ensures uniform coding style throughout the project as well as that the | ||
software is compliant with the REUSE licensing specifications. | ||
|
||
# Running the app | ||
|
||
To run the application the `pipenv` environment must be installed and all needed | ||
environment variables must be set in the `.env` file. Then the application can | ||
be started via | ||
|
||
```[bash] | ||
pipenv run python src/main.py | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,82 @@ | ||
<!-- | ||
SPDX-License-Identifier: MIT | ||
SPDX-FileCopyrightText: 2023 Felix Zailskas <[email protected]> | ||
SPDX-FileCopyrightText: 2024 Ahmed Sheta <[email protected]> | ||
--> | ||
|
||
# Introduction | ||
|
||
This application serves as a pivotal tool employed by our esteemed industry | ||
partner, SumUp, for the enrichment of information pertaining to potential leads | ||
garnered through their sign-up website. The refined data obtained undergoes | ||
utilization in the prediction of potential value that a lead could contribute to | ||
SumUp, facilitated by a sophisticated machine learning model. The application is | ||
branched into two integral components: the Base Data Collector (BDC) and the | ||
Estimated Value Predictor (EVP). | ||
|
||
## Base Data Collector | ||
|
||
### General description | ||
|
||
The Base Data Collector (BDC) plays a crucial role in enriching the dataset | ||
related to potential client leads. The initial dataset solely comprises | ||
fundamental lead information, encompassing the lead's first and last name, phone | ||
number, email address, and company name. Recognizing the insufficiency of this | ||
baseline data for value prediction, the BDC is designed to query diverse data | ||
sources, incorporating various Application Programming Interfaces (APIs), to | ||
enrich the provided lead data. | ||
|
||
### Design | ||
|
||
The different data sources are organised as steps in the program. Each step | ||
extends from a common parent class and implements methods to validate that it | ||
can run, perform the data collection from the source and perform clean up and | ||
statistics reports for itself. These steps are then collected in a pipeline | ||
object sequentially performing the steps to enhance the given data with all | ||
chosen data sources. The data sources include: | ||
|
||
- inspecting the possible custom domain of the email address. | ||
- retrieving multiple data from the Google Places API. | ||
- analysing the sentiment of Google reviews using GPT. | ||
- inspecting the surrounding areas of the business using the Regional Atlas API. | ||
- searching for company-related data using the OffeneRegisterAPI. | ||
- performing sentiment analysis on reviews using GPT-4 model. | ||
|
||
### Data storage | ||
|
||
All data for this project is stored in CSV files in the client's AWS S3 storage. | ||
The files here are split into three buckets. The input data and enhanced data | ||
are stored in the events bucket, pre-processed data ready for use of ML models | ||
is stored in the features bucket and the used model and inference is stored in | ||
the model bucket. Data preprocessing Following data enrichment, a pivotal phase | ||
in the machine learning pipeline is data preprocessing, an essential process | ||
encompassing scaling operations, numerical outlier elimination, and categorical | ||
one-hot encoding. This preprocessing stage serves transforms the output | ||
originating from the BDC into feature vectors, thereby rendering them amenable | ||
for predictive analysis by the machine learning model. | ||
|
||
## Estimated Value Predictor | ||
|
||
The primary objective of the EVP was initially oriented towards forecasting the | ||
estimated life-value of leads. However, this objective evolved during the | ||
project's progression, primarily influenced by labelling considerations. The | ||
machine learning model, integral to the EVP, undergoes training on proprietary | ||
historical data sourced from SumUp. The training process aims to discern | ||
discriminative features that effectively stratify each class within the Merchant | ||
Size taxonomy. It is imperative to note that the confidentiality of the | ||
underlying data prohibits its public disclosure. | ||
|
||
## Merchant Size Predictor | ||
|
||
In the context of Merchant Size Prediction, our aim is to leverage pre-trained | ||
ML models on new lead data. By applying these models, we intend to predict the | ||
potential Merchant Size, thereby assisting SumUp in prioritizing leads and | ||
making informed decisions on which leads to contact first. This predictive | ||
approach enhances the efficiency of lead management and optimizes resource | ||
allocation for maximum impact. | ||
|
||
## Data field definitions | ||
|
||
This section highlights the data field definitions obtained for each lead. The | ||
acquisition of such data may derive from the online Lead Form or may be | ||
extracted from online sources utilizing APIs. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,187 @@ | ||
<!-- | ||
SPDX-License-Identifier: MIT | ||
SPDX-FileCopyrightText: 2023 Felix Zailskas <[email protected]> | ||
SPDX-FileCopyrightText: 2024 Ahmed Sheta <[email protected]> | ||
--> | ||
|
||
# Project vision | ||
|
||
This product will give our industry partner a tool at hand, that can effectively | ||
increase conversion of their leads to customers, primarily by providing the | ||
sales team with valuable information. The modular architecture makes our product | ||
future-proof, by making it easy to add further data sources, employ improved | ||
prediction models or to adjust the output format if desired. | ||
|
||
# Project mission | ||
|
||
The mission of this project is to enrich historical data about customers and | ||
recent data about leads (with information from external sources) and to leverage | ||
the enriched data in machine learning, so that the estimated Merchant Size of | ||
leads can be predicted. | ||
|
||
# Usage | ||
|
||
To execute the final program, ensure the environment is installed (refer to | ||
build-documents.md) and run `python .\src\main.py` either locally or via the | ||
build process. The user will be presented with the following options: | ||
|
||
```[bash] | ||
Choose demo: | ||
(0) : Base Data Collector | ||
(1) : Data preprocessing | ||
(2) : Estimated Value Predictor | ||
(3) : Merchant Size Prediction | ||
(4) : Exit | ||
``` | ||
|
||
## (0) : Base Data Collector | ||
|
||
This is the data enrichment pipeline, utilizing multiple data enrichment steps. | ||
Configuration options are presented: | ||
|
||
`Do you want to list all available pipeline configs? (y/N)` If `y`: | ||
|
||
```[bash] | ||
Please enter the index of requested pipeline config: | ||
(0) : config_sprint09_release.json | ||
(1) : just_run_search_offeneregister.json | ||
(2) : run_all_steps.json | ||
(3) : Exit | ||
``` | ||
|
||
- (0) Coniguration used in sprint 9. | ||
- (1) Coniguration for OffeneRegister. | ||
- (2) Running all the steps of the pipeline without steps selection. | ||
- (3) Exit to the pipeline step selection. | ||
|
||
If `n`: proceed to pipeline step selection for data enrichment. Subsequent | ||
questions arise: | ||
|
||
```[bash] | ||
Run Scrape Address (will take a long time)(y/N)? | ||
Run Search OffeneRegister (will take a long time)(y/N)? | ||
Run Phone Number Validation (y/N)? | ||
Run Google API (will use token and generate cost!)(y/N)? | ||
Run Google API Detailed (will use token and generate cost!)(y/N)? | ||
Run openAI GPT Sentiment Analyzer (will use token and generate cost!)(y/N)? | ||
Run openAI GPT Summarizer (will use token and generate cost!)(y/N)? | ||
Run Smart Review Insights (will take looong time!)(y/N)? | ||
Run Regionalatlas (y/N)? | ||
``` | ||
|
||
- `Run Scrape Address (will take a long time)(y/N)?`: This enrichment step | ||
scrapes the leads website for an address using regex. | ||
- `Run Search OffeneRegister (will take a long time)(y/N)?`: This enrichment | ||
step searches for company-related data using the OffeneRegisterAPI. | ||
- `Run Phone Number Validation (y/N)?`: This enrichment step checks if the | ||
provided phone numbers are valid and extract geographical information using | ||
geocoder. | ||
- `Run Google API (will use token and generate cost!)(y/N)?`: This enrichment | ||
step tries to the correct business entry in the Google Maps database. It will | ||
save basic information along with the place id, that can be used to retrieve | ||
further detailed information and a confidence score that should indicate the | ||
confidence in having found the correct result. | ||
- `Run Google API Detailed (will use token and generate cost!)(y/N)?`: This | ||
enrichment step tries to gather detailed information for a given google | ||
business entry, identified by the place ID. | ||
- `Run openAI GPT Sentiment Analyzer (will use token and generate cost!)(y/N)?`: | ||
This enrichment step performs sentiment analysis on reviews using GPT-4 model. | ||
- `Run openAI GPT Summarizer (will use token and generate cost!)(y/N)?`: This | ||
enrichment step attempts to download a businesses website in raw html format | ||
and pass this information to OpenAIs GPT, which will then attempt to summarize | ||
the raw contents and extract valuable information for a salesperson. | ||
- `Run Smart Review Insights (will take looong time!)(y/N)?`: This enrichment | ||
step enhances review insights for smart review analysis | ||
- `Run Regionalatlas (y/N)?`: This enrichment step will query the RegionalAtlas | ||
database for location based geographic and demographic information, based on | ||
the address that was found for a business through Google API. | ||
|
||
It is emphasized that some steps are dependent on others, and excluding one | ||
might result in dependency issues for subsequent steps. | ||
|
||
After selecting the desired enrichtment steps, a prompt asks the user to | ||
`Set limit for data points to be processed (0=No limit)` such that the user | ||
chooses whether it apply the data enrichment steps for all the leads (no limit) | ||
or for a certain number of leads. | ||
|
||
**Note**: In case `DATABASE_TYPE="S3"` in your `.env` file, the limit will be | ||
removed, in order to enrich all the data into `s3://amos--data--events` S3 | ||
bucket. | ||
|
||
## (1) : Data preprocessing | ||
|
||
Post data enrichment, preprocessing is crucial for machine learning models, | ||
involving scaling, numerical outlier removal, and categorical one-hot encoding. | ||
The user is prompted with questions: | ||
|
||
`Filter out the API-irrelevant data? (y/n)`: This will filter out all the leads | ||
that couldn't be enriched during the data enrichtment steps, removing them would | ||
be useful for the Machine Learning algorithms, to avoid any bias introduced, | ||
even if we pad the features with zeros. | ||
`Run on historical data ? (y/n) Note: DATABASE_TYPE should be S3!`: The user has | ||
to have `DATABASE_TYPE="S3"` in `.env` file in order to run on historical data, | ||
otherwise, it will run locally. After preprocessing, the log will show where the | ||
preprocessed_data is stored. | ||
|
||
## (2) : Estimated Value Predictor | ||
|
||
Six machine learning models are available: | ||
|
||
```[bash] | ||
(0) : Random Forest | ||
(1) : XGBoost | ||
(2) : Naive Bayes | ||
(3) : KNN Classifier | ||
(4) : AdaBoost | ||
(5) : LightGBM | ||
``` | ||
|
||
After selection of the desired machine learning model, the user would be | ||
prompted with a series of questions: | ||
|
||
- `Load model from file? (y/N)` : In case of `y`, the program will ask for a | ||
file location of a previously saved model to use for predictions and testing. | ||
- `Use 3 classes ({XS}, {S, M, L}, {XL}) instead of 5 classes ({XS}, {S}, {M}, {L}, {XL})? (y/N)`: | ||
In case of `y`, the S, M, L labels of the data would be grouped alltogether as | ||
one class such that the training would be on 3 classes ({XS}, {S, M, L}, {XL}) | ||
instead of the 5 classes. It is worth noting that grouping the S, M and L | ||
classes alltogether as one class resulted in boosting the classification | ||
performance. | ||
- ```[bash] | ||
Do you want to train on a subset of features? | ||
(0) : ['Include all features'] | ||
(1) : ['google_places_rating', 'google_places_user_ratings_total', 'google_places_confidence', 'regional_atlas_regional_score'] | ||
``` | ||
|
||
`0` would include all the numerical and categorical one-hot encoded features, | ||
while `1` would choose a small subset of data as features for the machine | ||
learning models | ||
|
||
Then, the user would be given multiple options: | ||
|
||
```[bash] | ||
(1) Train | ||
(2) Test | ||
(3) Predict on single lead | ||
(4) Save model | ||
(5) Exit | ||
``` | ||
|
||
- (1): Train the current model on the current trainig dataset. | ||
- (2): Test the current model on the test dataset, displaying the mean squared | ||
error. | ||
- (3): Choose a single lead from the test dataset and display the prediction and | ||
true label. | ||
- (4): Save the current model to the `amos--models/models` on S3 in case of | ||
`DATABASE_TYPE=S3`, otherwise it will save it locally. | ||
- (5): Exit the EVP submenu | ||
|
||
## (3) : Merchant Size Prediction | ||
|
||
After training, testing, and saving the model, the true essence of models lies | ||
not just in crucial task of generating forecasted predictions for previously | ||
unseen leads. | ||
|
||
## (4) : Exit | ||
|
||
Gracefully exit the program. |