diff --git a/Deliverables/sprint-13/build-documentation.pdf b/Deliverables/sprint-13/build-documentation.pdf new file mode 100644 index 0000000..9614ffc Binary files /dev/null and b/Deliverables/sprint-13/build-documentation.pdf differ diff --git a/Deliverables/sprint-13/build-documentation.pdf.license b/Deliverables/sprint-13/build-documentation.pdf.license new file mode 100644 index 0000000..adcf5d5 --- /dev/null +++ b/Deliverables/sprint-13/build-documentation.pdf.license @@ -0,0 +1,2 @@ +SPDX-License-Identifier: MIT +SPDX-FileCopyrightText: 2024 Ahmed Sheta diff --git a/Deliverables/sprint-13/design-documentation.pdf b/Deliverables/sprint-13/design-documentation.pdf new file mode 100644 index 0000000..68e6579 Binary files /dev/null and b/Deliverables/sprint-13/design-documentation.pdf differ diff --git a/Deliverables/sprint-13/design-documentation.pdf.license b/Deliverables/sprint-13/design-documentation.pdf.license new file mode 100644 index 0000000..adcf5d5 --- /dev/null +++ b/Deliverables/sprint-13/design-documentation.pdf.license @@ -0,0 +1,2 @@ +SPDX-License-Identifier: MIT +SPDX-FileCopyrightText: 2024 Ahmed Sheta diff --git a/Deliverables/sprint-13/feature-board.png b/Deliverables/sprint-13/feature-board.png new file mode 100644 index 0000000..dfc38f0 Binary files /dev/null and b/Deliverables/sprint-13/feature-board.png differ diff --git a/Deliverables/sprint-13/feature-board.png.license b/Deliverables/sprint-13/feature-board.png.license new file mode 100644 index 0000000..875941a --- /dev/null +++ b/Deliverables/sprint-13/feature-board.png.license @@ -0,0 +1,2 @@ +SPDX-License-Identifier: MIT +SPDX-FileCopyrightText: 2024 Simon Zimmermann diff --git a/Deliverables/sprint-13/imp-squared-backlog.jpg b/Deliverables/sprint-13/imp-squared-backlog.jpg new file mode 100644 index 0000000..1f7fc01 Binary files /dev/null and b/Deliverables/sprint-13/imp-squared-backlog.jpg differ diff --git a/Deliverables/sprint-13/imp-squared-backlog.jpg.license b/Deliverables/sprint-13/imp-squared-backlog.jpg.license new file mode 100644 index 0000000..f6aa6f0 --- /dev/null +++ b/Deliverables/sprint-13/imp-squared-backlog.jpg.license @@ -0,0 +1,2 @@ +SPDX-License-Identifier: MIT +SPDX-FileCopyrightText: 2023 Nico Hambauer diff --git a/Deliverables/sprint-13/planning-document.pdf b/Deliverables/sprint-13/planning-document.pdf new file mode 100644 index 0000000..c458ca5 Binary files /dev/null and b/Deliverables/sprint-13/planning-document.pdf differ diff --git a/Deliverables/sprint-13/planning-document.pdf.license b/Deliverables/sprint-13/planning-document.pdf.license new file mode 100644 index 0000000..da783bc --- /dev/null +++ b/Deliverables/sprint-13/planning-document.pdf.license @@ -0,0 +1,2 @@ +SPDX-License-Identifier: MIT +SPDX-FileCopyrightText: 202$ Simon Zimmermann diff --git a/Deliverables/sprint-13/user-documentation.pdf b/Deliverables/sprint-13/user-documentation.pdf new file mode 100644 index 0000000..4c3c50c Binary files /dev/null and b/Deliverables/sprint-13/user-documentation.pdf differ diff --git a/Deliverables/sprint-13/user-documentation.pdf.license b/Deliverables/sprint-13/user-documentation.pdf.license new file mode 100644 index 0000000..adcf5d5 --- /dev/null +++ b/Deliverables/sprint-13/user-documentation.pdf.license @@ -0,0 +1,2 @@ +SPDX-License-Identifier: MIT +SPDX-FileCopyrightText: 2024 Ahmed Sheta diff --git a/Documentation/Architecture.md b/Documentation/Architecture.md deleted file mode 100644 index 9560376..0000000 --- a/Documentation/Architecture.md +++ /dev/null @@ -1,54 +0,0 @@ - - -# Software Architecture - -The goal of this project is to qualify sales leads in different ways, both in terms of -their likelihood of becoming customers and the size of their potential revenue. - -## External Software - -### Lead Form (LF) - -The _Lead Form_ is submitted by every new lead and provides a small set of data about the lead. - -### Customer Relationship Management (CRM) - -The project output is made available to the sales team. -This can be done in different ways, e.g. writing to a Google Sheet or pushing directly to SalesForce. - -## Components - -### Base Data Collector (BDC) - -The _Base Data Collector_ fullfills the task of collecting data about a lead from various online sources. -All collected data is then stored in a database for later retrieval in a standardized manner. - -### Expected Value Predictor (EVP) - -The _Expected Value Predictor_ estimates the expected value of a lead by analyzing the collected data of that lead. -This is done using a machine learning approach, where the EVP is trained on historical data. -Preprocessing of both the collected and the historical data should be done inside the EVP, -if it goes beyond the scope of standardization. - -### Controller - -The _Controller_ is an optional component, which coordinates BDC, EVP and the external components as a centralized control instance. -That said, another (more advanced) approach would be to use a pipelined control flow, driven by web hooks or similar signals. - -## Diagrams - -### Component Diagram - -![Component Diagram](Media/component-diagram.svg) - -### Sequence Diagram - -![Sequence Diagram](Media/sequence-diagram.svg) - -### Controller Workflow Diagram - -![Controller Workflow Diagram](Media/controller-workflow-diagram.jpg) diff --git a/Documentation/Build-Documentation.md b/Documentation/Build-Documentation.md new file mode 100644 index 0000000..0f5628c --- /dev/null +++ b/Documentation/Build-Documentation.md @@ -0,0 +1,60 @@ + + +# Creating the Environment + +The repository contains the file `.env.template`. This file is a template for +the environment variables that need to be set for the application to run. Copy +this file into a file called `.env` at the root level of this repository and +fill in all values with the corresponding secrets. + +To create the virtual environment in this project you must have `pipenv` +installed on your machine. Then run the following commands: + +```[bash] +# for development environment +pipenv install --dev +# for production environment +pipenv install +``` + +To work within the environment you can now run: + +```[bash] +# to activate the virtual environment +pipenv shell +# to run a single command +pipenv run +``` + +# Build Process + +This application is built and tested on every push and pull request creation +through Github actions. For this, the `pipenv` environment is installed and then +the code style is checked using `flake8`. Finally, the `tests/` directory is +executed using `pytest` and a test coverage report is created using `coverage`. +The test coverage report can be found in the Github actions output. + +In another task, all used packages are tested for their license to ensure that +the software does not use any copy-left licenses and remains open source and +free to use. + +If any of these steps fail for a pull request the pull request is blocked from +being merged until the corresponding step is fixed. + +Furthermore, it is required to install the pre-commit hooks as described +[here](https://github.com/amosproj/amos2023ws06-sales-lead-qualifier/wiki/Knowledge#pre-commit). +This ensures uniform coding style throughout the project as well as that the +software is compliant with the REUSE licensing specifications. + +# Running the app + +To run the application the `pipenv` environment must be installed and all needed +environment variables must be set in the `.env` file. Then the application can +be started via + +```[bash] +pipenv run python src/main.py +``` diff --git a/Documentation/Classifier-Comparison.md b/Documentation/Classifier-Comparison.md index 9aeb313..b38a5b4 100644 --- a/Documentation/Classifier-Comparison.md +++ b/Documentation/Classifier-Comparison.md @@ -6,8 +6,14 @@ SPDX-FileCopyrightText: 2024 Ahmed Sheta # Classifier Comparison -This document compares the results of the following classifiers on the enriched and -preprocessed data set from the 22.01.2024. +## Abstract + +This report presents a comprehensive evaluation of various classifiers trained on the historical dataset, which has been enriched and preprocessed through our pipeline. Each model type was tested on two splits of the data set. The used data set has five +classes for prediction corresponding to different merchant sizes, namely XS, S, M, L, and XL. The first split of the data set used exactly these classes for the prediction corresponding to the exact classes given by SumUp. The other data set split grouped the classes S, M, and L into one new class resulting in three classes of the form {XS}, {S, M, L}, and {XL}. While this does not exactly correspond to the given classes from SumUp, this simplification ofthe prediction task generally resulted in a better F1-score across models. + +## Experimental Attempts + +In accordance with the free lunch theorem, indicating no universal model superiority, multiple attempts were made to find the optimal solution. Unfortunately, certain models did not perform satisfactorily. Here are the experimented models and methodolgies - Quadratic Discriminant Analysis (QDA) - Ridge Classifier @@ -18,24 +24,13 @@ preprocessed data set from the 22.01.2024. - XGBoost Classifier Model - K Nearest Neighbor Classifier (KNN) - Bernoulli Naive Bayes Classifier - -Each model type was tested on two splits of the data set. The used data set has five -classes for prediction corresponding to different merchant sizes, namely XS, S, M, L, and XL. -The first split of the data set used exactly these classes for the prediction corresponding -to the exact classes given by SumUp. The other data set split grouped the classes S, M, and L -into one new class resulting in three classes of the form {XS}, {S, M, L}, and {XL}. While -this does not exactly correspond to the given classes from SumUp, this simplification of -the prediction task generally resulted in a better F1-score across models. - -## Experimental Attempts - -According to free lunch theorem, there is no universal model or methodology that is top performing on every problem or data, therefore multiple attempts are crucal. In this section, we will document the experiments we tried and their corresponding performance and outputs. +- LightGBM ## Models not performing well ### Support Vector Machine Classifier Model -Training Support Vector Machine (SVM) took a while such that the training never ended. It is believed that it is the case because SVMs are very sensitive to the misclassifications and it finds a hard time minimizing them, given the data. +Training Support Vector Machine (SVM) took a while such that the training never ended. We believe that it is the case because SVMs are very sensitive to the misclassifications and it finds a hard time minimizing them, given the data. ### Fully Connected Neural Networks Classifier Model @@ -82,7 +77,6 @@ The following subsets are available: - The XGBoost was trained for 10000 rounds. - The LightGBM was trained with 2000 number of leaves - In the following table we can see the model's overall weighted F1-score on the 3-class and 5-class data set split. The best performing classifiers per row is marked **bold**. @@ -141,3 +135,7 @@ In the following table we can see the F1-score of each model for each class in t For the 3-class split we observe similar performance for the XS and {S, M, L} classes for each model, while the LightGBM model slightly outperforms the other models. The LightGBM classifier is performing the best on the XL class while the Naive Bayes classifier performs worst. Interestingly, we can observe that the performance of the models on the XS class was barely affected by the merging of the S, M, and L classes while the performance on the XL class got worse for all of them. This needs to be considered, when evaluating the overall performance of the models on this data set split. The AdaBoost Classifier, trained on subset 1, performs best for the XL class. The KNN classifier got a slight boost in performance for the {S, M, L} and XL classes when using subset 1. All other models perform worse on subset 1. + +# Conclusion + +In summary, XGBoost consistently demonstrated superior performance, showcasing robust results across various splits and subsets. However, it is crucial to note that its elevated score is attributed to potential overfitting on the XS class. Given SumUp's emphasis on accurate predictions for higher classes, we recommend considering LightGBM. This model outperformed XGBoost in predicting the XL class and the other classes, offering better results in both the five-class and three-class splits. diff --git a/Documentation/Controller.md b/Documentation/Controller.md new file mode 100644 index 0000000..09e46d0 --- /dev/null +++ b/Documentation/Controller.md @@ -0,0 +1,32 @@ + + +# Automation + +The _Controller_ is a planned component, that has not been implemented beyond a +conceptual prototype. In the planned scenario, the controller would coordinate +BDC, MSP and the external components as a centralized instance of control. In +contrast to our current design, this scenario would enable the automation of our +current workflow, where there are currently several steps of human interaction +required to achieve a prediction result for initially unprocessed lead data. + +## Diagrams + +The following diagrams were created during the prototyping phase for the +Controller component. As they are from an early stage of our project, the +Merchant Size Predictor is labelled as the (Estimated) Value Predictor here. + +### Component Diagram + +![Component Diagram](Media/component-diagram-with-controller.svg) + +### Sequence Diagram + +![Sequence Diagram](Media/sequence-diagram.svg) + +### Controller Workflow Diagram + +![Controller Workflow Diagram](Media/controller-workflow-diagram.jpg) diff --git a/Documentation/Data-Field-Definition.md b/Documentation/Data-Field-Definition.md deleted file mode 100644 index c7dab11..0000000 --- a/Documentation/Data-Field-Definition.md +++ /dev/null @@ -1,29 +0,0 @@ - - -# Data Field Definitions - -This document outlines the data fields obtained for each lead. The data can be sourced from the online _Lead Form_ or be retrieved from the internet using APIs. It is currently unfinished, and will be updated once we finalise what data points will be used for the AI model. - -The data types selected are on the assumption that we’re using the PostgreSQL database. - -## Data Field Table - -| Field Name | Data Type | Description | Validation Rules | Data Source | Sample Data (if available) | Name Convention | -| -------------------------------- | :-------: | ----------------------------------------------------------------------------------------------------- | -------------------------------- | :------------: | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------: | -| First Name | text | First name of business owner | | Lead Form | | first_name | -| Last Name | text | Last name of business owner | | Lead Form | | last_name | -| Email Address | text | Owner’s email address (doesn’t specify business or personal) | | Lead Form | | email_address | -| Telephone Number | varchar | Owner’s telephone number (doesn’t specify business or personal) | Length dependent on country code | Lead Form | | phone_number | -| Annual Income from Card Payments | enum | Enumerated income-ranges that indicate how much of the company’s income is comprised of card payments | | Lead Form | Categories:
Keine
0 – 35.000
35.000 - 60.000
60.000 - 100.000
100.000 - 200.000
200.000 - 400.000
400.000 - 600.000
600.000 - 1 Mio.
1 Mio. – 2 Mio.
2 Mio. – 5 Mio.
Mehr als 5 Mio. | annual_income | -| Products of Interest | enum | Enumerated categories indicating SumUp products the owner is interested in | | Lead Form | Categories:
Keine
Alle
Kartenterminals
Kassensystem
Geshäftskonto
Andere | products_of_interest | -| Email Domain | text | Domain of the email address provided by the lead form | | Pre-processing | | domain | - -## Links to Data Sources: - -Lead form: https://www.sumup.com/de-de/kontaktieren-vertriebsteam/ \ -Google Places API: https://developers.google.com/maps/documentation/places/web-service/overview \ -OpenAI API: https://platform.openai.com/docs/overview \ -Meta API: https://developers.facebook.com/docs/graph-api/overview diff --git a/Documentation/Data-Fields.md b/Documentation/Data-Fields.md new file mode 100644 index 0000000..1043218 --- /dev/null +++ b/Documentation/Data-Fields.md @@ -0,0 +1,23 @@ + + +# Data Field Definitions + +This document outlines the data fields obtained for each lead. The data can be +sourced from the online _Lead Form_ or be retrieved from the internet using +APIs. + +## Data Field Table + +The most recent Data Fields table can now be found in a +[separate CSV File](./data-fields.csv). + +## Links to Data Sources: + +Lead form: https://www.sumup.com/de-de/kontaktieren-vertriebsteam/ \ +Google Places API: https://developers.google.com/maps/documentation/places/web-service/overview \ +OpenAI API: https://platform.openai.com/docs/overview \ +Meta API: https://developers.facebook.com/docs/graph-api/overview diff --git a/Documentation/Design-Documentation.md b/Documentation/Design-Documentation.md new file mode 100644 index 0000000..d9a97f1 --- /dev/null +++ b/Documentation/Design-Documentation.md @@ -0,0 +1,104 @@ + + +# Introduction + +This application serves as a pivotal tool employed by our esteemed industry +partner, SumUp, for the enrichment of information pertaining to potential leads +garnered through their sign-up website. The refined data obtained undergoes +utilization in the prediction of potential value that a lead could contribute to +SumUp, facilitated by a sophisticated machine learning model. The application is +branched into two integral components: the Base Data Collector (BDC) and the +Merchant Size Predictor (MSP). + +## Component Diagram + +![Component Diagram](Media/component-diagram.svg) + +## External Software + +### Lead Form (LF) + +The _Lead Form_ is submitted by every new lead and provides a small set of data +about the lead. + +### Customer Relationship Management (CRM) + +The project output is made available to the sales team. This can be done in +different ways, e.g. writing to a Google Sheet or pushing directly to +SalesForce. + +## Components + +## Base Data Collector (BDC) + +### General description + +The Base Data Collector (BDC) plays a crucial role in enriching the dataset +related to potential client leads. The initial dataset solely comprises +fundamental lead information, encompassing the lead's first and last name, phone +number, email address, and company name. Recognizing the insufficiency of this +baseline data for value prediction, the BDC is designed to query diverse data +sources, incorporating various Application Programming Interfaces (APIs), to +enrich the provided lead data. + +### Design + +The different data sources are organised as steps in the program. Each step +extends from a common parent class and implements methods to validate that it +can run, perform the data collection from the source and perform clean up and +statistics reports for itself. These steps are then collected in a pipeline +object sequentially performing the steps to enhance the given data with all +chosen data sources. The data sources include: + +- inspecting the possible custom domain of the email address. +- retrieving multiple data from the Google Places API. +- analysing the sentiment of Google reviews using GPT. +- inspecting the surrounding areas of the business using the Regional Atlas API. +- searching for company-related data using the OffeneRegisterAPI. +- performing sentiment analysis on reviews using GPT-4 model. + +### Data storage + +All data for this project is stored in CSV files in the client's AWS S3 storage. +The files here are split into three buckets. The input data and enhanced data +are stored in the events bucket, pre-processed data ready for use of ML models +is stored in the features bucket and the used model and inference is stored in +the model bucket. Data preprocessing Following data enrichment, a pivotal phase +in the machine learning pipeline is data preprocessing, an essential process +encompassing scaling operations, numerical outlier elimination, and categorical +one-hot encoding. This preprocessing stage serves transforms the output +originating from the BDC into feature vectors, thereby rendering them amenable +for predictive analysis by the machine learning model. + +## Merchant Size Predictor (MSP) / Estimated Value Predictor (EVP) + +### Historical Note + +The primary objective of the Estimated Value Predictor was initially oriented +towards forecasting the estimated life-time value of leads. However, this +objective evolved during the project's progression, primarily influenced by +labelling considerations. The main objective has therefore changed to predicting +only the size of a given lead, which can then be used as an indication for their +potential life-time value. As a consequence, the component in questions is now +(somewhat inconsistently) either referred to as the Estimated Value Predictor +(EVP) or as the Merchant Size Predictor (MSP). + +### Design + +In the context of Merchant Size Prediction, our aim is to leverage pre-trained +ML models on new lead data. By applying these models, we intend to predict the +potential Merchant Size, thereby assisting SumUp in prioritizing leads and +making informed decisions on which leads to contact first. This predictive +approach enhances the efficiency of lead management and optimizes resource +allocation for maximum impact. + +The machine learning model, integral to the MSP, undergoes training on +proprietary historical data sourced from SumUp. The training process aims to +discern discriminative features that effectively stratify each class within the +Merchant Size taxonomy. It is imperative to note that the confidentiality of the +underlying data prohibits its public disclosure. diff --git a/Documentation/Media/component-diagram-with-controller.svg b/Documentation/Media/component-diagram-with-controller.svg new file mode 100644 index 0000000..778caeb --- /dev/null +++ b/Documentation/Media/component-diagram-with-controller.svg @@ -0,0 +1,4 @@ + + + +
«component»
Base Data Collector
«component»Base Data Collector
«component»
Value Predictor
«component»...
SumUp
SumUp
«component»
Lead Form
«component»...
«component»
Customer Relationship
Management
«component»...
«component»
Controller
«component»Controller
Text is not SVG - cannot display
diff --git a/Documentation/Media/component-diagram-with-controller.svg.license b/Documentation/Media/component-diagram-with-controller.svg.license new file mode 100644 index 0000000..899fd72 --- /dev/null +++ b/Documentation/Media/component-diagram-with-controller.svg.license @@ -0,0 +1,3 @@ +SPDX-License-Identifier: MIT +SPDX-FileCopyrightText: 2023 Simon Zimmermann +SPDX-FileCopyrightText: 2023 Lucca Baumgärtner diff --git a/Documentation/Media/component-diagram.svg b/Documentation/Media/component-diagram.svg index 778caeb..4aab7d4 100644 --- a/Documentation/Media/component-diagram.svg +++ b/Documentation/Media/component-diagram.svg @@ -1,4 +1,4 @@ -
«component»
Base Data Collector
«component»Base Data Collector
«component»
Value Predictor
«component»...
SumUp
SumUp
«component»
Lead Form
«component»...
«component»
Customer Relationship
Management
«component»...
«component»
Controller
«component»Controller
Text is not SVG - cannot display
+
«component»
Base Data Collector
«component»
Merchant Size Predictor
SumUp
«component»
Lead Form
«component»
Customer Relationship
Management
AWS S3
diff --git a/Documentation/Media/component-diagram.svg.license b/Documentation/Media/component-diagram.svg.license index 899fd72..2621f97 100644 --- a/Documentation/Media/component-diagram.svg.license +++ b/Documentation/Media/component-diagram.svg.license @@ -1,3 +1,3 @@ SPDX-License-Identifier: MIT -SPDX-FileCopyrightText: 2023 Simon Zimmermann -SPDX-FileCopyrightText: 2023 Lucca Baumgärtner +SPDX-FileCopyrightText: 2023 Lucca Baumgärtner +SPDX-FileCopyrightText: 2024 Simon Zimmermann diff --git a/Documentation/Miscellaneous.md b/Documentation/Miscellaneous.md new file mode 100644 index 0000000..8a61015 --- /dev/null +++ b/Documentation/Miscellaneous.md @@ -0,0 +1,206 @@ + + +# Miscellaneous Content + +This file contains content that was moved over from our [Wiki](https://github.com/amosproj/amos2023ws06-sales-lead-qualifier/wiki), which we gave up in favor of having the documentation available more centrally. The contents of this file might to some extend overlap with the contents found in other documentation files. + +# Knowledge Base + +## AWS + +1. New password has to be >= 16 char and contain special chars +2. After changing the password you have to re-login +3. Add MFA (IAM -> Users -> Your Name -> Access Info) +4. MFA device = FirstName.LastName like the credential +5. Re-login +6. Get access keys: + - IAM -> Users -> Your Name -> Access Info -> Scroll to Access Keys + - Create new access key (for local development) + - Accept the warning + - Copy the secret key to your .env file + - Don’t add description tags to your key + +## PR Management: + +1. Create PR +2. Link issue +3. Other SD reviews the PR + - Modification needed? + - Fix/Discuss issue in the GitHub comments + - Make new commit + - Return to step 3 + - No Modification needed + - Reviewer approves PR +4. PR creator merges PR +5. Delete the used branch + +## Branch-Management: + +- Remove branches after merging +- Add reviews / pull requests so others check the code +- Feature branches with dev instead of main as base + +## Pre-commit: + +```[bash] +# If not installed yet +pip install pre-commit + +# pre-commit hooks now automatically are executed before every commit +python -m pre-commit install + +# execute pre-commit manually +python pre-commit +``` + +## Features + +- Existing Website (Pingable, SEO-Score, DNS Lookup) +- Existing Google Business Entry (using the [Google Places API](https://developers.google.com/maps/documentation/places/web-service?hl=de) + - Opening Times + - Number, Quality of Ratings + - Overall “completeness” of the entry/# of available datapoints + - Price category + - Phone Number (compare with lead form input) + - Website (compare with lead form input) + - Number of visitors (estimate revenue from that?) + - Product recognition from images + - Merchant Category (e.g. cafe, restaurant, retailer, etc.) +- Performance Indicators (NorthData, some other API) + - Revenue (as I understodd, this should be > 5000$/month) + - Number of Employees + - Bundesanzeiger / Handelsregister (Deutschland API) +- Popularity: Insta / facebook followers or website ranking on google +- Business type: google or website extraction (maybe with ChatGPT) +- Size of business: To categorize leads to decide whether they need to deal with a salesperson or self-direct their solution +- Business profile +- Sentiment Analysis: https://arxiv.org/pdf/2307.10234.pdf + +## Storage + +- Unique ID for Lead (Felix)? +- How to handle frequent data layout changes at S3 (Simon)? +- 3 stage file systems (Felix) vs. DB (Ruchita)? +- 3 stage file system (Felix): + - BDC trigger on single new lead entries or batches + - After BDC enriched the data => store in a parquet file in the events folder with some tag + - BDC triggers the creation of the feature vectors + - Transform the data in the parquet file after it was stored in the events file and store them in the feature folder with the same tag + - Use the data as a input for the model, which is triggered after the creation of the input, and store the results in the model folder +- Maybe the 3 stage file system as a part of the DB and hide the final decision behind the database abstraction layer (Simon)? + +## Control flow (Berkay) + +- Listener +- MessageQueue +- RoutingQueue + +Listener, as the name suggests, listens for incoming messages from other component, such as BDC, EVP, and enqueues these messages in messageQueue to be “read” and processed. If there are not incoming messages, it is in idle status. messageQueue is, where listened messages are being processed. After each message is processed by messageQueue,it is enqueued in routingQueue, to be routed to corresponding component. Both messageQueue and routingQueue are in idle, if there are no elements in queues. Whole concept of Controller is multi-threaded and asynchronous. While it accepts new incoming messages, it processes messages and at the same time routes some other messages. + +## AI + +> `expected value = life-time value of lead x probability of the lead becoming a customer` + +AI models needed that solve a regression or probability problem + +### AI Models + +- Classification: + - Decision Trees + - Random Forest + - Neural Networks + - Naïve Bayes + +### What data do we need? + +- Classification: Labeled data +- Probability: Data with leads and customers + +### ML Pipeline + +1. Preprocessing +2. Feature selection +3. Dataset split / cross validation +4. Dimensional reduction +5. Training +6. Testing / Evaluation +7. Improve performance + - Batch Normalization + - Optimizer + - L1 / L2 regularization: reduced overfitting by regularize the model + - Dropout (NN) + - Depth and width (NN) + - Initialization techniques (NN: Xavier and He) + - He: Layers with ReLu activation + - Xavier: Layers with sigmoid activation + +# Troubleshooting + +## Build + +### pipenv + +#### install stuck + +```[bash] +pipenv install –dev +``` + +**Solution**: Remove .lock file + restart PC + +### Docker + +#### VSCode + +Terminal can't run docker image (on windows) + +- **Solution**: workaround with git bash or with ubuntu + +### Testing + +#### Reuse + +don't analyze a certain part of the code with reuse +**Solution**: + +```[bash] +# REUSE-IgnoreStart + ... +# REUSE-IgnoreEnd +``` + +#### Failed checks + +1. Go to the specific pull request or Actions [Actions](https://github.com/amosproj/amos2023ws06-sales-lead-qualifier/actions) +2. Click "show all checks" +3. Click "details" +4. Click on the elements with the "red marks" + +## BDC + +### Google Places API + +Language is adjusted to the location from which the API is run + +- **Solution**: adjust the language feature, documentation in [Google Solution](https://developers.google.com/places/web-service/search#FindPlaceRequests) + +Google search results are based on the location from which the API is run + +- **Solution**: Pass a fixed point in the center of the country / city / area of the company (OSMNX) as a location bias, documentation in + [Google Solution](https://developers.google.com/places/web-service/search#FindPlaceRequests) + +## Branch-Management + +### Divergent branch + +Commits on local and remote are not the same + +- **Solution**: + 1. Pull remote changes + 2. Rebase the changes + 3. Solve any conflict during any commit you get from remote diff --git a/Documentation/User-Documentation.md b/Documentation/User-Documentation.md new file mode 100644 index 0000000..54bc3ac --- /dev/null +++ b/Documentation/User-Documentation.md @@ -0,0 +1,187 @@ + + +# Project vision + +This product will give our industry partner a tool at hand, that can effectively +increase conversion of their leads to customers, primarily by providing the +sales team with valuable information. The modular architecture makes our product +future-proof, by making it easy to add further data sources, employ improved +prediction models or to adjust the output format if desired. + +# Project mission + +The mission of this project is to enrich historical data about customers and +recent data about leads (with information from external sources) and to leverage +the enriched data in machine learning, so that the estimated Merchant Size of +leads can be predicted. + +# Usage + +To execute the final program, ensure the environment is installed (refer to +build-documents.md) and run `python .\src\main.py` either locally or via the +build process. The user will be presented with the following options: + +```[bash] +Choose demo: +(0) : Base Data Collector +(1) : Data preprocessing +(2) : ML model training +(3) : Merchant Size Predictor +(4) : Exit +``` + +## (0) : Base Data Collector + +This is the data enrichment pipeline, utilizing multiple data enrichment steps. +Configuration options are presented: + +`Do you want to list all available pipeline configs? (y/N)` If `y`: + +```[bash] +Please enter the index of requested pipeline config: +(0) : config_sprint09_release.json +(1) : just_run_search_offeneregister.json +(2) : run_all_steps.json +(3) : Exit +``` + +- (0) Coniguration used in sprint 9. +- (1) Coniguration for OffeneRegister. +- (2) Running all the steps of the pipeline without steps selection. +- (3) Exit to the pipeline step selection. + +If `n`: proceed to pipeline step selection for data enrichment. Subsequent +questions arise: + +```[bash] +Run Scrape Address (will take a long time)(y/N)? +Run Search OffeneRegister (will take a long time)(y/N)? +Run Phone Number Validation (y/N)? +Run Google API (will use token and generate cost!)(y/N)? +Run Google API Detailed (will use token and generate cost!)(y/N)? +Run openAI GPT Sentiment Analyzer (will use token and generate cost!)(y/N)? +Run openAI GPT Summarizer (will use token and generate cost!)(y/N)? +Run Smart Review Insights (will take looong time!)(y/N)? +Run Regionalatlas (y/N)? +``` + +- `Run Scrape Address (will take a long time)(y/N)?`: This enrichment step + scrapes the leads website for an address using regex. +- `Run Search OffeneRegister (will take a long time)(y/N)?`: This enrichment + step searches for company-related data using the OffeneRegisterAPI. +- `Run Phone Number Validation (y/N)?`: This enrichment step checks if the + provided phone numbers are valid and extract geographical information using + geocoder. +- `Run Google API (will use token and generate cost!)(y/N)?`: This enrichment + step tries to the correct business entry in the Google Maps database. It will + save basic information along with the place id, that can be used to retrieve + further detailed information and a confidence score that should indicate the + confidence in having found the correct result. +- `Run Google API Detailed (will use token and generate cost!)(y/N)?`: This + enrichment step tries to gather detailed information for a given google + business entry, identified by the place ID. +- `Run openAI GPT Sentiment Analyzer (will use token and generate cost!)(y/N)?`: + This enrichment step performs sentiment analysis on reviews using GPT-4 model. +- `Run openAI GPT Summarizer (will use token and generate cost!)(y/N)?`: This + enrichment step attempts to download a businesses website in raw html format + and pass this information to OpenAIs GPT, which will then attempt to summarize + the raw contents and extract valuable information for a salesperson. +- `Run Smart Review Insights (will take looong time!)(y/N)?`: This enrichment + step enhances review insights for smart review analysis +- `Run Regionalatlas (y/N)?`: This enrichment step will query the RegionalAtlas + database for location based geographic and demographic information, based on + the address that was found for a business through Google API. + +It is emphasized that some steps are dependent on others, and excluding one +might result in dependency issues for subsequent steps. + +After selecting the desired enrichtment steps, a prompt asks the user to +`Set limit for data points to be processed (0=No limit)` such that the user +chooses whether it apply the data enrichment steps for all the leads (no limit) +or for a certain number of leads. + +**Note**: In case `DATABASE_TYPE="S3"` in your `.env` file, the limit will be +removed, in order to enrich all the data into `s3://amos--data--events` S3 +bucket. + +## (1) : Data preprocessing + +Post data enrichment, preprocessing is crucial for machine learning models, +involving scaling, numerical outlier removal, and categorical one-hot encoding. +The user is prompted with questions: + +`Filter out the API-irrelevant data? (y/n)`: This will filter out all the leads +that couldn't be enriched during the data enrichtment steps, removing them would +be useful for the Machine Learning algorithms, to avoid any bias introduced, +even if we pad the features with zeros. +`Run on historical data ? (y/n) Note: DATABASE_TYPE should be S3!`: The user has +to have `DATABASE_TYPE="S3"` in `.env` file in order to run on historical data, +otherwise, it will run locally. After preprocessing, the log will show where the +preprocessed_data is stored. + +## (2) : ML model training + +Six machine learning models are available: + +```[bash] +(0) : Random Forest +(1) : XGBoost +(2) : Naive Bayes +(3) : KNN Classifier +(4) : AdaBoost +(5) : LightGBM +``` + +After selection of the desired machine learning model, the user would be +prompted with a series of questions: + +- `Load model from file? (y/N)` : In case of `y`, the program will ask for a + file location of a previously saved model to use for predictions and testing. +- `Use 3 classes ({XS}, {S, M, L}, {XL}) instead of 5 classes ({XS}, {S}, {M}, {L}, {XL})? (y/N)`: + In case of `y`, the S, M, L labels of the data would be grouped alltogether as + one class such that the training would be on 3 classes ({XS}, {S, M, L}, {XL}) + instead of the 5 classes. It is worth noting that grouping the S, M and L + classes alltogether as one class resulted in boosting the classification + performance. +- ```[bash] + Do you want to train on a subset of features? + (0) : ['Include all features'] + (1) : ['google_places_rating', 'google_places_user_ratings_total', 'google_places_confidence', 'regional_atlas_regional_score'] + ``` + +`0` would include all the numerical and categorical one-hot encoded features, +while `1` would choose a small subset of data as features for the machine +learning models + +Then, the user would be given multiple options: + +```[bash] +(1) Train +(2) Test +(3) Predict on single lead +(4) Save model +(5) Exit +``` + +- (1): Train the current model on the current trainig dataset. +- (2): Test the current model on the test dataset, displaying the mean squared + error. +- (3): Choose a single lead from the test dataset and display the prediction and + true label. +- (4): Save the current model to the `amos--models/models` on S3 in case of + `DATABASE_TYPE=S3`, otherwise it will save it locally. +- (5): Exit the EVP submenu + +## (3) : Merchant Size Predictor + +After training, testing, and saving the model, the true essence of models lies +not just in crucial task of generating forecasted predictions for previously +unseen leads. + +## (4) : Exit + +Gracefully exit the program. diff --git a/Documentation/data-fields.csv b/Documentation/data-fields.csv new file mode 100644 index 0000000..892206d --- /dev/null +++ b/Documentation/data-fields.csv @@ -0,0 +1,58 @@ +Field Name,Type,Description,Data source,Dependencies,Example +Last Name,string,Last name of the lead,Lead data,-,Mustermann +First Name,string,First name of the lead,Lead data,-,Mustername +Company / Account,string,Company name of the lead,Lead data,-,Mustercompany +Phone,string,Phone number of the lead,Lead data,-,49 1234 56789 +Email,string,Email of the lead,Lead data,-,musteremail@example.com +domain,string,"The domain of the email is the part that follows the ""@"" symbol, indicating the organization or service hosting the email address.",processing,Email,example.com +email_valid,boolean,Checks if the email is valid.,email_validator package,Email,True/False +first_name_in_account,boolean,Checks if first name is written in "Account" input,processing,First Name,True/False +last_name_in_account,boolean,Checks if last name is written in "Account" input,processing,Last Name,True/False +number_formatted,string,Phone number (formatted),phonenumbers package,Phone,49123456789 +number_country,string,Country derived from phone number,phonenumbers package,Phone,Germany +number_area,string,Area derived from phone number,phonenumbers package,Phone,Erlangen +number_valid,boolean,Indicator weather a phone number is valid,phonenumbers package,Phone,True/False +number_possible,boolean,Indicator weather a phone number is possible,phonenumbers package,Phone,True/False +google_places_place_id,string,Place ID used by Google,Google Places API,Company / Account,- +google_places_business_status,string,Business Status,Google Places API,Company / Account,Operational +google_places_formatted_address,string,Formatted address,Google Places API,Company / Account,Musterstr.1 +google_places_name,string,Business Name,Google Places API,Company / Account,Mustername +google_places_user_ratings_total,integer,Total number of ratings,Google Places API,Company / Account,100 +google_places_rating,float,Average star rating,Google Places API,Company / Account,4.5 +google_places_price_level,float,Price level (1-3),Google Places API,Company / Account,- +google_places_candidate_count_mail,integer,Number of results from E-Mail based search,Google Places API,Company / Account,1 +google_places_candidate_count_phone,integer,Number of results from Phone based search,Google Places API,Company / Account,1 +google_places_place_id_matches_phone_search,boolean,Indicator weather phone based and EMail based search gave the same result,Google Places API,Company / Account,True/False +google_places_confidence,float,Indicator of confidence in the Google result,processing,,0.9 +google_places_detailed_website,string,Link to business website,Google Places API,Company / Account,www.musterwebsite.de +google_places_detailed_type,list,Type of business,Google Places API,Company / Account,"[""florist"", ""store""]" +reviews_sentiment_score,float,Sentiment score between -1 and 1 for the reviews,GPT,Google reviews,0.9 +regional_atlas_pop_density,float,Population density,Regional Atlas,google_places_formatted_address,2649.6 +regional_atlas_pop_development,float,Population development,Regional Atlas,google_places_formatted_address,-96.5 +regional_atlas_age_0,float,Age group,Regional Atlas,google_places_formatted_address,16.3 +regional_atlas_age_1,float,Age group,Regional Atlas,google_places_formatted_address,8.2 +regional_atlas_age_2,float,Age group,Regional Atlas,google_places_formatted_address,31.1 +regional_atlas_age_3,float,Age group,Regional Atlas,google_places_formatted_address,26.8 +regional_atlas_age_4,float,Age group,Regional Atlas,google_places_formatted_address,17.7 +regional_atlas_pop_avg_age,float,Average population age,Regional Atlas,google_places_formatted_address,42.1 +regional_atlas_per_service_sector,float,-,Regional Atlas,google_places_formatted_address,88.4 +regional_atlas_per_trade,float,-,Regional Atlas,google_places_formatted_address,28.9 +regional_atlas_employment_rate,float,Employment rate,Regional Atlas,google_places_formatted_address,59.9 +regional_atlas_unemployment_rate,float,Unemployment rate,Regional Atlas,google_places_formatted_address,6.4 +regional_atlas_per_long_term_unemployment,float,Long term unemployment,Regional Atlas,google_places_formatted_address,49.9 +regional_atlas_investments_p_employee,float,Investments per employee,Regional Atlas,google_places_formatted_address,6.8 +regional_atlas_gross_salary_p_employee,float,Gross salary per employee,Regional Atlas,google_places_formatted_address,63.9 +regional_atlas_disp_income_p_inhabitant,float,Income per inhabitant,Regional Atlas,google_places_formatted_address,23703 +regional_atlas_tot_income_p_taxpayer,float,Income per taxpayer,Regional Atlas,google_places_formatted_address,45.2 +regional_atlas_gdp_p_employee,float,GDP per employee,Regional Atlas,google_places_formatted_address,84983 +regional_atlas_gdp_development,float,GDP development,Regional Atlas,google_places_formatted_address,5.2 +regional_atlas_gdp_p_inhabitant,float,GDP per inhabitant,Regional Atlas,google_places_formatted_address,61845 +regional_atlas_gdp_p_workhours,float,GDP per workhours,Regional Atlas,google_places_formatted_address,60.7 +regional_atlas_pop_avg_age_zensus,float,Average population age (from zensus),Regional Atlas,google_places_formatted_address,41.3 +regional_atlas_regional_score,float,Regional score,Regional Atlas,google_places_formatted_address,3761.93 +review_avg_grammatical_score,float,Average grammatical score of reviews,processing,google_places_place_id,0.56 +review_polarization_type,string,Polarization type of review ratings,processing,google_places_place_id,High-Rating Dominance +review_polarization_score,float,Polarization score of review ratings ,processing,google_places_place_id,1 +review_highest_rating_ratio,float,Ratio of the highest review ratings,processing,google_places_place_id,1 +review_lowest_rating_ratio,float,Ratio of the lowest review ratings,processing,google_places_place_id,0 +review_rating_trend,float,Value indicating the trend of ratings,processing,google_places_place_id,0 diff --git a/Documentation/data_fields.csv.license b/Documentation/data-fields.csv.license similarity index 63% rename from Documentation/data_fields.csv.license rename to Documentation/data-fields.csv.license index 2b087bd..8f96900 100644 --- a/Documentation/data_fields.csv.license +++ b/Documentation/data-fields.csv.license @@ -1,2 +1,3 @@ # SPDX-License-Identifier: MIT # SPDX-FileCopyrightText: 2023 Lucca Baumgärtner +# SPDX-FileCopyrightText: 2024 Ahmed Sheta diff --git a/Documentation/data_fields.csv b/Documentation/data_fields.csv deleted file mode 100644 index fe475c6..0000000 --- a/Documentation/data_fields.csv +++ /dev/null @@ -1,57 +0,0 @@ -column_id,description,source -domain,Custom domain (if any),calculated -email_valid,Indicator if E-Mail is valid,calculated -first_name_in_account,Indicator if first name is part of the E-Mail Account,calculated -last_name_in_account,Indicator if last name is part of the E-Mail Account,calculated -email,Normalized version of the E-Mail,calculated -category,, -reviews_sentiment_score,Sentiment score between -1 and 1 for the reviews,GPT/calculated -sales_person_summary,A summary of the website to support a salesperson in their call,GPT -google_places_place_id,Place ID used by Google,Google Places API -google_places_business_status,Business Status,Google Places API -google_places_formatted_address,Formatted address,Google Places API -google_places_name,Business Name,Google Places API -google_places_user_ratings_total,Total number of ratings,Google Places API -google_places_rating,Average star rating,Google Places API -google_places_price_level,Price level (1-3),Google Places API -google_places_candidate_count_mail,Number of results from E-Mail based search,calculated -google_places_candidate_count_phone,Number of results from Phone based search,calculated -google_places_place_id_matches_phone_search,Indicator weather phone based and E-Mail based search gave the same result,calculated -google_places_confidence,Indicator of confidence in the Google result,calculated -google_places_detailed_website,Link to business website,Google Places (detailed) API -google_places_detailed_type,Type of business,Google Places (detailed) API -number_formatted,Phone number (formatted),calculated -number_country,Country derived from phone number,calculated -number_area,Area derived from phone number,calculated -number_valid,Indicator weather a phone number is valid,calculated -number_possible,Indicator weather a phone number is possible,calculated -regional_atlas_pop_density,Population density,Regional Atlas -regional_atlas_pop_development,Population development,Regional Atlas -regional_atlas_age_0,,Regional Atlas -regional_atlas_age_1,,Regional Atlas -regional_atlas_age_2,,Regional Atlas -regional_atlas_age_3,,Regional Atlas -regional_atlas_age_4,,Regional Atlas -regional_atlas_pop_avg_age,Average population age,Regional Atlas -regional_atlas_per_service_sector,,Regional Atlas -regional_atlas_per_trade,,Regional Atlas -regional_atlas_employment_rate,Employment rate,Regional Atlas -regional_atlas_unemployment_rate,Unemployment rate,Regional Atlas -regional_atlas_per_long_term_unemployment,Long term unemployment,Regional Atlas -regional_atlas_investments_p_employee,Investments per employee,Regional Atlas -regional_atlas_gross_salary_p_employee,Gross salary per employee,Regional Atlas -regional_atlas_disp_income_p_inhabitant,Income per inhabitant,Regional Atlas -regional_atlas_tot_income_p_taxpayer,Income per taxpayer,Regional Atlas -regional_atlas_gdp_p_employee,GDP per employee,Regional Atlas -regional_atlas_gdp_development,GDP development,Regional Atlas -regional_atlas_gdp_p_inhabitant,GDP per inhabitant,Regional Atlas -regional_atlas_gdp_p_workhours,GDP per workhours,Regional Atlas -regional_atlas_pop_avg_age_zensus,Average population age (from zensus),Regional Atlas -regional_atlas_regional_score,Regional score,calculated -address_ver_1,?,? -review_avg_grammatical_score,Average grammatical score of reviews,calculated -review_polarization_type,Polarization type of review ratings,calculated -review_polarization_score,Polarization score of review ratings ,calculated -review_highest_rating_ratio,Ratio of the highest review ratings,calculated -review_lowest_rating_ratio,Ratio of the lowest review ratings,calculated -review_rating_trend,Value indicating the trend of ratings,calculated