Skip to content

Commit

Permalink
updated user documentation
Browse files Browse the repository at this point in the history
Signed-off-by: Ahmed Sheta <[email protected]>
  • Loading branch information
ultiwinter7 committed Feb 2, 2024
1 parent c2f0f3a commit 910141c
Showing 1 changed file with 54 additions and 68 deletions.
122 changes: 54 additions & 68 deletions Deliverables/sprint-13/user-documentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,71 +17,14 @@ To execute the final program, ensure the environment is installed (refer to buil

```[bash]
Choose demo:
(0) : BDC
(1) : EVP
(2) : DB
(3) : Pipeline
(4) : Data preprocessing
(5) : Exit
(0) : Base Data Collector
(1) : Data preprocessing
(2) : Estimated Value Predictor
(3) : Merchant Size Prediction
(4) : Exit
```

## (0) : BDC (Base Data Collector)

Two options would arise:

```[bash]
(1) Read CSV
(2) Dummy API
```

The first option reads the base data CSV file and confirms correct processing, while the latter demonstrates communication with a dummy API as a proof of concept.

## (1) : EVP (Expected Value Predictor)

Six machine learning models are available:

```[bash]
(0) : Random Forest
(1) : XGBoost
(2) : Naive Bayes
(3) : KNN Classifier
(4) : AdaBoost
(5) : LightGBM
```

The user would be prompted with a series of questions:

- `Load model from file? (y/N)` : In case of `y`, the program will ask for a file location of a previously saved model to use for predictions and testing.
- `Use 3 classes ({XS}, {S, M, L}, {XL}) instead of 5 classes ({XS}, {S}, {M}, {L}, {XL})? (y/N)`: In case of `y`, the S, M, L labels of the data would be grouped alltogether as one class such that the training would be on 3 classes ({XS}, {S, M, L}, {XL}) instead of the 5 classes. It is worth noting that grouping the S, M and L classes alltogether as one class resulted in boosting the classification performance.
- ```[bash]
Do you want to train on a subset of features?
(0) : ['Include all features']
(1) : ['google_places_rating', 'google_places_user_ratings_total', 'google_places_confidence', 'regional_atlas_regional_score']
```

````
`0` would include all the numerical and categorical one-hot encoded features, while `1` would choose a small subset of data as features for the machine learning models
Then, the user would be given multiple options:
```[bash]
(1) Train
(2) Test
(3) Predict on single lead
(4) Save model
(5) Exit
````

- (1): Train the current model on the current trainig dataset.
- (2): Test the current model on the test dataset, displaying the mean squared error.
- (3): Choose a single lead from the test dataset and display the prediction and true label.
- (4): Save the current model to a specified file location.
- (5): Exit the EVP submenu

## (2) : DB (Database)

Display the currently stored data of a chosen lead.

## (3) : Pipeline
## (0) : Base Data Collector

This is the data enrichment pipeline, utilizing multiple data enrichment steps. Configuration options are presented:

Expand All @@ -106,7 +49,6 @@ If `n`: proceed to pipeline step selection for data enrichment. Subsequent quest
```[bash]
Run Scrape Address (will take a long time)(y/N)?
Run Search OffeneRegister (will take a long time)(y/N)?
Run Facebook Graph API (will use token)(y/N)?
Run Phone Number Validation (y/N)?
Run Google API (will use token and generate cost!)(y/N)?
Run Google API Detailed (will use token and generate cost!)(y/N)?
Expand All @@ -118,7 +60,6 @@ Run Regionalatlas (y/N)?

- `Run Scrape Address (will take a long time)(y/N)?`: This enrichment step scrapes the leads website for an address using regex.
- `Run Search OffeneRegister (will take a long time)(y/N)?`: This enrichment step searches for company-related data using the OffeneRegisterAPI.
- `Run Facebook Graph API (will use token)(y/N)?`: This enrichment step searches for company information on Facebook through Facebook Graph API.
- `Run Phone Number Validation (y/N)?`: This enrichment step checks if the provided phone numbers are valid and extract geographical information using geocoder.
- `Run Google API (will use token and generate cost!)(y/N)?`: This enrichment step tries to the correct business entry in the Google Maps database. It will save basic information along with the place id, that can be used to retrieve further detailed information and a confidence score that should indicate the confidence in having found the correct result.
- `Run Google API Detailed (will use token and generate cost!)(y/N)?`: This enrichment step tries to gather detailed information for a given google business entry, identified by the place ID.
Expand All @@ -133,13 +74,58 @@ After selecting the desired enrichtment steps, a prompt asks the user to `Set li

**Note**: In case `DATABASE_TYPE="S3"` in your `.env` file, the limit will be removed, in order to enrich all the data into `s3://amos--data--events` S3 bucket.

## (4) : Data preprocessing
## (1) : Data preprocessing

Post data enrichment, preprocessing is crucial for machine learning models, involving scaling, numerical outlier removal, and categorical one-hot encoding. The user is prompted with questions:

`Filter out the API-irrelevant data? (y/n)`: This will filter out all the leads that couldn't be enriched during the data enrichtment steps, removing them would be useful for the Machine Learning algorithms, to avoid any bias introduced, even if we pad the features with zeros.
`Run on historical data ? (y/n) Note: DATABASE_TYPE should be S3!`: The user has to have `DATABASE_TYPE="S3"` in `.env` file in order to run on historical data, otherwise, it will run locally.
`Run on historical data ? (y/n) Note: DATABASE_TYPE should be S3!`: The user has to have `DATABASE_TYPE="S3"` in `.env` file in order to run on historical data, otherwise, it will run locally. After preprocessing, the log will show where the preprocessed_data is stored.

## (2) : Estimated Value Predictor

Six machine learning models are available:

```[bash]
(0) : Random Forest
(1) : XGBoost
(2) : Naive Bayes
(3) : KNN Classifier
(4) : AdaBoost
(5) : LightGBM
```

After selection of the desired machine learning model, the user would be prompted with a series of questions:

- `Load model from file? (y/N)` : In case of `y`, the program will ask for a file location of a previously saved model to use for predictions and testing.
- `Use 3 classes ({XS}, {S, M, L}, {XL}) instead of 5 classes ({XS}, {S}, {M}, {L}, {XL})? (y/N)`: In case of `y`, the S, M, L labels of the data would be grouped alltogether as one class such that the training would be on 3 classes ({XS}, {S, M, L}, {XL}) instead of the 5 classes. It is worth noting that grouping the S, M and L classes alltogether as one class resulted in boosting the classification performance.
- ```[bash]
Do you want to train on a subset of features?
(0) : ['Include all features']
(1) : ['google_places_rating', 'google_places_user_ratings_total', 'google_places_confidence', 'regional_atlas_regional_score']
```

`0` would include all the numerical and categorical one-hot encoded features, while `1` would choose a small subset of data as features for the machine learning models

Then, the user would be given multiple options:

```[bash]
(1) Train
(2) Test
(3) Predict on single lead
(4) Save model
(5) Exit
```

- (1): Train the current model on the current trainig dataset.
- (2): Test the current model on the test dataset, displaying the mean squared error.
- (3): Choose a single lead from the test dataset and display the prediction and true label.
- (4): Save the current model to the `amos--models/models` on S3 in case of `DATABASE_TYPE=S3`, otherwise it will save it locally.
- (5): Exit the EVP submenu

## (3) : Merchant Size Prediction

After training, testing, and saving the model, the true essence of models lies not just in crucial task of generating forecasted predictions for previously unseen leads.

## (5) : Exit
## (4) : Exit

Gracefully exit the program.

0 comments on commit 910141c

Please sign in to comment.