Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Final sprint release #238

Merged
merged 66 commits into from
Feb 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
7a1c360
Removed all unused or unnecessary code
felix-zailskas Jan 30, 2024
ed4b0b5
removed models folder as models are stored elsewhere now
felix-zailskas Jan 30, 2024
01da8ff
Merge branch 'dev' of github.com:amosproj/amos2023ws06-sales-lead-qua…
felix-zailskas Feb 1, 2024
3c02157
Added documentation on Cotroller and possible prediction improvement
felix-zailskas Feb 1, 2024
06330b0
Removed docker support
felix-zailskas Feb 1, 2024
91e4090
Cleanup Pipfile and remove unsused imports
felix-zailskas Feb 1, 2024
bd398df
added functionality of applying ml model on lead data
ultiwinter7 Feb 2, 2024
4ee3015
documentation
ultiwinter7 Feb 2, 2024
4a92080
added functionality to export the predictions and to update the prepr…
ultiwinter7 Feb 2, 2024
f2c2286
added MerchantSize Prediction to demos
ultiwinter7 Feb 2, 2024
9324ece
Merge pull request #230 from amosproj/refactor/remove-unused-code
felix-zailskas Feb 2, 2024
92d64d7
Merge branch 'dev' into feature/applying-ml-model-test
ultiwinter Feb 2, 2024
98d315e
modified the merge conflicts and removed the applying_ml.py
ultiwinter7 Feb 2, 2024
558296f
Merge pull request #231 from amosproj/feature/applying-ml-model-test
ultiwinter Feb 2, 2024
4d99756
inital commit to fix the file names bug
ultiwinter7 Feb 3, 2024
8debd4a
documentation.yml created
rbbozkurt Feb 3, 2024
4a25985
pip install fixed
rbbozkurt Feb 3, 2024
c746b94
path fixed
rbbozkurt Feb 3, 2024
f8ff032
path fixed
rbbozkurt Feb 3, 2024
1fadef1
path fixed
rbbozkurt Feb 3, 2024
0cd4e45
install all dependencies
rbbozkurt Feb 3, 2024
f23de91
workflow set to main
rbbozkurt Feb 3, 2024
93ba51c
fixed bug and now the pipeline can run locally
ultiwinter7 Feb 3, 2024
46cf065
sphinx author name changed
rbbozkurt Feb 3, 2024
d919567
demo/__init__.py failing imports are added
rbbozkurt Feb 3, 2024
ce6bf6a
README.md typo fixed and included to sphinx
rbbozkurt Feb 3, 2024
65f68a8
licenses added to readme linker
rbbozkurt Feb 3, 2024
960fdab
sphinx moved under dev-packages
rbbozkurt Feb 3, 2024
524b72e
Merge pull request #233 from amosproj/feature/pdoc
rbbozkurt Feb 3, 2024
94dccd5
Merge branch 'dev' into bugfix/datanames-ahmed
ultiwinter7 Feb 3, 2024
aa8ced8
updated pipfile, removed debugging prints, added logs
ultiwinter7 Feb 3, 2024
98dc73f
modified such hat models can be loaded from local path and applied in…
ultiwinter7 Feb 4, 2024
af8a7e2
modifications after review
ultiwinter7 Feb 4, 2024
b27c0b5
quick fix: make sure db type is respected in preprocessing step
luccalb Feb 5, 2024
5dad860
add folder for models
luccalb Feb 5, 2024
c718b1b
return to menu if invalid model name is given
luccalb Feb 5, 2024
d842143
Merge pull request #236 from amosproj/bugfix/datanames-ahmed
ultiwinter Feb 5, 2024
64a23ba
Added test suite for AnalyzeEmails and HashGenerator steps
felix-zailskas Feb 5, 2024
108a57b
Added simple execution test
felix-zailskas Feb 5, 2024
d2fb7ec
Merge branch 'main' of github.com:amosproj/amos2023ws06-sales-lead-qu…
felix-zailskas Feb 5, 2024
6dcd084
Added test case for phone numbers
felix-zailskas Feb 6, 2024
9834ce3
Tests for pipeline utils
felix-zailskas Feb 6, 2024
a38dc94
Merge branch 'dev' of github.com:amosproj/amos2023ws06-sales-lead-qua…
felix-zailskas Feb 6, 2024
55d71fe
Merge pull request #240 from amosproj/feature/229-more-test-cases
felix-zailskas Feb 6, 2024
ddc1eb5
Bugfix/235 all steps bdc errors (#239)
luccalb Feb 6, 2024
6c55b48
removed the redundant S3 question prompt from data preprocessing
ultiwinter7 Feb 6, 2024
6aac331
Merge pull request #241 from amosproj/bugfix/S3_prompt_redundancy
ultiwinter Feb 6, 2024
7a50187
fix minor error in regionalatlas step + moved scrape address step to …
luccalb Feb 5, 2024
1db8d9f
undo change to input filename
luccalb Feb 5, 2024
7f31452
Adjust gpt caching error logging
luccalb Feb 5, 2024
2e8eb12
change input file location back to original
luccalb Feb 5, 2024
3f7b1b4
Fix get_multiple_choice in main.py
Tims777 Feb 1, 2024
ea58ace
Rename menu options in main.py
Tims777 Feb 6, 2024
68156cc
Refactor handling of local and S3 file paths
Tims777 Feb 6, 2024
6a58037
Add the SBOM generator as a markdown file (shellscript) and my featur…
ur-tech Feb 6, 2024
eb39e24
supress warnings, fix preprocessing path for historical df
luccalb Feb 6, 2024
c86d898
Merge pull request #242 from amosproj/feature_analysis_notebook
ur-tech Feb 6, 2024
0c3b369
Update fabian_feature_analysis.ipynb
ur-tech Feb 6, 2024
65e1780
Add demo_pipeline.json
Tims777 Feb 6, 2024
e92a57b
Update fabian_feature_analysis.ipynb
ur-tech Feb 6, 2024
c05ad76
Needed update to portray result
ur-tech Feb 6, 2024
9a3120b
Merge pull request #243 from amosproj/update-notebook
ur-tech Feb 7, 2024
2b9f289
adjust log levels
luccalb Feb 7, 2024
cab53ad
Add report generation file to the deprecated steps
ur-tech Feb 7, 2024
7ae6adf
Merge pull request #245 from amosproj/feature/reports
ur-tech Feb 7, 2024
4a6ca72
Merge pull request #244 from amosproj/demo
ultiwinter Feb 7, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 2 additions & 8 deletions .env.template
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,9 @@
GOOGLE_PLACES_API_KEY=
OPEN_AI_API_KEY=

DB_USER=
DB_PASSWORD=
DB_CONNECTION=

FACEBOOK_APP_ID=
FACEBOOK_APP_SECRET=
OPEN_AI_API_KEY=

# Need to be set when 'DATABASE_TYPE' is 'S3'
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=

# Choose between 'Local' and 'S3'
DATABASE_TYPE=
39 changes: 39 additions & 0 deletions .github/workflows/documentation.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# SPDX-License-Identifier: MIT
# SPDX-FileCopyrightText: 2023 Berkay Bozkurt <[email protected]>

name: documentation

on: [push, pull_request, workflow_dispatch]

permissions:
contents: write

jobs:
docs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python 3.10
uses: actions/setup-python@v4
with:
python-version: "3.10"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install pipenv
# if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
pipenv install --dev
- name: Generate Sphinx
run: |
cd src/docs
pipenv run sphinx-apidoc -o . ..
pipenv run make clean
pipenv run make html
- name: Deploy to GitHub Pages
uses: peaceiris/actions-gh-pages@v3
if: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}
with:
publish_branch: gh-pages
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: src/docs/_build/html/
force_orphan: true
13 changes: 12 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -53,9 +53,17 @@ bin/
!**/data/merged_geo.geojson
**/data/reviews/*.json
**/data/gpt-results/*.json
**/data/models/*
**/data/models/*.pkl
**/data/models/*.joblib
**/data/classification_reports/*

**/docs/*
!**/docs/conf.py
!**/docs/index.rst
!**/docs/make.bat
!**/docs/Makefile
!**/docs/readme_link.md

# Env files
*.env

Expand All @@ -70,3 +78,6 @@ report.pdf
**/cache/*

!.gitkeep

# testing
.coverage
16 changes: 0 additions & 16 deletions Dockerfile

This file was deleted.

61 changes: 61 additions & 0 deletions Documentation/SBOM_generator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Automatic SBOM generation

```console
pipenv install
pipenv shell

pip install pipreqs
pip install cyclonedx-bom
pip install pip-licenses

# Create the SBOM (cyclonedx-bom) based on (pipreqs) requirements that are actually imported in the .py files

$sbom = pipreqs --print | cyclonedx-py -r -pb -o - -i -

# Create an XmlDocument object
$xml = New-Object System.Xml.XmlDocument

# Load XML content into the XmlDocument
$xml.LoadXml($sbom)


# Create an empty CSV file
$csvPath = "SBOM.csv"

# Initialize an empty array to store rows
$result = @()

# Iterate through the XML nodes and create rows for each node
$xml.SelectNodes("//*[local-name()='component']") | ForEach-Object {

$row = @{
"Version" = $_.Version
"Context" = $_.Purl
"Name" = if ($_.Name -eq 'scikit_learn') { 'scikit-learn' } else { $_.Name }
}

# Get license information
$match = pip-licenses --from=mixed --format=csv --with-system --packages $row.Name | ConvertFrom-Csv

# Add license information to the row
$result += [PSCustomObject]@{
"Context" = $row.Context
"Name" = $row.Name
"Version" = $row.Version
"License" = $match.License
}
}

# Export the data to the CSV file
$result | Export-Csv -Path $csvPath -NoTypeInformation

# Create the license file
$licensePath = $csvPath + '.license'
@"
SPDX-License-Identifier: CC-BY-4.0
SPDX-FileCopyrightText: 2023 Fabian-Paul Utech <[email protected]>
"@ | Out-File -FilePath $licensePath

exit

```
41 changes: 41 additions & 0 deletions Documentation/ideas.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
<!--
SPDX-License-Identifier: MIT
SPDX-FileCopyrightText: 2024 Felix Zailskas <[email protected]>
-->

# Unused Ideas

This document lists ideas and implementations which have either not been tried yet or have been deprecated as they are not used in the current product version but still carry some conceptual value.

## Deprecated

The original implementation of the deprecated modules can be found in the `deprecated/` directory.

### Controller

**_Note:_** This package has the additional dependency `pydantic==2.4.2`

The controller module was originally planned to be used as a communication device between EVP and BDC. Whenever the salesperson interface would register a new lead the controller is supposed to trigger the BDC pipeline to enrich the data of that lead and preprocess it to create a feature vector. The successful completion of the BDC pipeline is then registered at the controller which will then trigger an inference of the EVP to compute the predicted merchant size and write this back to the lead data. The computed merchant size can then be used to rank the leads and allow the salesperson to decide the value of the leads and which one to call.

The current implementation of the module supports queueing messages from the BDC and EVP as indicated by their type. Depending on the message type the message is then routed to the corresponding module (EVP or BDC). The actual processing of the messages by the modules is not implemented. All of this is done asynchronously by using the python threading library.

### FacebookGraphAPI

**_Note:_** This package has the additional dependency `facebook-sdk==3.1.0`. Also the environment variables `FACEBOOK_APP_ID` `FACEBOOK_APP_SECRET` need to be set with a valid token.

This step was supposed to be used for querying lead data from the facebook by using either the business owner's name or the company name. The attempt was deprecated as the cost for the needed API token was evaluated too high and because the usage permissions of the facebook API were changed. Furthermore, it is paramount to check the legal ramifications of querying facebook for this kind of data as there might be legal consequences of searching for individuals on facebook instead of their businesses due to data privacy regulations in the EU.

### ScrapeAddresses

This step was an early experiment, using only the custom domain from an email address. We check if there's a live website running
for the domain, and then try to parse the main site for a business address using a RegEx pattern. The pattern is not very precise
and calling the website, as well as parsing it, takes quite some time, which accumulates for a lot of entries. The Google places
step yields better results for the business address and is faster, that's why `scrape_addresses.py` was deprecated.

## Possible ML improvements

### Creating data subsets

The data collected by the BDC pipeline has not been refined to only include semantically valuable data fields. It is possible that some data fields contain no predictive power. This would mean they are practically polluting the dataset with unnecessary information. A proper analysis of the predictive power of all data fields would allow cutting down on the amount of data for each lead, reducing processing time and possibly make predictions more precise. This approach has been explored very briefly by the subset 1 as described in `Classifier-Comparison.md`. However, the choice of included features has not been justified by experiments making them somewhat arbitrary. Additionally, an analysis of this type could give insights on which data fields to expand on and what new data one might want to collect to increase the EVP's performance in predicting merchant sizes.

Possibly filtering data based on some quality metric could also improve general performance. The regional_atlas_score and google_confidence_score have been tried for this but did not improve performance. However, these values are computed somewhat arbitrarily and implementing a more refined quality metric might result in more promising results.
63 changes: 30 additions & 33 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -7,53 +7,50 @@ verify_ssl = true
name = "pypi"

[dev-packages]
pytest = "==7.4.0"
coverage = "==7.4.1"
pre-commit = "==3.5.0"
flake8 = "==6.0.0"
pytest-env = "==1.0.1"
matplotlib = "==3.8.2"
plotly = "==5.18.0"
geopy = "==2.4.1"
matplotlib = "==3.8.2"
notebook = "==7.0.6"
plotly = "==5.18.0"
pre-commit = "==3.5.0"
pytest = "==7.4.0"
pytest-env = "==1.0.1"
sphinx = "==7.2.6"
sphinx_rtd_theme = "==2.0.0"
myst_parser = "==2.0.0"

[packages]
numpy = "==1.26.1"
requests = "==2.31.0"
scikit-learn = "==1.3.2"
pydantic = "==2.4.2"
email-validator = "==2.1.0.post1"
pandas = "==2.0.3"
autocorrect = "==2.6.1"
beautifulsoup4 = "==4.12.2"
tqdm = "==4.65.0"
python-dotenv = "==0.21.0"
googlemaps = "==4.10.0"
phonenumbers = "==8.13.25"
pymongo = "==4.6.0"
facebook-sdk = "==3.1.0"
boto3 = "==1.33.1"
colorama = "==0.4.6"
deep-translator = "==1.11.4"
deutschland = "==0.4.0"
email-validator = "==2.1.0.post1"
fsspec = "==2023.12.2"
geopandas = "==0.14.1"
googlemaps = "==4.10.0"
joblib = "==1.3.2"
lightgbm = "==4.3.0"
numpy = "==1.26.1"
openai = "==1.3.3"
tiktoken = "==0.5.1"
osmnx = "==1.7.1"
pandas = "==2.0.3"
phonenumbers = "==8.13.25"
pylanguagetool = "==0.10.0"
pyspellchecker = "==0.7.2"
python-dotenv = "==0.21.0"
reportlab = "==4.0.7"
osmnx = "==1.7.1"
geopandas = "==0.14.1"
requests = "==2.31.0"
s3fs = "==2023.12.2"
scikit-learn = "==1.3.2"
shapely = "==2.0.2"
pyspellchecker = "==0.7.2"
autocorrect = "==2.6.1"
textblob = "==0.17.1"
deep-translator = "==1.11.4"
fsspec = "2023.12.2"
s3fs = "2023.12.2"
imblearn = "==0.0"
sagemaker = "==2.198.0"
joblib = "1.3.2"
tiktoken = "==0.5.1"
torch = "==2.1.2"
tqdm = "==4.65.0"
xgboost = "==2.0.3"
colorama = "==0.4.6"
torch = "2.1.2"
deutschland = "0.4.0"
bs4 = "0.0.2"
lightgbm = "==4.3.0"

[requires]
python_version = "3.10"
Loading
Loading