Skip to content

Commit

Permalink
[CAI-118] Presidio (#1183)
Browse files Browse the repository at this point in the history
* Update webapp

* Update modules

* Add handler

* Add presidio

* Update config

* Update poetry

* Update modules

* Update webapp

* Update poetry files

* Remove handler script

* Update config

* Update modules

* Update poetry files

* Update config

* Update config

* Update modules

* Update config

* Update modules

* Update README

* Update README

* Update README

* Update Redis tunnel bash script

* Update modules

* Update changeset

* Update env variables

* Update env vars example

* Update config

* Update modules

* Update redis tunnel

* Update modules

* Update env vars example

* Update modules

* Update config

* feat: added index_id ssm parameter

* fix: efs name

* chore: ran terraform fmt

* fix: ssm parameter type

* Update webapp

* Update presidio model to medium size

* feat: added presidio models caching

---------

Co-authored-by: christian-calabrese <[email protected]>
  • Loading branch information
mdciri and christian-calabrese authored Oct 11, 2024
1 parent 27ed2cf commit cee4135
Show file tree
Hide file tree
Showing 20 changed files with 1,162 additions and 269 deletions.
5 changes: 5 additions & 0 deletions .changeset/long-camels-sell.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"chatbot": minor
---

"Add Presidio to detect and mask PII entities"
6 changes: 5 additions & 1 deletion apps/chatbot/.env.example
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,16 @@ PYTHONPATH=app-path
LOG_LEVEL=DEBUG
CHB_AWS_ACCESS_KEY_ID=...
CHB_AWS_SECRET_ACCESS_KEY=...
CHB_AWS_DEFAULT_REGION=eu-west-3
CHB_AWS_DEFAULT_REGION=eu-south-1
CHB_AWS_BEDROCK_REGION=eu-west-3
CHB_AWS_S3_BUCKET=...
CHB_AWS_GUARDRAIL_ID=...
CHB_AWS_GUARDRAIL_VERSION=...
CHB_REDIS_URL=...
CHB_WEBSITE_URL=...
CHB_REDIS_INDEX_NAME=...
CHB_LLAMAINDEX_INDEX_ID=...
CHB_DOCUMENTATION_DIR=...
CHB_GOOGLE_API_KEY=...
CHB_PROVIDER=...
CHB_MODEL_ID=...
Expand Down
40 changes: 11 additions & 29 deletions apps/chatbot/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,15 @@
# PagoPA Chatbot

This folder contains all the details to build a RAG using the documentation provided in [`PagoPA Developer Portal`](https://developer.pagopa.it/). The retriver chosen is the `Auto Merging Retriver` one and it was implemented using [`llama-index`](https://docs.llamaindex.ai/en/stable/). Check out `src/modules/retriever.py`.
This folder contains all the details to build a RAG using the documentation provided in [`PagoPA Developer Portal`](https://developer.pagopa.it/).

This chatbot uses [`AWS Bedrock`](https://aws.amazon.com/bedrock/) as provider, so be sure to have installed [`aws-cli`](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) and stored your credential in `~/.aws/credentials`.
This chatbot uses [Google](https://ai.google.dev/) or [AWS Bedrock](https://aws.amazon.com/bedrock/) as provider.
Even though the provider is the Google one, we stored its API key in AWS. So, be sure to have installed [aws-cli](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) and stored your credential in `~/.aws/credentials`.

All the parameters and prompts used to build the Retrieval-Augmented Generation (RAG) are available in `config`.
The Retrieval-Augmented Generation (RAG) was implemented using [llama-index](https://docs.llamaindex.ai/en/stable/). All the parameters and prompts used are stored in `config`.

## Environment Variables

Create a `.env` file inside this folder and store the environment variables listed in `.env.example`.

## Virtual environment

Expand All @@ -27,40 +32,17 @@ The working directory is `/developer-portal/apps/chatbot`. So, to set the `PYTHO

In this way, `PYTHONPATH` points to where the Python packages and modules are, not where your checkouts are.

## File for Environment Variables

Create a `.env` file inside the folder and write to the file the following environment variables:

CHB_AWS_ACCESS_KEY_ID=...
CHB_AWS_SECRET_ACCESS_KEY=...
CHB_AWS_DEFAULT_REGION=...
CHB_AWS_S3_BUCKET=...
CHB_AWS_GUARDRAIL_ID=...
CHB_AWS_GUARDRAIL_VERSION=...
CHB_REDIS_URL=...
CHB_REDIS_INDEX_NAME=...
CHB_WEBSITE_URL=...
CHB_GOOGLE_API_KEY=...
CHB_PROVIDER=...
CHB_MODEL_ID=...
CHB_MODEL_TEMPERATURE=...
CHB_MODEL_MAXTOKENS=...
CHB_EMBED_MODEL_ID=...
CHB_ENGINE_SIMILARITY_TOPK=...
CHB_ENGINE_SIMILARITY_CUTOFF=...
CHB_ENGINE_USE_ASYNC=...
CHB_ENGINE_USE_STREAMING=...

## Knowledge vector database
## Knowledge index vector database

To reach the remote redis instance, it is necessary to open a tunnel:

```
./scripts/redis-tunnel.sh
```

Verify that the HTML files that compose the Developer Portal documentation exist in a directory. Otherwise create the documentation. Once you have the documentation directory ready, put its path in `params` and, in the end, create the vector index doing:

```
```
python src/modules/create_vector_index.py --params config/params.yaml
```

Expand Down
52 changes: 50 additions & 2 deletions apps/chatbot/config/params.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,57 @@ vector_index:
path: index
chunk_sizes: [2816, 704, 176]
chunk_overlap: 20
use_redis: True
use_s3: False

engine:
response_mode: compact
verbose: False

config_presidio:
nlp_engine_name: spacy
models:
-
lang_code: en
model_name: en_core_web_md
-
lang_code: it
model_name: it_core_news_md
# -
# lang_code: de
# model_name: de_core_news_md
# -
# lang_code: es
# model_name: es_core_news_md
# -
# lang_code: fr
# model_name: fr_core_news_md
ner_model_configuration:
labels_to_ignore:
- ORDINAL
- QUANTITY
- ORGANIZATION
- ORG
- LANGUAGE
- PRODUCT
- MONEY
- PERCENT
- O
- CARDINAL
- EVENT
- WORK_OF_ART
- LAW
- MISC
model_to_presidio_entity_mapping:
PER: PERSON
PERSON: PERSON
LOC: LOCATION
LOCATION: LOCATION
GPE: LOCATION
ORG: ORGANIZATION
DATE: DATE_TIME
TIME: DATE_TIME
NORP: NRP
low_confidence_score_multiplier: 0.4
low_score_entity_names:
- ORGANIZATION
- ORG
default_score: 0.8
2 changes: 1 addition & 1 deletion apps/chatbot/config/prompts.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
qa_prompt_str: |
You are an customer services chatbot.
Your name is Discovery and your duty is to assist the user with the PagoPA DevPortal documentation!
Your name is Discovery and your duty is to assist the user with the PagoPA DevPortal documentation, homepage: https://dev.developer.pagopa.it!
--------------------
Context information:
{context_str}
Expand Down
2 changes: 2 additions & 0 deletions apps/chatbot/docker/app.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,6 @@ RUN poetry install

COPY . ${LAMBDA_TASK_ROOT}
RUN python ./scripts/nltk_download.py
RUN python ./scripts/spacy_download.py

CMD ["src.app.main.handler"]
Loading

0 comments on commit cee4135

Please sign in to comment.