[ENH]: De-identify PII and Sensitive Data Before Sending to LLM or AI Agents and Re-identify in the Response #110

dbosco · 2024-10-06T23:12:43Z

Contact Details

No response

Feature Description

Implement a mechanism to de-identify PII and other sensitive data before sending prompts to LLMs or AI agents. Upon receiving the response, re-identify the tokenized data before displaying it to the user. This is help maintaining privacy and ensuring sensitive information is not exposed to untrusted systems.

Acceptance Criteria:

De-identification of PII:
- Create a process to identify and tokenize sensitive data (e.g., PII such as names, addresses, email addresses, phone numbers, social security numbers).
- Ensure that tokenization:
  - Replaces PII with unique, random tokens that retain the original structure (e.g., [email protected] → [EMAIL_123]).
  - Tokens should not be reversible without access to the mapping.
  - No meaningful information should be inferable from the tokenized data.
- Add support for multiple types of sensitive data, including:
  - Email addresses.
  - Phone numbers.
  - Financial data (e.g., credit card numbers).
  - Social security numbers.
  - Personal addresses.
- Ensure de-identification occurs before sending any prompts to LLMs or AI agents.
Re-identification in Responses:
- Implement a process to re-identify the sensitive data from the LLM/AI agent's response.
- Match tokens from the response to the original data and replace the tokens with the original PII or sensitive data (e.g., [EMAIL_123] → [email protected]).
- Ensure accurate and secure mapping between tokens and original data to prevent incorrect re-identification.
Tokenization Storage and Security:
- Ensure tokens and mappings are stored securely in memory or a temporary data store with encryption if needed.
- Ensure the mapping between tokens and sensitive data is cleared immediately after the response has been returned to the user to prevent leakage.
Admin Controls and Configuration:
- Provide configuration options for administrators to:
  - Define custom PII types and patterns.
  - Configure tokenization rules and settings.
  - View logs of de-identified interactions for audit purposes (without exposing PII).
Documentation:
- Provide detailed documentation that includes:
  - How to configure PII detection and tokenization.
  - Examples of how the de-identification and re-identification process works.
  - Security considerations for token storage and mapping.

Code of Conduct

I agree to follow this project's Code of Conduct

dbosco added the enhancement New feature or request label Oct 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH]: De-identify PII and Sensitive Data Before Sending to LLM or AI Agents and Re-identify in the Response #110

[ENH]: De-identify PII and Sensitive Data Before Sending to LLM or AI Agents and Re-identify in the Response #110

dbosco commented Oct 6, 2024

[ENH]: De-identify PII and Sensitive Data Before Sending to LLM or AI Agents and Re-identify in the Response #110

[ENH]: De-identify PII and Sensitive Data Before Sending to LLM or AI Agents and Re-identify in the Response #110

Comments

dbosco commented Oct 6, 2024

Contact Details

Feature Description

Code of Conduct