Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH]: De-identify PII and Sensitive Data Before Sending to LLM or AI Agents and Re-identify in the Response #110

Open
1 task done
dbosco opened this issue Oct 6, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@dbosco
Copy link
Contributor

dbosco commented Oct 6, 2024

Contact Details

No response

Feature Description

Implement a mechanism to de-identify PII and other sensitive data before sending prompts to LLMs or AI agents. Upon receiving the response, re-identify the tokenized data before displaying it to the user. This is help maintaining privacy and ensuring sensitive information is not exposed to untrusted systems.

Acceptance Criteria:

  1. De-identification of PII:

    • Create a process to identify and tokenize sensitive data (e.g., PII such as names, addresses, email addresses, phone numbers, social security numbers).
    • Ensure that tokenization:
      • Replaces PII with unique, random tokens that retain the original structure (e.g., [email protected] → [EMAIL_123]).
      • Tokens should not be reversible without access to the mapping.
      • No meaningful information should be inferable from the tokenized data.
    • Add support for multiple types of sensitive data, including:
      • Email addresses.
      • Phone numbers.
      • Financial data (e.g., credit card numbers).
      • Social security numbers.
      • Personal addresses.
    • Ensure de-identification occurs before sending any prompts to LLMs or AI agents.
  2. Re-identification in Responses:

    • Implement a process to re-identify the sensitive data from the LLM/AI agent's response.
    • Match tokens from the response to the original data and replace the tokens with the original PII or sensitive data (e.g., [EMAIL_123] → [email protected]).
    • Ensure accurate and secure mapping between tokens and original data to prevent incorrect re-identification.
  3. Tokenization Storage and Security:

    • Ensure tokens and mappings are stored securely in memory or a temporary data store with encryption if needed.
    • Ensure the mapping between tokens and sensitive data is cleared immediately after the response has been returned to the user to prevent leakage.
  4. Admin Controls and Configuration:

    • Provide configuration options for administrators to:
      • Define custom PII types and patterns.
      • Configure tokenization rules and settings.
      • View logs of de-identified interactions for audit purposes (without exposing PII).
  5. Documentation:

    • Provide detailed documentation that includes:
      • How to configure PII detection and tokenization.
      • Examples of how the de-identification and re-identification process works.
      • Security considerations for token storage and mapping.

Code of Conduct

  • I agree to follow this project's Code of Conduct
@dbosco dbosco added the enhancement New feature or request label Oct 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant