Skip to content

ltm920716/zh_PII

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

zh PII

数据脱敏接口,用于和外部模型交互时保护数据隐私
基于 Microsoft Presidio 开发

依赖及部署

  • 参照 Dockerfle
  • local
    $ pip3 install -r requirements.txt
    $ python3 -m spacy download zh_core_web_sm
    $ python3 -m spacy download en_core_web_lg
    $ python3 -m spacy download xx_ent_wiki_sm
    $ python3 start_app.py
    optional
    $ cp .env_copy .env
      or
    $ export OPENAI_API_KEY=''

功能描述

支持语言

  • ['zh', 'en']

目前支持分析实体类型

  • ['PERSON', 'IBAN_CODE', 'ID_CARD', '''CRYPTO', 'MEDICAL_LICENSE', 'DATE_TIME', 'URL', 'LOCATION', 'PHONE_NUMBER', 'CREDIT_CARD', 'EMAIL_ADDRESS', 'IP_ADDRESS', 'NRP']
    类型说明

目前支持隐私操作

  • ['replace', 'redact', 'hash', 'mask', 'encrypt']
    操作说明

Todo

  • 针对中文优化处理,继承修改 NlpArtifacts, RecognizerRegistry, PatternRecognizer
  • 对已有实体类型针对中文优化或新增中文版本、socre统一
    • 内置PHONE_NUMBER实体类型适配中文
    • 内置CREDIT_CARD实体类型适配中文
    • 内置MEDICAL_LICENSE实体类型适配中文
    • 内置IBAN_CODE实体类型适配中文
    • 新增中文ID_CARD实体类型
    • 新增中文BANK_CARD实体类型
  • 支持自定义关键词实体、自定义regex实体
    • 支持自定义实体与内置实体联合使用
  • 支持LLM(openai)对已知隐私部分替换数据合成
  • 支持本地LLM对敏感字段直接提取
  • 支持图片pii提取
  • 基于 JioNLP 优化

接口参数

get supported entities

curl -X 'GET' \
  'http://0.0.0.0:8080/pii/supported_entities/zh' \
  -H 'accept: application/json'
  • response
    {
      "status": 200,
      "msg": "success",
      "data": [
        "URL",
        "PHONE_NUMBER",
        "IP_ADDRESS",
        "CREDIT_CARD",
        "MEDICAL_LICENSE",
        "IBAN_CODE",
        "EMAIL_ADDRESS",
        "NRP",
        "LOCATION",
        "CRYPTO",
        "DATE_TIME",
        "PERSON"
      ]
    }

get supported anonymizers

curl -X 'GET' \
  'http://0.0.0.0:8080/pii/supported_anonymizers' \
  -H 'accept: application/json'
  • response
      {
      "status": 200,
      "msg": "success",
      "data": [
        "replace",
        "redact",
        "hash",
        "mask",
        "encrypt"
      ]
    }

text pii analyze

curl -X 'POST' \
  'http://0.0.0.0:8080/pii/analyze' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "text": "李雷的电话号码是13122832932",
  "lang": "zh",
  "entities": [
    "PHONE_NUMBER", "PERSON"
  ],
  "score_threshold": 0.3,
  "with_anonymize": false,
  "llm_synthesize": false,
  "anonymize_operators": []
}'
  • response
    {
      "status": 200,
      "msg": "success",
      "data": {
        "analyze": [
          {
            "entity_type": "PERSON",
            "start": 0,
            "end": 2,
            "score": 0.85
          },
          {
            "entity_type": "PHONE_NUMBER",
            "start": 8,
            "end": 19,
            "score": 0.75
          }
        ],
        "anonymize": []
      }
    }

text pii anonymize

curl -X 'POST' \
  'http://0.0.0.0:8080/pii/anonymize' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "text": "李雷的电话号码是13122832932",
  "analyzer_results": [
    {
      "entity_type": "PERSON",
      "start": 0,
      "end": 2,
      "score": 0.4
    }
  ],
  "llm_synthesize": false,
  "operators": [
    {
      "entity_type": "PERSON",
      "operator_name": "replace",
      "params": {"new_value": "韩梅梅"}
    }
  ]
}'
  • response
    {
      "status": 200,
      "msg": "success",
      "data": {
        "text": "韩梅梅的电话号码是13122832932",
        "items": [
          {
            "start": 0,
            "end": 3,
            "entity_type": "PERSON",
            "text": "韩梅梅",
            "operator": "replace"
          }
        ]
      }
    }

text pii anonymize use llm_synthesize

curl -X 'POST' \
  'http://0.0.0.0:8080/pii/anonymize' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "text": "李雷的电话号码是13122832932",
  "analyzer_results": [
    {
      "entity_type": "PERSON",
      "start": 0,
      "end": 2,
      "score": 0.4
    }
  ],
  "llm_synthesize": true,
  "operators": []
}'
  • response
      {
      "status": 200,
      "msg": "success",
      "data": {
        "text": "王磊的电话号码是13122832932",
        "items": []
      }
    }

text pii custom analyze

curl -X 'POST' \
  'http://0.0.0.0:8080/pii/custom_analyze' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "text": "李雷的电话号码是13122832932",
  "lang": "zh",
  "entities": [
    {
      "entity": "abc",
      "deny_list": ["电话", "是"],
      "patterns": [],
      "context": []
    }
  ],
  "with_anonymize": false,
  "llm_synthesize": false,
  "anonymize_operators": [],
  "allow_list": []
}'
  • response
    {
      "status": 200,
      "msg": "success",
      "data": {
        "analyze": [
          {
            "entity_type": "abc",
            "start": 3,
            "end": 5,
            "score": 1
          },
          {
            "entity_type": "abc",
            "start": 7,
            "end": 8,
            "score": 1
          }
        ],
        "anonymize": []
      }
    }

About

针对presidio-PII库的中文场景扩展

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published