Skip to content

Commit

Permalink
Docs revamp (#1500)
Browse files Browse the repository at this point in the history
  • Loading branch information
omri374 authored Dec 29, 2024
1 parent 9fee330 commit bed2979
Show file tree
Hide file tree
Showing 48 changed files with 1,954 additions and 395 deletions.
21 changes: 11 additions & 10 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,12 @@ Presidio is both a framework and a system. It's a framework in a sense that you
When contributing to presidio, it's important to keep this in mind, as some "framework" contributions might not be suitable for a deployment, or vice-versa.

### PR guidelines

Commit message should be clear, explaining the committed changes.

Update CHANGELOG.md:

Under Unreleased section, use the category which is most suitable for your change (changed/removed/deprecated).
Under Unreleased section, use the category which is most suitable for your change (changed/removed/deprecated).
Document the change with simple readable text and push it as part of the commit.
Next release, the change will be documented under the new version.

Expand All @@ -30,12 +31,6 @@ For more details follow the [Build and Release documentation](docs/build_release

To get started, refer to the documentation for [setting up a development environment](docs/development.md).

### How can I contribute?

- [Testing](#how-to-test)
- [Adding new recognizers for new PII types](#adding-new-recognizers-for-new-pii-types)
- [Fixing Bugs and improving the code](#fixing-bugs-and-improving-the-code)

### How to test?

For Python, Presidio leverages `pytest` and `ruff`. See [this tutorial](docs/development.md#testing) on more information on testing presidio modules.
Expand All @@ -50,14 +45,20 @@ Best practices for developing recognizers [are described here](docs/analyzer/dev

Please review the open [issues on Github](https://github.com/microsoft/presidio/issues) for known bugs and feature requests. We sometimes add 'good first issue' labels on those we believe are simpler, and 'advanced' labels on those which require more work or multiple changes across the solution.

### Adding samples

We would love to see more samples demonstrating how to use Presidio in different scenarios. If you have a sample that you think would be useful for others, please consider contributing it. You can find the samples in the [samples folder](docs/samples/).

When contributing a sample, make sure it is self contained (e.g. external dependencies are documented), add it [to the index] (docs/samples/index.md), and to the [mkdocs.yml](mkdocs.yml) file.

## Contacting Us

For any questions, please email [email protected].
For any questions, please email <[email protected]>.

## Contribution guidelines

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit <https://cla.microsoft.com>.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact <[email protected]> with any additional questions or comments.
19 changes: 12 additions & 7 deletions docs/analyzer/adding_recognizers.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ To add a recognizer to the list of pre-defined recognizers:

1. Clone the repo.
2. Create a file containing the new recognizer Python class.
3. Add the recognizer to the `recognizers` in the [`default_recognizers`](../../presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml) config. Details of recognizer paramers are given [Here](./recognizer_registry_provider.md#the-recognizer-parameters).
3. Add the recognizer to the `recognizers` in the [`default_recognizers`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml) config. Details of recognizer parameters are given [Here](./recognizer_registry_provider.md#the-recognizer-parameters).
4. Optional: Update documentation (e.g., the [supported entities list](../supported_entities.md)).

### Azure AI Language recognizer
Expand Down Expand Up @@ -218,22 +218,27 @@ Additional examples can be found in the [OpenAPI spec](../api-docs/api-docs.html
### Reading pattern recognizers from YAML

Recognizers can be loaded from a YAML file, which allows users to add recognition logic without writing code.
An example YAML file can be found [here](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/example_recognizers.yaml).
An example YAML file can be found [here](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml).

Once the YAML file is created, it can be loaded into the `RecognizerRegistry` instance.

This example creates a `RecognizerRegistry` holding only the recognizers in the YAML file:

<!--pytest-codeblocks:skip-->
``` python
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.recognizer_registry import RecognizerRegistryProvider

yaml_file = "recognizers.yaml"
registry = RecognizerRegistry()
registry.add_recognizers_from_yaml(yaml_file)
recognizer_registry_conf_file = "./analyzer/recognizers-config.yml"

provider = RecognizerRegistryProvider(
conf_file=recognizer_registry_conf_file
)
registry = provider.create_recognizer_registry()
analyzer = AnalyzerEngine(registry=registry)
analyzer.analyze(text="Mr. and Mrs. Smith", language="en")

results = analyzer.analyze(text="My name is Morris", language="en")
print(results)
```

This example adds the new recognizers to the predefined recognizers in Presidio:
Expand Down
2 changes: 1 addition & 1 deletion docs/analyzer/customizing_nlp_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ Configuration can be done in two ways:

## Leverage frameworks other than spaCy, Stanza and transformers for ML based PII detection

In addition to the built-in spaCy/Stanza/transformers capabitilies, it is possible to create new recognizers which serve as interfaces to other models.
In addition to the built-in spaCy/Stanza/transformers capabilities, it is possible to create new recognizers which serve as interfaces to other models.
For more information:

- [Remote recognizer documentation](adding_recognizers.md#creating-a-remote-recognizer) and [samples](../samples/python/integrating_with_external_services.ipynb).
Expand Down
10 changes: 5 additions & 5 deletions docs/analyzer/developing_recognizers.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Recognizers define the logic for detection, as well as the confidence a predicti

### Accuracy

Each recognizer, regardless of its complexity, could have false positives and false negatives. When adding new recognizers, we try to balance the effect of each recognizer on the entire system.
Each recognizer, regardless of its complexity, could have false positives and false negatives. When adding new recognizers, we try to balance the effect of each recognizer on the entire system.
A recognizer with many false positives would affect the system's usability, while a recognizer with many false negatives might require more work before it can be integrated. For reproducibility purposes, it is be best to note how the recognizer's accuracy was tested, and on which datasets.
For tools and documentation on evaluating and analyzing recognizers, refer to the [presidio-research Github repository](https://github.com/microsoft/presidio-research).

Expand All @@ -23,7 +23,7 @@ Make sure your recognizer doesn't take too long to process text. Anything above

### Environment

When adding new recognizers that have 3rd party dependencies, make sure that the new dependencies don't interfere with Presidio's dependencies.
When adding new recognizers that have 3rd party dependencies, make sure that the new dependencies don't interfere with Presidio's dependencies.
In the case of a conflict, one can create an isolated model environment (outside the main presidio-analyzer process) and implement a [`RemoteRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/remote_recognizer.py) on the presidio-analyzer side to interact with the model's endpoint.

## Recognizer Types
Expand All @@ -34,7 +34,7 @@ Generally speaking, there are three types of recognizers:

A deny list is a list of words that should be removed during text analysis. For example, it can include a list of titles (`["Mr.", "Mrs.", "Ms.", "Dr."]` to detect a "Title" entity.)

See [this documentation](index.md#how-to-add-a-new-recognizer) on adding a new recognizer. The [`PatternRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/pattern_recognizer.py) class has built-in support for a deny-list input.
See [this documentation](adding_recognizers.md) on adding a new recognizer. The [`PatternRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/pattern_recognizer.py) class has built-in support for a deny-list input.

### Pattern Based

Expand All @@ -57,13 +57,13 @@ Presidio currently uses [spaCy](https://spacy.io/) as a framework for text analy
`spaCy` provides descent results compared to state-of-the-art NER models, but with much better computational performance.
`spaCy`, `stanza` and `transformers` models could be trained from scratch, used in combination with pre-trained embeddings, or be fine-tuned.

In addition to those, it is also possible to use other ML models. In that case, a new `EntityRecognizer` should be created.
In addition to those, it is also possible to use other ML models. In that case, a new `EntityRecognizer` should be created.
See an example using [Flair here](https://github.com/microsoft/presidio/blob/main/docs/samples/python/flair_recognizer.py).

#### Apply Custom Logic

In some cases, rule-based logic provides reasonable ways for detecting entities.
The Presidio `EntityRecognizer` API allows you to use `spaCy` extracted features like lemmas, part of speech, dependencies and more to create your logic.
The Presidio `EntityRecognizer` API allows you to use `spaCy` extracted features like lemmas, part of speech, dependencies and more to create your logic.
When integrating such logic into Presidio, a class inheriting from the [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) should be created.

!!! attention "Considerations for selecting one option over another"
Expand Down
107 changes: 107 additions & 0 deletions docs/analyzer/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,113 @@ see [Installing Presidio](../installation.md).
curl -d '{"text":"John Smith drivers license is AC432223", "language":"en"}' -H "Content-Type: application/json" -X POST http://localhost:3000/analyze
```

## Main concepts

Presidio analyzer is a set of tools that are used to detect entities in text. The main object in Presidio Analyzer is the `AnalyzerEngine`. In the following section we'll describe the main concepts in Presidio Analyzer.

This simplified class diagram shows the main classes in Presidio Analyzer:

```mermaid
classDiagram
direction LR
class RecognizerResult {
+str entity_type
+float score
+int start
+int end
}
class EntityRecognizer {
+str name
+int version
+List[str] supported_entities
+analyze(text, entities) List[RecognizerResult]
}
class RecognizerRegistry {
+add_recognizer(recognizer) None
+remove_recognizer(recognizer) None
+load_predefined_recognizers() None
+get_recognizers() List[EntityRecognizer]
}
class NlpEngine {
+process_text(text, language) NlpArtifacts
+process_batch(texts, language) Iterator[NlpArtifacts]
}
class ContextAwareEnhancer {
+enhance_using_context(text, recognizer_results) List[RecognizerResult]
}
class AnalyzerEngine {
+NlpEngine nlp_engine
+RecognizerRegistry registry
+ContextAwareEnhancer context_aware_enhancer
+analyze(text: str, language) List[RecognizerResult]
}
NlpEngine <|-- SpacyNlpEngine
NlpEngine <|-- TransformersNlpEngine
NlpEngine <|-- StanzaNlpEngine
AnalyzerEngine *-- RecognizerRegistry
AnalyzerEngine *-- NlpEngine
AnalyzerEngine *-- ContextAwareEnhancer
RecognizerRegistry o-- "0..*" EntityRecognizer
ContextAwareEnhancer <|-- LemmaContextAwareEnhancer
%% Defining styles
style RecognizerRegistry fill:#E6F7FF,stroke:#005BAC,stroke-width:2px
style NlpEngine fill:#FFF5E6,stroke:#FFA500,stroke-width:2px
style SpacyNlpEngine fill:#FFF5E6,stroke:#FFA500,stroke-width:2px
style YourNlpEngine fill:#FFF5E6,stroke:#FFA500,stroke-width:2px
style TransformersNlpEngine fill:#FFF5E6,stroke:#FFA500,stroke-width:2px
style StanzaNlpEngine fill:#FFF5E6,stroke:#FFA500,stroke-width:2px
style ContextAwareEnhancer fill:#E6FFE6,stroke:#008000,stroke-width:2px
style LemmaContextAwareEnhancer fill:#E6FFE6,stroke:#008000,stroke-width:2px
style EntityRecognizer fill:#F5F5DC,stroke:#8B4513,stroke-width:2px
style YourEntityRecognizer fill:#F5F5DC,stroke:#8B4513,stroke-width:2px
style RecognizerResult fill:#FFF0F5,stroke:#FF69B4,stroke-width:2px
```

### `RecognizerResult`

A `RecognizerResult` holds the type and span of a PII entity.

### `EntityRecognizer`

An entity recognizer is an object in Presidio that is responsible for detecting entities in text. An entity recognizer can be a rule-based recognizer, a machine learning model, or a combination of both.

### `PatternRecognizer`

A `PatternRecognizer` is a type of entity recognizer that uses regular expressions to detect entities in text. One can create new `PatternRecognizer` objects by providing a list of regular expressions, context words, validation and invalidation logic and additional parameters that facilitate the detection of entities.

### `AnalyzerEngine`

The `AnalyzerEngine` is the main object in Presidio Analyzer that is responsible for detecting entities in text. The `AnalyzerEngine` can be configured in various ways to fit the specific needs of the user.

### `RecognizerRegistry`

The `RecognizerRegistry` is a registry that contains all the entity recognizers that are available in Presidio. The `AnalyzerEngine` uses the `RecognizerRegistry` to detect entities in text.

### `NlpEngine`

An NLP Engine is an object that holds the NLP model that is used by the `AnalyzerEngine` to parse the input text and extract different features from it, such as tokens, lemmas, entities, and more. Note that Named Entity Recognition (NER) models can be added in two ways to Presidio: One is through the `NlpEngine` object, and the other is through a new `EntityRecognizer` object. By creating a Named Entity Recognition model through the `NlpEngine`, the named entities will be available to the different modules in Presidio. Furthermore, the `NlpEngine` object supports a batch mode (i.e., processing multiple texts at once) which allows for faster processing of large amounts of text.
It is possible to mix multiple NER models in Presidio, for instance, one model as the `NlpEngine` and others as additional `EntityRecognizer` objects.

Presidio has an off-the-shelf support for multiple NLP packages, such as spaCy, stanza, and huggingface. The simplest way to integrate a model from these packages is through the `NlpEngine`. More information on this [can be found in the NlpEngine documentation](customizing_nlp_models.md). The samples gallery has several examples of leveraging NER models as new `EntityRecognizer` objects. For example, [flair](../samples/python/flair_recognizer.py) and [spanmarker](../samples/python/span_marker_recognizer.py).
For a detailed flow of Named Entities within presidio, see the diagram [in this document](nlp_engines/transformers.md#how-ner-results-flow-within-presidio).

### `Context Aware Enhancer`

The `ContextAwareEnhancer` is a module that enhances the detection of entities by using the context of the text. The `ContextAwareEnhancer` can be used to improve the detection of entities that are dependent on the context of the text, such as dates, locations, and more. The default implementation is the `LemmaContextAwareEnhancer` which uses the lemmas of the tokens in the text to enhance the detection of entities. Note that it's possible (and sometimes recommended) to create custom `ContextAwareEnhancer` objects to fit the specific needs of the user, for example if the context should support more than one word, which is currently not supported by the default Lemma based enhancer.
More information on this can be found [in this sample](../samples/python/customizing_presidio_analyzer.ipynb).

## Creating PII recognizers

Presidio analyzer can be easily extended to support additional PII entities.
Expand Down
14 changes: 6 additions & 8 deletions docs/analyzer/nlp_engines/transformers.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,9 @@ Presidio leverages other types of information from spaCy such as tokens, lemmas
Therefore the pipeline returns both the NER model results as well as results from other pipeline components.

## How NER results flow within Presidio

This diagram describes the flow of NER results within Presidio, and the relationship between the `TransformersNlpEngine` component and the `TransformersRecognizer` component:

```mermaid
sequenceDiagram
AnalyzerEngine->>TransformersNlpEngine: Call engine.process_text(text) <br>to get model results
Expand Down Expand Up @@ -55,7 +57,6 @@ Then, also download a spaCy pipeline/model:
python -m spacy download en_core_web_sm
```


### Configuring the NER pipeline

Once the models are downloaded, one option to configure them is to create a YAML configuration file.
Expand Down Expand Up @@ -193,7 +194,7 @@ Once the configuration file is created, it can be used to create a new `Transfor
print(results_english)
```

#### Explaning the configuration options
#### Explaining the configuration options

- `model_name.spacy` is a name of a spaCy model/pipeline, which would wrap the transformers NER model. For example, `en_core_web_sm`.
- The `model_name.transformers` is the full path for a huggingface model. Models can be found on [HuggingFace Models Hub](https://huggingface.co/models?pipeline_tag=token-classification). For example, `obi/deid_roberta_i2b2`
Expand All @@ -208,19 +209,16 @@ The `ner_model_configuration` section contains the following parameters:
- `low_confidence_score_multiplier`: A multiplier to apply to the score of entities with low confidence.
- `low_score_entity_names`: A list of entity types to apply the low confidence score multiplier to.


!!! note "Defining the entity mapping"
To be able to create the `model_to_presidio_entity_mapping` dictionary, it is advised to check which classes the model is able to predict.
This can be found on the huggingface hub site for the model in some cases. In other, one can check the model's `config.json` uner `id2label`.
For example, for `bert-base-NER-uncased`, it can be found here: https://huggingface.co/dslim/bert-base-NER-uncased/blob/main/config.json.
To be able to create the `model_to_presidio_entity_mapping` dictionary, it is advised to check which classes the model is able to predict.
This can be found on the huggingface hub site for the model in some cases. In other, one can check the model's `config.json` under `id2label`.
For example, for `bert-base-NER-uncased`, it can be found here: <https://huggingface.co/dslim/bert-base-NER-uncased/blob/main/config.json>.
Note that most NER models add a prefix to the class (e.g. `B-PER` for class `PER`). When creating the mapping, do not add the prefix.


See more information on parameters on the [spacy-huggingface-pipelines Github repo](https://github.com/explosion/spacy-huggingface-pipelines#token-classification).

Once created, see [the NLP configuration documentation](../customizing_nlp_models.md#Configure-Presidio-to-use-the-new-model) for more information.


### Training your own model

!!! note "Note"
Expand Down
Loading

0 comments on commit bed2979

Please sign in to comment.