Docs revamp (#1500)

microsoft · Dec 29, 2024 · bed2979 · bed2979
1 parent 9fee330
commit bed2979
Show file tree

Hide file tree

Showing 48 changed files with 1,954 additions and 395 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -13,11 +13,12 @@ Presidio is both a framework and a system. It's a framework in a sense that you
 When contributing to presidio, it's important to keep this in mind, as some "framework" contributions might not be suitable for a deployment, or vice-versa.
 
 ### PR guidelines
+
 Commit message should be clear, explaining the committed changes.
 
 Update CHANGELOG.md:
 
-Under Unreleased section, use the category which is most suitable for your change (changed/removed/deprecated). 
+Under Unreleased section, use the category which is most suitable for your change (changed/removed/deprecated).
 Document the change with simple readable text and push it as part of the commit.
 Next release, the change will be documented under the new version.
 
@@ -30,12 +31,6 @@ For more details follow the [Build and Release documentation](docs/build_release
 
 To get started, refer to the documentation for [setting up a development environment](docs/development.md).
 
-### How can I contribute?
-
--   [Testing](#how-to-test)
--   [Adding new recognizers for new PII types](#adding-new-recognizers-for-new-pii-types)
--   [Fixing Bugs and improving the code](#fixing-bugs-and-improving-the-code)
-
 ### How to test?
 
 For Python, Presidio leverages `pytest` and `ruff`. See [this tutorial](docs/development.md#testing) on more information on testing presidio modules.
@@ -50,14 +45,20 @@ Best practices for developing recognizers [are described here](docs/analyzer/dev
 
 Please review the open [issues on Github](https://github.com/microsoft/presidio/issues) for known bugs and feature requests. We sometimes add 'good first issue' labels on those we believe are simpler, and 'advanced' labels on those which require more work or multiple changes across the solution.
 
+### Adding samples
+
+We would love to see more samples demonstrating how to use Presidio in different scenarios. If you have a sample that you think would be useful for others, please consider contributing it. You can find the samples in the [samples folder](docs/samples/).
+
+When contributing a sample, make sure it is self contained (e.g. external dependencies are documented), add it [to the index] (docs/samples/index.md), and to the [mkdocs.yml](mkdocs.yml) file.
+
 ## Contacting Us
 
-For any questions, please email [email protected].
+For any questions, please email <[email protected]>.
 
 ## Contribution guidelines
 
-This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.
+This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit <https://cla.microsoft.com>.
 
 When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
 
-This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
+This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact <[email protected]> with any additional questions or comments.
diff --git a/docs/analyzer/adding_recognizers.md b/docs/analyzer/adding_recognizers.md
@@ -150,7 +150,7 @@ To add a recognizer to the list of pre-defined recognizers:
 
 1. Clone the repo.
 2. Create a file containing the new recognizer Python class.
-3. Add the recognizer to the `recognizers` in the [`default_recognizers`](../../presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml) config. Details of recognizer paramers are given [Here](./recognizer_registry_provider.md#the-recognizer-parameters).
+3. Add the recognizer to the `recognizers` in the [`default_recognizers`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml) config. Details of recognizer parameters are given [Here](./recognizer_registry_provider.md#the-recognizer-parameters).
 4. Optional: Update documentation (e.g., the [supported entities list](../supported_entities.md)).
 
 ### Azure AI Language recognizer
@@ -218,22 +218,27 @@ Additional examples can be found in the [OpenAPI spec](../api-docs/api-docs.html
 ### Reading pattern recognizers from YAML
 
 Recognizers can be loaded from a YAML file, which allows users to add recognition logic without writing code.
-An example YAML file can be found [here](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/example_recognizers.yaml).
+An example YAML file can be found [here](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml).
 
 Once the YAML file is created, it can be loaded into the `RecognizerRegistry` instance.
 
 This example creates a `RecognizerRegistry` holding only the recognizers in the YAML file:
 
  <!--pytest-codeblocks:skip-->
 ``` python
-from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
+from presidio_analyzer import AnalyzerEngine
+from presidio_analyzer.recognizer_registry import RecognizerRegistryProvider
 
-yaml_file = "recognizers.yaml"
-registry = RecognizerRegistry()
-registry.add_recognizers_from_yaml(yaml_file)
+recognizer_registry_conf_file = "./analyzer/recognizers-config.yml"
 
+provider = RecognizerRegistryProvider(
+                conf_file=recognizer_registry_conf_file
+            )
+registry = provider.create_recognizer_registry()
 analyzer = AnalyzerEngine(registry=registry)
-analyzer.analyze(text="Mr. and Mrs. Smith", language="en")
+
+results = analyzer.analyze(text="My name is Morris", language="en")
+print(results)
 ```
 
 This example adds the new recognizers to the predefined recognizers in Presidio:

diff --git a/docs/analyzer/customizing_nlp_models.md b/docs/analyzer/customizing_nlp_models.md
@@ -123,7 +123,7 @@ Configuration can be done in two ways:
 
 ## Leverage frameworks other than spaCy, Stanza and transformers for ML based PII detection
 
-In addition to the built-in spaCy/Stanza/transformers capabitilies, it is possible to create new recognizers which serve as interfaces to other models.
+In addition to the built-in spaCy/Stanza/transformers capabilities, it is possible to create new recognizers which serve as interfaces to other models.
 For more information:
 
 - [Remote recognizer documentation](adding_recognizers.md#creating-a-remote-recognizer) and [samples](../samples/python/integrating_with_external_services.ipynb).

diff --git a/docs/analyzer/developing_recognizers.md b/docs/analyzer/developing_recognizers.md
@@ -7,7 +7,7 @@ Recognizers define the logic for detection, as well as the confidence a predicti
 
 ### Accuracy
 
-Each recognizer, regardless of its complexity, could have false positives and false negatives. When adding new recognizers, we try to balance the effect of each recognizer on the entire system. 
+Each recognizer, regardless of its complexity, could have false positives and false negatives. When adding new recognizers, we try to balance the effect of each recognizer on the entire system.
 A recognizer with many false positives would affect the system's usability, while a recognizer with many false negatives might require more work before it can be integrated. For reproducibility purposes, it is be best to note how the recognizer's accuracy was tested, and on which datasets.
 For tools and documentation on evaluating and analyzing recognizers, refer to the [presidio-research Github repository](https://github.com/microsoft/presidio-research).
 
@@ -23,7 +23,7 @@ Make sure your recognizer doesn't take too long to process text. Anything above
 
 ### Environment
 
-When adding new recognizers that have 3rd party dependencies, make sure that the new dependencies don't interfere with Presidio's dependencies. 
+When adding new recognizers that have 3rd party dependencies, make sure that the new dependencies don't interfere with Presidio's dependencies.
 In the case of a conflict, one can create an isolated model environment (outside the main presidio-analyzer process) and implement a [`RemoteRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/remote_recognizer.py) on the presidio-analyzer side to interact with the model's endpoint.
 
 ## Recognizer Types
@@ -34,7 +34,7 @@ Generally speaking, there are three types of recognizers:
 
 A deny list is a list of words that should be removed during text analysis. For example, it can include a list of titles (`["Mr.", "Mrs.", "Ms.", "Dr."]` to detect a "Title" entity.)
 
-See [this documentation](index.md#how-to-add-a-new-recognizer) on adding a new recognizer. The [`PatternRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/pattern_recognizer.py) class has built-in support for a deny-list input.
+See [this documentation](adding_recognizers.md) on adding a new recognizer. The [`PatternRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/pattern_recognizer.py) class has built-in support for a deny-list input.
 
 ### Pattern Based
 
@@ -57,13 +57,13 @@ Presidio currently uses [spaCy](https://spacy.io/) as a framework for text analy
 `spaCy` provides descent results compared to state-of-the-art NER models, but with much better computational performance.
 `spaCy`, `stanza` and `transformers` models could be trained from scratch, used in combination with pre-trained embeddings, or be fine-tuned.
 
-In addition to those, it is also possible to use other ML models. In that case, a new `EntityRecognizer` should be created. 
+In addition to those, it is also possible to use other ML models. In that case, a new `EntityRecognizer` should be created.
 See an example using [Flair here](https://github.com/microsoft/presidio/blob/main/docs/samples/python/flair_recognizer.py).
 
 #### Apply Custom Logic
 
 In some cases, rule-based logic provides reasonable ways for detecting entities.
-The Presidio `EntityRecognizer` API allows you to use `spaCy` extracted features like lemmas, part of speech, dependencies and more to create your logic. 
+The Presidio `EntityRecognizer` API allows you to use `spaCy` extracted features like lemmas, part of speech, dependencies and more to create your logic.
 When integrating such logic into Presidio, a class inheriting from the [`EntityRecognizer`](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/entity_recognizer.py) should be created.
 
 !!! attention "Considerations for selecting one option over another"

diff --git a/docs/analyzer/index.md b/docs/analyzer/index.md
@@ -58,6 +58,113 @@ see [Installing Presidio](../installation.md).
     curl -d '{"text":"John Smith drivers license is AC432223", "language":"en"}' -H "Content-Type: application/json" -X POST http://localhost:3000/analyze
     ```
 
+## Main concepts
+
+Presidio analyzer is a set of tools that are used to detect entities in text. The main object in Presidio Analyzer is the `AnalyzerEngine`. In the following section we'll describe the main concepts in Presidio Analyzer.
+
+This simplified class diagram shows the main classes in Presidio Analyzer:
+
+```mermaid
+classDiagram
+    direction LR
+    class RecognizerResult {
+        +str entity_type
+        +float score
+        +int start
+        +int end
+    }
+
+    class EntityRecognizer {
+        +str name
+        +int version
+        +List[str] supported_entities
+        +analyze(text, entities) List[RecognizerResult]
+    }
+
+
+    class RecognizerRegistry {
+        +add_recognizer(recognizer) None
+        +remove_recognizer(recognizer) None
+        +load_predefined_recognizers() None
+        +get_recognizers() List[EntityRecognizer]
+
+
+    }
+
+    class NlpEngine {
+        +process_text(text, language) NlpArtifacts
+        +process_batch(texts, language) Iterator[NlpArtifacts]
+    }
+
+    class ContextAwareEnhancer {
+        +enhance_using_context(text, recognizer_results) List[RecognizerResult]
+    }
+
+
+    class AnalyzerEngine {
+        +NlpEngine nlp_engine
+        +RecognizerRegistry registry
+        +ContextAwareEnhancer context_aware_enhancer
+        +analyze(text: str, language) List[RecognizerResult]
+
+    }
+
+    NlpEngine <|-- SpacyNlpEngine
+    NlpEngine <|-- TransformersNlpEngine
+    NlpEngine <|-- StanzaNlpEngine
+    AnalyzerEngine *-- RecognizerRegistry
+    AnalyzerEngine *-- NlpEngine
+    AnalyzerEngine *-- ContextAwareEnhancer
+    RecognizerRegistry o-- "0..*" EntityRecognizer
+    ContextAwareEnhancer <|-- LemmaContextAwareEnhancer
+
+    %% Defining styles
+    style RecognizerRegistry fill:#E6F7FF,stroke:#005BAC,stroke-width:2px
+    style NlpEngine fill:#FFF5E6,stroke:#FFA500,stroke-width:2px
+    style SpacyNlpEngine fill:#FFF5E6,stroke:#FFA500,stroke-width:2px
+    style YourNlpEngine fill:#FFF5E6,stroke:#FFA500,stroke-width:2px
+    style TransformersNlpEngine fill:#FFF5E6,stroke:#FFA500,stroke-width:2px
+    style StanzaNlpEngine fill:#FFF5E6,stroke:#FFA500,stroke-width:2px
+    style ContextAwareEnhancer fill:#E6FFE6,stroke:#008000,stroke-width:2px
+    style LemmaContextAwareEnhancer fill:#E6FFE6,stroke:#008000,stroke-width:2px
+    style EntityRecognizer fill:#F5F5DC,stroke:#8B4513,stroke-width:2px
+    style YourEntityRecognizer fill:#F5F5DC,stroke:#8B4513,stroke-width:2px
+    style RecognizerResult fill:#FFF0F5,stroke:#FF69B4,stroke-width:2px
+```
+
+### `RecognizerResult`
+
+A `RecognizerResult` holds the type and span of a PII entity.
+
+### `EntityRecognizer`
+
+An entity recognizer is an object in Presidio that is responsible for detecting entities in text. An entity recognizer can be a rule-based recognizer, a machine learning model, or a combination of both.
+
+### `PatternRecognizer`
+
+A `PatternRecognizer` is a type of entity recognizer that uses regular expressions to detect entities in text. One can create new `PatternRecognizer` objects by providing a list of regular expressions, context words, validation and invalidation logic and additional parameters that facilitate the detection of entities.
+
+### `AnalyzerEngine`
+
+The `AnalyzerEngine` is the main object in Presidio Analyzer that is responsible for detecting entities in text. The `AnalyzerEngine` can be configured in various ways to fit the specific needs of the user.
+
+### `RecognizerRegistry`
+
+The `RecognizerRegistry` is a registry that contains all the entity recognizers that are available in Presidio. The `AnalyzerEngine` uses the `RecognizerRegistry` to detect entities in text.
+
+### `NlpEngine`
+
+An NLP Engine is an object that holds the NLP model that is used by the `AnalyzerEngine` to parse the input text and extract different features from it, such as tokens, lemmas, entities, and more. Note that Named Entity Recognition (NER) models can be added in two ways to Presidio: One is through the `NlpEngine` object, and the other is through a new `EntityRecognizer` object. By creating a Named Entity Recognition model through the `NlpEngine`, the named entities will be available to the different modules in Presidio. Furthermore, the `NlpEngine` object supports a batch mode (i.e., processing multiple texts at once) which allows for faster processing of large amounts of text.
+It is possible to mix multiple NER models in Presidio, for instance, one model as the `NlpEngine` and others as additional `EntityRecognizer` objects.
+
+Presidio has an off-the-shelf support for multiple NLP packages, such as spaCy, stanza, and huggingface. The simplest way to integrate a model from these packages is through the `NlpEngine`. More information on this [can be found in the NlpEngine documentation](customizing_nlp_models.md). The samples gallery has several examples of leveraging NER models as new `EntityRecognizer` objects. For example, [flair](../samples/python/flair_recognizer.py) and [spanmarker](../samples/python/span_marker_recognizer.py).
+For a detailed flow of Named Entities within presidio, see the diagram [in this document](nlp_engines/transformers.md#how-ner-results-flow-within-presidio).
+
+### `Context Aware Enhancer`
+
+The `ContextAwareEnhancer` is a module that enhances the detection of entities by using the context of the text. The `ContextAwareEnhancer` can be used to improve the detection of entities that are dependent on the context of the text, such as dates, locations, and more. The default implementation is the `LemmaContextAwareEnhancer` which uses the lemmas of the tokens in the text to enhance the detection of entities. Note that it's possible (and sometimes recommended) to create custom `ContextAwareEnhancer` objects to fit the specific needs of the user, for example if the context should support more than one word, which is currently not supported by the default Lemma based enhancer.
+More information on this can be found [in this sample](../samples/python/customizing_presidio_analyzer.ipynb).
+
 ## Creating PII recognizers
 
 Presidio analyzer can be easily extended to support additional PII entities.

diff --git a/docs/analyzer/nlp_engines/transformers.md b/docs/analyzer/nlp_engines/transformers.md
@@ -8,7 +8,9 @@ Presidio leverages other types of information from spaCy such as tokens, lemmas
 Therefore the pipeline returns both the NER model results as well as results from other pipeline components.
 
 ## How NER results flow within Presidio
+
 This diagram describes the flow of NER results within Presidio, and the relationship between the `TransformersNlpEngine` component and the `TransformersRecognizer` component:
+
 ```mermaid
 sequenceDiagram
     AnalyzerEngine->>TransformersNlpEngine: Call engine.process_text(text) <br>to get model results
@@ -55,7 +57,6 @@ Then, also download a spaCy pipeline/model:
 python -m spacy download en_core_web_sm
 ```
 
-
 ### Configuring the NER pipeline
 
 Once the models are downloaded, one option to configure them is to create a YAML configuration file.
@@ -193,7 +194,7 @@ Once the configuration file is created, it can be used to create a new `Transfor
     print(results_english)
 ```
 
-#### Explaning the configuration options
+#### Explaining the configuration options
 
 - `model_name.spacy` is a name of a spaCy model/pipeline, which would wrap the transformers NER model. For example, `en_core_web_sm`.
 - The `model_name.transformers` is the full path for a huggingface model. Models can be found on [HuggingFace Models Hub](https://huggingface.co/models?pipeline_tag=token-classification). For example, `obi/deid_roberta_i2b2`
@@ -208,19 +209,16 @@ The `ner_model_configuration` section contains the following parameters:
 - `low_confidence_score_multiplier`: A multiplier to apply to the score of entities with low confidence.
 - `low_score_entity_names`: A list of entity types to apply the low confidence score multiplier to.
 
-
 !!! note "Defining the entity mapping"
-    To be able to create the `model_to_presidio_entity_mapping` dictionary, it is advised to check which classes the model is able to predict. 
-    This can be found on the huggingface hub site for the model in some cases. In other, one can check the model's `config.json` uner `id2label`. 
-    For example, for `bert-base-NER-uncased`, it can be found here: https://huggingface.co/dslim/bert-base-NER-uncased/blob/main/config.json. 
+    To be able to create the `model_to_presidio_entity_mapping` dictionary, it is advised to check which classes the model is able to predict.
+    This can be found on the huggingface hub site for the model in some cases. In other, one can check the model's `config.json` under `id2label`.
+    For example, for `bert-base-NER-uncased`, it can be found here: <https://huggingface.co/dslim/bert-base-NER-uncased/blob/main/config.json>.
     Note that most NER models add a prefix to the class (e.g. `B-PER` for class `PER`). When creating the mapping, do not add the prefix.
 
-
 See more information on parameters on the [spacy-huggingface-pipelines Github repo](https://github.com/explosion/spacy-huggingface-pipelines#token-classification).
 
 Once created, see [the NLP configuration documentation](../customizing_nlp_models.md#Configure-Presidio-to-use-the-new-model) for more information.
 
-
 ### Training your own model
 
 !!! note "Note"