-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using Presidio with Huggingface support #1083
Comments
Actually, after checking the source code more, it's actually not clear to me how one is supposed to use the Using the |
Hi @Matei9721, thanks for your feedback! I can understand why this causes confusion. We initially wanted to support Huggingface the same way we support Stanza, but bumped into some issues. In the future, the plan is to integrate the new spacy-huggingface-pipelines package for a more seamless integration. The easiest path forward, IMHO, is to use the TransformerRecognizer in parallel to the default SpacyNlpEngine. In our demo website's code, you'll find a method which does this. It uses the small spacy model to reduce the overhead (but maintain capabilities like lemmas), and removes the def create_nlp_engine_with_transformers(
model_path: str,
) -> Tuple[NlpEngine, RecognizerRegistry]:
"""
Instantiate an NlpEngine with a TransformersRecognizer and a small spaCy model.
The TransformersRecognizer would return results from Transformers models, the spaCy model
would return NlpArtifacts such as POS and lemmas.
:param model_path: HuggingFace model path.
"""
from transformers_rec import (
STANFORD_COFIGURATION,
BERT_DEID_CONFIGURATION,
TransformersRecognizer,
)
registry = RecognizerRegistry()
registry.load_predefined_recognizers()
if not spacy.util.is_package("en_core_web_sm"):
spacy.cli.download("en_core_web_sm")
# Using a small spaCy model + a HF NER model
transformers_recognizer = TransformersRecognizer(model_path=model_path)
if model_path == "StanfordAIMI/stanford-deidentifier-base":
transformers_recognizer.load_transformer(**STANFORD_COFIGURATION)
elif model_path == "obi/deid_roberta_i2b2":
transformers_recognizer.load_transformer(**BERT_DEID_CONFIGURATION)
else:
print(f"Warning: Model has no configuration, loading default.")
transformers_recognizer.load_transformer(**BERT_DEID_CONFIGURATION)
# Use small spaCy model, no need for both spacy and HF models
# The transformers model is used here as a recognizer, not as an NlpEngine
nlp_configuration = {
"nlp_engine_name": "spacy",
"models": [{"lang_code": "en", "model_name": "en_core_web_sm"}],
}
registry.add_recognizer(transformers_recognizer)
registry.remove_recognizer("SpacyRecognizer")
nlp_engine = NlpEngineProvider(nlp_configuration=nlp_configuration).create_engine()
return nlp_engine, registry Hope this helps. We'll work on making this easier going forward. |
Thank you for your swift reply @omri374 , that's exactly what I ended up following! I just wanted to make sure that I am doing it in the "best" way possible and not re-invent the wheel. :) Looking forward to the spacy-hugging face-pipeline addition as it seems to indeed streamline the process more. I will close the issue as my questions were answered and it's clear how to approach the task now! |
Dear @omri374 & @Matei9721 , Sorry to re-open this issue. The answers are really helpful. Am I wrong and can I use the Thanks in advance ! |
Hi @LSD-98, you are correct. There are essentially two flows here, and we're also about to improve the experience in the upcoming weeks, but in essence, the flows are:
In essense: sequenceDiagram
AnalyzerEngine->>SpacyNlpEngine: Call engine.process_text(text) <br>to get model results
SpacyNlpEngine->>NamedEntityRecognitionModel: call spaCy NER model
NamedEntityRecognitionModel->>SpacyNlpEngine: return PII entities
SpacyNlpEngine->>AnalyzerEngine: Pass NlpArtifacts<BR>(Entities, lemmas, tokens etc.)
Note over AnalyzerEngine: Call all recognizers
AnalyzerEngine->>SpacyRecognizer: Pass NlpArtifacts
Note over SpacyRecognizer: Extract PII entities out of NlpArtifacts
SpacyRecognizer->>AnalyzerEngine: Return List[RecognizerResult]<BR>based on entities
Flow 2: sequenceDiagram
Note over AnalyzerEngine: Call all recognizers, <br>including <br>MyNerModelRecognizer
AnalyzerEngine->>MyNerModelRecognizer: call .analyze
MyNerModelRecognizer->>transformers_model: Call transformers model
transformers_model->>MyNerModelRecognizer: get NER/PII entities
MyNerModelRecognizer->>AnalyzerEngine: Return List[RecognizerResult] <br>of PII entities
Where |
Reopening to improve logic and docs. Will be fixed in #1159 |
@omri374 This seems to work for current model and for english language. However, when I want to use french HF models , I get following error.
These are the changes I made: from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from transformers_recognizer import TransformersRecognizer
import spacy
from presidio_analyzer.nlp_engine import NlpEngineProvider
FR_MODEL_CONF = {'PRESIDIO_SUPPORTED_ENTITIES': ['LOCATION', 'PERSON', 'ORGANIZATION', 'DATE_TIME', 'NRP'],
'DEFAULT_MODEL_PATH': 'Jean-Baptiste/camembert-ner-with-dates',
'DATASET_TO_PRESIDIO_MAPPING': {'DATE': 'DATE_TIME', 'MISC': 'NRP', 'PER': 'PERSON', 'ORG': 'ORGANIZATION', 'LOC': 'LOCATION'},
"MODEL_TO_PRESIDIO_MAPPING": {'DATE': 'DATE_TIME', 'MISC': 'NRP', 'PER': 'PERSON', 'ORG': 'ORGANIZATION', 'LOC': 'LOCATION'},
"CHUNK_OVERLAP_SIZE": 40,
"CHUNK_SIZE": 600,
"ID_SCORE_MULTIPLIER": 0.4,
"ID_ENTITY_NAME": "ID"}
registry = RecognizerRegistry()
registry.load_predefined_recognizers()
if not spacy.util.is_package("fr_core_news_sm"):
spacy.cli.download("fr_core_news_sm")
supported_entities = FR_MODEL_CONF.get(
"PRESIDIO_SUPPORTED_ENTITIES")
model = "Jean-Baptiste/camembert-ner-with-dates"
transformers_recognizerr = TransformersRecognizer(model_path=model, supported_entities= supported_entities)
transformers_recognizerr.load_transformer(**FR_MODEL_CONF)
if not spacy.util.is_package("fr_core_news_sm"):
spacy.cli.download("fr_core_news_sm")
registry.add_recognizer(transformers_recognizerr)
registry.remove_recognizer("SpacyRecognizer")
nlp_configuration = {
"nlp_engine_name": "spacy",
"models": [{"lang_code": "fr", "model_name":"fr_core_news_sm"}],
}
nlp_engine = NlpEngineProvider(nlp_configuration=nlp_configuration).create_engine()
analyzer = AnalyzerEngine(registry=registry, nlp_engine=nlp_engine)
results = analyzer.analyze(
text="Je m'appelle jean-baptiste et j'habite à montréal depuis fevr 2012",
language="fr",
entities = ['LOCATION', 'PERSON', 'ORGANIZATION', 'DATE_TIME', 'NRP'],
return_decision_process=True,
)
for result in results:
print(result)
print(result.analysis_explanation) |
Many thanks @omri374 for the reply, very clear.
I tried the same thing last week and had the exact same issue. I did not manage to solve it and moved to another project. I assume there will be an easier way to use HF models when #1159 is pushed! |
Make sure you pass the language argument to the |
Hi, currently I am using presidio with Spacy and Stanza by creating an
nlp_engine
usingNlpEngineProvider
and passing it the correct model in the config. I was planning on adding support for HuggingFace transformer models, but I was a bit confused by the fact that there are 2 ways of doing this:TransformerRecognizer
TransformerNlpEngine
As far as I understand, if you use the recognizer then you apply the recognizer on top of the usual e.g. Spacy NER pipeline so you will get results from both Spacy and HuggingFace model. On the other hand, using the TransformerNlpEngine substitutes the Spacy NER module in the pipeline.
In this example: https://microsoft.github.io/presidio/samples/python/transformers_recognizer/ it is shown how to use the TransformersRecognizer with a specific configuration given as an example in configuration.py where you can do the
MODEL_TO_PRESIDIO_MAPPING
. If you are to use theTransformerNlpEngine
, how are you supposed to do the mapping between model types and presidio types similar to the ones done inTransformerRecognizer
?Is my understanding above right and if yes, is there a way to create an
AnalyzerEngine
with aTransformerNlpEngine
with the same configuration as aTransformerRecognizer
?Thanks for the help!
The text was updated successfully, but these errors were encountered: