Using Flair model with PySpark dataframes #1499
Unanswered
jeffreyhebrank
asked this question in
Q&A
Replies: 1 comment
-
Hi, I'm not an expert in Spark, but can offer a few suggestions to try. It looks like Flair is not properly serialized and broadcasted to all the nodes. To check if this is indeed a serialization issue, I would run the following test: from pyspark.serializers import CloudPickleSerializer
serializer = CloudPickleSerializer()
try:
serializer.dumps(FlairRecognizer())
print("Serialization successful!")
except Exception as e:
print(f"Serialization failed: {e}") One option that could work is to instantiate the from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.recognizer_registry.recognizer_registry import RecognizerRegistry
from mypackage.flair_recognizer import FlairRecognizer
def analyze_text(text):
analyzer = broadcasted_analyzer.value
anonymizer = broadcasted_anonymizer.value
# Add FlairRecognizer on the worker
registry = RecognizerRegistry()
registry.add_recognizer(FlairRecognizer())
registry.remove_recognizer("SpacyRecognizer")
analyzer.registy = registry
analyzer_results = analyzer.analyze(text=text, language="en")
anonymized_results = anonymizer.anonymize(
text=text,
analyzer_results=analyzer_results,
operators={
"DEFAULT": OperatorConfig("replace", {"new_value": "<ANONYMIZED>"})
},
)
return anonymized_results.text Hope this helps |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi Team,
I'm trying to combine these two samples / tutorials:
Flair Recognizer
https://github.com/microsoft/presidio/blob/main/docs/samples/python/flair_recognizer.py
Anonymize PII using Presidio on Spark
https://microsoft.github.io/presidio/samples/deployments/spark/
I'm getting an error that looks related to broadcasting the FlairRecognizer (full stack trace attached below):
File "/opt/spark/python/lib/pyspark.zip/pyspark/broadcast.py", line 259, in load
gc.enable()
AttributeError: Can't get attribute 'FlairRecognizer' on <module 'pyspark.daemon' from '/opt/spark/python/lib/pyspark.zip/pyspark/daemon.py'>
My environment is Apache Spark 3.4, python 3.10 running on a Synapse Analytics workspace.
Here is my code (FlairRecognizer class defined immediately above this in my notebook). Any help would be appreciated!
fullStackTrace.txt
Beta Was this translation helpful? Give feedback.
All reactions