Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing codeless language configuration #1182

Open
loeschzwerg opened this issue Oct 2, 2023 · 2 comments
Open

Missing codeless language configuration #1182

loeschzwerg opened this issue Oct 2, 2023 · 2 comments
Labels
analyzer bug Something isn't working docker good first issue Good for newcomers

Comments

@loeschzwerg
Copy link

Is your feature request related to a problem? Please describe.

I don't see a way to start the presidio-analyzer with a language configuration other than the hardcoded one (i.e. self.engine = AnalyzerEngine()) in server mode.

For example, it is impossible to start the analyzer in language mode "de" (python3 -m app). Maybe it is not intended this way, or because german is not officially supported, but I'd rather have a disclaimer than to touch code I would need to understand first. And apparently the code to handle this specific configuration of the AnalyzerEngine is already present in presidio_analyzer/analyzer_engine.py

Describe the solution you'd like
I would like to have either or both:

  • configuration file, similar to conf/default.yml
  • CLI params like --supported-languages en,de or --supported-languages ALL

Describe alternatives you've considered
The only apparent alternative to start presidio with a different language is to modify app.py. However, this still means the user is dependant on understanding and modifying code. Which in this case is pointless, to my knowledge.

Additional context
In the predefined recognizers registry is an entry for language "ALL", which I interpret to be available for ALL languages. As such, even languages without language-specific predefined recognizers should be able to leverage these. As a result languages should be configurable from the CLI, and not just every time fallback to "en"

# app.py
    def __init__(self):
        fileConfig(Path(Path(__file__).parent, LOGGING_CONF_FILE))
        self.logger = logging.getLogger("presidio-analyzer")
        self.logger.setLevel(os.environ.get("LOG_LEVEL", self.logger.level))
        self.app = Flask(__name__)
        self.logger.info("Starting analyzer engine")
        self.engine = AnalyzerEngine()         # <<<
        self.logger.info(WELCOME_MESSAGE)
# presidio_analyzer/analyzer_engine.py

    def __init__(
        self,
        registry: RecognizerRegistry = None,
        nlp_engine: NlpEngine = None,
        app_tracer: AppTracer = None,
        log_decision_process: bool = False,
        default_score_threshold: float = 0,
        supported_languages: List[str] = None,    # <<<
        context_aware_enhancer: Optional[ContextAwareEnhancer] = None,
    ):
        if not supported_languages:         #<<<
            supported_languages = ["en"]
@omri374
Copy link
Contributor

omri374 commented Oct 4, 2023

Thanks for raising this. The app.py is not part of the package, and is meant to be customized. We will look into your suggestion, and are also open to community contributions.

Some parameters, such as language, are easy to configure, but others require the specific AnalyzerEngine pipeline to be configured through code (for example, if you integrate new types of recognizers, or a custom ContextAwareEnhancer). In other words, the specific instance in app.py is meant to be customized.

@omri374 omri374 added bug Something isn't working good first issue Good for newcomers analyzer docker labels Oct 4, 2023
@TheIndra55
Copy link
Contributor

Is it an idea to add this to the documentation? I just ran into this while adding a new model for my language to the yaml file.

I got confused since the analyzer writes during startup:

Created NLP engine: spacy. Loaded models: ['en', 'nl']

But the app.py needs to be edited to actually use these languages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analyzer bug Something isn't working docker good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants