-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ISSUE] Backends using TupleFilesystemStoreBackend constantly request new Azure credentials #10896
Comments
Hi @jschra, what version are you using? Custom store backends are no longer supported, and the tuple store backend code has been removed in newer versions. Unfortunately, we won't be implementing this feature. However, you can still configure backends with GX, but setup and management are now the user's responsibility. |
I am using 1.3.3, so the latest. Below you can also see my great_expectations.yml file: analytics_enabled: null
checkpoint_store_name: checkpoint_store
config_variables_file_path: null
config_version: 4
data_context_id: null
data_docs_sites:
TNE Expectations 2025:
class_name: SiteBuilder
site_index_builder:
class_name: DefaultSiteIndexBuilder
store_backend:
class_name: TupleAzureBlobStoreBackend
container: \$web
expectations_store_name: expectations_store
fluent_datasources: {}
plugins_directory: null
progress_bars: null
stores:
checkpoint_store:
class_name: CheckpointStore
store_backend:
class_name: TupleAzureBlobStoreBackend
container: checkpoints
expectations_store:
class_name: ExpectationsStore
store_backend:
class_name: TupleAzureBlobStoreBackend
container: expectations
validation_definition_store:
class_name: ValidationDefinitionStore
store_backend:
class_name: TupleAzureBlobStoreBackend
container: validation-definitions
validation_results_store:
class_name: ValidationResultsStore
store_backend:
class_name: TupleAzureBlobStoreBackend
container: validation-results
validation_results_store_name: validation_results_store
What do you mean by no longer supporting custom backends? Are all the hyperscalar backed options deprecated (S3, Azure storage account etc.)? |
I also pulled the latest version of dev of great_expectations and it is still in there, so I find it highly doubtful that this is supposed to be out? |
To clarify, users can still store their GX configurations in Azure, but they must manage it themselves. We currently only support writing files to disk using the standard FileSystem backend provider—what happens beyond that is up to the user. This decision allows us to focus on GX as a data quality and trust library rather than maintaining a limited set of configuration management backends. While Azure Blob Storage may still exist in the codebase, it is not a supported backend, and we won't be making feature changes for it. If you're interested in contributing a fix for the if/else order block, I can discuss it with the team. However, a broader implementation like a singleton connection manager is not something we plan to support in the near future. |
Understood and fair enough, it is a bit of scope creep, I agree. With that being said, I'd be happy to contribute the fix since it is literally just adjusting the order of the if/else block in the script I mentioned, so super easy fix! |
Although to be fair, this does complicate the usage of GX in e.g. serverless services or remote agents, since such instances do not always have write permissions on their local file storage so if that is the default and only way of using GX in the future, it will definitely complicate things. I'd say there should at least be an alternative to catch outputs in memory to the handle them in whichever way you want. |
Hi @jschra, it's definitely not the only way—just the default. Anyone without local write permissions can still write to any backend of their choice; they just need to configure it. We’d still encounter the same issues since we had only a few backends, but not all. I’ll make sure to pass this feedback to the team and double-check whether we can accept a contribution. However, it may not be possible, as I believe this part of the code is either slated for removal or should have already been removed entirely. |
Are there or will there be guides on how one can set up such backends themselves? Specifically, how you can customize the GX setup to handle the storage of GX artifacts yourself? Because that’s be very helpful in such a case. |
Is your feature request related to a problem? Please describe.
We are using GX Core with our backends configured to an Azure Storage Account over multiple containers for each of the stores. Whenever we run our validation jobs, the logs show that connections with our backend are reinitialised constantly where every time, new credentials are requested via AzureDefaultCredential. Now in the source code, it shows that there is an option to set these credentials as an environment variable (AZURE_CREDENTIAL) to prevent them from being requested again every time. However the following loop which goes over the authentication options it set in the wrong order, making it so that the option with AZURE_CREDENTIAL passed beforehand is never reached and instead, the code re-authenticates every single time.
Here is a snippet from
tuple_store_backend.py
:As can be seen, there is an arm in the if/else block that first checks for the presence of self.account_url after which the next arm checks for the combination of self.credential and self.account_url, which will never be reached. This makes it so that even when the credentials are set in the environment variable, it always goes into the account_url option and regenerates the credentials using AzureDefaultCredential.
Describe the solution you'd like
Now in general I think GX could really benefit from implementing singleton connection managers for e.g. blob storage, S3, so that connections are created, stored and re-used throughout the codebase, instead of reconnecting every single time. Because it is not just the credentials that are requested over and over again, if I am not mistaking the connections to the blob resources are also constantly recreated, leading to unnecessary extra overhead and decreased performance.
However, at least I'd hope that you can correct the order of this if-else block (so that the arm that checks
self.credential and self.account_url
comes before the one that just requiresself.account_url
), so that we can prevent the logic from requesting credentials over and over.Describe alternatives you've considered
So the singleton implementation I mentioned, but that'd be a looooot more work.
Additional context
Screenshot of re-authentication logs. This happens when I call a checkpoint which validates one pandas in-memory dataframe using one expectation suite with as only related action, the rendering of our Data Docs website. The website is also hosted on an Azure Storage Account.
Below you can also find the code we use for calling checkpoints and running validations. As apparent from the TODO, it sits in calling the checkpoint.
Our checkpoint is defined as follows (it uses some of our supporting methods, but it just couples a pandas in-memory frame to one expectation suite and sets that in the checkpoint, along with the update_data_docs action):
The text was updated successfully, but these errors were encountered: