-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[BUG] Raise Error when can't deserialize configuration json from server, lazily load ef on CollectionModel, warn on api_key #4471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
c11b7ad
to
cd63d71
Compare
Stricter Error Handling and Lazy Embedding Function Loading for Collection Configuration This PR refactors the configuration loading and embedding function behavior throughout ChromaDB. It enforces that configuration deserialization errors now raise exceptions rather than being silently handled or replaced with default/empty values, preventing silent misconfigurations and client/server version mismatches. Embedding functions specified in configuration are now lazily constructed at embed time (e.g., add/query), allowing basic collection operations like list_collections and get_collection to work even if EF setups (e.g., API keys) are not present, and surfacing more precise errors only when embedding is triggered. Additionally, a deprecation warning is added when users provide API keys directly to embedding function constructors, guiding them to use environment variables for persistence. Key Changes: Affected Areas: Potential Impact: Functionality: Invalid or out-of-sync configuration now causes explicit errors rather than fallback; downstream systems must be ready for exceptions. Basic collection operations succeed even when embedding backends are misconfigured, but embedding-related collection operations will fail predictably. Performance: No major performance impact expected; lazy EF loading may reduce unnecessary object construction. Security: Deprecated use of direct API keys improves security by encouraging environment variable usage. Scalability: Improved error granularity benefits large installations with many collections having diverse embedding setups. Review Focus: Testing Needed• Verify that collections with malformed config fail at the correct point (add, query) with clear error messages Code Quality Assessmentchromadb/api/models/CollectionCommon.py: Constructor simplifies EF handling, embed() refactored for lazy loading and clarity; removes silent mismatches chromadb/utils/embedding_functions/*: Adds deprecation warnings for API key direct usage in all external EF providers; consistent pattern applied chromadb/test/configurations/test_collection_configuration.py: Significant new tests for invalid/edge cases; test logic is comprehensive and systematic chromadb/types.py: Systematically updated to enforce strict config handling, but removal of warnings import (fixed) and 'json' naming nit; otherwise, clean chromadb/api/collection_configuration.py: Rewritten EF builder logic for better error granularity; follows modern error-handling patterns Best PracticesError Handling: Configuration Management: Test Coverage: Security: Potential Issues• Downstream consumer code that relied on warning+empty config fallback may now break if not ready for exceptions This summary was automatically generated by @propel-code-bot |
b40998d
to
af51006
Compare
chromadb/types.py
Outdated
configuration = CollectionConfiguration( | ||
hnsw=None, | ||
spann=None, | ||
embedding_function=None, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[DataTypeCheck]
Now the fallback value for configuration in from_json()
constructs CollectionConfiguration(hnsw=None, spann=None, embedding_function=None)
. Confirm that this matches the expected constructor signature and field types. Add appropriate type annotations to CollectionConfiguration to clarify that None is valid for these fields, or provide documentation about acceptable values.
af51006
to
318d249
Compare
@@ -8,16 +8,13 @@ | |||
from uuid import UUID | |||
from enum import Enum | |||
from pydantic import BaseModel | |||
import warnings | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[CriticalError]
The warnings
module is used in this file now (see usage in chromadb/api/collection_configuration.py
), but it is not imported. Add the import at the top of the file to prevent runtime errors.
warnings.warn( | ||
f"Embedding function {ef_config['name']} not found. Add @register_embedding_function decorator to the class definition.", | ||
stacklevel=2, | ||
) | ||
ef = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Changing from ValueError
to warnings.warn
and ef = None
when an embedding function is not found might mask critical configuration errors. If an embedding function is specified in the configuration but cannot be located, it typically indicates a misconfiguration that should prevent the system from proceeding as if everything is normal. This could lead to a collection being created or used without the intended embedding capabilities, with the issue only becoming apparent later. Raising an error ensures such issues are addressed immediately. This approach is also more consistent with the stricter error handling for configuration loading adopted elsewhere in these changes (e.g., in chromadb/types.py
).
Consider restoring the behavior of raising an error:
warnings.warn( | |
f"Embedding function {ef_config['name']} not found. Add @register_embedding_function decorator to the class definition.", | |
stacklevel=2, | |
) | |
ef = None | |
raise ValueError( | |
f"Embedding function {ef_config['name']} not found. Add @register_embedding_function decorator to the class definition." | |
) |
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
318d249
to
7bb55b9
Compare
try: | ||
return load_collection_configuration_from_json(self.configuration_json) | ||
except Exception as e: | ||
warnings.warn( | ||
f"Server does not respond with configuration_json. Please update server: {e}", | ||
DeprecationWarning, | ||
stacklevel=2, | ||
) | ||
return CollectionConfiguration( | ||
hnsw=HNSWConfiguration(), | ||
spann=SpannConfiguration(), | ||
embedding_function=None, | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[CriticalError]
Instead of returning a default configuration on deserialization error, now code raises ValueError
. Ensure that all consumers of this method are prepared to handle this exception appropriately. If unhandled, it may lead to crashes during deserialization.
chromadb/types.py
Outdated
try: | ||
configuration_json = json_map.get("configuration_json", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[DataTypeCheck]
Calls to load_collection_configuration_from_json
may now receive None
for configuration_json
. Before calling the loader, consider explicitly validating the presence of configuration_json
to provide a clearer error message.
0ccccea
to
0c840c7
Compare
2d2dbde
to
6e26c32
Compare
UserWarning, | ||
stacklevel=2, | ||
) | ||
ef = MalformedEmbeddingFunction( # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also gives us leeway for future embedding functions to make breaking changes. The embedding function being "malformed" (i.e. the config could not build an ef in the case where let's say a model name gets deprecated, an argument name is changed etc.) should not stop the user from the using the ef. Instead it should give them a warning, and now give them the ability to update the ef if necessary. see line 669.
I was also able to get into the sqlite, modify the config json str manually and trigger this path, and this gives users a way out, basically if they ever hit a "broken" state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also help with the discussion about what happens if the user has thousands of collections, all configured with diff api key env vars. this will allow them to not need to set them all to build their efs. Instead they will get the malformed ef, and not error for list_collections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we not instead defer ef construction to use, making it lazy, removing this misc type? This feels very messy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can, the downside with that is on collection construction, if theres no ef it auto gives default ef
] = ef.DefaultEmbeddingFunction(), # type: ignore |
so if a user does collection._embed, checks collection._embedding_function they'll see the default ef.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added this, ef's are now lazily loaded during _embed
users will see the configuration_json as expected, but it will not fail even if the ef is "misconfigured" until embedding time. this includes if the ef somehow gets out of sync, or if they dont set the env var at list_collections/get_collection time. thanks for the suggestion it feels a lot cleaner now
6e26c32
to
a76cbc4
Compare
a76cbc4
to
4a22684
Compare
0976e2c
to
fe4811f
Compare
fe4811f
to
874e3d9
Compare
9be5a1b
to
2c18cb2
Compare
if embedding_function is not None and not isinstance( | ||
embedding_function, ef.DefaultEmbeddingFunction | ||
): | ||
if embedding_function.name() is not config_ef.name(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we want this validation?
@@ -92,7 +89,7 @@ def __init__( | |||
self, | |||
id: UUID, | |||
name: str, | |||
configuration: CollectionConfiguration, | |||
configuration_json: Dict[str, Any], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming nit- its not json!
configuration=load_collection_configuration_from_json_str( | ||
collection.configuration_json_str | ||
), | ||
configuration_json=json.loads(collection.configuration_json_str), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you care if this errors?
configuration = load_collection_configuration_from_json_str( | ||
request.configuration_json_str | ||
) | ||
configuration_json = json.loads(request.configuration_json_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you care that this can error? Should it be made nice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test more.
a862a92
to
e8ffaa9
Compare
e8ffaa9
to
1aff970
Compare
1aff970
to
457a780
Compare
Description of changes
This PR fixes a bug where it warns users if the configuration is not deserializable rather than raising an error, giving the users an empty config. Now, it raises an error if the configuration is not deserializable. This will only happen when the client is ahead of the server, which we do not need to support.
EF's from the config are now lazily loaded. only when ._embed is invoked from the model is it tried to be built, and will take precedent over the provided ef.
This also fixes a bug with list_collections, where if the user tries to load collections without the proper embedding function parameters set up, like api key env var, etc it would throw an hnsw and spann not configured error, which is not the root cause. instead, it will now load only the configuration_json, and at embed time try to load the ef, so errors will only propogate at that point, and with much clearer error messages.
Now, using api_key directly in embedding_functions will give a deprecation warning saying it is not persisted.
Test plan
How are these changes tested?
pytest
for python,yarn test
for js,cargo test
for rustDocumentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?