-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[CLN] schema: build default with config ef & default_knn_index, remove #document population in defaults #5775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
4bb47ac to
53142f3
Compare
This comment has been minimized.
This comment has been minimized.
b3975e9 to
83d8fb6
Compare
This comment has been minimized.
This comment has been minimized.
b428f20 to
8f2ee14
Compare
This comment has been minimized.
This comment has been minimized.
9ab2efd to
2f1e7bc
Compare
This comment has been minimized.
This comment has been minimized.
7ad31cf to
dcb63e8
Compare
dcb63e8 to
3cd61f8
Compare
3cd61f8 to
1c0b019
Compare
8ad206c to
abba8f9
Compare
1c0b019 to
463f83b
Compare
This comment has been minimized.
This comment has been minimized.
463f83b to
3117661
Compare
This comment has been minimized.
This comment has been minimized.
a1daf85 to
97a318b
Compare
|
Align default schema generation with tenant KNN prefs & clean up default index metadata This change set overhauls how the system builds an internal CollectionSchema when the user relies on built-in defaults. The builder now honours the tenant-wide Key Changes• Schema builder now selects HNSW or SPANN based on tenant Affected Areas• rust/types::collection_schema This summary was automatically generated by @propel-code-bot |
| if collection_config.embedding_function.is_some() { | ||
| if let Some(float_list) = &mut new_schema.defaults.float_list { | ||
| if let Some(vector_index) = &mut float_list.vector_index { | ||
| vector_index.config.embedding_function = | ||
| collection_config.embedding_function.clone(); | ||
| } | ||
| } | ||
| if let Some(embedding_types) = new_schema.keys.get_mut(EMBEDDING_KEY) { | ||
| if let Some(float_list) = &mut embedding_types.float_list { | ||
| if let Some(vector_index) = &mut float_list.vector_index { | ||
| vector_index.config.embedding_function = | ||
| collection_config.embedding_function.clone(); | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
This block for setting the embedding function on the default and #embedding vector indexes is repetitive and can be significantly improved. The suggestion to use and_then is excellent as it follows Rust's idiomatic Option chaining patterns. The and_then combinator is specifically designed for this use case - it flattens nested Option<Option<T>> into Option<T> and provides cleaner error handling.
Suggested Change
| if collection_config.embedding_function.is_some() { | |
| if let Some(float_list) = &mut new_schema.defaults.float_list { | |
| if let Some(vector_index) = &mut float_list.vector_index { | |
| vector_index.config.embedding_function = | |
| collection_config.embedding_function.clone(); | |
| } | |
| } | |
| if let Some(embedding_types) = new_schema.keys.get_mut(EMBEDDING_KEY) { | |
| if let Some(float_list) = &mut embedding_types.float_list { | |
| if let Some(vector_index) = &mut float_list.vector_index { | |
| vector_index.config.embedding_function = | |
| collection_config.embedding_function.clone(); | |
| } | |
| } | |
| } | |
| } | |
| if let Some(ef) = &collection_config.embedding_function { | |
| if let Some(vector_index) = new_schema | |
| .defaults | |
| .float_list | |
| .as_mut() | |
| .and_then(|fl| fl.vector_index.as_mut()) | |
| { | |
| vector_index.config.embedding_function = Some(ef.clone()); | |
| } | |
| if let Some(vector_index) = new_schema | |
| .keys | |
| .get_mut(EMBEDDING_KEY) | |
| .and_then(|vt| vt.float_list.as_mut()) | |
| .and_then(|fl| fl.vector_index.as_mut()) | |
| { | |
| vector_index.config.embedding_function = Some(ef.clone()); | |
| } | |
| } |
This approach is more idiomatic in Rust and reduces nesting while maintaining the same functionality. The and_then combinator is the standard library's recommended approach for chaining Option operations.
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
Context for Agents
[**BestPractice**]
This block for setting the embedding function on the default and `#embedding` vector indexes is repetitive and can be significantly improved. The suggestion to use `and_then` is excellent as it follows Rust's idiomatic Option chaining patterns. The `and_then` combinator is specifically designed for this use case - it flattens nested `Option<Option<T>>` into `Option<T>` and provides cleaner error handling.
<details>
<summary>Suggested Change</summary>
```suggestion
if let Some(ef) = &collection_config.embedding_function {
if let Some(vector_index) = new_schema
.defaults
.float_list
.as_mut()
.and_then(|fl| fl.vector_index.as_mut())
{
vector_index.config.embedding_function = Some(ef.clone());
}
if let Some(vector_index) = new_schema
.keys
.get_mut(EMBEDDING_KEY)
.and_then(|vt| vt.float_list.as_mut())
.and_then(|fl| fl.vector_index.as_mut())
{
vector_index.config.embedding_function = Some(ef.clone());
}
}
```
This approach is more idiomatic in Rust and reduces nesting while maintaining the same functionality. The `and_then` combinator is the standard library's recommended approach for chaining Option operations.
⚡ **Committable suggestion**
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
</details>
File: rust/types/src/collection_schema.rs
Line: 1430
rust/types/src/collection.rs
Outdated
| /// The read path needs to tolerate collections that only have a configuration persisted. | ||
| /// This helper hydrates `schema` from the stored configuration when needed, or regenerates | ||
| /// the configuration from the existing schema to keep both representations consistent. | ||
| pub fn reconcile_schema_for_read(&mut self, knn_index: KnnIndex) -> Result<(), SchemaError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the invariance assumed here is that either configuration or schema exists, never both. therefore on the read path, the only conversion needed is to ensure that the populated one is used to write to the nonexistent one. previous logic was reconciling, moving spaces between and checking for defaults, all of which is unnecessary. this is purely to show users the correct info in both config and schema
| // since default schema doesnt have an ef, we need to use the coll config to create | ||
| // a schema with the ef. | ||
| let new_schema = Self::convert_collection_config_to_schema(collection_config)?; | ||
| // if both are default, use the schema, and apply the ef from config if available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for note here of the assumption made on what is a "default"
for config: any ef with a name that isnt "default" or "unknown" (unknown looks like a special case for sparse vec, we should revisit this @sanketkedia ), or if any attribute in either hnsw or spann is not the default
for schema: if ANY single attribute does not match the default it is not default.
in this case, since they are both essentially equivalent, we can just use the schema and take the ef from config. if the config has the default ef set, we will use it, otherwise it just takes the none so nothing really happens
b791b8f to
061ee06
Compare
061ee06 to
fe04c8f
Compare
| // for both defaults and #embedding key | ||
| let mut new_schema = Schema::new_default(default_knn_index); | ||
|
|
||
| if collection_config.embedding_function.is_some() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using and_then is more idiomatic rust here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also this should be method instead of inlining here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im not sure what this means
| space: Some(hnsw_config.space.clone()), | ||
| embedding_function: collection_config.embedding_function.clone(), | ||
| source_key: Some(DOCUMENT_KEY.to_string()), // Default source key | ||
| source_key: None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: will need to migrate existing collections for uniformity
| ) -> Result<Schema, SchemaError> { | ||
| // Start with a default schema structure | ||
| let mut schema = Schema::new_default(KnnIndex::Spann); // Default to HNSW, will be overridden | ||
| let mut schema = Schema::new_default(default_knn_index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does it matter what default you pass here since it is overridden below by config anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't but the code feels cleaner like this instead of creating a spann always
rust/types/src/collection_schema.rs
Outdated
| if let Some(embedding_types) = schema.keys.get_mut(EMBEDDING_KEY) { | ||
| if let Some(float_list) = &mut embedding_types.float_list { | ||
| if let Some(vector_index) = &mut float_list.vector_index { | ||
| let mut vector_config = vector_config.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need to clone vector_config here
| ) | ||
| .map_err(CollectionsWithSegmentsProviderError::InvalidSchema)?; | ||
| collection_and_segments_sysdb.collection.schema = Some(reconciled_schema); | ||
| if collection_and_segments_sysdb.collection.schema.is_none() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have we tested that schema is None and not {} for older collections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
| .map_err(CollectionsWithSegmentsProviderError::InvalidSchema)?; | ||
| collection_and_segments_sysdb.collection.schema = Some(reconciled_schema); | ||
| if collection_and_segments_sysdb.collection.schema.is_none() { | ||
| collection_and_segments_sysdb.collection.schema = Some( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are always passing schema down to the reader then should we update the reader in distributed_hnsw.rs to use schema instead of collection config (with fallback to legacy metadata). Similar to local_hnsw.rs. Makes things uniform and easier to understand. (I understand that the current code is also correct, this is just a code design nit)
| ) | ||
| .map_err(SpannSegmentWriterError::InvalidSchema)?; | ||
| let schema = if let Some(schema) = collection.schema.as_ref() { | ||
| schema.clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why clone here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because i cant do collection.schema directly, i have to borrow the schema with as_ref
the other alternative would be to borrow collection.schema directly. this works, just want to confirm this is safe?
let schema = if let Some(schema) = &collection.schema {
schema.clone()
} else {
Schema::convert_collection_config_to_schema(&collection.config, KnnIndex::Spann)
.map_err(SpannSegmentWriterError::InvalidSchema)?
};
rust/types/src/collection.rs
Outdated
| /// The read path needs to tolerate collections that only have a configuration persisted. | ||
| /// This helper hydrates `schema` from the stored configuration when needed, or regenerates | ||
| /// the configuration from the existing schema to keep both representations consistent. | ||
| pub fn reconcile_schema_for_read(&mut self, knn_index: KnnIndex) -> Result<(), SchemaError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why read path needs to take a default_knn config. That seems like unnecessary because the knn index type has already been decided and persisted
…ent population in defaults
fe04c8f to
db3b74e
Compare
|
Found 4 test failures on Blacksmith runners:
|

Description of changes
Summarize the changes made by this PR.
Test plan
How are these changes tested?
added unit tests for all 8 default cases (config hnsw or spann default, schema hnsw or spann default, default_knn_index), test thta #document does not populate for defaults, and embedding functions do.
pytestfor python,yarn testfor js,cargo testfor rustMigration plan
Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?
Observability plan
What is the plan to instrument and monitor this change?
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?