Skip to content

Conversation

@jairad26
Copy link
Contributor

@jairad26 jairad26 commented Oct 30, 2025

Description of changes

Summarize the changes made by this PR.

  • Improvements & Bug fixes
    • This PR is to clean up how collections are created:
    1. when collection configuration was hnsw default and schema was any default, it blindly converted config -> schema. instead, this should use the knn index to build the correct default schema, and only take the embedding function from config
    2. when converting config -> schema, it writes #document as the source key for both defaults and #embedding vector indexes. Instead, it should only write #document as the source key for #embedding
    3. On the distributed modify path, the json mapping for the boolean type in go does not match the rust type
  • New functionality
    • ...

Test plan

How are these changes tested?
added unit tests for all 8 default cases (config hnsw or spann default, schema hnsw or spann default, default_knn_index), test thta #document does not populate for defaults, and embedding functions do.

  • [ x] Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?

Observability plan

What is the plan to instrument and monitor this change?

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

Copy link
Contributor Author

jairad26 commented Oct 30, 2025

@github-actions
Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@jairad26 jairad26 force-pushed the jai/fix-default-path-reconcile branch from 4bb47ac to 53142f3 Compare October 30, 2025 17:50
@jairad26 jairad26 changed the title [BUG] schema: build default with config ef & knn_index, remove #document population in defaults [BUG] schema: build default with config ef & default_knn_index, remove #document population in defaults Oct 30, 2025
@blacksmith-sh

This comment has been minimized.

@jairad26 jairad26 force-pushed the jai/fix-default-path-reconcile branch 4 times, most recently from b3975e9 to 83d8fb6 Compare October 30, 2025 19:10
@blacksmith-sh

This comment has been minimized.

@jairad26 jairad26 force-pushed the jai/fix-default-path-reconcile branch 2 times, most recently from b428f20 to 8f2ee14 Compare October 30, 2025 21:37
@blacksmith-sh

This comment has been minimized.

@jairad26 jairad26 force-pushed the jai/fix-default-path-reconcile branch 2 times, most recently from 9ab2efd to 2f1e7bc Compare October 31, 2025 16:59
@blacksmith-sh

This comment has been minimized.

@jairad26 jairad26 force-pushed the jai/fix-default-path-reconcile branch 5 times, most recently from 7ad31cf to dcb63e8 Compare November 1, 2025 01:08
@jairad26 jairad26 changed the base branch from main to graphite-base/5775 November 3, 2025 18:47
@jairad26 jairad26 force-pushed the jai/fix-default-path-reconcile branch from dcb63e8 to 3cd61f8 Compare November 3, 2025 18:47
@jairad26 jairad26 changed the base branch from graphite-base/5775 to jai/fix-is-default-schema-check November 3, 2025 18:47
@jairad26 jairad26 changed the base branch from jai/fix-is-default-schema-check to graphite-base/5775 November 3, 2025 20:22
@jairad26 jairad26 force-pushed the jai/fix-default-path-reconcile branch from 3cd61f8 to 1c0b019 Compare November 3, 2025 20:22
@jairad26 jairad26 changed the base branch from graphite-base/5775 to main November 3, 2025 20:23
@jairad26 jairad26 force-pushed the jai/fix-default-path-reconcile branch from 1c0b019 to 463f83b Compare November 3, 2025 20:52
@jairad26 jairad26 marked this pull request as ready for review November 3, 2025 20:52
@jairad26 jairad26 marked this pull request as draft November 3, 2025 20:52
@blacksmith-sh

This comment has been minimized.

@jairad26 jairad26 force-pushed the jai/fix-default-path-reconcile branch from 463f83b to 3117661 Compare November 3, 2025 21:04
@blacksmith-sh

This comment has been minimized.

@jairad26 jairad26 force-pushed the jai/fix-default-path-reconcile branch 2 times, most recently from a1daf85 to 97a318b Compare November 3, 2025 21:27
@jairad26 jairad26 marked this pull request as ready for review November 3, 2025 21:50
@propel-code-bot
Copy link
Contributor

propel-code-bot bot commented Nov 3, 2025

Align default schema generation with tenant KNN prefs & clean up default index metadata

This change set overhauls how the system builds an internal CollectionSchema when the user relies on built-in defaults. The builder now honours the tenant-wide default_knn_index, correctly carries over search-tuning parameters (e.g. ef_search) and avoids leaking the "#document" source key to every automatically created index. In parallel it fixes a Go↔Rust JSON enum mismatch that could corrupt distributed modify requests. Together these fixes remove silent inconsistencies, reduce memory overhead and make default-path collections behave predictably across single- and multi-node deployments.

Key Changes

• Schema builder now selects HNSW or SPANN based on tenant default_knn_index, not the hard-coded default in the CollectionConfiguration
ef_search is propagated into the generated HNSW schema so search behaviour remains tunable
• Removed unconditional "#document" source key on auto-generated vector indexes; only the #embedding index keeps it
• Fixed boolean/enum mismatch in Go->Rust JSON mapping for distributed modify operations
• Expanded unit-test matrix to cover 8 default combinations and verify source-key placement
• Updated Rust components (schema types, segments, CLI, compaction) and Go coordinator models to consume the new builder

Affected Areas

• rust/types::collection_schema
• rust/types::collection
• distributed HNSW/SPANN segment creation
• compaction manager & CLI vacuum
• go/sysdb/coordinator/model/collection_configuration.*
• unit-test suites in Rust & Go

This summary was automatically generated by @propel-code-bot

Comment on lines +1415 to +1430
if collection_config.embedding_function.is_some() {
if let Some(float_list) = &mut new_schema.defaults.float_list {
if let Some(vector_index) = &mut float_list.vector_index {
vector_index.config.embedding_function =
collection_config.embedding_function.clone();
}
}
if let Some(embedding_types) = new_schema.keys.get_mut(EMBEDDING_KEY) {
if let Some(float_list) = &mut embedding_types.float_list {
if let Some(vector_index) = &mut float_list.vector_index {
vector_index.config.embedding_function =
collection_config.embedding_function.clone();
}
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

This block for setting the embedding function on the default and #embedding vector indexes is repetitive and can be significantly improved. The suggestion to use and_then is excellent as it follows Rust's idiomatic Option chaining patterns. The and_then combinator is specifically designed for this use case - it flattens nested Option<Option<T>> into Option<T> and provides cleaner error handling.

Suggested Change
Suggested change
if collection_config.embedding_function.is_some() {
if let Some(float_list) = &mut new_schema.defaults.float_list {
if let Some(vector_index) = &mut float_list.vector_index {
vector_index.config.embedding_function =
collection_config.embedding_function.clone();
}
}
if let Some(embedding_types) = new_schema.keys.get_mut(EMBEDDING_KEY) {
if let Some(float_list) = &mut embedding_types.float_list {
if let Some(vector_index) = &mut float_list.vector_index {
vector_index.config.embedding_function =
collection_config.embedding_function.clone();
}
}
}
}
if let Some(ef) = &collection_config.embedding_function {
if let Some(vector_index) = new_schema
.defaults
.float_list
.as_mut()
.and_then(|fl| fl.vector_index.as_mut())
{
vector_index.config.embedding_function = Some(ef.clone());
}
if let Some(vector_index) = new_schema
.keys
.get_mut(EMBEDDING_KEY)
.and_then(|vt| vt.float_list.as_mut())
.and_then(|fl| fl.vector_index.as_mut())
{
vector_index.config.embedding_function = Some(ef.clone());
}
}

This approach is more idiomatic in Rust and reduces nesting while maintaining the same functionality. The and_then combinator is the standard library's recommended approach for chaining Option operations.

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Context for Agents
[**BestPractice**]

This block for setting the embedding function on the default and `#embedding` vector indexes is repetitive and can be significantly improved. The suggestion to use `and_then` is excellent as it follows Rust's idiomatic Option chaining patterns. The `and_then` combinator is specifically designed for this use case - it flattens nested `Option<Option<T>>` into `Option<T>` and provides cleaner error handling.

<details>
<summary>Suggested Change</summary>

```suggestion
                if let Some(ef) = &collection_config.embedding_function {
                    if let Some(vector_index) = new_schema
                        .defaults
                        .float_list
                        .as_mut()
                        .and_then(|fl| fl.vector_index.as_mut())
                    {
                        vector_index.config.embedding_function = Some(ef.clone());
                    }
                    if let Some(vector_index) = new_schema
                        .keys
                        .get_mut(EMBEDDING_KEY)
                        .and_then(|vt| vt.float_list.as_mut())
                        .and_then(|fl| fl.vector_index.as_mut())
                    {
                        vector_index.config.embedding_function = Some(ef.clone());
                    }
                }
```

This approach is more idiomatic in Rust and reduces nesting while maintaining the same functionality. The `and_then` combinator is the standard library's recommended approach for chaining Option operations.

⚡ **Committable suggestion**

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

</details>

File: rust/types/src/collection_schema.rs
Line: 1430

@jairad26 jairad26 changed the title [BUG] schema: build default with config ef & default_knn_index, remove #document population in defaults [ENH] schema: build default with config ef & default_knn_index, remove #document population in defaults Nov 3, 2025
@jairad26 jairad26 changed the title [ENH] schema: build default with config ef & default_knn_index, remove #document population in defaults [CLN] schema: build default with config ef & default_knn_index, remove #document population in defaults Nov 3, 2025
/// The read path needs to tolerate collections that only have a configuration persisted.
/// This helper hydrates `schema` from the stored configuration when needed, or regenerates
/// the configuration from the existing schema to keep both representations consistent.
pub fn reconcile_schema_for_read(&mut self, knn_index: KnnIndex) -> Result<(), SchemaError> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the invariance assumed here is that either configuration or schema exists, never both. therefore on the read path, the only conversion needed is to ensure that the populated one is used to write to the nonexistent one. previous logic was reconciling, moving spaces between and checking for defaults, all of which is unnecessary. this is purely to show users the correct info in both config and schema

// since default schema doesnt have an ef, we need to use the coll config to create
// a schema with the ef.
let new_schema = Self::convert_collection_config_to_schema(collection_config)?;
// if both are default, use the schema, and apply the ef from config if available
Copy link
Contributor Author

@jairad26 jairad26 Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for note here of the assumption made on what is a "default"
for config: any ef with a name that isnt "default" or "unknown" (unknown looks like a special case for sparse vec, we should revisit this @sanketkedia ), or if any attribute in either hnsw or spann is not the default

for schema: if ANY single attribute does not match the default it is not default.

in this case, since they are both essentially equivalent, we can just use the schema and take the ef from config. if the config has the default ef set, we will use it, otherwise it just takes the none so nothing really happens

@jairad26 jairad26 force-pushed the jai/fix-default-path-reconcile branch 3 times, most recently from b791b8f to 061ee06 Compare November 6, 2025 16:22
@jairad26 jairad26 force-pushed the jai/fix-default-path-reconcile branch from 061ee06 to fe04c8f Compare November 6, 2025 18:00
// for both defaults and #embedding key
let mut new_schema = Schema::new_default(default_knn_index);

if collection_config.embedding_function.is_some() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using and_then is more idiomatic rust here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this should be method instead of inlining here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im not sure what this means

space: Some(hnsw_config.space.clone()),
embedding_function: collection_config.embedding_function.clone(),
source_key: Some(DOCUMENT_KEY.to_string()), // Default source key
source_key: None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: will need to migrate existing collections for uniformity

) -> Result<Schema, SchemaError> {
// Start with a default schema structure
let mut schema = Schema::new_default(KnnIndex::Spann); // Default to HNSW, will be overridden
let mut schema = Schema::new_default(default_knn_index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does it matter what default you pass here since it is overridden below by config anyways?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't but the code feels cleaner like this instead of creating a spann always

if let Some(embedding_types) = schema.keys.get_mut(EMBEDDING_KEY) {
if let Some(float_list) = &mut embedding_types.float_list {
if let Some(vector_index) = &mut float_list.vector_index {
let mut vector_config = vector_config.clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need to clone vector_config here

)
.map_err(CollectionsWithSegmentsProviderError::InvalidSchema)?;
collection_and_segments_sysdb.collection.schema = Some(reconciled_schema);
if collection_and_segments_sysdb.collection.schema.is_none() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have we tested that schema is None and not {} for older collections?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

.map_err(CollectionsWithSegmentsProviderError::InvalidSchema)?;
collection_and_segments_sysdb.collection.schema = Some(reconciled_schema);
if collection_and_segments_sysdb.collection.schema.is_none() {
collection_and_segments_sysdb.collection.schema = Some(
Copy link
Contributor

@sanketkedia sanketkedia Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are always passing schema down to the reader then should we update the reader in distributed_hnsw.rs to use schema instead of collection config (with fallback to legacy metadata). Similar to local_hnsw.rs. Makes things uniform and easier to understand. (I understand that the current code is also correct, this is just a code design nit)

)
.map_err(SpannSegmentWriterError::InvalidSchema)?;
let schema = if let Some(schema) = collection.schema.as_ref() {
schema.clone()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why clone here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because i cant do collection.schema directly, i have to borrow the schema with as_ref

the other alternative would be to borrow collection.schema directly. this works, just want to confirm this is safe?

        let schema = if let Some(schema) = &collection.schema {
            schema.clone()
        } else {
            Schema::convert_collection_config_to_schema(&collection.config, KnnIndex::Spann)
                .map_err(SpannSegmentWriterError::InvalidSchema)?
        };

/// The read path needs to tolerate collections that only have a configuration persisted.
/// This helper hydrates `schema` from the stored configuration when needed, or regenerates
/// the configuration from the existing schema to keep both representations consistent.
pub fn reconcile_schema_for_read(&mut self, knn_index: KnnIndex) -> Result<(), SchemaError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why read path needs to take a default_knn config. That seems like unnecessary because the knn index type has already been decided and persisted

@jairad26 jairad26 force-pushed the jai/fix-default-path-reconcile branch from fe04c8f to db3b74e Compare November 7, 2025 19:26
@blacksmith-sh
Copy link
Contributor

blacksmith-sh bot commented Nov 7, 2025

Found 4 test failures on Blacksmith runners:

Test View Logs
worker/compactor::compaction_manager::tests::test_compaction_manager View Logs
worker/
execution::functions::statistics::tests::test_k8s_integration_statistics_function
View Logs
worker/execution::orchestration::compact::tests::test_rebuild View Logs
worker/execution::orchestration::compact::tests::test_rebuild_empty_filepath View Logs


Fix in Cursor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants