[ENH] Try shoe-horning sparse vector tokens into the metadata value. #5767

rescrv · 2025-10-29T20:53:14Z

Description of changes

This is an attempt to put the tokens for a sparse vector in said sparse vector.

Test plan

CI

Migration plan

N/A

Observability plan

N/A

Documentation Changes

N/A

github-actions · 2025-10-29T20:53:29Z

propel-code-bot · 2025-10-29T20:54:03Z

Unify Sparse-Vector Tokens into Metadata Across Storage, Execution, and Client APIs

This PR removes the dedicated tokens field from the sparse-vector model and instead stores the token list as a first-class metadata value (array of strings). The change propagates from the Rust core through the protobuf definition to the JS/Python clients and execution operators. By standardising on the public metadata map we eliminate special-case plumbing, reduce schema divergence between layers, and allow tokens to flow through every external API without further wire-format changes.

Key Changes

• Extended MetadataValue to support array-of-string values and added validation/round-trip logic
• Refactored execution operators (rank, idf, sparse_log_knn, BM25 embedder) to read tokens from metadata instead of SparseVector::tokens
• Updated chroma.proto, regenerated Rust & JS typings, and adapted Python bindings to expose tokens solely via metadata
• Removed obsolete tokens field from internal structs; all persistence layers (block-file, collection schema) now rely on metadata
• Added/updated tests and minor async/boxing clean-ups while touching the same code paths

Affected Areas

• Metadata schema & validators
• Execution operators (ranking / sparse KNN / BM25)
• Persistence (segment store, block-file metadata)
• Protobuf wire format and generated JS/Python client code
• Collection schema validation

This summary was automatically generated by @propel-code-bot

rust/types/src/metadata.rs

rust/worker/src/execution/operators/idf.rs

propel-code-bot · 2025-10-29T21:06:15Z

rust/types/src/metadata.rs

+    /// Create a sparse vector from an iterator of ((index, string), value) pairs.
+    pub fn from_pairs(triples: impl IntoIterator<Item = (u32, f32)>) -> Self {
+        let mut indices = vec![];
+        let mut values = vec![];
+        for (index, value) in triples {
+            indices.push(index);
+            values.push(value);
+        }
+        let tokens = None;
+        Self {
+            indices,
+            values,
+            tokens,
+        }
    }

-    /// Create a sparse vector from an iterator of (index, value) pairs.
-    pub fn from_pairs(pairs: impl IntoIterator<Item = (u32, f32)>) -> Self {
-        let (indices, values) = pairs.into_iter().unzip();
-        Self { indices, values }
+    /// Create a sparse vector from an iterator of ((index, string), value) pairs.
+    pub fn from_triples(triples: impl IntoIterator<Item = (String, u32, f32)>) -> Self {
+        let mut tokens = vec![];
+        let mut indices = vec![];
+        let mut values = vec![];
+        for (token, index, value) in triples {
+            tokens.push(token);
+            indices.push(index);
+            values.push(value);
+        }
+        let tokens = Some(tokens);
+        Self {
+            indices,
+            values,
+            tokens,
+        }
    }


[BestPractice]

The documentation for from_pairs and from_triples appears to be incorrect, and the parameter name in from_pairs could be more descriptive. I've updated the doc comments and parameter name for clarity. I also restored the more idiomatic unzip implementation for from_pairs.

Suggested Change

Suggested change

/// Create a sparse vector from an iterator of ((index, string), value) pairs.

pub fn from_pairs(triples: impl IntoIterator<Item = (u32, f32)>) -> Self {

let mut indices = vec![];

let mut values = vec![];

for (index, value) in triples {

indices.push(index);

values.push(value);

}

let tokens = None;

Self {

indices,

values,

tokens,

}

}

/// Create a sparse vector from an iterator of (index, value) pairs.

pub fn from_pairs(pairs: impl IntoIterator<Item = (u32, f32)>) -> Self {

let (indices, values) = pairs.into_iter().unzip();

Self { indices, values }

/// Create a sparse vector from an iterator of ((index, string), value) pairs.

pub fn from_triples(triples: impl IntoIterator<Item = (String, u32, f32)>) -> Self {

let mut tokens = vec![];

let mut indices = vec![];

let mut values = vec![];

for (token, index, value) in triples {

tokens.push(token);

indices.push(index);

values.push(value);

}

let tokens = Some(tokens);

Self {

indices,

values,

tokens,

}

}

/// Create a sparse vector from an iterator of (index, value) pairs.

pub fn from_pairs(pairs: impl IntoIterator<Item = (u32, f32)>) -> Self {

let (indices, values) = pairs.into_iter().unzip();

let tokens = None;

Self {

indices,

values,

tokens,

}

}

/// Create a sparse vector from an iterator of (token, index, value) triples.

pub fn from_triples(triples: impl IntoIterator<Item = (String, u32, f32)>) -> Self {

let mut tokens = vec![];

let mut indices = vec![];

let mut values = vec![];

for (token, index, value) in triples {

tokens.push(token);

indices.push(index);

values.push(value);

}

let tokens = Some(tokens);

Self {

indices,

values,

tokens,

}

}

⚡ Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Context for Agents

[**BestPractice**] The documentation for `from_pairs` and `from_triples` appears to be incorrect, and the parameter name in `from_pairs` could be more descriptive. I've updated the doc comments and parameter name for clarity. I also restored the more idiomatic `unzip` implementation for `from_pairs`. <details> <summary>Suggested Change</summary> ```suggestion /// Create a sparse vector from an iterator of (index, value) pairs. pub fn from_pairs(pairs: impl IntoIterator<Item = (u32, f32)>) -> Self { let (indices, values) = pairs.into_iter().unzip(); let tokens = None; Self { indices, values, tokens, } } /// Create a sparse vector from an iterator of (token, index, value) triples. pub fn from_triples(triples: impl IntoIterator<Item = (String, u32, f32)>) -> Self { let mut tokens = vec![]; let mut indices = vec![]; let mut values = vec![]; for (token, index, value) in triples { tokens.push(token); indices.push(index); values.push(value); } let tokens = Some(tokens); Self { indices, values, tokens, } } ``` ⚡ **Committable suggestion** Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. </details> File: rust/types/src/metadata.rs Line: 132

rust/types/src/metadata.rs

propel-code-bot · 2025-10-29T23:03:41Z

rust/types/src/metadata.rs

 impl From<chroma_proto::SparseVector> for SparseVector {
    fn from(proto: chroma_proto::SparseVector) -> Self {
-        SparseVector::new(proto.indices, proto.values)
+        let tokens = if proto.tokens.is_empty() && !proto.indices.is_empty() {


[CriticalError]

Proto conversion logic error: When proto.tokens.is_empty() && !proto.indices.is_empty(), tokens is set to None, but when proto.tokens.is_empty() && proto.indices.is_empty(), tokens becomes Some(vec![]). This inconsistency could cause unexpected behavior. Both cases should result in None:

let tokens = if proto.tokens.is_empty() { None } else { Some(proto.tokens) };

Context for Agents

[**CriticalError**] Proto conversion logic error: When `proto.tokens.is_empty() && !proto.indices.is_empty()`, tokens is set to `None`, but when `proto.tokens.is_empty() && proto.indices.is_empty()`, tokens becomes `Some(vec![])`. This inconsistency could cause unexpected behavior. Both cases should result in `None`: ```rust let tokens = if proto.tokens.is_empty() { None } else { Some(proto.tokens) }; ``` File: rust/types/src/metadata.rs Line: 184

rust/types/src/metadata.rs

propel-code-bot · 2025-10-31T00:21:52Z

rust/types/src/metadata.rs


        let dict = ob.downcast::<PyDict>()?;
        let indices_obj = dict.get_item("indices")?;
+        if indices_obj.is_none() {
+            return Err(pyo3::exceptions::PyKeyError::new_err(
+                "missing 'indices' key",
+            ));
+        }
+        let indices: Vec<u32> = indices_obj.unwrap().extract()?;
+
        let values_obj = dict.get_item("values")?;
+        if values_obj.is_none() {
+            return Err(pyo3::exceptions::PyKeyError::new_err(
+                "missing 'values' key",
+            ));
+        }
+        let values: Vec<f32> = values_obj.unwrap().extract()?;

-        let indices: Vec<u32> = indices_obj.extract()?;
-        let values: Vec<f32> = values_obj.extract()?;
+        let tokens_obj = dict.get_item("tokens")?;
+        let tokens = match tokens_obj {
+            Some(obj) if obj.is_none() => None,
+            Some(obj) => Some(obj.extract::<Vec<String>>()?),
+            None => None,
+        };



[BestPractice]

Token length validation missing in Python extraction. If a Python dict contains mismatched token and indices lengths, it won't be caught until later operations:

# This passes Python extraction but fails later dict_in.set_item("indices", [0, 1, 2]) dict_in.set_item("tokens", ["a", "b"]) # Length mismatch!

Add validation using PyO3's standard error handling:

if let Some(tokens) = &tokens { if tokens.len() != indices.len() { return Err(pyo3::exceptions::PyValueError::new_err( format!("tokens length ({}) must match indices length ({})", tokens.len(), indices.len()) )); } }

Context for Agents

[**BestPractice**] Token length validation missing in Python extraction. If a Python dict contains mismatched token and indices lengths, it won't be caught until later operations: ```python # This passes Python extraction but fails later dict_in.set_item("indices", [0, 1, 2]) dict_in.set_item("tokens", ["a", "b"]) # Length mismatch! ``` Add validation using PyO3's standard error handling: ```rust if let Some(tokens) = &tokens { if tokens.len() != indices.len() { return Err(pyo3::exceptions::PyValueError::new_err( format!("tokens length ({}) must match indices length ({})", tokens.len(), indices.len()) )); } } ``` File: rust/types/src/metadata.rs Line: 272

propel-code-bot · 2025-10-31T00:21:53Z

clients/new-js/packages/ai-embeddings/chroma-bm25/src/index.ts

                (1 - this.b + (this.b * docLen) / this.avgDocLength);
            return (tf * (this.k + 1)) / denominator;
        });
+        const tokens = indices.map((idx) => tokenMap.get(idx)!);


[BestPractice]

Potential panic on unwrap: tokenMap.get(idx)! will panic if the index doesn't exist in the map. This can happen if there are hash collisions or concurrent modifications:

// If tokenMap.get(idx) returns undefined, ! will throw const tokens = indices.map((idx) => tokenMap.get(idx)!);

Add safety check:

const tokens = indices.map((idx) => { const token = tokenMap.get(idx); if (!token) throw new Error(`Token not found for index ${idx}`); return token; });

Context for Agents

[**BestPractice**] Potential panic on unwrap: `tokenMap.get(idx)!` will panic if the index doesn't exist in the map. This can happen if there are hash collisions or concurrent modifications: ```typescript // If tokenMap.get(idx) returns undefined, ! will throw const tokens = indices.map((idx) => tokenMap.get(idx)!); ``` Add safety check: ```typescript const tokens = indices.map((idx) => { const token = tokenMap.get(idx); if (!token) throw new Error(`Token not found for index ${idx}`); return token; }); ``` File: clients/new-js/packages/ai-embeddings/chroma-bm25/src/index.ts Line: 218

rescrv requested a review from codetheweb as a code owner October 29, 2025 20:53

rescrv requested a review from tanujnay112 October 29, 2025 20:55