Skip to content

Conversation

@rescrv
Copy link
Contributor

@rescrv rescrv commented Oct 29, 2025

Description of changes

This is an attempt to put the tokens for a sparse vector in said sparse vector.

Test plan

CI

Migration plan

N/A

Observability plan

N/A

Documentation Changes

N/A

@rescrv rescrv requested a review from codetheweb as a code owner October 29, 2025 20:53
@github-actions
Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@propel-code-bot
Copy link
Contributor

propel-code-bot bot commented Oct 29, 2025

Unify Sparse-Vector Tokens into Metadata Across Storage, Execution, and Client APIs

This PR removes the dedicated tokens field from the sparse-vector model and instead stores the token list as a first-class metadata value (array of strings). The change propagates from the Rust core through the protobuf definition to the JS/Python clients and execution operators. By standardising on the public metadata map we eliminate special-case plumbing, reduce schema divergence between layers, and allow tokens to flow through every external API without further wire-format changes.

Key Changes

• Extended MetadataValue to support array-of-string values and added validation/round-trip logic
• Refactored execution operators (rank, idf, sparse_log_knn, BM25 embedder) to read tokens from metadata instead of SparseVector::tokens
• Updated chroma.proto, regenerated Rust & JS typings, and adapted Python bindings to expose tokens solely via metadata
• Removed obsolete tokens field from internal structs; all persistence layers (block-file, collection schema) now rely on metadata
• Added/updated tests and minor async/boxing clean-ups while touching the same code paths

Affected Areas

• Metadata schema & validators
• Execution operators (ranking / sparse KNN / BM25)
• Persistence (segment store, block-file metadata)
• Protobuf wire format and generated JS/Python client code
• Collection schema validation

This summary was automatically generated by @propel-code-bot

@rescrv rescrv requested a review from tanujnay112 October 29, 2025 20:55
Comment on lines 100 to 132
/// Create a sparse vector from an iterator of ((index, string), value) pairs.
pub fn from_pairs(triples: impl IntoIterator<Item = (u32, f32)>) -> Self {
let mut indices = vec![];
let mut values = vec![];
for (index, value) in triples {
indices.push(index);
values.push(value);
}
let tokens = None;
Self {
indices,
values,
tokens,
}
}

/// Create a sparse vector from an iterator of (index, value) pairs.
pub fn from_pairs(pairs: impl IntoIterator<Item = (u32, f32)>) -> Self {
let (indices, values) = pairs.into_iter().unzip();
Self { indices, values }
/// Create a sparse vector from an iterator of ((index, string), value) pairs.
pub fn from_triples(triples: impl IntoIterator<Item = (String, u32, f32)>) -> Self {
let mut tokens = vec![];
let mut indices = vec![];
let mut values = vec![];
for (token, index, value) in triples {
tokens.push(token);
indices.push(index);
values.push(value);
}
let tokens = Some(tokens);
Self {
indices,
values,
tokens,
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The documentation for from_pairs and from_triples appears to be incorrect, and the parameter name in from_pairs could be more descriptive. I've updated the doc comments and parameter name for clarity. I also restored the more idiomatic unzip implementation for from_pairs.

Suggested Change
Suggested change
/// Create a sparse vector from an iterator of ((index, string), value) pairs.
pub fn from_pairs(triples: impl IntoIterator<Item = (u32, f32)>) -> Self {
let mut indices = vec![];
let mut values = vec![];
for (index, value) in triples {
indices.push(index);
values.push(value);
}
let tokens = None;
Self {
indices,
values,
tokens,
}
}
/// Create a sparse vector from an iterator of (index, value) pairs.
pub fn from_pairs(pairs: impl IntoIterator<Item = (u32, f32)>) -> Self {
let (indices, values) = pairs.into_iter().unzip();
Self { indices, values }
/// Create a sparse vector from an iterator of ((index, string), value) pairs.
pub fn from_triples(triples: impl IntoIterator<Item = (String, u32, f32)>) -> Self {
let mut tokens = vec![];
let mut indices = vec![];
let mut values = vec![];
for (token, index, value) in triples {
tokens.push(token);
indices.push(index);
values.push(value);
}
let tokens = Some(tokens);
Self {
indices,
values,
tokens,
}
}
/// Create a sparse vector from an iterator of (index, value) pairs.
pub fn from_pairs(pairs: impl IntoIterator<Item = (u32, f32)>) -> Self {
let (indices, values) = pairs.into_iter().unzip();
let tokens = None;
Self {
indices,
values,
tokens,
}
}
/// Create a sparse vector from an iterator of (token, index, value) triples.
pub fn from_triples(triples: impl IntoIterator<Item = (String, u32, f32)>) -> Self {
let mut tokens = vec![];
let mut indices = vec![];
let mut values = vec![];
for (token, index, value) in triples {
tokens.push(token);
indices.push(index);
values.push(value);
}
let tokens = Some(tokens);
Self {
indices,
values,
tokens,
}
}

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Context for Agents
[**BestPractice**]

The documentation for `from_pairs` and `from_triples` appears to be incorrect, and the parameter name in `from_pairs` could be more descriptive. I've updated the doc comments and parameter name for clarity. I also restored the more idiomatic `unzip` implementation for `from_pairs`.

<details>
<summary>Suggested Change</summary>

```suggestion
    /// Create a sparse vector from an iterator of (index, value) pairs.
    pub fn from_pairs(pairs: impl IntoIterator<Item = (u32, f32)>) -> Self {
        let (indices, values) = pairs.into_iter().unzip();
        let tokens = None;
        Self {
            indices,
            values,
            tokens,
        }
    }

    /// Create a sparse vector from an iterator of (token, index, value) triples.
    pub fn from_triples(triples: impl IntoIterator<Item = (String, u32, f32)>) -> Self {
        let mut tokens = vec![];
        let mut indices = vec![];
        let mut values = vec![];
        for (token, index, value) in triples {
            tokens.push(token);
            indices.push(index);
            values.push(value);
        }
        let tokens = Some(tokens);
        Self {
            indices,
            values,
            tokens,
        }
    }
```

⚡ **Committable suggestion**

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

</details>

File: rust/types/src/metadata.rs
Line: 132

@HammadB HammadB self-requested a review October 29, 2025 22:09
impl From<chroma_proto::SparseVector> for SparseVector {
fn from(proto: chroma_proto::SparseVector) -> Self {
SparseVector::new(proto.indices, proto.values)
let tokens = if proto.tokens.is_empty() && !proto.indices.is_empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CriticalError]

Proto conversion logic error: When proto.tokens.is_empty() && !proto.indices.is_empty(), tokens is set to None, but when proto.tokens.is_empty() && proto.indices.is_empty(), tokens becomes Some(vec![]). This inconsistency could cause unexpected behavior. Both cases should result in None:

let tokens = if proto.tokens.is_empty() {
    None
} else {
    Some(proto.tokens)
};
Context for Agents
[**CriticalError**]

Proto conversion logic error: When `proto.tokens.is_empty() && !proto.indices.is_empty()`, tokens is set to `None`, but when `proto.tokens.is_empty() && proto.indices.is_empty()`, tokens becomes `Some(vec![])`. This inconsistency could cause unexpected behavior. Both cases should result in `None`:

```rust
let tokens = if proto.tokens.is_empty() {
    None
} else {
    Some(proto.tokens)
};
```

File: rust/types/src/metadata.rs
Line: 184

@blacksmith-sh

This comment has been minimized.

Comment on lines 248 to 272

let dict = ob.downcast::<PyDict>()?;
let indices_obj = dict.get_item("indices")?;
if indices_obj.is_none() {
return Err(pyo3::exceptions::PyKeyError::new_err(
"missing 'indices' key",
));
}
let indices: Vec<u32> = indices_obj.unwrap().extract()?;

let values_obj = dict.get_item("values")?;
if values_obj.is_none() {
return Err(pyo3::exceptions::PyKeyError::new_err(
"missing 'values' key",
));
}
let values: Vec<f32> = values_obj.unwrap().extract()?;

let indices: Vec<u32> = indices_obj.extract()?;
let values: Vec<f32> = values_obj.extract()?;
let tokens_obj = dict.get_item("tokens")?;
let tokens = match tokens_obj {
Some(obj) if obj.is_none() => None,
Some(obj) => Some(obj.extract::<Vec<String>>()?),
None => None,
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Token length validation missing in Python extraction. If a Python dict contains mismatched token and indices lengths, it won't be caught until later operations:

# This passes Python extraction but fails later
dict_in.set_item("indices", [0, 1, 2])
dict_in.set_item("tokens", ["a", "b"])  # Length mismatch!

Add validation using PyO3's standard error handling:

if let Some(tokens) = &tokens {
    if tokens.len() != indices.len() {
        return Err(pyo3::exceptions::PyValueError::new_err(
            format!("tokens length ({}) must match indices length ({})", tokens.len(), indices.len())
        ));
    }
}
Context for Agents
[**BestPractice**]

Token length validation missing in Python extraction. If a Python dict contains mismatched token and indices lengths, it won't be caught until later operations:

```python
# This passes Python extraction but fails later
dict_in.set_item("indices", [0, 1, 2])
dict_in.set_item("tokens", ["a", "b"])  # Length mismatch!
```

Add validation using PyO3's standard error handling:
```rust
if let Some(tokens) = &tokens {
    if tokens.len() != indices.len() {
        return Err(pyo3::exceptions::PyValueError::new_err(
            format!("tokens length ({}) must match indices length ({})", tokens.len(), indices.len())
        ));
    }
}
```

File: rust/types/src/metadata.rs
Line: 272

(1 - this.b + (this.b * docLen) / this.avgDocLength);
return (tf * (this.k + 1)) / denominator;
});
const tokens = indices.map((idx) => tokenMap.get(idx)!);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Potential panic on unwrap: tokenMap.get(idx)! will panic if the index doesn't exist in the map. This can happen if there are hash collisions or concurrent modifications:

// If tokenMap.get(idx) returns undefined, ! will throw
const tokens = indices.map((idx) => tokenMap.get(idx)!);

Add safety check:

const tokens = indices.map((idx) => {
    const token = tokenMap.get(idx);
    if (!token) throw new Error(`Token not found for index ${idx}`);
    return token;
});
Context for Agents
[**BestPractice**]

Potential panic on unwrap: `tokenMap.get(idx)!` will panic if the index doesn't exist in the map. This can happen if there are hash collisions or concurrent modifications:

```typescript
// If tokenMap.get(idx) returns undefined, ! will throw
const tokens = indices.map((idx) => tokenMap.get(idx)!);
```

Add safety check:
```typescript
const tokens = indices.map((idx) => {
    const token = tokenMap.get(idx);
    if (!token) throw new Error(`Token not found for index ${idx}`);
    return token;
});
```

File: clients/new-js/packages/ai-embeddings/chroma-bm25/src/index.ts
Line: 218

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants