-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[ENH] Try shoe-horning sparse vector tokens into the metadata value. #5767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
|
Unify Sparse-Vector Tokens into Metadata Across Storage, Execution, and Client APIs This PR removes the dedicated Key Changes• Extended Affected Areas• Metadata schema & validators This summary was automatically generated by @propel-code-bot |
| /// Create a sparse vector from an iterator of ((index, string), value) pairs. | ||
| pub fn from_pairs(triples: impl IntoIterator<Item = (u32, f32)>) -> Self { | ||
| let mut indices = vec![]; | ||
| let mut values = vec![]; | ||
| for (index, value) in triples { | ||
| indices.push(index); | ||
| values.push(value); | ||
| } | ||
| let tokens = None; | ||
| Self { | ||
| indices, | ||
| values, | ||
| tokens, | ||
| } | ||
| } | ||
|
|
||
| /// Create a sparse vector from an iterator of (index, value) pairs. | ||
| pub fn from_pairs(pairs: impl IntoIterator<Item = (u32, f32)>) -> Self { | ||
| let (indices, values) = pairs.into_iter().unzip(); | ||
| Self { indices, values } | ||
| /// Create a sparse vector from an iterator of ((index, string), value) pairs. | ||
| pub fn from_triples(triples: impl IntoIterator<Item = (String, u32, f32)>) -> Self { | ||
| let mut tokens = vec![]; | ||
| let mut indices = vec![]; | ||
| let mut values = vec![]; | ||
| for (token, index, value) in triples { | ||
| tokens.push(token); | ||
| indices.push(index); | ||
| values.push(value); | ||
| } | ||
| let tokens = Some(tokens); | ||
| Self { | ||
| indices, | ||
| values, | ||
| tokens, | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
The documentation for from_pairs and from_triples appears to be incorrect, and the parameter name in from_pairs could be more descriptive. I've updated the doc comments and parameter name for clarity. I also restored the more idiomatic unzip implementation for from_pairs.
Suggested Change
| /// Create a sparse vector from an iterator of ((index, string), value) pairs. | |
| pub fn from_pairs(triples: impl IntoIterator<Item = (u32, f32)>) -> Self { | |
| let mut indices = vec![]; | |
| let mut values = vec![]; | |
| for (index, value) in triples { | |
| indices.push(index); | |
| values.push(value); | |
| } | |
| let tokens = None; | |
| Self { | |
| indices, | |
| values, | |
| tokens, | |
| } | |
| } | |
| /// Create a sparse vector from an iterator of (index, value) pairs. | |
| pub fn from_pairs(pairs: impl IntoIterator<Item = (u32, f32)>) -> Self { | |
| let (indices, values) = pairs.into_iter().unzip(); | |
| Self { indices, values } | |
| /// Create a sparse vector from an iterator of ((index, string), value) pairs. | |
| pub fn from_triples(triples: impl IntoIterator<Item = (String, u32, f32)>) -> Self { | |
| let mut tokens = vec![]; | |
| let mut indices = vec![]; | |
| let mut values = vec![]; | |
| for (token, index, value) in triples { | |
| tokens.push(token); | |
| indices.push(index); | |
| values.push(value); | |
| } | |
| let tokens = Some(tokens); | |
| Self { | |
| indices, | |
| values, | |
| tokens, | |
| } | |
| } | |
| /// Create a sparse vector from an iterator of (index, value) pairs. | |
| pub fn from_pairs(pairs: impl IntoIterator<Item = (u32, f32)>) -> Self { | |
| let (indices, values) = pairs.into_iter().unzip(); | |
| let tokens = None; | |
| Self { | |
| indices, | |
| values, | |
| tokens, | |
| } | |
| } | |
| /// Create a sparse vector from an iterator of (token, index, value) triples. | |
| pub fn from_triples(triples: impl IntoIterator<Item = (String, u32, f32)>) -> Self { | |
| let mut tokens = vec![]; | |
| let mut indices = vec![]; | |
| let mut values = vec![]; | |
| for (token, index, value) in triples { | |
| tokens.push(token); | |
| indices.push(index); | |
| values.push(value); | |
| } | |
| let tokens = Some(tokens); | |
| Self { | |
| indices, | |
| values, | |
| tokens, | |
| } | |
| } |
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
Context for Agents
[**BestPractice**]
The documentation for `from_pairs` and `from_triples` appears to be incorrect, and the parameter name in `from_pairs` could be more descriptive. I've updated the doc comments and parameter name for clarity. I also restored the more idiomatic `unzip` implementation for `from_pairs`.
<details>
<summary>Suggested Change</summary>
```suggestion
/// Create a sparse vector from an iterator of (index, value) pairs.
pub fn from_pairs(pairs: impl IntoIterator<Item = (u32, f32)>) -> Self {
let (indices, values) = pairs.into_iter().unzip();
let tokens = None;
Self {
indices,
values,
tokens,
}
}
/// Create a sparse vector from an iterator of (token, index, value) triples.
pub fn from_triples(triples: impl IntoIterator<Item = (String, u32, f32)>) -> Self {
let mut tokens = vec![];
let mut indices = vec![];
let mut values = vec![];
for (token, index, value) in triples {
tokens.push(token);
indices.push(index);
values.push(value);
}
let tokens = Some(tokens);
Self {
indices,
values,
tokens,
}
}
```
⚡ **Committable suggestion**
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
</details>
File: rust/types/src/metadata.rs
Line: 132| impl From<chroma_proto::SparseVector> for SparseVector { | ||
| fn from(proto: chroma_proto::SparseVector) -> Self { | ||
| SparseVector::new(proto.indices, proto.values) | ||
| let tokens = if proto.tokens.is_empty() && !proto.indices.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[CriticalError]
Proto conversion logic error: When proto.tokens.is_empty() && !proto.indices.is_empty(), tokens is set to None, but when proto.tokens.is_empty() && proto.indices.is_empty(), tokens becomes Some(vec![]). This inconsistency could cause unexpected behavior. Both cases should result in None:
let tokens = if proto.tokens.is_empty() {
None
} else {
Some(proto.tokens)
};Context for Agents
[**CriticalError**]
Proto conversion logic error: When `proto.tokens.is_empty() && !proto.indices.is_empty()`, tokens is set to `None`, but when `proto.tokens.is_empty() && proto.indices.is_empty()`, tokens becomes `Some(vec![])`. This inconsistency could cause unexpected behavior. Both cases should result in `None`:
```rust
let tokens = if proto.tokens.is_empty() {
None
} else {
Some(proto.tokens)
};
```
File: rust/types/src/metadata.rs
Line: 184
This comment has been minimized.
This comment has been minimized.
|
|
||
| let dict = ob.downcast::<PyDict>()?; | ||
| let indices_obj = dict.get_item("indices")?; | ||
| if indices_obj.is_none() { | ||
| return Err(pyo3::exceptions::PyKeyError::new_err( | ||
| "missing 'indices' key", | ||
| )); | ||
| } | ||
| let indices: Vec<u32> = indices_obj.unwrap().extract()?; | ||
|
|
||
| let values_obj = dict.get_item("values")?; | ||
| if values_obj.is_none() { | ||
| return Err(pyo3::exceptions::PyKeyError::new_err( | ||
| "missing 'values' key", | ||
| )); | ||
| } | ||
| let values: Vec<f32> = values_obj.unwrap().extract()?; | ||
|
|
||
| let indices: Vec<u32> = indices_obj.extract()?; | ||
| let values: Vec<f32> = values_obj.extract()?; | ||
| let tokens_obj = dict.get_item("tokens")?; | ||
| let tokens = match tokens_obj { | ||
| Some(obj) if obj.is_none() => None, | ||
| Some(obj) => Some(obj.extract::<Vec<String>>()?), | ||
| None => None, | ||
| }; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Token length validation missing in Python extraction. If a Python dict contains mismatched token and indices lengths, it won't be caught until later operations:
# This passes Python extraction but fails later
dict_in.set_item("indices", [0, 1, 2])
dict_in.set_item("tokens", ["a", "b"]) # Length mismatch!Add validation using PyO3's standard error handling:
if let Some(tokens) = &tokens {
if tokens.len() != indices.len() {
return Err(pyo3::exceptions::PyValueError::new_err(
format!("tokens length ({}) must match indices length ({})", tokens.len(), indices.len())
));
}
}Context for Agents
[**BestPractice**]
Token length validation missing in Python extraction. If a Python dict contains mismatched token and indices lengths, it won't be caught until later operations:
```python
# This passes Python extraction but fails later
dict_in.set_item("indices", [0, 1, 2])
dict_in.set_item("tokens", ["a", "b"]) # Length mismatch!
```
Add validation using PyO3's standard error handling:
```rust
if let Some(tokens) = &tokens {
if tokens.len() != indices.len() {
return Err(pyo3::exceptions::PyValueError::new_err(
format!("tokens length ({}) must match indices length ({})", tokens.len(), indices.len())
));
}
}
```
File: rust/types/src/metadata.rs
Line: 272| (1 - this.b + (this.b * docLen) / this.avgDocLength); | ||
| return (tf * (this.k + 1)) / denominator; | ||
| }); | ||
| const tokens = indices.map((idx) => tokenMap.get(idx)!); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Potential panic on unwrap: tokenMap.get(idx)! will panic if the index doesn't exist in the map. This can happen if there are hash collisions or concurrent modifications:
// If tokenMap.get(idx) returns undefined, ! will throw
const tokens = indices.map((idx) => tokenMap.get(idx)!);Add safety check:
const tokens = indices.map((idx) => {
const token = tokenMap.get(idx);
if (!token) throw new Error(`Token not found for index ${idx}`);
return token;
});Context for Agents
[**BestPractice**]
Potential panic on unwrap: `tokenMap.get(idx)!` will panic if the index doesn't exist in the map. This can happen if there are hash collisions or concurrent modifications:
```typescript
// If tokenMap.get(idx) returns undefined, ! will throw
const tokens = indices.map((idx) => tokenMap.get(idx)!);
```
Add safety check:
```typescript
const tokens = indices.map((idx) => {
const token = tokenMap.get(idx);
if (!token) throw new Error(`Token not found for index ${idx}`);
return token;
});
```
File: clients/new-js/packages/ai-embeddings/chroma-bm25/src/index.ts
Line: 218
Description of changes
This is an attempt to put the tokens for a sparse vector in said sparse vector.
Test plan
CI
Migration plan
N/A
Observability plan
N/A
Documentation Changes
N/A