-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[ENH] Add muvera and colBERT support to python client #5744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
15304a1 to
998da94
Compare
|
Add Muvera Fixed-Dimensional Encoding & ColBERT multivector support Introduces Key Changes• Added 700-line implementation of MuVERA FDE ( Affected Areas• This summary was automatically generated by @propel-code-bot |
|
|
||
| dims = len(multi_embeddings[0][0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[CriticalError]
Potential null pointer dereference: Line 186 accesses multi_embeddings[0][0] without checking if multi_embeddings is empty or if the first embedding is empty. If the Jina API returns empty multivector embeddings, this will cause an IndexError. This is a known pattern in ChromaDB - similar issues were addressed in PR #3183 for empty embedding list validation.
if not multi_embeddings or not multi_embeddings[0]:
raise RuntimeError("Invalid multivector embeddings format from Jina API")
dims = len(multi_embeddings[0][0])Note: ChromaDB has established patterns for defensive programming against empty embedding arrays, and this follows the same principle.
Context for Agents
[**CriticalError**]
Potential null pointer dereference: Line 186 accesses `multi_embeddings[0][0]` without checking if `multi_embeddings` is empty or if the first embedding is empty. If the Jina API returns empty multivector embeddings, this will cause an IndexError. This is a known pattern in ChromaDB - similar issues were addressed in PR #3183 for empty embedding list validation.
```python
if not multi_embeddings or not multi_embeddings[0]:
raise RuntimeError("Invalid multivector embeddings format from Jina API")
dims = len(multi_embeddings[0][0])
```
Note: ChromaDB has established patterns for defensive programming against empty embedding arrays, and this follows the same principle.
File: chromadb/utils/embedding_functions/jina_embedding_function.py
Line: 186| return create_fdes( | ||
| multivec, | ||
| dims=len(multivec[0][0]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[CriticalError]
Potential null pointer dereference: Line 46 accesses multivec[0][0] without checking if multivec is empty or if the first vector is empty. If the pylate model returns empty multivectors, this will cause an IndexError. This follows the same vulnerability pattern that ChromaDB addressed in their normalization code (PR #3183).
if not multivec or not multivec[0]:
raise ValueError("Model returned empty multivector embeddings")
return create_fdes(
multivec,
dims=len(multivec[0][0]),
is_query=False,
fill_empty_partitions=True,
)Note: Pylate/ColBERT models generate multivector embeddings where each document produces multiple vectors, making this validation critical.
Context for Agents
[**CriticalError**]
Potential null pointer dereference: Line 46 accesses `multivec[0][0]` without checking if `multivec` is empty or if the first vector is empty. If the pylate model returns empty multivectors, this will cause an IndexError. This follows the same vulnerability pattern that ChromaDB addressed in their normalization code (PR #3183).
```python
if not multivec or not multivec[0]:
raise ValueError("Model returned empty multivector embeddings")
return create_fdes(
multivec,
dims=len(multivec[0][0]),
is_query=False,
fill_empty_partitions=True,
)
```
Note: Pylate/ColBERT models generate multivector embeddings where each document produces multiple vectors, making this validation critical.
File: chromadb/utils/embedding_functions/pylate_colbert_embedding_function.py
Line: 46| return create_fdes( | ||
| multivec, | ||
| dims=len(multivec[0][0]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[CriticalError]
Potential null pointer dereference: Same issue as in __call__ method - line 64 accesses multivec[0][0] without validation. This will fail if the model returns empty embeddings. The fix should be consistent with the document embedding method for maintainability.
if not multivec or not multivec[0]:
raise ValueError("Model returned empty multivector embeddings")
return create_fdes(
multivec,
dims=len(multivec[0][0]),
is_query=True,
fill_empty_partitions=False,
)Note: Query and document embedding methods should use consistent validation patterns to prevent similar IndexErrors across the codebase.
Context for Agents
[**CriticalError**]
Potential null pointer dereference: Same issue as in `__call__` method - line 64 accesses `multivec[0][0]` without validation. This will fail if the model returns empty embeddings. The fix should be consistent with the document embedding method for maintainability.
```python
if not multivec or not multivec[0]:
raise ValueError("Model returned empty multivector embeddings")
return create_fdes(
multivec,
dims=len(multivec[0][0]),
is_query=True,
fill_empty_partitions=False,
)
```
Note: Query and document embedding methods should use consistent validation patterns to prevent similar IndexErrors across the codebase.
File: chromadb/utils/embedding_functions/pylate_colbert_embedding_function.py
Line: 64| distances = [ | ||
| _distance_to_simhash_partition(sketches[j], i) | ||
| for j in range(num_points) | ||
| ] | ||
| nearest_point_idx = np.argmin(distances) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[PerformanceOptimization]
Performance concern: The nested loop on lines 357-361 creates O(N²) complexity when processing empty partitions. For each empty partition, it calculates distances to all num_points, which can be expensive for large point clouds.
Consider caching distance calculations or using vectorized operations:
# Pre-calculate all distances once per repetition
if config.fill_empty_partitions and num_points > 0:
all_distances = np.array([
[_distance_to_simhash_partition(sketches[j], i)
for i in range(num_partitions)]
for j in range(num_points)
])Context for Agents
[**PerformanceOptimization**]
Performance concern: The nested loop on lines 357-361 creates O(N²) complexity when processing empty partitions. For each empty partition, it calculates distances to all `num_points`, which can be expensive for large point clouds.
Consider caching distance calculations or using vectorized operations:
```python
# Pre-calculate all distances once per repetition
if config.fill_empty_partitions and num_points > 0:
all_distances = np.array([
[_distance_to_simhash_partition(sketches[j], i)
for i in range(num_partitions)]
for j in range(num_points)
])
```
File: chromadb/utils/muvera.py
Line: 361
Description of changes
Summarize the changes made by this PR.
Test plan
How are these changes tested?
pytestfor python,yarn testfor js,cargo testfor rustMigration plan
Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?
Observability plan
What is the plan to instrument and monitor this change?
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?