Skip to content

Conversation

@Himanshu7921
Copy link

Description of changes

Summary

This PR adds a small, deterministic, dependency-free local embedding implementation designed for quick smoke tests, examples, and contributor onboarding.
It introduces the SimpleHashEmbeddingFunction, supporting deterministic embeddings without requiring any external models or API keys.

This change is Python-only, focused on improving testability and developer experience when setting up Chroma locally.


Improvements & Additions

New Embedding Function

  • simple_hash_embedding_function.py

    • Implements SimpleHashEmbeddingFunction, following the repository’s EmbeddingFunction convention.
    • Accepts list[str] inputs and returns fixed-dimensional NumPy embeddings.
    • Deterministic — produces identical results across runs for the same input.
    • Lightweight — requires no external dependencies or network access.

Exports & Registry

  • __init__.py updated to expose and register the new embedding under the name "local_simple_hash" for both direct import and config-driven creation.

Tests

  • test_simple_hash_embedding.py added to validate:

    • Deterministic embeddings (same input → same output)
    • Handling of empty strings and long text
    • Type flexibility (non-string inputs auto-stringified)
    • Dimensional consistency and correct dtype

Example Script

  • examples/local_simple_hash_example.py

    • Runnable demonstration of embedding generation and configuration-based construction.
    • Prints embedding metadata (length, dtype, norm) for given inputs.

Documentation

  • examples/README.md – references the new example script.
  • DEVELOP.md – includes a short note for Windows Python-only environments relevant to local testing.
  • docs/docs.trychroma.com/.../embedding-functions.md – adds a new entry describing local_simple_hash usage and config pattern.

Test plan

Local Testing (Windows PowerShell or Unix Terminal)

  1. Install editable package:

    python -m pip install -e .
  2. Run pre-commit only on the modified files:

    pre-commit run --files \
    chromadb/utils/embedding_functions/simple_hash_embedding_function.py \
    chromadb/utils/embedding_functions/__init__.py \
    chromadb/test/test_simple_hash_embedding.py \
    examples/local_simple_hash_example.py \
    DEVELOP.md examples/README.md \
    docs/docs.trychroma.com/markdoc/content/docs/embeddings/embedding-functions.md
  3. Execute new unit tests:

    python -m pytest -q chromadb/test/test_simple_hash_embedding.py
  4. Run the example script:

    python examples/local_simple_hash_example.py

Expected output:

  • Embedding metadata (vector length, dtype, L2 norm) for each input.

  • A second embedding displayed for the function created via config:

    config_to_embedding_function({"name": "local_simple_hash", "config": {"dim": 16}})

Validation performed:

  • pytest passed for all added tests.
  • pre-commit auto-fixes applied and passed on modified files.
  • Confirmed deterministic behavior across repeated runs.

Migration plan

No migrations or backward compatibility concerns.
This feature is isolated, purely additive, and does not impact existing embeddings or APIs.
Existing users or pipelines remain unaffected.


Observability plan

No new runtime instrumentation required.
Developers can validate correct functionality locally via:

  • Running the included example script.
  • Unit test logs showing deterministic embedding results.

No production monitoring changes needed — this is a local-only utility function.


Documentation Changes

  • Added documentation:

    • New entry in embedding-functions.md describing local_simple_hash and its config-driven usage pattern.
  • Updated:

    • DEVELOP.md with an additional note relevant to Windows Python-only contributor setups.
    • examples/README.md to include the runnable example reference.

All doc changes validated for formatting and linted with pre-commit.


Notes for reviewers

  • The PR is intentionally scoped and Python-only, avoiding modifications to Rust or complex integrations.
  • Designed to simplify CI smoke testing and onboarding for contributors.
  • If CI shows unrelated flake8 or mypy warnings, note that only the files listed above were intentionally changed.
  • A repo-wide lint/format PR can be submitted separately if desired.

- Adds a deterministic, dependency-free 'local_simple_hash' embedding for smoke tests
- Adds unit tests covering determinism, long strings, and non-string inputs
- Adds example script and updates DEVELOP.md and examples/README.md
- Adds docs section describing lightweight local embeddings
@github-actions
Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@propel-code-bot
Copy link
Contributor

propel-code-bot bot commented Oct 24, 2025

Add local_simple_hash deterministic embedding function with tests, docs & examples

Introduces a tiny, dependency-free embedding implementation (SimpleHashEmbeddingFunction) intended for smoke tests, CI runs and onboarding scenarios. The PR wires the new class into the embedding registry (known_embedding_functions), supplies unit tests, a runnable example script, Windows-specific dev-setup docs, and updates general documentation and READMEs. All changes are Python-only and strictly additive.

Key Changes

• New file chromadb/utils/embedding_functions/simple_hash_embedding_function.py implementing SimpleHashEmbeddingFunction (deterministic hash-based embeddings, configurable dimension, supports build/get_config/validation).
• Registers new embedding under key "local_simple_hash" in chromadb/utils/embedding_functions/__init__.py and exports symbol.
• Adds unit suite chromadb/test/test_simple_hash_embedding.py covering determinism, edge cases, config integration.
• Adds example script examples/local_simple_hash_example.py plus notebook and README/DEVELOP.md snippets for Windows Python-only setup.
• Extends docs (embedding-functions.md, examples/README.md, README.md) to describe lightweight local embeddings and usage pattern.

Affected Areas

chromadb.utils.embedding_functions registry
chromadb.utils.embedding_functions.simple_hash_embedding_function
• Documentation & examples
• Unit-test corpus

This summary was automatically generated by @propel-code-bot

…ion.py


updating the __call__ method which expects ValueError if input is None

Co-authored-by: propel-code-bot[bot] <203372662+propel-code-bot[bot]@users.noreply.github.com>
@Himanshu7921
Copy link
Author

Hi maintainers 👋

Could you please approve the workflow runs for this pull request?
It looks like the required GitHub Actions workflows are currently awaiting manual approval, which is blocking the status checks and further review.

Once the workflows are approved, all required checks should run and report their results, allowing the PR to move forward in the review and merge process.

If you have any feedback or requested changes after the checks complete, please let me know—I’m happy to address them promptly.

Thank you very much for your time and for helping with the approval and review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant