Embedding service for Data House

The embedding service transform text to a vector representation. This can be useful, for example, for semantic textual similarity or semantic search.

An embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space.

The service can use sentence transformers models available on Hugging Face.

The default model is T-Systems-onsite/cross-en-de-roberta-sentence-transformer which supports English and German.

Getting started

The Embedding service is available as a Docker image.

docker pull ghcr.io/data-house/embedding-service:main

The model is downloaded at startup time, so depending on the model size the first start can take several minutes. We suggest to mount a persistent volume to cache the downloaded models (folder: /root/.cache/torch/sentence_transformers).

A sample docker-compose.yaml file is available within the repository.

Please refer to Releases and Packages for the available tags.

Available environment variables

variable	default	description
`MODEL_NAME`	`T-Systems-onsite/cross-en-de-roberta-sentence-transformer`	The name of a sentence transformer model published on Hugging Face
`STRICT_MODE`	`false`	Whenever to return an error if the text to embed is longer than the maximum context length of the model
`WORKERS`	2	The number of Gunicorn sync workers
`WORKERS_TIMEOUT`	600	The timeout, in seconds, of each worker

Usage

The Embedding service expose a web application on port 5000. The available API receive the text and return the vector representation as a JSON response.

The exposed service is unauthenticated therefore consider exposing it only within a trusted network. If you plan to make it available publicly consider adding a reverse proxy with authentication in front.

Embed endpoint

POST /embed

The /embed endpoint accepts a POST request with the following input as a json body:

corpus the text to transform

It returns an JSON with the following fields:

embedding the array representing the embedding of the given text

warning The processing is performed synchronously

Embed maximum size endpoint

GET /embed

To obtain the maximum embedding size, the /embed endpoint accepts a GET request with no parameters.

It returns an JSON with the following fields:

embedding_size the maximum length of the embedding

Error handling

The service can return the following errors

code	message	description
`422`	Missing corpus in json body	In case no text is passed to the API
`422`	Corpus too long	In case the text to embed is too long for the model, when STRICT_MODE is enabled

The body of the response can contain a JSON with the following fields:

error the error description

{
  "error": "Missing corpus in json body",
}

Development

The Embedding service is built using Flask on Python 3.9.

Given the selected stack the development requires:

Python 3.9 with PIP
Docker (optional) to test the build

Install all the required dependencies:

pip install -r requirements.txt

Run the local development application using:

python -m flask --app embedding_service run

Testing

to be documented

Contributing

Thank you for considering contributing to the Embedding service! The contribution guide can be found in the CONTRIBUTING.md file.

Supporters

The project is supported by OneOff-Tech (UG) and Oaks S.r.l.

Security Vulnerabilities

If you discover a security vulnerability within Embedding service, please send an e-mail to OneOff-Tech team via [email protected]. All security vulnerabilities will be promptly addressed.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github		.github
embedding_service		embedding_service
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yaml		docker-compose.yaml
gunicorn.sh		gunicorn.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Embedding service for Data House

Getting started

Usage

Embed endpoint

Embed maximum size endpoint

Error handling

Development

Testing

Contributing

Supporters

Security Vulnerabilities

About

Releases

Packages

Contributors 2

Languages

data-house/embedding-service

Folders and files

Latest commit

History

Repository files navigation

Embedding service for Data House

Getting started

Usage

Embed endpoint

Embed maximum size endpoint

Error handling

Development

Testing

Contributing

Supporters

Security Vulnerabilities

About

Topics

Resources

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages