Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LanceDB Destination #1375

Merged
merged 179 commits into from
Jun 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
179 commits
Select commit Hold shift + click to select a range
5c35057
Added lancedb as an optional dependency
Pipboyguy May 16, 2024
f4450a7
Added lancedb to dependencies in test workflow
Pipboyguy May 16, 2024
0808bb2
Merge branch 'refs/heads/devel' into 1370-lancedb-destination
Pipboyguy May 17, 2024
5380892
Add initial capabilities for LanceDB destination
Pipboyguy May 17, 2024
0f40ffa
Added new lancedb_adapter
Pipboyguy May 17, 2024
d34ab6f
Added LanceDB factory in destinations implementation
Pipboyguy May 17, 2024
0f3d57f
Added LanceDB client configuration with embedding details
Pipboyguy May 17, 2024
69e1daa
Merge branch 'refs/heads/devel' into 1370-lancedb-destination
Pipboyguy May 17, 2024
c51a9bc
Added LanceDB Client with data load and schema management functionali…
Pipboyguy May 20, 2024
c49ee2b
Merge remote-tracking branch 'origin/devel' into 1370-lancedb-destina…
Pipboyguy May 29, 2024
389d06b
Lockfile
Pipboyguy May 29, 2024
1d9a072
Wireframe LanceDB client implementation
Pipboyguy May 30, 2024
8c0de4c
Add abstract methods
Pipboyguy May 30, 2024
d6d02b2
Enhance LanceDB client with additional functionality
Pipboyguy May 31, 2024
ada3ebe
Add tests and GitHub workflow for LanceDB destination
Pipboyguy Jun 1, 2024
5d961b5
Update Python version to 3.11.x in GitHub workflow
Pipboyguy Jun 2, 2024
c3cad2f
Merge branch 'refs/heads/devel' into 1370-lancedb-destination
Pipboyguy Jun 3, 2024
f97256e
Refactor and cleanup LanceDBClient and LoadLanceDBJob classes
Pipboyguy Jun 3, 2024
1b68391
Refactor load tests in lancedb/utils.py and add test for LanceDB mode…
Pipboyguy Jun 3, 2024
ae70f97
Added functionality to infer LanceDB model from data and refactored n…
Pipboyguy Jun 3, 2024
2222a88
Remove storage options
Pipboyguy Jun 4, 2024
f819c25
Refactor test pipeline and implement lancedb_adapter in LanceDBClient
Pipboyguy Jun 4, 2024
8bbd515
Add schema argument to LoadLanceDBJob function
Pipboyguy Jun 4, 2024
fb7565e
Format
Pipboyguy Jun 4, 2024
4c73541
Refactor LanceDB related code and increase type hint coverage
Pipboyguy Jun 4, 2024
bfcc8bb
Refactor LanceDB client and tests, enhance DB type mapping
Pipboyguy Jun 4, 2024
4827798
Refactor code to improve readability by reducing line breaks
Pipboyguy Jun 4, 2024
53c0b0d
Refactor LanceDB client code by adding schema_conversion and utils mo…
Pipboyguy Jun 4, 2024
2cfbdb4
Remove redundant variables in lancedb_client.py
Pipboyguy Jun 4, 2024
092fcf0
Refactor code to improve readability and move environment variable se…
Pipboyguy Jun 4, 2024
b1783b2
Refactor LanceDB client implementation and error handling
Pipboyguy Jun 4, 2024
08dcea1
Refactor code for better readability and add type ignore comments
Pipboyguy Jun 4, 2024
ad43996
Added lancedb as an optional dependency
Pipboyguy May 16, 2024
c225916
Added lancedb to dependencies in test workflow
Pipboyguy May 16, 2024
cfc4038
Add initial capabilities for LanceDB destination
Pipboyguy May 17, 2024
7e1b651
Added new lancedb_adapter
Pipboyguy May 17, 2024
892c4f8
Added LanceDB factory in destinations implementation
Pipboyguy May 17, 2024
b4e8ad5
Added LanceDB client configuration with embedding details
Pipboyguy May 17, 2024
605ec89
Added LanceDB Client with data load and schema management functionali…
Pipboyguy May 20, 2024
8cd3208
Wireframe LanceDB client implementation
Pipboyguy May 30, 2024
d4a9314
Add abstract methods
Pipboyguy May 30, 2024
99fb712
Enhance LanceDB client with additional functionality
Pipboyguy May 31, 2024
e024925
Add tests and GitHub workflow for LanceDB destination
Pipboyguy Jun 1, 2024
c1e84b5
Update Python version to 3.11.x in GitHub workflow
Pipboyguy Jun 2, 2024
fe0cfbe
Refactor and cleanup LanceDBClient and LoadLanceDBJob classes
Pipboyguy Jun 3, 2024
32fd5c8
Refactor load tests in lancedb/utils.py and add test for LanceDB mode…
Pipboyguy Jun 3, 2024
1eea62a
Added functionality to infer LanceDB model from data and refactored n…
Pipboyguy Jun 3, 2024
eb05fcd
Remove storage options
Pipboyguy Jun 4, 2024
d560269
Refactor test pipeline and implement lancedb_adapter in LanceDBClient
Pipboyguy Jun 4, 2024
80eef96
Add schema argument to LoadLanceDBJob function
Pipboyguy Jun 4, 2024
1414ded
Format
Pipboyguy Jun 4, 2024
033c981
Refactor LanceDB related code and increase type hint coverage
Pipboyguy Jun 4, 2024
c655c8a
Refactor LanceDB client and tests, enhance DB type mapping
Pipboyguy Jun 4, 2024
1815da4
Refactor code to improve readability by reducing line breaks
Pipboyguy Jun 4, 2024
32478ae
Refactor LanceDB client code by adding schema_conversion and utils mo…
Pipboyguy Jun 4, 2024
c4bd4f2
Remove redundant variables in lancedb_client.py
Pipboyguy Jun 4, 2024
6c74cc5
Refactor code to improve readability and move environment variable se…
Pipboyguy Jun 4, 2024
9e5d6db
Refactor LanceDB client implementation and error handling
Pipboyguy Jun 4, 2024
fb6974d
Refactor code for better readability and add type ignore comments
Pipboyguy Jun 4, 2024
ee239e5
Dependency Versioning
Pipboyguy Jun 6, 2024
42e32d9
Merge remote-tracking branch 'origin/1370-lancedb-destination' into 1…
Pipboyguy Jun 6, 2024
827ebd3
Remove unnecessary dependencies and update lancedb and pylance versions
Pipboyguy Jun 6, 2024
71e3579
Silence mypy warnings
Pipboyguy Jun 6, 2024
d5b8cda
Revert mypy ignores
Pipboyguy Jun 6, 2024
3ea0a3a
Revert mypy ignores
Pipboyguy Jun 6, 2024
395e1b7
Fix versioning with 3.8
Pipboyguy Jun 6, 2024
ec2774d
Fix versioning
Pipboyguy Jun 6, 2024
1591a1f
Update default URI and dataset separator in LanceDB configuration
Pipboyguy Jun 6, 2024
7f3c772
Merge branch 'refs/heads/devel' into 1370-lancedb-destination
Pipboyguy Jun 7, 2024
a727cf3
Merge branch 'refs/heads/devel' into 1370-lancedb-destination
Pipboyguy Jun 8, 2024
66d1cc7
Refactor LanceDB typemapper with timestamp and decimal precision adju…
Pipboyguy Jun 8, 2024
21bc285
Updated method for retrieving sentinel table name
Pipboyguy Jun 8, 2024
7fadbe5
Remove redundant table normalisation for version_table_name
Pipboyguy Jun 8, 2024
bda6123
Refactor LanceDB functionalities and improve handling of optional emb…
Pipboyguy Jun 8, 2024
74ac9f3
Refactor LanceDBClient and update parameter defaults in schema.py
Pipboyguy Jun 8, 2024
e4b3a8d
Added lancedb to default vector configs and improved type annotations…
Pipboyguy Jun 9, 2024
5e8718c
Return self in enter context manager method
Pipboyguy Jun 9, 2024
310eccf
Merge branch 'refs/heads/devel' into 1370-lancedb-destination
Pipboyguy Jun 10, 2024
6a5db5c
Handle FileNotFoundError
Pipboyguy Jun 10, 2024
f4bcfe7
Replace FileNotFoundError with DestinationUndefinedEntity in lancedb_…
Pipboyguy Jun 10, 2024
b193f38
Refactor LanceDB client for simplified table name handling
Pipboyguy Jun 10, 2024
6e61c7a
Merge branch 'refs/heads/devel' into 1370-lancedb-destination
Pipboyguy Jun 10, 2024
4701547
Refactored LanceDB schema creation and storage update processes to py…
Pipboyguy Jun 10, 2024
74080a4
Remove LanceModels
Pipboyguy Jun 10, 2024
f3ac7e2
Ensure 'records' is a list in lancedb_client.py
Pipboyguy Jun 11, 2024
173be9e
Refactor code and add batch error handling in lancedb client
Pipboyguy Jun 11, 2024
9d6c6d2
Refactor LanceDB client and schema for improved embedding handling
Pipboyguy Jun 11, 2024
bb165a8
Improve error handling and retries in LanceDB client
Pipboyguy Jun 11, 2024
2adeaea
Add error decorator to get_stored_state method in lancedb_client
Pipboyguy Jun 11, 2024
ba02ed5
Change error handling from FileNotFoundError to IndexError
Pipboyguy Jun 11, 2024
432dd76
Refactor lancedb_client.py and add error decorators
Pipboyguy Jun 11, 2024
fa1202e
Add configurable read consistency to LanceDB client
Pipboyguy Jun 11, 2024
8963876
Merge branch 'refs/heads/devel' into 1370-lancedb-destination
Pipboyguy Jun 12, 2024
26f497e
Versioning
Pipboyguy Jun 12, 2024
30cc9a7
Refactor code for readability and change return type in tests
Pipboyguy Jun 12, 2024
175b6db
Update queries in lancedb_client to order by insertion date
Pipboyguy Jun 12, 2024
719fcfb
Refactor LanceDB client and schema for better table creation and mana…
Pipboyguy Jun 12, 2024
aa683da
Combine "skip" and "append" write dispositions in batch upload
Pipboyguy Jun 12, 2024
6483152
Add schema version hash check in LanceDB client write operations
Pipboyguy Jun 12, 2024
65ede6f
Remove testing code
Pipboyguy Jun 12, 2024
fbc558f
Refactor return statement in lancedb_client for successful state loads
Pipboyguy Jun 12, 2024
8fbae21
Update lancedb_client.py to improve table handling and embedding fields
Pipboyguy Jun 12, 2024
88bc519
Refactor LanceDB schema generation and handle metadata for embedding …
Pipboyguy Jun 12, 2024
aab21a6
Merge branch 'refs/heads/devel' into 1370-lancedb-destination
Pipboyguy Jun 13, 2024
3b85442
Refactor schema creation and remove unused code
Pipboyguy Jun 13, 2024
31e1895
Add mapping for provider environment variables and update schema comment
Pipboyguy Jun 13, 2024
9400e02
Update package versions in pyproject.toml and poetry.lock
Pipboyguy Jun 13, 2024
10c41a8
Refactor LanceDB utils and client, handle exception and remove unnece…
Pipboyguy Jun 13, 2024
9ba97a8
Refactor utility functions in lancedb tests
Pipboyguy Jun 13, 2024
97127a5
Update 'replace' mode and improve table handling in lancedb client
Pipboyguy Jun 13, 2024
b77b6f7
Refactor assert_unordered_list_equal to handle dictionaries
Pipboyguy Jun 13, 2024
bc8948e
Refactor code for better readability and remove unnecessary blank lines
Pipboyguy Jun 13, 2024
c9a1667
Refactor code for readability and remove redundant comments
Pipboyguy Jun 13, 2024
84357d7
Update sentinel table name in test_pipeline.py
Pipboyguy Jun 13, 2024
a334c98
"Add order by clause to database query in lancedb_client"
Pipboyguy Jun 13, 2024
d4b56da
Use super method to reduce redundancy
Pipboyguy Jun 14, 2024
8f89d38
Syntax
Pipboyguy Jun 14, 2024
3b44631
Remove bare except clauses
Pipboyguy Jun 14, 2024
ee6f525
Revert "Remove bare except clauses"
Pipboyguy Jun 14, 2024
d8db465
Remove bare except clause
Pipboyguy Jun 14, 2024
5aed9e5
Remove bare except clause
Pipboyguy Jun 14, 2024
cd68966
Remove bare except clause
Pipboyguy Jun 14, 2024
11f562f
Remove bare except clause
Pipboyguy Jun 14, 2024
06408b9
Refactor error handling in LanceDB client
Pipboyguy Jun 14, 2024
cf41859
Add configurable sentinel table name in LanceDB client configuration
Pipboyguy Jun 14, 2024
bbc00a5
Update embedding model config and schema in LanceDB
Pipboyguy Jun 14, 2024
9322477
Refactor lancedb_client.py, remove unused methods and imports
Pipboyguy Jun 14, 2024
9b4c519
Add support for adding multiple fields to LanceDB table in a single o…
Pipboyguy Jun 15, 2024
71ffaaa
Only filter by successful loads
Pipboyguy Jun 15, 2024
9a7c0b5
Remove redundant exception handling in JSON extraction
Pipboyguy Jun 15, 2024
a854649
Refactor lancedb_client.py for better code readability
Pipboyguy Jun 15, 2024
2751699
Merge remote-tracking branch 'origin/1370-lancedb-destination' into 1…
Pipboyguy Jun 15, 2024
348825f
Refactor lancedb_client.py for improved code readability
Pipboyguy Jun 15, 2024
4bfee6d
Fix module docstring
Pipboyguy Jun 15, 2024
54ef1ac
Remove embedding_fields from make_arrow_field_schema function
Pipboyguy Jun 15, 2024
a11f0e4
Add merge key support
Pipboyguy Jun 15, 2024
aff0032
Refactor `get_stored_state` to perform join in memory
Pipboyguy Jun 15, 2024
7c693b8
Packaging
Pipboyguy Jun 15, 2024
78cb85d
Format
Pipboyguy Jun 15, 2024
de23786
Update dependencies in GitHub workflow for testing lancedb
Pipboyguy Jun 15, 2024
894585c
Add "cohere" to package dependencies in pyproject.toml
Pipboyguy Jun 15, 2024
c228ea3
Dependencies
Pipboyguy Jun 16, 2024
d472e16
Update dependencies installation in GitHub workflow
Pipboyguy Jun 16, 2024
ba6d8b8
Dependencies
Pipboyguy Jun 16, 2024
71b7fce
Update dependency in GitHub workflow
Pipboyguy Jun 16, 2024
b7d1ebf
Dependencies
Pipboyguy Jun 16, 2024
a4f355d
Dependencies
Pipboyguy Jun 16, 2024
a7249b2
Add documentation for LanceDB
Pipboyguy Jun 16, 2024
1052757
Add limitations
Pipboyguy Jun 16, 2024
5c7288b
Offload ordering logic from LanceDB
Pipboyguy Jun 16, 2024
554c471
Update import statements in lancedb client and exceptions files
Pipboyguy Jun 16, 2024
3750aae
Create `_get_table_name` getter
Pipboyguy Jun 16, 2024
289d679
Format
Pipboyguy Jun 16, 2024
a8e4e62
Avoid race conditions by delegating all state management to dlt
Pipboyguy Jun 16, 2024
b977e5c
Imports
Pipboyguy Jun 16, 2024
232a84f
small doc and test fixes
sh-rp Jun 17, 2024
e08bb8a
Fix OpenAI embedding handling of empty strings
Pipboyguy Jun 17, 2024
a862742
Add 'embeddings' dependencies manually
Pipboyguy Jun 17, 2024
c722979
Finally...
Pipboyguy Jun 17, 2024
eb16858
Dependencies
Pipboyguy Jun 17, 2024
d510af1
Dependencies
Pipboyguy Jun 17, 2024
15a65c9
Docs
Pipboyguy Jun 17, 2024
5d42963
Remove superfluous helper method.
Pipboyguy Jun 18, 2024
9e4d8bd
Merge branch 'refs/heads/devel' into 1370-lancedb-destination
Pipboyguy Jun 21, 2024
e07ab1f
Lock File
Pipboyguy Jun 21, 2024
ec78282
Make api_key and embedding_model_provider_api_key optional
Pipboyguy Jun 21, 2024
9d3c57d
Clear environment for config test
Pipboyguy Jun 21, 2024
7f08f85
Minor test config
Pipboyguy Jun 21, 2024
db761e9
test config
Pipboyguy Jun 21, 2024
e450f47
lancedb config
Pipboyguy Jun 21, 2024
44405d7
Config test
Pipboyguy Jun 21, 2024
656a9b7
Config
Pipboyguy Jun 21, 2024
107907a
config
Pipboyguy Jun 21, 2024
114e74d
Import lancedb_adapter function instead of module in adapter collecti…
Pipboyguy Jun 21, 2024
433ce28
Merge branch 'devel' into 1370-lancedb-destination
Pipboyguy Jun 25, 2024
38f9e11
Clarify embedding facilities in LanceDB docs
Pipboyguy Jun 25, 2024
a318ccb
Merge branch 'devel' into 1370-lancedb-destination
sh-rp Jun 27, 2024
19f583d
update lancedb to support new naming setup (cleanup will follow)
sh-rp Jun 27, 2024
db1e81d
update lockfile
sh-rp Jun 27, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 81 additions & 0 deletions .github/workflows/test_destination_lancedb.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
name: dest | lancedb

on:
pull_request:
branches:
- master
- devel
workflow_dispatch:
schedule:
- cron: '0 2 * * *'

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

env:
DLT_SECRETS_TOML: ${{ secrets.DLT_SECRETS_TOML }}

RUNTIME__SENTRY_DSN: https://[email protected]/4504819859914752
RUNTIME__LOG_LEVEL: ERROR
RUNTIME__DLTHUB_TELEMETRY_ENDPOINT: ${{ secrets.RUNTIME__DLTHUB_TELEMETRY_ENDPOINT }}

ACTIVE_DESTINATIONS: "[\"lancedb\"]"
ALL_FILESYSTEM_DRIVERS: "[\"memory\"]"

jobs:
get_docs_changes:
name: docs changes
uses: ./.github/workflows/get_docs_changes.yml
if: ${{ !github.event.pull_request.head.repo.fork || contains(github.event.pull_request.labels.*.name, 'ci from fork')}}

run_loader:
name: dest | lancedb tests
needs: get_docs_changes
if: needs.get_docs_changes.outputs.changes_outside_docs == 'true'
defaults:
run:
shell: bash
runs-on: "ubuntu-latest"

steps:
- name: Check out
uses: actions/checkout@master

- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: "3.11.x"

- name: Install Poetry
uses: snok/[email protected]
with:
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true

- name: Load cached venv
id: cached-poetry-dependencies
uses: actions/cache@v3
with:
path: .venv
key: venv-${{ runner.os }}-${{ steps.setup-python.outputs.python-version }}-${{ hashFiles('**/poetry.lock') }}-gcp

- name: create secrets.toml
run: pwd && echo "$DLT_SECRETS_TOML" > tests/.dlt/secrets.toml

- name: Install dependencies
run: poetry install --no-interaction -E lancedb -E parquet --with sentry-sdk --with pipeline

- name: Install embedding provider dependencies
run: poetry run pip install openai

- run: |
poetry run pytest tests/load -m "essential"
name: Run essential tests Linux
if: ${{ ! (contains(github.event.pull_request.labels.*.name, 'ci full') || github.event_name == 'schedule')}}

- run: |
poetry run pytest tests/load
name: Run all tests Linux
if: ${{ contains(github.event.pull_request.labels.*.name, 'ci full') || github.event_name == 'schedule'}}
2 changes: 1 addition & 1 deletion .github/workflows/test_doc_snippets.yml
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ jobs:

- name: Install dependencies
# if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true'
run: poetry install --no-interaction -E duckdb -E weaviate -E parquet -E qdrant -E bigquery -E postgres --with docs,sentry-sdk --without airflow
run: poetry install --no-interaction -E duckdb -E weaviate -E parquet -E qdrant -E bigquery -E postgres -E lancedb --with docs,sentry-sdk --without airflow

- name: create secrets.toml for examples
run: pwd && echo "$DLT_SECRETS_TOML" > docs/examples/.dlt/secrets.toml
Expand Down
2 changes: 2 additions & 0 deletions dlt/destinations/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from dlt.destinations.impl.athena.factory import athena
from dlt.destinations.impl.redshift.factory import redshift
from dlt.destinations.impl.qdrant.factory import qdrant
from dlt.destinations.impl.lancedb.factory import lancedb
from dlt.destinations.impl.motherduck.factory import motherduck
from dlt.destinations.impl.weaviate.factory import weaviate
from dlt.destinations.impl.destination.factory import destination
Expand All @@ -28,6 +29,7 @@
"athena",
"redshift",
"qdrant",
"lancedb",
"motherduck",
"weaviate",
"synapse",
Expand Down
2 changes: 2 additions & 0 deletions dlt/destinations/adapters.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

from dlt.destinations.impl.weaviate.weaviate_adapter import weaviate_adapter
from dlt.destinations.impl.qdrant.qdrant_adapter import qdrant_adapter
from dlt.destinations.impl.lancedb import lancedb_adapter
from dlt.destinations.impl.bigquery.bigquery_adapter import bigquery_adapter
from dlt.destinations.impl.synapse.synapse_adapter import synapse_adapter
from dlt.destinations.impl.clickhouse.clickhouse_adapter import clickhouse_adapter
Expand All @@ -10,6 +11,7 @@
__all__ = [
"weaviate_adapter",
"qdrant_adapter",
"lancedb_adapter",
"bigquery_adapter",
"synapse_adapter",
"clickhouse_adapter",
Expand Down
1 change: 1 addition & 0 deletions dlt/destinations/impl/lancedb/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from dlt.destinations.impl.lancedb.lancedb_adapter import lancedb_adapter
111 changes: 111 additions & 0 deletions dlt/destinations/impl/lancedb/configuration.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
import dataclasses
from typing import Optional, Final, Literal, ClassVar, List

from dlt.common.configuration import configspec
from dlt.common.configuration.specs.base_configuration import (
BaseConfiguration,
CredentialsConfiguration,
)
from dlt.common.destination.reference import DestinationClientDwhConfiguration
from dlt.common.typing import TSecretStrValue
from dlt.common.utils import digest128


@configspec
class LanceDBCredentials(CredentialsConfiguration):
uri: Optional[str] = ".lancedb"
"""LanceDB database URI. Defaults to local, on-disk instance.

The available schemas are:

- `/path/to/database` - local database.
- `db://host:port` - remote database (LanceDB cloud).
"""
api_key: Optional[TSecretStrValue] = None
"""API key for the remote connections (LanceDB cloud)."""
embedding_model_provider_api_key: Optional[str] = None
"""API key for the embedding model provider."""

__config_gen_annotations__: ClassVar[List[str]] = [
"uri",
"api_key",
"embedding_model_provider_api_key",
]


@configspec
class LanceDBClientOptions(BaseConfiguration):
max_retries: Optional[int] = 3
"""`EmbeddingFunction` class wraps the calls for source and query embedding
generation inside a rate limit handler that retries the requests with exponential
backoff after successive failures.

You can tune it by setting it to a different number, or disable it by setting it to 0."""

__config_gen_annotations__: ClassVar[List[str]] = [
"max_retries",
]


TEmbeddingProvider = Literal[
"gemini-text",
"bedrock-text",
"cohere",
"gte-text",
"imagebind",
"instructor",
"open-clip",
"openai",
"sentence-transformers",
"huggingface",
"colbert",
]


@configspec
class LanceDBClientConfiguration(DestinationClientDwhConfiguration):
destination_type: Final[str] = dataclasses.field( # type: ignore
default="LanceDB", init=False, repr=False, compare=False
)
credentials: LanceDBCredentials = None
dataset_separator: str = "___"
"""Character for the dataset separator."""
dataset_name: Final[Optional[str]] = dataclasses.field( # type: ignore
default=None, init=False, repr=False, compare=False
)

options: Optional[LanceDBClientOptions] = None
"""LanceDB client options."""

embedding_model_provider: TEmbeddingProvider = "cohere"
"""Embedding provider used for generating embeddings. Default is "cohere". You can find the full list of
providers at https://github.com/lancedb/lancedb/tree/main/python/python/lancedb/embeddings as well as
https://lancedb.github.io/lancedb/embeddings/default_embedding_functions/."""
embedding_model: str = "embed-english-v3.0"
"""The model used by the embedding provider for generating embeddings.
Check with the embedding provider which options are available.
Reference https://lancedb.github.io/lancedb/embeddings/default_embedding_functions/."""
embedding_model_dimensions: Optional[int] = None
"""The dimensions of the embeddings generated. In most cases it will be automatically inferred, by LanceDB,
but it is configurable in rare cases.

Make sure it corresponds with the associated embedding model's dimensionality."""
vector_field_name: str = "vector__"
"""Name of the special field to store the vector embeddings."""
id_field_name: str = "id__"
"""Name of the special field to manage deduplication."""
sentinel_table_name: str = "dltSentinelTable"
"""Name of the sentinel table that encapsulates datasets. Since LanceDB has no
concept of schemas, this table serves as a proxy to group related dlt tables together."""

__config_gen_annotations__: ClassVar[List[str]] = [
"embedding_model",
"embedding_model_provider",
]

def fingerprint(self) -> str:
"""Returns a fingerprint of a connection string."""

if self.credentials and self.credentials.uri:
return digest128(self.credentials.uri)
return ""
30 changes: 30 additions & 0 deletions dlt/destinations/impl/lancedb/exceptions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
from functools import wraps
from typing import (
Any,
)

from lancedb.exceptions import MissingValueError, MissingColumnError # type: ignore

from dlt.common.destination.exceptions import (
DestinationUndefinedEntity,
DestinationTerminalException,
)
from dlt.common.destination.reference import JobClientBase
from dlt.common.typing import TFun


def lancedb_error(f: TFun) -> TFun:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I left a comment about this somewhere: Why can we not use the same error handling mechanism we use on all the other destinations: @raise_database_error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sh-rp I don't think it will work because lancedb doesn't have a DBAPI driver we can use, i.e. doesn't inheret from SqlClientBase

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if I'm missing something?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, you are right, we should make this decorator more universal, forget about it for now ;)

@wraps(f)
def _wrap(self: JobClientBase, *args: Any, **kwargs: Any) -> Any:
try:
return f(self, *args, **kwargs)
except (
FileNotFoundError,
MissingValueError,
MissingColumnError,
) as status_ex:
raise DestinationUndefinedEntity(status_ex) from status_ex
except Exception as e:
raise DestinationTerminalException(e) from e

return _wrap # type: ignore[return-value]
53 changes: 53 additions & 0 deletions dlt/destinations/impl/lancedb/factory.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
import typing as t

from dlt.common.destination import Destination, DestinationCapabilitiesContext
from dlt.destinations.impl.lancedb.configuration import (
LanceDBCredentials,
LanceDBClientConfiguration,
)


if t.TYPE_CHECKING:
from dlt.destinations.impl.lancedb.lancedb_client import LanceDBClient


class lancedb(Destination[LanceDBClientConfiguration, "LanceDBClient"]):
spec = LanceDBClientConfiguration

def _raw_capabilities(self) -> DestinationCapabilitiesContext:
caps = DestinationCapabilitiesContext()
caps.preferred_loader_file_format = "jsonl"
caps.supported_loader_file_formats = ["jsonl"]

caps.max_identifier_length = 200
caps.max_column_identifier_length = 1024
caps.max_query_length = 8 * 1024 * 1024
caps.is_max_query_length_in_bytes = False
caps.max_text_data_type_length = 8 * 1024 * 1024
caps.is_max_text_data_type_length_in_bytes = False
caps.supports_ddl_transactions = False

caps.decimal_precision = (38, 18)
caps.timestamp_precision = 6

return caps

@property
def client_class(self) -> t.Type["LanceDBClient"]:
from dlt.destinations.impl.lancedb.lancedb_client import LanceDBClient

return LanceDBClient

def __init__(
self,
credentials: t.Union[LanceDBCredentials, t.Dict[str, t.Any]] = None,
destination_name: t.Optional[str] = None,
environment: t.Optional[str] = None,
**kwargs: t.Any,
) -> None:
super().__init__(
credentials=credentials,
destination_name=destination_name,
environment=environment,
**kwargs,
)
58 changes: 58 additions & 0 deletions dlt/destinations/impl/lancedb/lancedb_adapter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
from typing import Any

from dlt.common.schema.typing import TColumnNames, TTableSchemaColumns
from dlt.destinations.utils import ensure_resource
from dlt.extract import DltResource


VECTORIZE_HINT = "x-lancedb-embed"


def lancedb_adapter(
data: Any,
embed: TColumnNames = None,
) -> DltResource:
"""Prepares data for the LanceDB destination by specifying which columns should be embedded.

Args:
data (Any): The data to be transformed. It can be raw data or an instance
of DltResource. If raw data, the function wraps it into a DltResource
object.
embed (TColumnNames, optional): Specify columns to generate embeddings for.
It can be a single column name as a string, or a list of column names.

Returns:
DltResource: A resource with applied LanceDB-specific hints.

Raises:
ValueError: If input for `embed` invalid or empty.

Examples:
>>> data = [{"name": "Marcel", "description": "Moonbase Engineer"}]
>>> lancedb_adapter(data, embed="description")
[DltResource with hints applied]
"""
resource = ensure_resource(data)

column_hints: TTableSchemaColumns = {}

if embed:
if isinstance(embed, str):
embed = [embed]
if not isinstance(embed, list):
raise ValueError(
"'embed' must be a list of column names or a single column name as a string."
)

for column_name in embed:
column_hints[column_name] = {
"name": column_name,
VECTORIZE_HINT: True, # type: ignore[misc]
}

if not column_hints:
raise ValueError("A value for 'embed' must be specified.")
else:
resource.apply_hints(columns=column_hints)

return resource
Loading
Loading