-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LanceDB Destination #1375
Merged
Merged
LanceDB Destination #1375
Changes from all commits
Commits
Show all changes
179 commits
Select commit
Hold shift + click to select a range
5c35057
Added lancedb as an optional dependency
Pipboyguy f4450a7
Added lancedb to dependencies in test workflow
Pipboyguy 0808bb2
Merge branch 'refs/heads/devel' into 1370-lancedb-destination
Pipboyguy 5380892
Add initial capabilities for LanceDB destination
Pipboyguy 0f40ffa
Added new lancedb_adapter
Pipboyguy d34ab6f
Added LanceDB factory in destinations implementation
Pipboyguy 0f3d57f
Added LanceDB client configuration with embedding details
Pipboyguy 69e1daa
Merge branch 'refs/heads/devel' into 1370-lancedb-destination
Pipboyguy c51a9bc
Added LanceDB Client with data load and schema management functionali…
Pipboyguy c49ee2b
Merge remote-tracking branch 'origin/devel' into 1370-lancedb-destina…
Pipboyguy 389d06b
Lockfile
Pipboyguy 1d9a072
Wireframe LanceDB client implementation
Pipboyguy 8c0de4c
Add abstract methods
Pipboyguy d6d02b2
Enhance LanceDB client with additional functionality
Pipboyguy ada3ebe
Add tests and GitHub workflow for LanceDB destination
Pipboyguy 5d961b5
Update Python version to 3.11.x in GitHub workflow
Pipboyguy c3cad2f
Merge branch 'refs/heads/devel' into 1370-lancedb-destination
Pipboyguy f97256e
Refactor and cleanup LanceDBClient and LoadLanceDBJob classes
Pipboyguy 1b68391
Refactor load tests in lancedb/utils.py and add test for LanceDB mode…
Pipboyguy ae70f97
Added functionality to infer LanceDB model from data and refactored n…
Pipboyguy 2222a88
Remove storage options
Pipboyguy f819c25
Refactor test pipeline and implement lancedb_adapter in LanceDBClient
Pipboyguy 8bbd515
Add schema argument to LoadLanceDBJob function
Pipboyguy fb7565e
Format
Pipboyguy 4c73541
Refactor LanceDB related code and increase type hint coverage
Pipboyguy bfcc8bb
Refactor LanceDB client and tests, enhance DB type mapping
Pipboyguy 4827798
Refactor code to improve readability by reducing line breaks
Pipboyguy 53c0b0d
Refactor LanceDB client code by adding schema_conversion and utils mo…
Pipboyguy 2cfbdb4
Remove redundant variables in lancedb_client.py
Pipboyguy 092fcf0
Refactor code to improve readability and move environment variable se…
Pipboyguy b1783b2
Refactor LanceDB client implementation and error handling
Pipboyguy 08dcea1
Refactor code for better readability and add type ignore comments
Pipboyguy ad43996
Added lancedb as an optional dependency
Pipboyguy c225916
Added lancedb to dependencies in test workflow
Pipboyguy cfc4038
Add initial capabilities for LanceDB destination
Pipboyguy 7e1b651
Added new lancedb_adapter
Pipboyguy 892c4f8
Added LanceDB factory in destinations implementation
Pipboyguy b4e8ad5
Added LanceDB client configuration with embedding details
Pipboyguy 605ec89
Added LanceDB Client with data load and schema management functionali…
Pipboyguy 8cd3208
Wireframe LanceDB client implementation
Pipboyguy d4a9314
Add abstract methods
Pipboyguy 99fb712
Enhance LanceDB client with additional functionality
Pipboyguy e024925
Add tests and GitHub workflow for LanceDB destination
Pipboyguy c1e84b5
Update Python version to 3.11.x in GitHub workflow
Pipboyguy fe0cfbe
Refactor and cleanup LanceDBClient and LoadLanceDBJob classes
Pipboyguy 32fd5c8
Refactor load tests in lancedb/utils.py and add test for LanceDB mode…
Pipboyguy 1eea62a
Added functionality to infer LanceDB model from data and refactored n…
Pipboyguy eb05fcd
Remove storage options
Pipboyguy d560269
Refactor test pipeline and implement lancedb_adapter in LanceDBClient
Pipboyguy 80eef96
Add schema argument to LoadLanceDBJob function
Pipboyguy 1414ded
Format
Pipboyguy 033c981
Refactor LanceDB related code and increase type hint coverage
Pipboyguy c655c8a
Refactor LanceDB client and tests, enhance DB type mapping
Pipboyguy 1815da4
Refactor code to improve readability by reducing line breaks
Pipboyguy 32478ae
Refactor LanceDB client code by adding schema_conversion and utils mo…
Pipboyguy c4bd4f2
Remove redundant variables in lancedb_client.py
Pipboyguy 6c74cc5
Refactor code to improve readability and move environment variable se…
Pipboyguy 9e5d6db
Refactor LanceDB client implementation and error handling
Pipboyguy fb6974d
Refactor code for better readability and add type ignore comments
Pipboyguy ee239e5
Dependency Versioning
Pipboyguy 42e32d9
Merge remote-tracking branch 'origin/1370-lancedb-destination' into 1…
Pipboyguy 827ebd3
Remove unnecessary dependencies and update lancedb and pylance versions
Pipboyguy 71e3579
Silence mypy warnings
Pipboyguy d5b8cda
Revert mypy ignores
Pipboyguy 3ea0a3a
Revert mypy ignores
Pipboyguy 395e1b7
Fix versioning with 3.8
Pipboyguy ec2774d
Fix versioning
Pipboyguy 1591a1f
Update default URI and dataset separator in LanceDB configuration
Pipboyguy 7f3c772
Merge branch 'refs/heads/devel' into 1370-lancedb-destination
Pipboyguy a727cf3
Merge branch 'refs/heads/devel' into 1370-lancedb-destination
Pipboyguy 66d1cc7
Refactor LanceDB typemapper with timestamp and decimal precision adju…
Pipboyguy 21bc285
Updated method for retrieving sentinel table name
Pipboyguy 7fadbe5
Remove redundant table normalisation for version_table_name
Pipboyguy bda6123
Refactor LanceDB functionalities and improve handling of optional emb…
Pipboyguy 74ac9f3
Refactor LanceDBClient and update parameter defaults in schema.py
Pipboyguy e4b3a8d
Added lancedb to default vector configs and improved type annotations…
Pipboyguy 5e8718c
Return self in enter context manager method
Pipboyguy 310eccf
Merge branch 'refs/heads/devel' into 1370-lancedb-destination
Pipboyguy 6a5db5c
Handle FileNotFoundError
Pipboyguy f4bcfe7
Replace FileNotFoundError with DestinationUndefinedEntity in lancedb_…
Pipboyguy b193f38
Refactor LanceDB client for simplified table name handling
Pipboyguy 6e61c7a
Merge branch 'refs/heads/devel' into 1370-lancedb-destination
Pipboyguy 4701547
Refactored LanceDB schema creation and storage update processes to py…
Pipboyguy 74080a4
Remove LanceModels
Pipboyguy f3ac7e2
Ensure 'records' is a list in lancedb_client.py
Pipboyguy 173be9e
Refactor code and add batch error handling in lancedb client
Pipboyguy 9d6c6d2
Refactor LanceDB client and schema for improved embedding handling
Pipboyguy bb165a8
Improve error handling and retries in LanceDB client
Pipboyguy 2adeaea
Add error decorator to get_stored_state method in lancedb_client
Pipboyguy ba02ed5
Change error handling from FileNotFoundError to IndexError
Pipboyguy 432dd76
Refactor lancedb_client.py and add error decorators
Pipboyguy fa1202e
Add configurable read consistency to LanceDB client
Pipboyguy 8963876
Merge branch 'refs/heads/devel' into 1370-lancedb-destination
Pipboyguy 26f497e
Versioning
Pipboyguy 30cc9a7
Refactor code for readability and change return type in tests
Pipboyguy 175b6db
Update queries in lancedb_client to order by insertion date
Pipboyguy 719fcfb
Refactor LanceDB client and schema for better table creation and mana…
Pipboyguy aa683da
Combine "skip" and "append" write dispositions in batch upload
Pipboyguy 6483152
Add schema version hash check in LanceDB client write operations
Pipboyguy 65ede6f
Remove testing code
Pipboyguy fbc558f
Refactor return statement in lancedb_client for successful state loads
Pipboyguy 8fbae21
Update lancedb_client.py to improve table handling and embedding fields
Pipboyguy 88bc519
Refactor LanceDB schema generation and handle metadata for embedding …
Pipboyguy aab21a6
Merge branch 'refs/heads/devel' into 1370-lancedb-destination
Pipboyguy 3b85442
Refactor schema creation and remove unused code
Pipboyguy 31e1895
Add mapping for provider environment variables and update schema comment
Pipboyguy 9400e02
Update package versions in pyproject.toml and poetry.lock
Pipboyguy 10c41a8
Refactor LanceDB utils and client, handle exception and remove unnece…
Pipboyguy 9ba97a8
Refactor utility functions in lancedb tests
Pipboyguy 97127a5
Update 'replace' mode and improve table handling in lancedb client
Pipboyguy b77b6f7
Refactor assert_unordered_list_equal to handle dictionaries
Pipboyguy bc8948e
Refactor code for better readability and remove unnecessary blank lines
Pipboyguy c9a1667
Refactor code for readability and remove redundant comments
Pipboyguy 84357d7
Update sentinel table name in test_pipeline.py
Pipboyguy a334c98
"Add order by clause to database query in lancedb_client"
Pipboyguy d4b56da
Use super method to reduce redundancy
Pipboyguy 8f89d38
Syntax
Pipboyguy 3b44631
Remove bare except clauses
Pipboyguy ee6f525
Revert "Remove bare except clauses"
Pipboyguy d8db465
Remove bare except clause
Pipboyguy 5aed9e5
Remove bare except clause
Pipboyguy cd68966
Remove bare except clause
Pipboyguy 11f562f
Remove bare except clause
Pipboyguy 06408b9
Refactor error handling in LanceDB client
Pipboyguy cf41859
Add configurable sentinel table name in LanceDB client configuration
Pipboyguy bbc00a5
Update embedding model config and schema in LanceDB
Pipboyguy 9322477
Refactor lancedb_client.py, remove unused methods and imports
Pipboyguy 9b4c519
Add support for adding multiple fields to LanceDB table in a single o…
Pipboyguy 71ffaaa
Only filter by successful loads
Pipboyguy 9a7c0b5
Remove redundant exception handling in JSON extraction
Pipboyguy a854649
Refactor lancedb_client.py for better code readability
Pipboyguy 2751699
Merge remote-tracking branch 'origin/1370-lancedb-destination' into 1…
Pipboyguy 348825f
Refactor lancedb_client.py for improved code readability
Pipboyguy 4bfee6d
Fix module docstring
Pipboyguy 54ef1ac
Remove embedding_fields from make_arrow_field_schema function
Pipboyguy a11f0e4
Add merge key support
Pipboyguy aff0032
Refactor `get_stored_state` to perform join in memory
Pipboyguy 7c693b8
Packaging
Pipboyguy 78cb85d
Format
Pipboyguy de23786
Update dependencies in GitHub workflow for testing lancedb
Pipboyguy 894585c
Add "cohere" to package dependencies in pyproject.toml
Pipboyguy c228ea3
Dependencies
Pipboyguy d472e16
Update dependencies installation in GitHub workflow
Pipboyguy ba6d8b8
Dependencies
Pipboyguy 71b7fce
Update dependency in GitHub workflow
Pipboyguy b7d1ebf
Dependencies
Pipboyguy a4f355d
Dependencies
Pipboyguy a7249b2
Add documentation for LanceDB
Pipboyguy 1052757
Add limitations
Pipboyguy 5c7288b
Offload ordering logic from LanceDB
Pipboyguy 554c471
Update import statements in lancedb client and exceptions files
Pipboyguy 3750aae
Create `_get_table_name` getter
Pipboyguy 289d679
Format
Pipboyguy a8e4e62
Avoid race conditions by delegating all state management to dlt
Pipboyguy b977e5c
Imports
Pipboyguy 232a84f
small doc and test fixes
sh-rp e08bb8a
Fix OpenAI embedding handling of empty strings
Pipboyguy a862742
Add 'embeddings' dependencies manually
Pipboyguy c722979
Finally...
Pipboyguy eb16858
Dependencies
Pipboyguy d510af1
Dependencies
Pipboyguy 15a65c9
Docs
Pipboyguy 5d42963
Remove superfluous helper method.
Pipboyguy 9e4d8bd
Merge branch 'refs/heads/devel' into 1370-lancedb-destination
Pipboyguy e07ab1f
Lock File
Pipboyguy ec78282
Make api_key and embedding_model_provider_api_key optional
Pipboyguy 9d3c57d
Clear environment for config test
Pipboyguy 7f08f85
Minor test config
Pipboyguy db761e9
test config
Pipboyguy e450f47
lancedb config
Pipboyguy 44405d7
Config test
Pipboyguy 656a9b7
Config
Pipboyguy 107907a
config
Pipboyguy 114e74d
Import lancedb_adapter function instead of module in adapter collecti…
Pipboyguy 433ce28
Merge branch 'devel' into 1370-lancedb-destination
Pipboyguy 38f9e11
Clarify embedding facilities in LanceDB docs
Pipboyguy a318ccb
Merge branch 'devel' into 1370-lancedb-destination
sh-rp 19f583d
update lancedb to support new naming setup (cleanup will follow)
sh-rp db1e81d
update lockfile
sh-rp File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
name: dest | lancedb | ||
|
||
on: | ||
pull_request: | ||
branches: | ||
- master | ||
- devel | ||
workflow_dispatch: | ||
schedule: | ||
- cron: '0 2 * * *' | ||
|
||
concurrency: | ||
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }} | ||
cancel-in-progress: true | ||
|
||
env: | ||
DLT_SECRETS_TOML: ${{ secrets.DLT_SECRETS_TOML }} | ||
|
||
RUNTIME__SENTRY_DSN: https://[email protected]/4504819859914752 | ||
RUNTIME__LOG_LEVEL: ERROR | ||
RUNTIME__DLTHUB_TELEMETRY_ENDPOINT: ${{ secrets.RUNTIME__DLTHUB_TELEMETRY_ENDPOINT }} | ||
|
||
ACTIVE_DESTINATIONS: "[\"lancedb\"]" | ||
ALL_FILESYSTEM_DRIVERS: "[\"memory\"]" | ||
|
||
jobs: | ||
get_docs_changes: | ||
name: docs changes | ||
uses: ./.github/workflows/get_docs_changes.yml | ||
if: ${{ !github.event.pull_request.head.repo.fork || contains(github.event.pull_request.labels.*.name, 'ci from fork')}} | ||
|
||
run_loader: | ||
name: dest | lancedb tests | ||
needs: get_docs_changes | ||
if: needs.get_docs_changes.outputs.changes_outside_docs == 'true' | ||
defaults: | ||
run: | ||
shell: bash | ||
runs-on: "ubuntu-latest" | ||
|
||
steps: | ||
- name: Check out | ||
uses: actions/checkout@master | ||
|
||
- name: Setup Python | ||
uses: actions/setup-python@v4 | ||
with: | ||
python-version: "3.11.x" | ||
|
||
- name: Install Poetry | ||
uses: snok/[email protected] | ||
with: | ||
virtualenvs-create: true | ||
virtualenvs-in-project: true | ||
installer-parallel: true | ||
|
||
- name: Load cached venv | ||
id: cached-poetry-dependencies | ||
uses: actions/cache@v3 | ||
with: | ||
path: .venv | ||
key: venv-${{ runner.os }}-${{ steps.setup-python.outputs.python-version }}-${{ hashFiles('**/poetry.lock') }}-gcp | ||
|
||
- name: create secrets.toml | ||
run: pwd && echo "$DLT_SECRETS_TOML" > tests/.dlt/secrets.toml | ||
|
||
- name: Install dependencies | ||
run: poetry install --no-interaction -E lancedb -E parquet --with sentry-sdk --with pipeline | ||
|
||
- name: Install embedding provider dependencies | ||
run: poetry run pip install openai | ||
|
||
- run: | | ||
poetry run pytest tests/load -m "essential" | ||
name: Run essential tests Linux | ||
if: ${{ ! (contains(github.event.pull_request.labels.*.name, 'ci full') || github.event_name == 'schedule')}} | ||
|
||
- run: | | ||
poetry run pytest tests/load | ||
name: Run all tests Linux | ||
if: ${{ contains(github.event.pull_request.labels.*.name, 'ci full') || github.event_name == 'schedule'}} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
from dlt.destinations.impl.lancedb.lancedb_adapter import lancedb_adapter |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,111 @@ | ||
import dataclasses | ||
from typing import Optional, Final, Literal, ClassVar, List | ||
|
||
from dlt.common.configuration import configspec | ||
from dlt.common.configuration.specs.base_configuration import ( | ||
BaseConfiguration, | ||
CredentialsConfiguration, | ||
) | ||
from dlt.common.destination.reference import DestinationClientDwhConfiguration | ||
from dlt.common.typing import TSecretStrValue | ||
from dlt.common.utils import digest128 | ||
|
||
|
||
@configspec | ||
class LanceDBCredentials(CredentialsConfiguration): | ||
uri: Optional[str] = ".lancedb" | ||
"""LanceDB database URI. Defaults to local, on-disk instance. | ||
|
||
The available schemas are: | ||
|
||
- `/path/to/database` - local database. | ||
- `db://host:port` - remote database (LanceDB cloud). | ||
""" | ||
api_key: Optional[TSecretStrValue] = None | ||
"""API key for the remote connections (LanceDB cloud).""" | ||
embedding_model_provider_api_key: Optional[str] = None | ||
"""API key for the embedding model provider.""" | ||
|
||
__config_gen_annotations__: ClassVar[List[str]] = [ | ||
"uri", | ||
"api_key", | ||
"embedding_model_provider_api_key", | ||
] | ||
|
||
|
||
@configspec | ||
class LanceDBClientOptions(BaseConfiguration): | ||
max_retries: Optional[int] = 3 | ||
"""`EmbeddingFunction` class wraps the calls for source and query embedding | ||
generation inside a rate limit handler that retries the requests with exponential | ||
backoff after successive failures. | ||
|
||
You can tune it by setting it to a different number, or disable it by setting it to 0.""" | ||
|
||
__config_gen_annotations__: ClassVar[List[str]] = [ | ||
"max_retries", | ||
] | ||
|
||
|
||
TEmbeddingProvider = Literal[ | ||
"gemini-text", | ||
"bedrock-text", | ||
"cohere", | ||
"gte-text", | ||
"imagebind", | ||
"instructor", | ||
"open-clip", | ||
"openai", | ||
"sentence-transformers", | ||
"huggingface", | ||
"colbert", | ||
] | ||
|
||
|
||
@configspec | ||
class LanceDBClientConfiguration(DestinationClientDwhConfiguration): | ||
destination_type: Final[str] = dataclasses.field( # type: ignore | ||
default="LanceDB", init=False, repr=False, compare=False | ||
) | ||
credentials: LanceDBCredentials = None | ||
dataset_separator: str = "___" | ||
"""Character for the dataset separator.""" | ||
dataset_name: Final[Optional[str]] = dataclasses.field( # type: ignore | ||
default=None, init=False, repr=False, compare=False | ||
) | ||
|
||
options: Optional[LanceDBClientOptions] = None | ||
"""LanceDB client options.""" | ||
|
||
embedding_model_provider: TEmbeddingProvider = "cohere" | ||
"""Embedding provider used for generating embeddings. Default is "cohere". You can find the full list of | ||
providers at https://github.com/lancedb/lancedb/tree/main/python/python/lancedb/embeddings as well as | ||
https://lancedb.github.io/lancedb/embeddings/default_embedding_functions/.""" | ||
embedding_model: str = "embed-english-v3.0" | ||
"""The model used by the embedding provider for generating embeddings. | ||
Check with the embedding provider which options are available. | ||
Reference https://lancedb.github.io/lancedb/embeddings/default_embedding_functions/.""" | ||
embedding_model_dimensions: Optional[int] = None | ||
"""The dimensions of the embeddings generated. In most cases it will be automatically inferred, by LanceDB, | ||
but it is configurable in rare cases. | ||
|
||
Make sure it corresponds with the associated embedding model's dimensionality.""" | ||
vector_field_name: str = "vector__" | ||
"""Name of the special field to store the vector embeddings.""" | ||
id_field_name: str = "id__" | ||
"""Name of the special field to manage deduplication.""" | ||
sentinel_table_name: str = "dltSentinelTable" | ||
"""Name of the sentinel table that encapsulates datasets. Since LanceDB has no | ||
concept of schemas, this table serves as a proxy to group related dlt tables together.""" | ||
|
||
__config_gen_annotations__: ClassVar[List[str]] = [ | ||
"embedding_model", | ||
"embedding_model_provider", | ||
] | ||
|
||
def fingerprint(self) -> str: | ||
"""Returns a fingerprint of a connection string.""" | ||
|
||
if self.credentials and self.credentials.uri: | ||
return digest128(self.credentials.uri) | ||
return "" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
from functools import wraps | ||
from typing import ( | ||
Any, | ||
) | ||
|
||
from lancedb.exceptions import MissingValueError, MissingColumnError # type: ignore | ||
|
||
from dlt.common.destination.exceptions import ( | ||
DestinationUndefinedEntity, | ||
DestinationTerminalException, | ||
) | ||
from dlt.common.destination.reference import JobClientBase | ||
from dlt.common.typing import TFun | ||
|
||
|
||
def lancedb_error(f: TFun) -> TFun: | ||
@wraps(f) | ||
def _wrap(self: JobClientBase, *args: Any, **kwargs: Any) -> Any: | ||
try: | ||
return f(self, *args, **kwargs) | ||
except ( | ||
FileNotFoundError, | ||
MissingValueError, | ||
MissingColumnError, | ||
) as status_ex: | ||
raise DestinationUndefinedEntity(status_ex) from status_ex | ||
except Exception as e: | ||
raise DestinationTerminalException(e) from e | ||
|
||
return _wrap # type: ignore[return-value] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
import typing as t | ||
|
||
from dlt.common.destination import Destination, DestinationCapabilitiesContext | ||
from dlt.destinations.impl.lancedb.configuration import ( | ||
LanceDBCredentials, | ||
LanceDBClientConfiguration, | ||
) | ||
|
||
|
||
if t.TYPE_CHECKING: | ||
from dlt.destinations.impl.lancedb.lancedb_client import LanceDBClient | ||
|
||
|
||
class lancedb(Destination[LanceDBClientConfiguration, "LanceDBClient"]): | ||
spec = LanceDBClientConfiguration | ||
|
||
def _raw_capabilities(self) -> DestinationCapabilitiesContext: | ||
caps = DestinationCapabilitiesContext() | ||
caps.preferred_loader_file_format = "jsonl" | ||
caps.supported_loader_file_formats = ["jsonl"] | ||
|
||
caps.max_identifier_length = 200 | ||
caps.max_column_identifier_length = 1024 | ||
caps.max_query_length = 8 * 1024 * 1024 | ||
caps.is_max_query_length_in_bytes = False | ||
caps.max_text_data_type_length = 8 * 1024 * 1024 | ||
caps.is_max_text_data_type_length_in_bytes = False | ||
caps.supports_ddl_transactions = False | ||
|
||
caps.decimal_precision = (38, 18) | ||
caps.timestamp_precision = 6 | ||
|
||
return caps | ||
|
||
@property | ||
def client_class(self) -> t.Type["LanceDBClient"]: | ||
from dlt.destinations.impl.lancedb.lancedb_client import LanceDBClient | ||
|
||
return LanceDBClient | ||
|
||
def __init__( | ||
self, | ||
credentials: t.Union[LanceDBCredentials, t.Dict[str, t.Any]] = None, | ||
destination_name: t.Optional[str] = None, | ||
environment: t.Optional[str] = None, | ||
**kwargs: t.Any, | ||
) -> None: | ||
super().__init__( | ||
credentials=credentials, | ||
destination_name=destination_name, | ||
environment=environment, | ||
**kwargs, | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
from typing import Any | ||
|
||
from dlt.common.schema.typing import TColumnNames, TTableSchemaColumns | ||
from dlt.destinations.utils import ensure_resource | ||
from dlt.extract import DltResource | ||
|
||
|
||
VECTORIZE_HINT = "x-lancedb-embed" | ||
|
||
|
||
def lancedb_adapter( | ||
data: Any, | ||
embed: TColumnNames = None, | ||
) -> DltResource: | ||
"""Prepares data for the LanceDB destination by specifying which columns should be embedded. | ||
|
||
Args: | ||
data (Any): The data to be transformed. It can be raw data or an instance | ||
of DltResource. If raw data, the function wraps it into a DltResource | ||
object. | ||
embed (TColumnNames, optional): Specify columns to generate embeddings for. | ||
It can be a single column name as a string, or a list of column names. | ||
|
||
Returns: | ||
DltResource: A resource with applied LanceDB-specific hints. | ||
|
||
Raises: | ||
ValueError: If input for `embed` invalid or empty. | ||
|
||
Examples: | ||
>>> data = [{"name": "Marcel", "description": "Moonbase Engineer"}] | ||
>>> lancedb_adapter(data, embed="description") | ||
[DltResource with hints applied] | ||
""" | ||
resource = ensure_resource(data) | ||
|
||
column_hints: TTableSchemaColumns = {} | ||
|
||
if embed: | ||
if isinstance(embed, str): | ||
embed = [embed] | ||
if not isinstance(embed, list): | ||
raise ValueError( | ||
"'embed' must be a list of column names or a single column name as a string." | ||
) | ||
|
||
for column_name in embed: | ||
column_hints[column_name] = { | ||
"name": column_name, | ||
VECTORIZE_HINT: True, # type: ignore[misc] | ||
} | ||
|
||
if not column_hints: | ||
raise ValueError("A value for 'embed' must be specified.") | ||
else: | ||
resource.apply_hints(columns=column_hints) | ||
|
||
return resource |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I left a comment about this somewhere: Why can we not use the same error handling mechanism we use on all the other destinations: @raise_database_error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sh-rp I don't think it will work because lancedb doesn't have a DBAPI driver we can use, i.e. doesn't inheret from SqlClientBase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if I'm missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, you are right, we should make this decorator more universal, forget about it for now ;)