SAP HANA SQLAlchemy destination fails to synchronize #2110

kouzapo · 2024-11-29T11:47:20Z

dlt version

1.4.0

Describe the problem

I'm trying to load data to a SAP HANA database using a SQLAlchemy engine as destination, defining it like so engine = create_engine('hana+hdbcli://XXXXXX:[email protected]:12345'). The HANA instance is up and running, and credentials are correct. Source is a simple REST API. This is how I define my pipeline

pipeline = dlt.pipeline(
    pipeline_name='hana_load',
    destination=dlt.destinations.sqlalchemy(engine),
    dataset_name='MYSCHEMA',
)

load_info = pipeline.run(source)

Immediately, an exception is raised

(hdbcli.dbapi.Error) (362, 'invalid schema name: MYSCHEMA: line 2 col 6 (at pos 302)')
[SQL: SELECT myschema._dlt_pipeline_state.version, myschema._dlt_pipeline_state.engine_version, myschema._dlt_pipeline_state.pipeline_name, myschema._dlt_pipeline_state.state, myschema._dlt_pipeline_state.created_at, myschema._dlt_pipeline_state.version_hash, myschema._dlt_pipeline_state._dlt_load_id
FROM myschema._dlt_pipeline_state JOIN myschema._dlt_loads ON myschema._dlt_loads.load_id = myschema._dlt_pipeline_state._dlt_load_id
WHERE myschema._dlt_pipeline_state.pipeline_name = ? AND myschema._dlt_loads.status = ? ORDER BY myschema._dlt_loads.load_id DESC]
[parameters: ('quick_start', 0)]
(Background on this error at: https://sqlalche.me/e/14/dbapi)

which is expected since at this point, neither the schema nor the metadata tables have been created. This happens in destinations/impl/sqlalchemy/sqlalchemy_job_client.py in get_stored_state(). Same thing happens in _get_stored_schema().

Surprisingly, this is not the case with a Postgres SQLAlchemy engine where the exception is not thrown, the pipeline runs without errors and data is loaded in the destination table. It is curious though that the code block within the context manager is never reached.

with self.sql_client.execute_query(q) as cur:
    row = cur.fetchone()    # <---- Never reached, even in Postgres case
    if row is None:
        return None

    # TODO: Decode compressed schema str if needed
    return StorageSchemaInfo.from_normalized_mapping(
        row._mapping, self.schema.naming  # type: ignore[attr-defined]
    )

Postgres and HANA SQLAlchemy engines use different drivers underneath (psycopg2 and hdbcli respectively) so this might have some relation to the issue. I was able to bypass this by adding a try/except block in get_stored_state() and _get_stored_schema() but I'm not sure that this is the proper way to solve it. I haven't tested it with other RDBMS destinations (like MySQL).

Expected behavior

No error thrown, like in the Postgres case or proper exception handling for the cases it throws.

Steps to reproduce

Create a SAP HANA and a Postgres SQLAlchemy engine like so
hana_engine = create_engine('hana+hdbcli://XXXXXX:[email protected]:12345')
postgres_engine = create_engine('postgresql+psycopg2://XXXXXX:[email protected]:5432/some_db')
Run this simple pipeline

pipeline = dlt.pipeline(
    pipeline_name='hana_load',
    destination=dlt.destinations.sqlalchemy(engine),
    dataset_name='MYSCHEMA',
)

load_info = pipeline.run(source)

Expect an exception raised for the HANA case, but not for Postgres

Operating system

Linux

Runtime environment

Local

Python version

3.11

dlt data source

A simple REST API

dlt destination

SQLAlchemy (SAP HANA & Postgres)

Other deployment details

No response

Additional information

Adding a try/except block solves the issue but there might be a more elegant approach.

The text was updated successfully, but these errors were encountered:

MrJack91 · 2024-12-20T15:49:38Z

Exactly, still with dlt 1.5.0.
Would be great if HANA would be also supported.

Guess it's a problem of sqlalchemy-hana or even hdbcli.
I see, that there is a rollback after this select. Not sure, if this error should then be still raised further.

Any dev's do have a more clear picture about the problem?
Maybe it's also not nice, that dlt try to read the table before creating it, even this is working for other DBMS's.

Easy / ugly workaround:

in site-packages\dlt\destinations\impl\sqlalchemy\sqlalchemy_job_client.py
- only for the first run: add a return None, before both .execute_query calls
the tables will be created: with only nclobs and bigint (and some dates)
- at least nclobs do not support string comparisons (in joins)
- so alter them manually
revert the adaptions from the first point
now dlt is working for hana

Or easier, just use the final table (at least for version 1.5.0)

create table <my_schema>._dlt_loads
(
    load_id             nvarchar(5000) not null,
    schema_name         nvarchar(5000),
    status              int not null,
    inserted_at         longdate CS_LONGDATE not null,
    schema_version_hash nvarchar(5000)
) UNLOAD PRIORITY 5  AUTO merge
;
create table <my_schema>._dlt_pipeline_state
(
    version        int not null,
    engine_version int not null,
    pipeline_name  nvarchar(5000) not null,
    state          nvarchar(5000) not null,
    created_at     longdate CS_LONGDATE not null,
    version_hash   nvarchar(5000),
    _dlt_load_id nvarchar(5000) not null,
    _dlt_id      nvarchar(128) not null,
    unique CPBTREE (_DLT_ID)
) UNLOAD PRIORITY 5  AUTO merge
;
create table <my_schema>._dlt_version
(
    version        integer CS_INT not null,
    engine_version integer CS_INT not null,
    inserted_at    longdate CS_LONGDATE not null,
    schema_name    nvarchar(5000) not null,
    version_hash   nvarchar(5000) not null,
    schema         nclob MEMORY THRESHOLD 1000 not null
) UNLOAD PRIORITY 5  AUTO merge
;

MrJack91 · 2024-12-20T15:51:55Z

This behaviour of the hana driver, will also stop use of pyarrow.

github-project-automation bot added this to dlt core library Nov 29, 2024

github-project-automation bot moved this to Todo in dlt core library Nov 29, 2024

rudolfix moved this from Todo to Planned in dlt core library Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SAP HANA SQLAlchemy destination fails to synchronize #2110

SAP HANA SQLAlchemy destination fails to synchronize #2110

kouzapo commented Nov 29, 2024 •

edited

Loading

MrJack91 commented Dec 20, 2024 •

edited

Loading

MrJack91 commented Dec 20, 2024

SAP HANA SQLAlchemy destination fails to synchronize #2110

SAP HANA SQLAlchemy destination fails to synchronize #2110

Comments

kouzapo commented Nov 29, 2024 • edited Loading

dlt version

Describe the problem

Expected behavior

Steps to reproduce

Operating system

Runtime environment

Python version

dlt data source

dlt destination

Other deployment details

Additional information

MrJack91 commented Dec 20, 2024 • edited Loading

MrJack91 commented Dec 20, 2024

kouzapo commented Nov 29, 2024 •

edited

Loading

MrJack91 commented Dec 20, 2024 •

edited

Loading