Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAP HANA SQLAlchemy destination fails to synchronize #2110

Open
kouzapo opened this issue Nov 29, 2024 · 2 comments
Open

SAP HANA SQLAlchemy destination fails to synchronize #2110

kouzapo opened this issue Nov 29, 2024 · 2 comments

Comments

@kouzapo
Copy link

kouzapo commented Nov 29, 2024

dlt version

1.4.0

Describe the problem

I'm trying to load data to a SAP HANA database using a SQLAlchemy engine as destination, defining it like so engine = create_engine('hana+hdbcli://XXXXXX:[email protected]:12345'). The HANA instance is up and running, and credentials are correct. Source is a simple REST API. This is how I define my pipeline

pipeline = dlt.pipeline(
    pipeline_name='hana_load',
    destination=dlt.destinations.sqlalchemy(engine),
    dataset_name='MYSCHEMA',
)

load_info = pipeline.run(source)

Immediately, an exception is raised

(hdbcli.dbapi.Error) (362, 'invalid schema name: MYSCHEMA: line 2 col 6 (at pos 302)')
[SQL: SELECT myschema._dlt_pipeline_state.version, myschema._dlt_pipeline_state.engine_version, myschema._dlt_pipeline_state.pipeline_name, myschema._dlt_pipeline_state.state, myschema._dlt_pipeline_state.created_at, myschema._dlt_pipeline_state.version_hash, myschema._dlt_pipeline_state._dlt_load_id
FROM myschema._dlt_pipeline_state JOIN myschema._dlt_loads ON myschema._dlt_loads.load_id = myschema._dlt_pipeline_state._dlt_load_id
WHERE myschema._dlt_pipeline_state.pipeline_name = ? AND myschema._dlt_loads.status = ? ORDER BY myschema._dlt_loads.load_id DESC]
[parameters: ('quick_start', 0)]
(Background on this error at: https://sqlalche.me/e/14/dbapi)

which is expected since at this point, neither the schema nor the metadata tables have been created. This happens in destinations/impl/sqlalchemy/sqlalchemy_job_client.py in get_stored_state(). Same thing happens in _get_stored_schema().

Surprisingly, this is not the case with a Postgres SQLAlchemy engine where the exception is not thrown, the pipeline runs without errors and data is loaded in the destination table. It is curious though that the code block within the context manager is never reached.

with self.sql_client.execute_query(q) as cur:
    row = cur.fetchone()    # <---- Never reached, even in Postgres case
    if row is None:
        return None

    # TODO: Decode compressed schema str if needed
    return StorageSchemaInfo.from_normalized_mapping(
        row._mapping, self.schema.naming  # type: ignore[attr-defined]
    )

Postgres and HANA SQLAlchemy engines use different drivers underneath (psycopg2 and hdbcli respectively) so this might have some relation to the issue. I was able to bypass this by adding a try/except block in get_stored_state() and _get_stored_schema() but I'm not sure that this is the proper way to solve it. I haven't tested it with other RDBMS destinations (like MySQL).

Expected behavior

No error thrown, like in the Postgres case or proper exception handling for the cases it throws.

Steps to reproduce

  • Create a SAP HANA and a Postgres SQLAlchemy engine like so
    hana_engine = create_engine('hana+hdbcli://XXXXXX:[email protected]:12345')
    postgres_engine = create_engine('postgresql+psycopg2://XXXXXX:[email protected]:5432/some_db')

  • Run this simple pipeline

pipeline = dlt.pipeline(
    pipeline_name='hana_load',
    destination=dlt.destinations.sqlalchemy(engine),
    dataset_name='MYSCHEMA',
)

load_info = pipeline.run(source)
  • Expect an exception raised for the HANA case, but not for Postgres

Operating system

Linux

Runtime environment

Local

Python version

3.11

dlt data source

A simple REST API

dlt destination

SQLAlchemy (SAP HANA & Postgres)

Other deployment details

No response

Additional information

Adding a try/except block solves the issue but there might be a more elegant approach.

@MrJack91
Copy link

MrJack91 commented Dec 20, 2024

Exactly, still with dlt 1.5.0.
Would be great if HANA would be also supported.

Guess it's a problem of sqlalchemy-hana or even hdbcli.
I see, that there is a rollback after this select. Not sure, if this error should then be still raised further.

Any dev's do have a more clear picture about the problem?
Maybe it's also not nice, that dlt try to read the table before creating it, even this is working for other DBMS's.

Easy / ugly workaround:

  • in site-packages\dlt\destinations\impl\sqlalchemy\sqlalchemy_job_client.py
    • only for the first run: add a return None, before both .execute_query calls
  • the tables will be created: with only nclobs and bigint (and some dates)
    • at least nclobs do not support string comparisons (in joins)
    • so alter them manually
  • revert the adaptions from the first point
  • now dlt is working for hana

Or easier, just use the final table (at least for version 1.5.0)

create table <my_schema>._dlt_loads
(
    load_id             nvarchar(5000) not null,
    schema_name         nvarchar(5000),
    status              int not null,
    inserted_at         longdate CS_LONGDATE not null,
    schema_version_hash nvarchar(5000)
) UNLOAD PRIORITY 5  AUTO merge
;
create table <my_schema>._dlt_pipeline_state
(
    version        int not null,
    engine_version int not null,
    pipeline_name  nvarchar(5000) not null,
    state          nvarchar(5000) not null,
    created_at     longdate CS_LONGDATE not null,
    version_hash   nvarchar(5000),
    _dlt_load_id nvarchar(5000) not null,
    _dlt_id      nvarchar(128) not null,
    unique CPBTREE (_DLT_ID)
) UNLOAD PRIORITY 5  AUTO merge
;
create table <my_schema>._dlt_version
(
    version        integer CS_INT not null,
    engine_version integer CS_INT not null,
    inserted_at    longdate CS_LONGDATE not null,
    schema_name    nvarchar(5000) not null,
    version_hash   nvarchar(5000) not null,
    schema         nclob MEMORY THRESHOLD 1000 not null
) UNLOAD PRIORITY 5  AUTO merge
;

@MrJack91
Copy link

This behaviour of the hana driver, will also stop use of pyarrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Planned
Development

No branches or pull requests

2 participants