Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update dependencies to fix bug on databricks has_dataset #1436

Closed

Conversation

oonyoontong
Copy link
Contributor

@oonyoontong oonyoontong commented Jun 3, 2024

Description

Fix error in parsing results back from the underlying databricks sql client by updating dependencies.

Traceback (most recent call last):
File "dlt\destinations\sql_client.py", line 243, in _wrap_gen
return (yield from f(self, *args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "dlt\destinations\impl\databricks\sql_client.py", line 115, in execute_query
yield DBApiCursorImpl(curr) # type: ignore[abstract]
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "dlt\destinations\impl\databricks\sql_client.py", line 84, in execute_sql
f = curr.fetchall()
^^^^^^^^^^^^^^^
File "databricks\sql\client.py", line 670, in fetchall
return self.active_result_set.fetchall()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "databricks\sql\client.py", line 944, in fetchall
return self._convert_arrow_table(self.fetchall_arrow())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "databricks\sql\client.py", line 884, in _convert_arrow_table
res = df.to_numpy(na_value=None)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pandas\core\frame.py", line 1993, in to_numpy
result = self._mgr.as_array(dtype=dtype, copy=copy, na_value=na_value)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pandas\core\internals\managers.py", line 1703, in as_array
arr[isna(arr)] = na_value
~~~^^^^^^^^^^^
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

Related Issues

  • Fixes #...
  • Closes #...
  • Resolves #...

Additional Context

https://dlthub-community.slack.com/archives/C04DQA7JJN6/p1716857509703029

Copy link

netlify bot commented Jun 3, 2024

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit 3e6e4ee
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/665fe9ccc8a0be000890c37b
😎 Deploy Preview https://deploy-preview-1436--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link
Collaborator

@steinitzu steinitzu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @oonyoontong , thanks for the contribution. I have a few comments/questions, so please take a look :)

dlt/destinations/impl/databricks/sql_client.py Outdated Show resolved Hide resolved
dlt/destinations/impl/databricks/sql_client.py Outdated Show resolved Hide resolved
dlt/destinations/impl/databricks/sql_client.py Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
@oonyoontong oonyoontong force-pushed the fix/databricks-destination branch from 955eabc to 39779e9 Compare June 4, 2024 04:59
@oonyoontong oonyoontong requested a review from steinitzu June 4, 2024 05:05
@oonyoontong
Copy link
Contributor Author

sorry for the messy PR @steinitzu, was planning clean it up today. The PR was done early to let the others facing this issue that it was being actively worked on

Copy link
Collaborator

@steinitzu steinitzu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! So this is due to an incompatibility with latest pandas databricks/databricks-sql-python#326
Latest connector pins pandas version, so updating is good imo and we should do it anyway.

The only problem is that the named paramstyle fails with a confusing error on cluster runtime version 13.x for some reason. I had to upgrade the test cluster to 14.3.
If there's no way around that I think we should try to detect which version is running at the beginning of has_dataset. I.e. https://docs.databricks.com/en/sql/language-manual/functions/current_version.html
And raise a DatabaseTerminalException with a helpful message.
What do you think?

dlt/destinations/impl/databricks/sql_client.py Outdated Show resolved Hide resolved
dlt/destinations/impl/databricks/sql_client.py Outdated Show resolved Hide resolved
@steinitzu steinitzu changed the base branch from master to devel June 4, 2024 21:44
@steinitzu steinitzu changed the base branch from devel to master June 4, 2024 21:44
@oonyoontong
Copy link
Contributor Author

Ok! So this is due to an incompatibility with latest pandas databricks/databricks-sql-python#326 Latest connector pins pandas version, so updating is good imo and we should do it anyway.

The only problem is that the named paramstyle fails with a confusing error on cluster runtime version 13.x for some reason. I had to upgrade the test cluster to 14.3. If there's no way around that I think we should try to detect which version is running at the beginning of has_dataset. I.e. https://docs.databricks.com/en/sql/language-manual/functions/current_version.html And raise a DatabaseTerminalException with a helpful message. What do you think?

from my investigation of current_version() and digging through the docs of databricks-sql-python, i dont see any clear way of getting

  1. the cluster runtime version from databricks sql
  2. the versions of the databricks runtime which this implementation breaks for, where we can clearly say This is incompatible with databricks version x.x.x, please upgrade to >= y.y.y for this connector to work

I am more inclined to catch the name paramstyle error and provide a error message like Known error seen with databricks cluster <14.x, try again after upgrading the databricks cluster to >= 14.x

I dont have the ability to change my databricks cluster to 13.x to test named paramstyle error returned, if you think the above solution works, could you help me add a commit to except the Error?

@oonyoontong oonyoontong requested a review from steinitzu June 5, 2024 04:31
@oonyoontong oonyoontong changed the title update dependencies and fix bug on has_dataset update dependencies and fix bug on databricks has_dataset Jun 5, 2024
@oonyoontong oonyoontong changed the title update dependencies and fix bug on databricks has_dataset update dependencies to fix bug on databricks has_dataset Jun 5, 2024
@oonyoontong
Copy link
Contributor Author

let me know how i can expedit this, this is a huge blocker for us

@steinitzu
Copy link
Collaborator

Alright, @oonyoontong . Thanks for looking into this! I moved to a new branch here and I'll finish this up #1443
Running the full CI suite from forks seems to not be working yet.

I found out we can still use inline params in the new driver: https://github.com/databricks/databricks-sql-python/blob/main/docs/parameters.md#using-inline-parameters
So we'll stick with that for now and we don't have to drop support for old clusters.

@steinitzu
Copy link
Collaborator

Moved to #1443

@steinitzu steinitzu closed this Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants