Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import missing pyarrow compute for transforms on arrowitems #1010

Merged
merged 2 commits into from
Mar 2, 2024

Conversation

sh-rp
Copy link
Collaborator

@sh-rp sh-rp commented Feb 27, 2024

Description

Adding an incremental for loads with arrowitems fails on a missing compute property. This PR fixes this.

@sh-rp sh-rp marked this pull request as ready for review February 27, 2024 10:45
Copy link

netlify bot commented Feb 27, 2024

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit aa0db6d
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/65e2302e0e4b5e0008118773
😎 Deploy Preview https://deploy-preview-1010--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@@ -11,6 +11,11 @@
except ModuleNotFoundError:
np = None

try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add this to libs/pyarrow together with other imports

@sh-rp
Copy link
Collaborator Author

sh-rp commented Mar 1, 2024

I have also fixed a few other imports, i am just not 100% certain if maybe the user is not getting the right error messages in some cases, especially where I have this grouped import. lmk

@sh-rp sh-rp requested a review from rudolfix March 1, 2024 20:39
@rudolfix rudolfix merged commit fc34dd0 into devel Mar 2, 2024
58 of 66 checks passed
@rudolfix rudolfix deleted the d#/fix_arrow_compute_missing branch March 2, 2024 21:53
@sspaeti
Copy link
Contributor

sspaeti commented Mar 15, 2024

@sh-rp, just FYI regarding error handling:

I have also fixed a few other imports, i am just not 100% certain if maybe the user is not getting the right error messages in some cases, especially where I have this grouped import. lmk

I just ran into this, the error says now:

AttributeError: 'NoneType' object has no attribute 'compute'

but the error solves when installing:

pip install pandas SQLAlchemy pandasql

It's a bit confusing now. Maybe should be added to the docs somewhere.

Full stack:

❯ python delta-load.py
Traceback (most recent call last):
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/extract/pipe_iterator.py", line 221, in __next__
    next_item = step(item, meta=pipe_item.meta)  # type: ignore
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/extract/incremental/__init__.py", line 463, in __call__
    return self._transform_item(transformer, rows)
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/extract/incremental/__init__.py", line 316, in _transform_item
    row, self.start_out_of_range, self.end_out_of_range = transformer(row)
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/extract/incremental/transform.py", line 228, in __call__
    compute = pa.compute.max
AttributeError: 'NoneType' object has no attribute 'compute'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 412, in extract
    self._extract_source(extract_step, source, max_parallel_items, workers)
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 1060, in _extract_source
    load_id = extract.extract(source, max_parallel_items, workers)
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/extract/extract.py", line 350, in extract
    self._extract_single_source(
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/extract/extract.py", line 280, in _extract_single_source
    for pipe_item in pipes:
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/extract/pipe_iterator.py", line 236, in __next__
    raise ResourceExtractionError(
dlt.extract.exceptions.ResourceExtractionError: In processing pipe a02t044: extraction of resource a02t044 in transform Incremental caused an exception: 'NoneType' object has no attribute 'compute'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/sspaeti/Documents/git/bedag/susa/hellodata-svsa/hellodata-ws-svsa/src/dlt/delta-load.py", line 80, in <module>
    info = pipeline.run(resources)
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 210, in _wrap
    step_info = f(self, *args, **kwargs)
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 255, in _wrap
    return f(self, *args, **kwargs)
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 640, in run
    self.extract(
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 210, in _wrap
    step_info = f(self, *args, **kwargs)
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 165, in _wrap
    rv = f(self, *args, **kwargs)
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 151, in _wrap
    return f(self, *args, **kwargs)
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 255, in _wrap
    return f(self, *args, **kwargs)
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 424, in extract
    raise PipelineStepFailed(
dlt.pipeline.exceptions.PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1710396526.7155468 with exception:

<class 'dlt.extract.exceptions.ResourceExtractionError'>
In processing pipe a02t044: extraction of resource a02t044 in transform Incremental caused an exception: 'NoneType' object has no attribute 'compute'

@rudolfix
Copy link
Collaborator

@sh-rp SQLAlchemy was removed from our dependencies. this does not come out in the tests because most probably one of extra or dev deps install panda. what is wrong

  1. we import pandas and arrow in one try except block in transform.py. so lack of pandas kills arrow even if it exists
  2. in pandas helper we import pandas and pandas sql together. those should come from separate modules so we can instruct user to install different deps.
    1 + 2 -> lack of sqlalchemy and pandas fails pyarrow import

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants