Import missing pyarrow compute for transforms on arrowitems #1010

sh-rp · 2024-02-27T10:44:55Z

Description

Adding an incremental for loads with arrowitems fails on a missing compute property. This PR fixes this.

netlify · 2024-02-27T10:45:11Z

✅ Deploy Preview for dlt-hub-docs ready!

Name	Link
🔨 Latest commit	`aa0db6d`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/65e2302e0e4b5e0008118773
😎 Deploy Preview	https://deploy-preview-1010--dlt-hub-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

rudolfix · 2024-02-28T08:52:15Z

dlt/extract/incremental/transform.py

@@ -11,6 +11,11 @@
 except ModuleNotFoundError:
    np = None

+try:


please add this to libs/pyarrow together with other imports

sh-rp · 2024-03-01T19:45:44Z

I have also fixed a few other imports, i am just not 100% certain if maybe the user is not getting the right error messages in some cases, especially where I have this grouped import. lmk

sspaeti · 2024-03-15T14:34:03Z

@sh-rp, just FYI regarding error handling:

I have also fixed a few other imports, i am just not 100% certain if maybe the user is not getting the right error messages in some cases, especially where I have this grouped import. lmk

I just ran into this, the error says now:

AttributeError: 'NoneType' object has no attribute 'compute'

but the error solves when installing:

pip install pandas SQLAlchemy pandasql

It's a bit confusing now. Maybe should be added to the docs somewhere.

Full stack:

❯ python delta-load.py
Traceback (most recent call last):
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/extract/pipe_iterator.py", line 221, in __next__
    next_item = step(item, meta=pipe_item.meta)  # type: ignore
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/extract/incremental/__init__.py", line 463, in __call__
    return self._transform_item(transformer, rows)
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/extract/incremental/__init__.py", line 316, in _transform_item
    row, self.start_out_of_range, self.end_out_of_range = transformer(row)
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/extract/incremental/transform.py", line 228, in __call__
    compute = pa.compute.max
AttributeError: 'NoneType' object has no attribute 'compute'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 412, in extract
    self._extract_source(extract_step, source, max_parallel_items, workers)
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 1060, in _extract_source
    load_id = extract.extract(source, max_parallel_items, workers)
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/extract/extract.py", line 350, in extract
    self._extract_single_source(
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/extract/extract.py", line 280, in _extract_single_source
    for pipe_item in pipes:
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/extract/pipe_iterator.py", line 236, in __next__
    raise ResourceExtractionError(
dlt.extract.exceptions.ResourceExtractionError: In processing pipe a02t044: extraction of resource a02t044 in transform Incremental caused an exception: 'NoneType' object has no attribute 'compute'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/sspaeti/Documents/git/bedag/susa/hellodata-svsa/hellodata-ws-svsa/src/dlt/delta-load.py", line 80, in <module>
    info = pipeline.run(resources)
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 210, in _wrap
    step_info = f(self, *args, **kwargs)
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 255, in _wrap
    return f(self, *args, **kwargs)
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 640, in run
    self.extract(
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 210, in _wrap
    step_info = f(self, *args, **kwargs)
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 165, in _wrap
    rv = f(self, *args, **kwargs)
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 151, in _wrap
    return f(self, *args, **kwargs)
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 255, in _wrap
    return f(self, *args, **kwargs)
  File "/Users/sspaeti/.venvs/dlt/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 424, in extract
    raise PipelineStepFailed(
dlt.pipeline.exceptions.PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1710396526.7155468 with exception:

<class 'dlt.extract.exceptions.ResourceExtractionError'>
In processing pipe a02t044: extraction of resource a02t044 in transform Incremental caused an exception: 'NoneType' object has no attribute 'compute'

rudolfix · 2024-03-17T13:34:02Z

@sh-rp SQLAlchemy was removed from our dependencies. this does not come out in the tests because most probably one of extra or dev deps install panda. what is wrong

we import pandas and arrow in one try except block in transform.py. so lack of pandas kills arrow even if it exists
in pandas helper we import pandas and pandas sql together. those should come from separate modules so we can instruct user to install different deps.
1 + 2 -> lack of sqlalchemy and pandas fails pyarrow import

fix missing arrow compute for incrementals on arrow loads

aab6baa

sh-rp marked this pull request as ready for review February 27, 2024 10:45

sh-rp mentioned this pull request Feb 27, 2024

fixing: module 'pyarrow' has no attribute 'compute' #1007

Closed

sh-rp requested a review from rudolfix February 27, 2024 11:12

sh-rp self-assigned this Feb 27, 2024

rudolfix requested changes Feb 28, 2024

View reviewed changes

fix numpy and pandas imports

aa0db6d

sh-rp requested a review from rudolfix March 1, 2024 20:39

rudolfix approved these changes Mar 2, 2024

View reviewed changes

rudolfix merged commit fc34dd0 into devel Mar 2, 2024
58 of 66 checks passed

rudolfix deleted the d#/fix_arrow_compute_missing branch March 2, 2024 21:53

rudolfix mentioned this pull request Mar 18, 2024

splits pandas and arrow imports #1112

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import missing pyarrow compute for transforms on arrowitems #1010

Import missing pyarrow compute for transforms on arrowitems #1010

sh-rp commented Feb 27, 2024

netlify bot commented Feb 27, 2024 •

edited

Loading

rudolfix Feb 28, 2024

sh-rp commented Mar 1, 2024

sspaeti commented Mar 15, 2024 •

edited

Loading

rudolfix commented Mar 17, 2024

Import missing pyarrow compute for transforms on arrowitems #1010

Import missing pyarrow compute for transforms on arrowitems #1010

Conversation

sh-rp commented Feb 27, 2024

Description

netlify bot commented Feb 27, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs ready!

rudolfix Feb 28, 2024

Choose a reason for hiding this comment

sh-rp commented Mar 1, 2024

sspaeti commented Mar 15, 2024 • edited Loading

rudolfix commented Mar 17, 2024

netlify bot commented Feb 27, 2024 •

edited

Loading

sspaeti commented Mar 15, 2024 •

edited

Loading