[Data] Re-implement APIs like select_columns with PyArrow batch format #48140

ArturNiederfahrenhorst · 2024-10-21T22:09:34Z

Related issue number

Closes #48090

Prerequisite: #48575

python/ray/data/dataset.py

ArturNiederfahrenhorst · 2024-10-22T17:00:00Z

Looking at the failed test...

python/ray/data/dataset.py

ArturNiederfahrenhorst · 2024-11-05T19:58:34Z

I'll rebase once the fix is in and mongodb test should pass

Signed-off-by: Balaji Veeramani <[email protected]>

…batches

alexeykudinkin

@ArturNiederfahrenhorst please hold on landing this one

python/ray/data/dataset.py

alexeykudinkin · 2024-11-06T18:01:12Z

python/ray/data/dataset.py

+            Callable[["pandas.DataFrame"], "pandas.Series"],
+            Callable[["pyarrow.Table"], "pyarrow.Array"],


These API has to be consistent with map_batches for both inputs and outputs

It wasn't consistent to begin with.

ray/python/ray/data/dataset.py

Lines 700 to 701 in f6439e1

fn: Callable[["pandas.DataFrame"], "pandas.Series"],

*,

Without making breaking changes, what should the type of fn be?

Understood. That could not be an excuse to not make it right, though.

Should be matching the map_batches (REF):

DataBatch = Union["pyarrow.Table", "pandas.DataFrame", Dict[str, np.ndarray]]

(ie also permit ndarray to be accepted/returned)

What would the return type of the callable be? Currently it's pandas.Series. Changing the return type from Series to DataFrame would be a breaking change

Map batches accepts "table"-like structure, here we expect list of column values so it's not to replace with DataFrame literally, but instead align the APIs

map_batches accepts: DataBatch = Union["pyarrow.Table", "pandas.DataFrame", Dict[str, np.ndarray]]

add_column should accept Union[pa.Array, pandas.Seriers, ndarray]

@alexeykudinkin Do you want us to add the numpy functionality in this PR as well for consistency with map_batches?

Discussed offline with @alexeykudinkin -- let's do Callable[[DataBatch], Union[pa.Array, pd.Series, ndarray]].

That would not be correct because that term would allow for example Callable[[pyarrow.Table], ndarray]] which I don't think we want to allow?

It's weird, but map_batches allows you to change the batch format. Something like this is valid:

def udf(batch: pa.Table) -> Dict[str, np.ndarray] ... ds.map_batches(udf, batch_format="pyarrow")

@bveeramani this happens b/c Arrow is able to do zero-copy from ndarray (with some exceptions)

Oh, I meant more from like an interface perspective. At least personally, I found it unexpected that I could do an Arrow table as input and a DataFrame as output (not that it's necessarily an issue)

Thanks for the input guys. I've made the change. Waiting for CI...

python/ray/data/tests/test_map.py

ArturNiederfahrenhorst · 2024-11-13T19:11:13Z

python/ray/data/tests/test_mongo.py

-    )
+
+    assert ds.count() == 5
+    assert ds.schema().names == ["_id", "float_field", "int_field"]


Made these changes to decouple them from the string representation which may vary over versions. On my local environment, it was different then here/CI.

ArturNiederfahrenhorst · 2024-11-13T19:12:27Z

python/ray/data/tests/test_map.py

@@ -362,7 +383,7 @@ def test_drop_columns(ray_start_regular_shared, tmp_path):
        assert ds.drop_columns(["col2"]).take(1) == [{"col1": 1, "col3": 3}]
        assert ds.drop_columns(["col1", "col3"]).take(1) == [{"col2": 2}]
        assert ds.drop_columns([]).take(1) == [{"col1": 1, "col2": 2, "col3": 3}]
-        assert ds.drop_columns(["col1", "col2", "col3"]).take(1) == [{}]
+        assert ds.drop_columns(["col1", "col2", "col3"]).take(1) == []


As discussed offline, this behavior is arbitrary and probably has little practical relevance.
Since our pyarrow implementation of the drop operation returns an empty list, we decided to just change the test in this case.

richardliaw · 2024-11-14T17:55:06Z

python/ray/data/dataset.py

-        def add_column(batch: "pandas.DataFrame") -> "pandas.DataFrame":
-            batch.loc[:, col] = fn(batch)
-            return batch
+        def add_column(batch: "pyarrow.Table") -> "pyarrow.Table":


the typing here is off - batch is DataBatch type right? for example if it is pandas

richardliaw · 2024-11-14T17:57:32Z

python/ray/data/dataset.py

+        if batch_format not in [
+            "pandas",
+            "pyarrow",
+        ]:
+            raise ValueError(
+                f"batch_format argument must be 'pandas' or 'pyarrow', "
+                f"got: {batch_format}"
+            )
+


I don't think you need to validate here, should happen in map_batches

richardliaw · 2024-11-14T17:58:29Z

python/ray/data/dataset.py

+        # Historically, we have also accepted lists with duplicate column names.
+        # This is not tolerated by the underlying pyarrow.Table.drop_columns method.
+        cols_without_duplicates = list(set(cols))
+


i think we should just enforce this via validation / raise an error

This is a breaking change then!
Still?

I think it's fine, yes.

python/ray/data/dataset.py

bveeramani · 2024-11-14T17:58:27Z

python/ray/data/dataset.py

+        if batch_format not in [
+            "pandas",
+            "pyarrow",
+        ]:
+            raise ValueError(
+                f"batch_format argument must be 'pandas' or 'pyarrow', "
+                f"got: {batch_format}"
+            )


Any reason we can't support the numpy batch format?

bveeramani · 2024-11-14T17:58:46Z

python/ray/data/dataset.py

+                    # Create a new table with the updated column
+                    return batch.set_column(column_idx, col, column)


Should we either error or emit a warning here? Overriding a column might be unexpected

@bveeramani Does Ray Data have existing helpers to log this without spamming?
I'd do the same for numpy, pandas and arrow then.

bveeramani · 2024-11-14T17:59:14Z

python/ray/data/tests/test_mongo.py

-    )
+
+    assert ds.count() == 5
+    assert ds.schema().names == ["_id", "float_field", "int_field"]


Co-authored-by: Balaji Veeramani <[email protected]> Signed-off-by: Artur Niederfahrenhorst <[email protected]>

ArturNiederfahrenhorst added 4 commits October 22, 2024 00:08

initial

426e65b

fix add_column

ee7ac05

drop columns working

3c5db11

lint

e80f2e0

ArturNiederfahrenhorst marked this pull request as ready for review October 22, 2024 15:32

ArturNiederfahrenhorst requested review from scottjlee, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners October 22, 2024 15:32

ArturNiederfahrenhorst commented Oct 22, 2024

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

ArturNiederfahrenhorst commented Oct 22, 2024

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

lint

54660f3

ArturNiederfahrenhorst force-pushed the pyarrowbatches branch from 7bca0e2 to 54660f3 Compare October 23, 2024 13:43

fix doctests

bab3632

ArturNiederfahrenhorst force-pushed the pyarrowbatches branch from 53ef980 to bab3632 Compare October 23, 2024 15:05

ArturNiederfahrenhorst added 4 commits October 23, 2024 18:41

Add pa.array() to dataset/iterator docstrings

87e5203

lint

8d3b1c4

make pandas default

dbb5564

revert drop columns test

4f9acbe

ArturNiederfahrenhorst assigned bveeramani Oct 25, 2024

ArturNiederfahrenhorst added 2 commits October 25, 2024 14:57

revert docstrings to reflect simple pandas version

6d6e484

replace drop_columns with drop

a8c2144

bveeramani approved these changes Oct 28, 2024

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

ArturNiederfahrenhorst added 5 commits October 31, 2024 17:09

format

1ad2330

fix select columns by removing duplicates and fixing the test

c808632

wip

a2c957f

Merge branch 'master' into pyarrowbatches

a9161c8

replace kwarg by arg

f53ccab

cleanup

3ab3169

ArturNiederfahrenhorst requested a review from alexeykudinkin as a code owner November 4, 2024 23:04

fix select columns

09218d4

bveeramani and others added 4 commits November 5, 2024 13:08

Initial commit

2a982f1

Signed-off-by: Balaji Veeramani <[email protected]>

Appease lint

fed1328

Signed-off-by: Balaji Veeramani <[email protected]>

Merge remote-tracking branch 'upstream/fix-pandas-union' into pyarrow…

27f8bf5

…batches

Merge branch 'master' into pyarrowbatches

ffac41b

ArturNiederfahrenhorst requested a review from srinathk10 as a code owner November 5, 2024 22:16

alexeykudinkin reviewed Nov 6, 2024

View reviewed changes

richardliaw reviewed Nov 7, 2024

View reviewed changes

python/ray/data/tests/test_map.py Outdated Show resolved Hide resolved

ArturNiederfahrenhorst added 4 commits November 8, 2024 17:32

merge master

57c2262

cleanup after merge

816724e

Alexey's comment

e995a0c

Merge branch 'master' into pyarrowbatches

02ceb9e

ArturNiederfahrenhorst commented Nov 13, 2024

View reviewed changes

ArturNiederfahrenhorst added 2 commits November 14, 2024 12:56

minor change

6b696c1

lint

80ead0f

richardliaw reviewed Nov 14, 2024

View reviewed changes

bveeramani reviewed Nov 14, 2024

View reviewed changes

ArturNiederfahrenhorst and others added 2 commits November 18, 2024 18:51

richard's comments

6241e74

Update python/ray/data/dataset.py

0a8c60b

Co-authored-by: Balaji Veeramani <[email protected]> Signed-off-by: Artur Niederfahrenhorst <[email protected]>

ruisearch42 mentioned this pull request Nov 18, 2024

[core][compiled graphs] Fix do_profile_tasks #48782

Open

8 tasks

ArturNiederfahrenhorst added 2 commits November 19, 2024 00:15

add numpy

14978c6

add tests

7f75e5d

richardliaw added the go add ONLY when ready to merge, run all tests label Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Re-implement APIs like select_columns with PyArrow batch format #48140

[Data] Re-implement APIs like select_columns with PyArrow batch format #48140

ArturNiederfahrenhorst commented Oct 21, 2024 •

edited

Loading

ArturNiederfahrenhorst commented Oct 22, 2024

ArturNiederfahrenhorst commented Nov 5, 2024

alexeykudinkin left a comment

alexeykudinkin Nov 6, 2024

bveeramani Nov 6, 2024 •

edited

Loading

alexeykudinkin Nov 6, 2024

bveeramani Nov 6, 2024

alexeykudinkin Nov 8, 2024

ArturNiederfahrenhorst Nov 10, 2024

bveeramani Nov 12, 2024

alexeykudinkin Nov 12, 2024

bveeramani Nov 12, 2024

ArturNiederfahrenhorst Nov 13, 2024

ArturNiederfahrenhorst Nov 13, 2024

richardliaw Nov 14, 2024

bveeramani Nov 14, 2024

ArturNiederfahrenhorst Nov 13, 2024

richardliaw Nov 14, 2024

ArturNiederfahrenhorst Nov 18, 2024

richardliaw Nov 14, 2024

richardliaw Nov 14, 2024

ArturNiederfahrenhorst Nov 18, 2024

richardliaw Nov 19, 2024

bveeramani Nov 14, 2024

bveeramani Nov 14, 2024

ArturNiederfahrenhorst Nov 18, 2024

bveeramani Nov 14, 2024

		Callable[["pandas.DataFrame"], "pandas.Series"],
		Callable[["pyarrow.Table"], "pyarrow.Array"],

		# Create a new table with the updated column
		return batch.set_column(column_idx, col, column)

[Data] Re-implement APIs like select_columns with PyArrow batch format #48140

Are you sure you want to change the base?

[Data] Re-implement APIs like select_columns with PyArrow batch format #48140

Conversation

ArturNiederfahrenhorst commented Oct 21, 2024 • edited Loading

Related issue number

ArturNiederfahrenhorst commented Oct 22, 2024

ArturNiederfahrenhorst commented Nov 5, 2024

alexeykudinkin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bveeramani Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArturNiederfahrenhorst commented Oct 21, 2024 •

edited

Loading

bveeramani Nov 6, 2024 •

edited

Loading