Add `ParquetObjectReader::with_runtime` #6612

itsjunetime · 2024-10-21T20:35:06Z

Which issue does this PR close?

What changes are included in this PR?

This PR works on top of #6249 to add a test for the new with_runtime fn, as well as changing the signature of the spawn function slightly to avoid an extra re-boxing when a runtime is set.

This also fixes a few things that clippy was complaining about.

Rationale for this change

See #6248 for the API addition.

With regard to the test, I felt like this is really the only test we'd want for this feature - we just want to make sure that the runtime is actually being used by ParquetObjectReader. We can't make any guarantees about how it actually performs or would work if there's another runtime being used for CPU-bound operations, so all we really want to test is if it is used.

Are there any user-facing changes?

No

tustvold · 2024-10-21T20:47:05Z

parquet/src/arrow/async_reader/store.rs

+    #[tokio::test]
+    // We need to mark this with the `target_has_atomic` because the spawned_tasks_count() fn is
+    // only available for that cfg
+    #[cfg(all(target_has_atomic = "64", tokio_unstable))]


I wonder if we could instead create a runtime with IO / blocking threads disabled and use that to determine that the IO was spawned to a different runtime?

I don't think that would work. I'm not certain why, but ParquetObjectReader seems to work fine regardless of whether or not IO is 'enabled' on its runtime. I was able to change the tests so they don't rely on tokio_unstable anymore and (I think) still show what we want them to show, so I'll push that in a minute.

…ble anymore

alamb

Thanks @itsjunetime -- this is looking very good. I think we need to also move get_metadata to spawn

Otherwise I have a few other suggestions, but this one is looking very close

parquet/src/arrow/async_reader/store.rs

alamb · 2024-10-28T11:07:30Z

parquet/src/arrow/async_reader/store.rs

+
+        assert_ne!(current_id, other_id);
+
+        tokio::runtime::Handle::current().spawn_blocking(move || drop(rt));


Can you also add unit tests for each of the three APIs in ParquetObjectReader that spawn is used?

get_bytes

get_byte_ranges

get_metadata?

Co-authored-by: Andrew Lamb <[email protected]>

- Remove outdated comment about target_has_atomic - Add test to verify reader fails when spawned on a shutdown runtime

alamb

Looks good to me -- thank you @itsjunetime

alamb · 2024-10-29T20:47:44Z

parquet/src/errors.rs

@@ -107,6 +107,13 @@ impl From<str::Utf8Error> for ParquetError {
    }
 }

+#[cfg(test)]
+impl From<std::convert::Infallible> for ParquetError {


This is a nice improvement too. Thank you. Maybe it is worth adding publically as well

I'm not sure about this, the whole point of infallible is that it can't be constructed and so doesn't need to be handled

Well, it can't be constructed, but it often does need to be "handled" (aka to transform a Result<.., Infallible> to Result<.., Error> type expected by an API)

I don't feel strongly about this particular code.

aka to transform a Result<.., Infallible> to Result<.., Error> type expected by an API)

Right but this is a little funky, because it then makes code look more fallible than it is. Often you can use an infallible version of the API, i.e. into() instead of try_into(), but sometimes you do have to either unwrap() or let _ = ...

FWIW Rust 1.82 gives us a very nice way to handle this, but I'm not sure whether our MSRV policy covers tests.

let Ok(value) = expression();

Removed in 8d24cd7

tustvold · 2024-10-30T11:22:50Z

parquet/src/arrow/async_reader/store.rs

+        let current_id = std::thread::current().id();
+
+        let other_id = reader
+            .spawn(|_, _| async move { Ok::<_, Infallible>(std::thread::current().id()) }.boxed())


Suggested change

.spawn(|_, _| async move { Ok::<_, Infallible>(std::thread::current().id()) }.boxed())

.spawn(|_, _| async move { Ok::<_, ParquetError>(std::thread::current().id()) }.boxed())

Would remove the need for the std::convert::Infallible conversion

I had this repo checked out and in the editor, so I just made this change to accelerate getting this PR in in 8d24cd7

It results in a nice simplification

…with_runtime

alamb · 2024-11-02T11:41:55Z

Thanks again @itsjunetime and @tustvold

tustvold and others added 2 commits October 21, 2024 13:49

Add ParquetObjectReader::with_runtime (apache#6248)

62cb6be

Add test for ParquetObjectReader::with_runtime and fix clippy complaints

74ffb19

github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Oct 21, 2024

tustvold reviewed Oct 21, 2024

View reviewed changes

Switch ParquetObjectReader runtime tests to not depend on tokio_unsta…

80befa1

…ble anymore

alamb mentioned this pull request Oct 25, 2024

Document DataFusion Threading / tokio runtimes (how to separate IO and CPU bound work) apache/datafusion#12393

Open

alamb reviewed Oct 28, 2024

View reviewed changes

itsjunetime and others added 2 commits October 29, 2024 10:57

Add doc-comment for test_runtime_thread_id_different

ff4437d

Co-authored-by: Andrew Lamb <[email protected]>

- Add comment about why we don't use spawn for metadata

e2270b0

- Remove outdated comment about target_has_atomic - Add test to verify reader fails when spawned on a shutdown runtime

alamb approved these changes Oct 29, 2024

View reviewed changes

tustvold approved these changes Oct 29, 2024

View reviewed changes

tustvold reviewed Oct 30, 2024

View reviewed changes

alamb added 2 commits November 2, 2024 07:05

Avoid use of Infallable and From conversion

8d24cd7

Merge remote-tracking branch 'apache/master' into june/object_reader_…

c2a667d

…with_runtime

alamb merged commit 22bc772 into apache:master Nov 2, 2024
26 checks passed

Tom-Newton mentioned this pull request Nov 26, 2024

to_pyarrow_table() on a table in S3 kept getting "Generic S3 error: error decoding response body" delta-io/delta-rs#2595

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `ParquetObjectReader::with_runtime` #6612

Add `ParquetObjectReader::with_runtime` #6612

itsjunetime commented Oct 21, 2024

tustvold Oct 21, 2024

itsjunetime Oct 24, 2024

alamb left a comment

alamb Oct 28, 2024

alamb left a comment

alamb Oct 29, 2024

tustvold Oct 29, 2024

alamb Oct 30, 2024 •

edited

Loading

tustvold Oct 30, 2024

alamb Nov 2, 2024

tustvold Oct 30, 2024

alamb Nov 2, 2024

alamb commented Nov 2, 2024


		assert_ne!(current_id, other_id);

		tokio::runtime::Handle::current().spawn_blocking(move \|\| drop(rt));

	.spawn(\|_, _\| async move { Ok::<_, Infallible>(std::thread::current().id()) }.boxed())
	.spawn(\|_, _\| async move { Ok::<_, ParquetError>(std::thread::current().id()) }.boxed())

Add ParquetObjectReader::with_runtime #6612

Add ParquetObjectReader::with_runtime #6612

Conversation

itsjunetime commented Oct 21, 2024

Which issue does this PR close?

What changes are included in this PR?

Rationale for this change

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Oct 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Nov 2, 2024

Add `ParquetObjectReader::with_runtime` #6612

Add `ParquetObjectReader::with_runtime` #6612

alamb Oct 30, 2024 •

edited

Loading