Ray Dataset - improve documentation #28484

DevJake · 2022-09-13T21:01:01Z

Description

To be really blunt, the documentation is of pretty low quality. I've just spent several hours trying to find how to apply a specific schema to the load_parquet() function, and it's as simple as just supplying it directly: load_parquet(schema=[Schema object]). I'm glad the solution is simple, but to spend this long trying to find a solution is indicative of a bigger issue: the documentation fails to mention that this is possible, nor what other parameters a Dataset object might take. Navigating to the actual Dataset page is difficult, since the search does not return meaningful results, nor are mentions of it hyperlinked, only highlighted.

Take this page on the read_parquet method. The listed parameters aren't exactly described or given examples, their descriptions are already clear from the name: "filesystem - The filesystem implementation to read from." This information is obvious from the name, but what types of values are accepted? I will assume string values, but what strings, exactly?

I'm sure this is a great library, I can tell there's an enormous amount of work and effort that has gone into it. But to ship with such uninformative, inaccessible and poorly structured documentation makes it incredibly frustrating for new people like me to get to grips with it. I'd love to use this library further but if anything more complex than just loading in data is equally unclear, I won't be able to...

The solution I found was from this stackoverflow post, which I only found by googling the uninformative error message I received when calling .repartition(): pyarrow.lib.check_statuspyarrow.lib.ArrowNotImplementedError: Unsupported cast from double to null using function cast_null. Perhaps such an error message should show a snippet of the record and/or column(s) in question, or mention needing to verify no null entries exist in the dataset's schema?

Link

https://docs.ray.io/en/latest/data/consuming-datasets.html#consuming-datasets
https://docs.ray.io/en/latest/data/api/input_output.html#ray.data.read_parquet

The text was updated successfully, but these errors were encountered:

richardliaw · 2022-09-13T21:54:57Z

@DevJake I think this is really good feedback, and we've been trying to make the documentation better.

That being said, from our perspective it's very hard to write great docs when we're trying to write it for ourselves. I'm wondering if you'd be willing to help sit down with us, walk us through your experience from a first-person perspective, and give some better color on what you're trying to do?

pcmoritz · 2022-09-13T22:14:38Z

Thanks, this is super great feedback! I'll try to make a PR to improve the documentation with these points and tag you in it @DevJake

As Richard points out, we'd also love to get more feedback from you. Ray Data is still a very new library and we are actively improving the documentation -- such immediate feedback is incredibly valuable!

Thanks for your help :)

pcmoritz · 2022-09-13T23:25:48Z

I made a PR to improve the issues you brought up on the Ray side (#28488).

The error message will need to be improved in Apache Arrow. I'm happy to do that and have done so in the past (e.g. apache/arrow#13979, apache/arrow#14001, apache/arrow#14019) -- the Apache Arrow project is still relatively young and can use all our help to iron out usability issues like this. I'm incredibly excited that Apache Arrow is developing a (Python) native data processing stack and we should help them as best as we can by giving them feedback and helping by sending them PRs.

Sorry for your frustration and thanks for your help in surfacing and fixing these issues!

DevJake · 2022-09-14T15:13:09Z

Hey @pcmoritz and @richardliaw, I really appreciate your quick responses and action on this. I apologise for the bluntness in my message, but I'm sure you can understand. A lot of new, open-source software is in development in response to the AI explosion we're seeing lately, so it definitely needs time to get up to scratch with stuff that's been out there for a much longer period. I'd be more than happy to help work through the documentation, perhaps you'll see me on the Slack soon enough 👀

When I've got more time, I'll probably end up making some pull requests myself to contribute. No use knowing an issue exists without trying to also resolve it. 😅

This addresses some of the issues brought up in #28484

pcmoritz · 2022-09-14T17:49:20Z

Thanks that sounds great! Any more help and feedback and also PRs are certainly very much appreciated :)

This very concrete feedback really helps improve things, keep it coming!

…tests (#28535) * [core/ci] Disallow protobuf 3.19.5 (#28504) This leads to hangs in Ray client (e.g. test_dataclient_disconnect) Signed-off-by: Kai Fricke <[email protected]> * [tune] Fix trial checkpoint syncing after recovery from other node (#28470) On restore from a different IP, the SyncerCallback currently still tries to sync from a stale node IP, because `trial.last_result` has not been updated, yet. Instead, the syncer callback should keep its own map of trials to IPs, and only act on this. Signed-off-by: Kai Fricke <[email protected]> * [air] minor example fix. (#28379) Signed-off-by: xwjiang2010 <[email protected]> * [cleanup] Remove memory unit conversion (#28396) The internal memory unit was switched back to bytes years ago, there's no point in keeping confusing conversion code around anymore. Recommendation: Review #28394 first, since this is stacked on top of it. Co-authored-by: Alex <[email protected]> * [RLlib] Sync policy specs from local_worker_for_synching while recovering rollout/eval workers. (#28422) * Cast rewards as tf.float32 to fix error in DQN in tf2 (#28384) * Cast rewards as tf.float32 to fix error in DQN in tf2 Signed-off-by: mgerstgrasser <[email protected]> * Add test case for DQN with integer rewards Signed-off-by: mgerstgrasser <[email protected]> Signed-off-by: mgerstgrasser <[email protected]> * [doc] [Datasets] Improve docstring and doctest for read_parquet (#28488) This addresses some of the issues brought up in #28484 * [ci] Increase timeout on test_metrics (#28508) 10 milliseconds is ambitious for the CI to do anything. Co-authored-by: Alex <[email protected]> * [air/tune] Catch empty hyperopt search space, raise better Tuner error message (#28503) * Add imports to object-spilling.rst Python code (#28507) * Add imports to object-spilling.rst Python code Also adjust a couple descriptions, retaining the same general information Signed-off-by: Jake <[email protected]> * fix doc build / keep note formatting Signed-off-by: Philipp Moritz <[email protected]> * another tiny fix Signed-off-by: Philipp Moritz <[email protected]> Signed-off-by: Jake <[email protected]> Signed-off-by: Philipp Moritz <[email protected]> Co-authored-by: Philipp Moritz <[email protected]> * [AIR] Make PathPartitionScheme a dataclass (#28390) Signed-off-by: Balaji Veeramani <[email protected]> * [Telemetry][Kuberentes] Distinguish Kubernetes deployment stacks (#28490) Right now, Ray telemetry indicates the majority of Ray's CPU hour usage comes from Ray running within a Kubernetes cluster. However, we have no data on what method is used to deploy Ray on Kubernetes. This PR enables Ray telemetry to distinguish between three methods of deploying Ray on Kubernetes: KubeRay >= 0.4.0 Legacy Ray Operator with Ray >= 2.1.0 All other methods The strategy is to have the operators inject an env variable into the Ray container's environment. The variable identifies the deployment method. This PR also modifies the legacy Ray operator to inject the relevant env variable. A follow-up KubeRay PR changes the KubeRay operator to do the same thing: ray-project/kuberay#562 Signed-off-by: Dmitri Gekhtman <[email protected]> * [autoscaler][observability] Experimental verbose mode (#28392) This PR introduces a super secret hidden verbose mode for ray status, which we can keep hidden while collecting feedback before going through the process of officially declaring it part of the public API. Example output ======== Autoscaler status: 2020-12-28 01:02:03 ======== GCS request time: 3.141500s Node Provider non_terminated_nodes time: 1.618000s Node status -------------------------------------------------------- Healthy: 2 p3.2xlarge 20 m4.4xlarge Pending: m4.4xlarge, 2 launching 1.2.3.4: m4.4xlarge, waiting-for-ssh 1.2.3.5: m4.4xlarge, waiting-for-ssh Recent failures: p3.2xlarge: RayletUnexpectedlyDied (ip: 1.2.3.6) Resources -------------------------------------------------------- Total Usage: 1/2 AcceleratorType:V100 530.0/544.0 CPU 2/2 GPU 2.00/8.000 GiB memory 3.14/16.000 GiB object_store_memory Total Demands: {'CPU': 1}: 150+ pending tasks/actors {'CPU': 4} * 5 (PACK): 420+ pending placement groups {'CPU': 16}: 100+ from request_resources() Node: 192.168.1.1 Usage: 0.1/1 AcceleratorType:V100 5.0/20.0 CPU 0.7/1 GPU 1.00/4.000 GiB memory 3.14/4.000 GiB object_store_memory Node: 192.168.1.2 Usage: 0.9/1 AcceleratorType:V100 15.0/20.0 CPU 0.3/1 GPU 1.00/12.000 GiB memory 0.00/4.000 GiB object_store_memory Co-authored-by: Alex <[email protected]> * [doc/tune] fix tune stopper attribute name (#28517) * [doc] Fix tune stopper doctests (#28531) * [air] Use self-hosted mirror for CIFAR10 dataset (#28480) The CIFAR10 website host has been unreliable in the past. This PR injects our own mirror into our CI packages for testing. Signed-off-by: Kai Fricke <[email protected]> * draft Signed-off-by: Artur Niederfahrenhorst <[email protected]> Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: mgerstgrasser <[email protected]> Signed-off-by: Jake <[email protected]> Signed-off-by: Philipp Moritz <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: Dmitri Gekhtman <[email protected]> Signed-off-by: Artur Niederfahrenhorst <[email protected]> Co-authored-by: Kai Fricke <[email protected]> Co-authored-by: xwjiang2010 <[email protected]> Co-authored-by: Alex Wu <[email protected]> Co-authored-by: Alex <[email protected]> Co-authored-by: Jun Gong <[email protected]> Co-authored-by: mgerstgrasser <[email protected]> Co-authored-by: Philipp Moritz <[email protected]> Co-authored-by: Jake <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]> Co-authored-by: Dmitri Gekhtman <[email protected]> Co-authored-by: Árpád Rózsás <[email protected]>

pcmoritz · 2022-09-16T04:26:00Z

Btw, I was digging a bit deeper on the underlying issue why you needed to specify a schema at all here and my findings are in https://issues.apache.org/jira/projects/ARROW/issues/ARROW-17719

I wonder how large the performance overhead would be if we made the default to inspect the schemata of all partitions. That seems more user friendly :)

DevJake · 2022-09-16T14:15:40Z

That makes complete sense, with the schema being inferred from the very first file.

I imagine inspecting all partitions might carry a considerable overhead, especially with a large number of partitions. Perhaps any or all of the following would work:

Prompting the user to explicitly specify a schema, as I have done.
Providing an optional parameter to infer the schema by merging across all partitions, as you suggested. Something like get_schema_from_all_partitions=False.
Issuing clear warnings for any fields denoted as null in the schema, as this is a probable issue in most cases.

I'm glad you managed to track down the underlying issue. I suppose it's particularly unlikely that someone would load a dataset that features one or more entirely empty/null columns, but not impossible.

…project#28488) This addresses some of the issues brought up in ray-project#28484 Signed-off-by: PaulFenton <[email protected]>

DevJake added the docs An issue or change related to documentation label Sep 13, 2022

richardliaw changed the title ~~Ray Dataset~~ Ray Dataset - improve documentation Sep 13, 2022

matthewdeng added data Ray Data-related issues P1 Issue that should be fixed within a few weeks labels Sep 13, 2022

pcmoritz self-assigned this Sep 13, 2022

pcmoritz mentioned this issue Sep 13, 2022

[doc] [Datasets] Improve docstring and doctest for read_parquet #28488

Merged

7 tasks

DevJake closed this as completed Sep 14, 2022

pcmoritz added a commit that referenced this issue Sep 14, 2022

[doc] [Datasets] Improve docstring and doctest for read_parquet (#28488)

fd89891

This addresses some of the issues brought up in #28484

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray Dataset - improve documentation #28484

Ray Dataset - improve documentation #28484

DevJake commented Sep 13, 2022

richardliaw commented Sep 13, 2022

pcmoritz commented Sep 13, 2022

pcmoritz commented Sep 13, 2022 •

edited

Loading

DevJake commented Sep 14, 2022

pcmoritz commented Sep 14, 2022 •

edited

Loading

pcmoritz commented Sep 16, 2022

DevJake commented Sep 16, 2022

Ray Dataset - improve documentation #28484

Ray Dataset - improve documentation #28484

Comments

DevJake commented Sep 13, 2022

Description

Link

richardliaw commented Sep 13, 2022

pcmoritz commented Sep 13, 2022

pcmoritz commented Sep 13, 2022 • edited Loading

DevJake commented Sep 14, 2022

pcmoritz commented Sep 14, 2022 • edited Loading

pcmoritz commented Sep 16, 2022

DevJake commented Sep 16, 2022

pcmoritz commented Sep 13, 2022 •

edited

Loading

pcmoritz commented Sep 14, 2022 •

edited

Loading