Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray Dataset - improve documentation #28484

Closed
DevJake opened this issue Sep 13, 2022 · 7 comments
Closed

Ray Dataset - improve documentation #28484

DevJake opened this issue Sep 13, 2022 · 7 comments
Assignees
Labels
data Ray Data-related issues docs An issue or change related to documentation P1 Issue that should be fixed within a few weeks

Comments

@DevJake
Copy link
Contributor

DevJake commented Sep 13, 2022

Description

To be really blunt, the documentation is of pretty low quality. I've just spent several hours trying to find how to apply a specific schema to the load_parquet() function, and it's as simple as just supplying it directly: load_parquet(schema=[Schema object]). I'm glad the solution is simple, but to spend this long trying to find a solution is indicative of a bigger issue: the documentation fails to mention that this is possible, nor what other parameters a Dataset object might take. Navigating to the actual Dataset page is difficult, since the search does not return meaningful results, nor are mentions of it hyperlinked, only highlighted.

Take this page on the read_parquet method. The listed parameters aren't exactly described or given examples, their descriptions are already clear from the name: "filesystem - The filesystem implementation to read from." This information is obvious from the name, but what types of values are accepted? I will assume string values, but what strings, exactly?

I'm sure this is a great library, I can tell there's an enormous amount of work and effort that has gone into it. But to ship with such uninformative, inaccessible and poorly structured documentation makes it incredibly frustrating for new people like me to get to grips with it. I'd love to use this library further but if anything more complex than just loading in data is equally unclear, I won't be able to...

The solution I found was from this stackoverflow post, which I only found by googling the uninformative error message I received when calling .repartition(): pyarrow.lib.check_statuspyarrow.lib.ArrowNotImplementedError: Unsupported cast from double to null using function cast_null. Perhaps such an error message should show a snippet of the record and/or column(s) in question, or mention needing to verify no null entries exist in the dataset's schema?

Link

https://docs.ray.io/en/latest/data/consuming-datasets.html#consuming-datasets
https://docs.ray.io/en/latest/data/api/input_output.html#ray.data.read_parquet

@DevJake DevJake added the docs An issue or change related to documentation label Sep 13, 2022
@richardliaw richardliaw changed the title Ray Dataset Ray Dataset - improve documentation Sep 13, 2022
@matthewdeng matthewdeng added data Ray Data-related issues P1 Issue that should be fixed within a few weeks labels Sep 13, 2022
@richardliaw
Copy link
Contributor

@DevJake I think this is really good feedback, and we've been trying to make the documentation better.

That being said, from our perspective it's very hard to write great docs when we're trying to write it for ourselves. I'm wondering if you'd be willing to help sit down with us, walk us through your experience from a first-person perspective, and give some better color on what you're trying to do?

@pcmoritz
Copy link
Contributor

Thanks, this is super great feedback! I'll try to make a PR to improve the documentation with these points and tag you in it @DevJake

As Richard points out, we'd also love to get more feedback from you. Ray Data is still a very new library and we are actively improving the documentation -- such immediate feedback is incredibly valuable!

Thanks for your help :)

@pcmoritz
Copy link
Contributor

pcmoritz commented Sep 13, 2022

I made a PR to improve the issues you brought up on the Ray side (#28488).

The error message will need to be improved in Apache Arrow. I'm happy to do that and have done so in the past (e.g. apache/arrow#13979, apache/arrow#14001, apache/arrow#14019) -- the Apache Arrow project is still relatively young and can use all our help to iron out usability issues like this. I'm incredibly excited that Apache Arrow is developing a (Python) native data processing stack and we should help them as best as we can by giving them feedback and helping by sending them PRs.

Sorry for your frustration and thanks for your help in surfacing and fixing these issues!

@DevJake
Copy link
Contributor Author

DevJake commented Sep 14, 2022

Hey @pcmoritz and @richardliaw, I really appreciate your quick responses and action on this. I apologise for the bluntness in my message, but I'm sure you can understand. A lot of new, open-source software is in development in response to the AI explosion we're seeing lately, so it definitely needs time to get up to scratch with stuff that's been out there for a much longer period. I'd be more than happy to help work through the documentation, perhaps you'll see me on the Slack soon enough 👀

When I've got more time, I'll probably end up making some pull requests myself to contribute. No use knowing an issue exists without trying to also resolve it. 😅

@DevJake DevJake closed this as completed Sep 14, 2022
pcmoritz added a commit that referenced this issue Sep 14, 2022
@pcmoritz
Copy link
Contributor

pcmoritz commented Sep 14, 2022

Thanks that sounds great! Any more help and feedback and also PRs are certainly very much appreciated :)

This very concrete feedback really helps improve things, keep it coming!

ArturNiederfahrenhorst added a commit that referenced this issue Sep 15, 2022
…tests (#28535)

* [core/ci] Disallow protobuf 3.19.5 (#28504)

This leads to hangs in Ray client (e.g. test_dataclient_disconnect)

Signed-off-by: Kai Fricke <[email protected]>

* [tune] Fix trial checkpoint syncing after recovery from other node (#28470)

On restore from a different IP, the SyncerCallback currently still tries to sync from a stale node IP, because `trial.last_result` has not been updated, yet. Instead, the syncer callback should keep its own map of trials to IPs, and only act on this.

Signed-off-by: Kai Fricke <[email protected]>

* [air] minor example fix. (#28379)

Signed-off-by: xwjiang2010 <[email protected]>

* [cleanup] Remove memory unit conversion (#28396)

The internal memory unit was switched back to bytes years ago, there's no point in keeping confusing conversion code around anymore.

Recommendation: Review #28394 first, since this is stacked on top of it.

Co-authored-by: Alex <[email protected]>

* [RLlib] Sync policy specs from local_worker_for_synching while recovering rollout/eval workers. (#28422)

* Cast rewards as tf.float32 to fix error in DQN in tf2 (#28384)

* Cast rewards as tf.float32 to fix error in DQN in tf2

Signed-off-by: mgerstgrasser <[email protected]>

* Add test case for DQN with integer rewards

Signed-off-by: mgerstgrasser <[email protected]>

Signed-off-by: mgerstgrasser <[email protected]>

* [doc] [Datasets] Improve docstring and doctest for read_parquet (#28488)

This addresses some of the issues brought up in #28484

* [ci] Increase timeout on test_metrics (#28508)

10 milliseconds is ambitious for the CI to do anything.

Co-authored-by: Alex <[email protected]>

* [air/tune] Catch empty hyperopt search space, raise better Tuner error message (#28503)

* Add imports to object-spilling.rst Python code (#28507)

* Add imports to object-spilling.rst Python code

Also adjust a couple descriptions, retaining the same general information

Signed-off-by: Jake <[email protected]>

* fix doc build / keep note formatting

Signed-off-by: Philipp Moritz <[email protected]>

* another tiny fix

Signed-off-by: Philipp Moritz <[email protected]>

Signed-off-by: Jake <[email protected]>
Signed-off-by: Philipp Moritz <[email protected]>
Co-authored-by: Philipp Moritz <[email protected]>

* [AIR] Make PathPartitionScheme a dataclass (#28390)

Signed-off-by: Balaji Veeramani <[email protected]>

* [Telemetry][Kuberentes] Distinguish Kubernetes deployment stacks (#28490)

Right now, Ray telemetry indicates the majority of Ray's CPU hour usage comes from Ray running within a Kubernetes cluster. However, we have no data on what method is used to deploy Ray on Kubernetes.

This PR enables Ray telemetry to distinguish between three methods of deploying Ray on Kubernetes:

KubeRay >= 0.4.0
Legacy Ray Operator with Ray >= 2.1.0
All other methods
The strategy is to have the operators inject an env variable into the Ray container's environment.
The variable identifies the deployment method.

This PR also modifies the legacy Ray operator to inject the relevant env variable.
A follow-up KubeRay PR changes the KubeRay operator to do the same thing: ray-project/kuberay#562

Signed-off-by: Dmitri Gekhtman <[email protected]>

* [autoscaler][observability] Experimental verbose mode (#28392)

This PR introduces a super secret hidden verbose mode for ray status, which we can keep hidden while collecting feedback before going through the process of officially declaring it part of the public API.

Example output

======== Autoscaler status: 2020-12-28 01:02:03 ========
GCS request time: 3.141500s
Node Provider non_terminated_nodes time: 1.618000s

Node status
--------------------------------------------------------
Healthy:
 2 p3.2xlarge
 20 m4.4xlarge
Pending:
 m4.4xlarge, 2 launching
 1.2.3.4: m4.4xlarge, waiting-for-ssh
 1.2.3.5: m4.4xlarge, waiting-for-ssh
Recent failures:
 p3.2xlarge: RayletUnexpectedlyDied (ip: 1.2.3.6)

Resources
--------------------------------------------------------
Total Usage:
 1/2 AcceleratorType:V100
 530.0/544.0 CPU
 2/2 GPU
 2.00/8.000 GiB memory
 3.14/16.000 GiB object_store_memory

Total Demands:
 {'CPU': 1}: 150+ pending tasks/actors
 {'CPU': 4} * 5 (PACK): 420+ pending placement groups
 {'CPU': 16}: 100+ from request_resources()

Node: 192.168.1.1
 Usage:
  0.1/1 AcceleratorType:V100
  5.0/20.0 CPU
  0.7/1 GPU
  1.00/4.000 GiB memory
  3.14/4.000 GiB object_store_memory

Node: 192.168.1.2
 Usage:
  0.9/1 AcceleratorType:V100
  15.0/20.0 CPU
  0.3/1 GPU
  1.00/12.000 GiB memory
  0.00/4.000 GiB object_store_memory

Co-authored-by: Alex <[email protected]>

* [doc/tune] fix tune stopper attribute name (#28517)

* [doc] Fix tune stopper doctests (#28531)

* [air] Use self-hosted mirror for CIFAR10 dataset (#28480)

The CIFAR10 website host has been unreliable in the past. This PR injects our own mirror into our CI packages for testing.

Signed-off-by: Kai Fricke <[email protected]>

* draft

Signed-off-by: Artur Niederfahrenhorst <[email protected]>

Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: xwjiang2010 <[email protected]>
Signed-off-by: mgerstgrasser <[email protected]>
Signed-off-by: Jake <[email protected]>
Signed-off-by: Philipp Moritz <[email protected]>
Signed-off-by: Balaji Veeramani <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Artur Niederfahrenhorst <[email protected]>
Co-authored-by: Kai Fricke <[email protected]>
Co-authored-by: xwjiang2010 <[email protected]>
Co-authored-by: Alex Wu <[email protected]>
Co-authored-by: Alex <[email protected]>
Co-authored-by: Jun Gong <[email protected]>
Co-authored-by: mgerstgrasser <[email protected]>
Co-authored-by: Philipp Moritz <[email protected]>
Co-authored-by: Jake <[email protected]>
Co-authored-by: Balaji Veeramani <[email protected]>
Co-authored-by: Dmitri Gekhtman <[email protected]>
Co-authored-by: Árpád Rózsás <[email protected]>
@pcmoritz
Copy link
Contributor

Btw, I was digging a bit deeper on the underlying issue why you needed to specify a schema at all here and my findings are in https://issues.apache.org/jira/projects/ARROW/issues/ARROW-17719

I wonder how large the performance overhead would be if we made the default to inspect the schemata of all partitions. That seems more user friendly :)

@DevJake
Copy link
Contributor Author

DevJake commented Sep 16, 2022

That makes complete sense, with the schema being inferred from the very first file.

I imagine inspecting all partitions might carry a considerable overhead, especially with a large number of partitions. Perhaps any or all of the following would work:

  • Prompting the user to explicitly specify a schema, as I have done.
  • Providing an optional parameter to infer the schema by merging across all partitions, as you suggested. Something like get_schema_from_all_partitions=False.
  • Issuing clear warnings for any fields denoted as null in the schema, as this is a probable issue in most cases.

I'm glad you managed to track down the underlying issue. I suppose it's particularly unlikely that someone would load a dataset that features one or more entirely empty/null columns, but not impossible.

PaulFenton pushed a commit to PaulFenton/ray that referenced this issue Sep 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues docs An issue or change related to documentation P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

4 participants