Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT SUBMIT] diff 2.23 to 2.24 #48263

Draft
wants to merge 120 commits into
base: releases/2.23.0
Choose a base branch
from
Draft

Conversation

rynewang
Copy link
Contributor

No description provided.

aslonnie and others added 30 commits May 17, 2024 22:46
we use clang now; no gcc

Signed-off-by: Lonnie Liu <[email protected]>
not being used anymore. images are built with wanda and `ray_ci` scripts

Signed-off-by: Lonnie Liu <[email protected]>
old `bazel_tools` constraints are deprecated.

Signed-off-by: Lonnie Liu <[email protected]>
to latest version of 1.6.1

required to upgrade bazel. old skylib uses platform constraints that are
depredated in newer versions of bazel.

Signed-off-by: Lonnie Liu <[email protected]>
Package uploading is a CPU intensive work in Dashboard, where it collects the whole 500 MiB working_dir and uploads it to the GCS. It can take 30s to do so - during which the Dashboard event loop is blocking.

This PR moves the uploading to another thread. This avoids event loop blocking.

This PR also removes a dead reference to gcs_client in http_server_head.py.

Signed-off-by: Ruiyang Wang <[email protected]>
…passed via NCCL in accelerated DAG (#45332)

This adds support for dynamically sized torch.Tensors to be passed
between accelerated DAG nodes via NCCL. Specifically, the following code
is now supported, whereas previously `shape` and `dtype` had to be
explicitly passed to `TorchTensorType`.

```python
    with InputNode() as inp:
        dag = sender.send.bind(inp)
        dag = dag.with_type_hint(TorchTensorType(transport="nccl"))
        dag = receiver.recv.bind(dag)

    compiled_dag = dag.experimental_compile()
```

The feature works by creating a shared memory channel to pass the
metadata for the shape and dtype of the tensor. The metadata is then
used to create a buffer of the correct size on the NCCL receiver.

Initial microbenchmarks shows this adds about 50% throughput overhead
compared to statically declaring the shape and dtype, or about 160us/DAG
call. This seems a bit higher than expected (see also #45319).

This also adds a few other fixes:
- adds support for reusing actors to create new NCCL groups, which is
needed if a DAG is torn down and a new one is created.
- adds a lock to DAG teardown, to prevent the same NCCL group from
getting destructed twice.
- User-defined TorchTensorType shape or dtype is now used as a hint for
the buffer size, instead of a required size. Since buffers are currently
static, an error will be thrown if the user tries to return a too-large
tensor.

Part 1 of #45306, will follow up with a separate PR for nested tensors.


---------

Signed-off-by: Stephanie Wang <[email protected]>
Co-authored-by: SangBin Cho <[email protected]>
Co-authored-by: Kai-Hsun Chen <[email protected]>
so that they do not have to execute in sequential order

Signed-off-by: Lonnie Liu <[email protected]>
This
[commit](0de88e4)
added these files into `benchmarks/benchmarks/` directory instead of
just `benchmarks/` by accident. This PR moves these files back into just
`benchmarks/` directory

Signed-off-by: khluu <[email protected]>
… store as artifact (#45363)

- This is to use for automation from `product` repo
- Builds `update_version` binary into a python zip file and upload it as
an artifact in `release-automation` pipeline
- Have `root_dir` as an arg for `update_version` since automation is
using this on a cloned Ray repo

---------

Signed-off-by: khluu <[email protected]>
#45392)

Avoid pickling LanceFragment when creating read tasks for Lance, as this is expensive.

Signed-off-by: Cheng Su <[email protected]>
…45210)

Make the "Experiment state snapshotting has been triggered multiple..." warning message is less confusing, and remove the false positive log at the end of every run. Also makes some deprecations of `TUNE_RESULT_DIR`,
`RAY_AIR_LOCAL_CACHE_DIR`, `local_dir` legacy settings.

---------

Signed-off-by: Justin Yu <[email protected]>
Co-authored-by: Cuong Nguyen <[email protected]>
not built or used anywhere anymore

Signed-off-by: Lonnie Liu <[email protected]>
approved by @jjyao 

---------

Signed-off-by: khluu <[email protected]>
Signed-off-by: kevin <[email protected]>
This PR removes several methods from BlockList and LazyBlockList that aren't used anywhere.

Signed-off-by: Balaji Veeramani <[email protected]>
Co-authored-by: Cuong Nguyen <[email protected]>
Some minor code cleanup separated from #45450 .
To focus that PR more on new changes only.
…45194)

Currently calling get_runtime_context().get_actor_name() from driver will crash. Instead of crashing, this PR returns None in this case.

Signed-off-by: 982945902 <[email protected]>
Co-authored-by: Huaiwei Sun <[email protected]>
Fix compute config for microbenchmark_gpu_unstable.

Closes #45322.

---------

Signed-off-by: Stephanie Wang <[email protected]>
to version 1.14.0

Signed-off-by: Lonnie Liu <[email protected]>
not supported on newer version of bazel

Signed-off-by: Lonnie Liu <[email protected]>
the flag already flipped its default to true in bazel 5.6.x , and it is
removed in bazel 6.x

Signed-off-by: Lonnie Liu <[email protected]>
More recent versions of `jax` (e.g. `0.4.28`) will cause this to fail.

Signed-off-by: Matthew Deng <[email protected]>
to 0.29.37; required for bazel upgrade.

Signed-off-by: Lonnie Liu <[email protected]>
so that we know which archive import it is talking about

Signed-off-by: Lonnie Liu <[email protected]>
The _split_at_index function isn't used anywhere. This PR removes it.

Signed-off-by: Balaji Veeramani <[email protected]>
cleaner to write, and easier to parse

Signed-off-by: Lonnie Liu <[email protected]>
This package is not available for mac, let's skip it on mac platform

Test:
- CI

Signed-off-by: can <[email protected]>
and moving it out, as it is a very fundamental bazel package, not
specific to ray.

Signed-off-by: Lonnie Liu <[email protected]>
dudeperf3ct and others added 30 commits May 29, 2024 00:08
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Update the experimental feature guide on multi-container deployment
approach for Ray Serve.

## Related issue number

Closes: #45026

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: dudeperf3ct <[email protected]>
…n MultiAgentEnvRunner when sampling whole episodes. (#45617)
for bumping package versions up in the container and dodging cve's

also upgrade `idna` and add missing `cupy-cuda11x` package in
constraints..

Signed-off-by: Lonnie Liu <[email protected]>
some packages are declared more than once.

Signed-off-by: Lonnie Liu <[email protected]>
This PR adds multi-arg and kwarg support by serializing all positional
args and kwargs and passing it through the channel. When the channel is
read at runtime, the individual args are extracted first before passing
to the consuming tasks.

Closes #42793
---------

Signed-off-by: Rui Qiao <[email protected]>
Signed-off-by: Rui Qiao <[email protected]>
…ult value is set (#45301)

Currently it's unclear how the default value is set

Signed-off-by: Jiajun Yao <[email protected]>
This code path deletes the release test working directory upon the job
completion.

We found repetitive cases where users want the data to be available for
debugging purpose. Let's rely on s3 policy to clean up the data after a
few days.

Test:
- CI

Signed-off-by: can <[email protected]>
Notice that we haven't removed this support completely once I work on
upgrading python 3.12.

Need to change some runtime environment to `oss-ci-base_build` since
`forge` is using python 3.8.

Test:
- CI

Signed-off-by: can <[email protected]>
Refactor ResourceManager and avoid it directly depending on concrete
operators.

---------

Signed-off-by: Hao Chen <[email protected]>
… symlinks (#45618)

New env var is called RAY_DASHBOARD_BUILD_FOLLOW_SYMLINKS.

This is an advanced setting that should only be used with special Ray installations
where the dashboard build files are symlinked to a different directory.
This is not recommended for most users and can pose a security risk.
Please reference the aiohttp docs here:
https://docs.aiohttp.org/en/stable/web_reference.html#aiohttp.web.UrlDispatcher.add_static
add oss tag to container tests

Add `oss` tag to container tests.


Signed-off-by: Cindy Zhang <[email protected]>

Signed-off-by: Cindy Zhang <[email protected]>
…45217)

This PR adds an example for stable diffusion model fine-tuning and
serving using HPU. Moreover, it also covers how to adapt an existing HPU
example to run on Ray, so that users can use Ray to run the examples on
huggingface/optimum-habana.

---------

Signed-off-by: Zhi Lin <[email protected]>
Signed-off-by: Yunxuan Xiao <[email protected]>
Signed-off-by: Samuel Chan <[email protected]>
Co-authored-by: Yunxuan Xiao <[email protected]>
Co-authored-by: Yunxuan Xiao <[email protected]>
Co-authored-by: Samuel Chan <[email protected]>
Co-authored-by: Peyton Murray <[email protected]>
Add keys to a few cheap builds and tests that I noticed failed on
people's PR so we can include them in microcheck. These tests are not
covered in the scope of test_in_docker.

Test:
- CI

Signed-off-by: can <[email protected]>
This PR is to add the telemetry recording for newly added datasources.

Signed-off-by: Cheng Su <[email protected]>
Generated by release-automation bot

---------

Signed-off-by: kevin <[email protected]>
Signed-off-by: khluu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.