[DO NOT SUBMIT] diff 2.23 to 2.24 #48263

rynewang · 2024-10-24T22:18:36Z

No description provided.

we use clang now; no gcc Signed-off-by: Lonnie Liu <[email protected]>

not being used anymore. images are built with wanda and `ray_ci` scripts Signed-off-by: Lonnie Liu <[email protected]>

old `bazel_tools` constraints are deprecated. Signed-off-by: Lonnie Liu <[email protected]>

to latest version of 1.6.1 required to upgrade bazel. old skylib uses platform constraints that are depredated in newer versions of bazel. Signed-off-by: Lonnie Liu <[email protected]>

…ucing object size (#45309) Signed-off-by: Ruiyang Wang <[email protected]>

Package uploading is a CPU intensive work in Dashboard, where it collects the whole 500 MiB working_dir and uploads it to the GCS. It can take 30s to do so - during which the Dashboard event loop is blocking. This PR moves the uploading to another thread. This avoids event loop blocking. This PR also removes a dead reference to gcs_client in http_server_head.py. Signed-off-by: Ruiyang Wang <[email protected]>

…passed via NCCL in accelerated DAG (#45332) This adds support for dynamically sized torch.Tensors to be passed between accelerated DAG nodes via NCCL. Specifically, the following code is now supported, whereas previously `shape` and `dtype` had to be explicitly passed to `TorchTensorType`. ```python with InputNode() as inp: dag = sender.send.bind(inp) dag = dag.with_type_hint(TorchTensorType(transport="nccl")) dag = receiver.recv.bind(dag) compiled_dag = dag.experimental_compile() ``` The feature works by creating a shared memory channel to pass the metadata for the shape and dtype of the tensor. The metadata is then used to create a buffer of the correct size on the NCCL receiver. Initial microbenchmarks shows this adds about 50% throughput overhead compared to statically declaring the shape and dtype, or about 160us/DAG call. This seems a bit higher than expected (see also #45319). This also adds a few other fixes: - adds support for reusing actors to create new NCCL groups, which is needed if a DAG is torn down and a new one is created. - adds a lock to DAG teardown, to prevent the same NCCL group from getting destructed twice. - User-defined TorchTensorType shape or dtype is now used as a hint for the buffer size, instead of a required size. Since buffers are currently static, an error will be thrown if the user tries to return a too-large tensor. Part 1 of #45306, will follow up with a separate PR for nested tensors. --------- Signed-off-by: Stephanie Wang <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Co-authored-by: Kai-Hsun Chen <[email protected]>

so that they do not have to execute in sequential order Signed-off-by: Lonnie Liu <[email protected]>

This [commit](0de88e4) added these files into `benchmarks/benchmarks/` directory instead of just `benchmarks/` by accident. This PR moves these files back into just `benchmarks/` directory Signed-off-by: khluu <[email protected]>

… store as artifact (#45363) - This is to use for automation from `product` repo - Builds `update_version` binary into a python zip file and upload it as an artifact in `release-automation` pipeline - Have `root_dir` as an arg for `update_version` since automation is using this on a cloned Ray repo --------- Signed-off-by: khluu <[email protected]>

#45392) Avoid pickling LanceFragment when creating read tasks for Lance, as this is expensive. Signed-off-by: Cheng Su <[email protected]>

…45210) Make the "Experiment state snapshotting has been triggered multiple..." warning message is less confusing, and remove the false positive log at the end of every run. Also makes some deprecations of `TUNE_RESULT_DIR`, `RAY_AIR_LOCAL_CACHE_DIR`, `local_dir` legacy settings. --------- Signed-off-by: Justin Yu <[email protected]> Co-authored-by: Cuong Nguyen <[email protected]>

not built or used anywhere anymore Signed-off-by: Lonnie Liu <[email protected]>

@jjyao

approved by @jjyao --------- Signed-off-by: khluu <[email protected]> Signed-off-by: kevin <[email protected]>

This PR removes several methods from BlockList and LazyBlockList that aren't used anywhere. Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: Cuong Nguyen <[email protected]>

Some minor code cleanup separated from #45450 . To focus that PR more on new changes only.

…45194) Currently calling get_runtime_context().get_actor_name() from driver will crash. Instead of crashing, this PR returns None in this case. Signed-off-by: 982945902 <[email protected]> Co-authored-by: Huaiwei Sun <[email protected]>

Fix compute config for microbenchmark_gpu_unstable. Closes #45322. --------- Signed-off-by: Stephanie Wang <[email protected]>

to version 1.14.0 Signed-off-by: Lonnie Liu <[email protected]>

not supported on newer version of bazel Signed-off-by: Lonnie Liu <[email protected]>

the flag already flipped its default to true in bazel 5.6.x , and it is removed in bazel 6.x Signed-off-by: Lonnie Liu <[email protected]>

fixes https://errorprone.info/bugpattern/DoubleBraceInitialization Signed-off-by: Lonnie Liu <[email protected]>

More recent versions of `jax` (e.g. `0.4.28`) will cause this to fail. Signed-off-by: Matthew Deng <[email protected]>

to 0.29.37; required for bazel upgrade. Signed-off-by: Lonnie Liu <[email protected]>

…eads and skip mixin buffer if not needed. (#45467)

so that we know which archive import it is talking about Signed-off-by: Lonnie Liu <[email protected]>

The _split_at_index function isn't used anywhere. This PR removes it. Signed-off-by: Balaji Veeramani <[email protected]>

cleaner to write, and easier to parse Signed-off-by: Lonnie Liu <[email protected]>

This package is not available for mac, let's skip it on mac platform Test: - CI Signed-off-by: can <[email protected]>

and moving it out, as it is a very fundamental bazel package, not specific to ray. Signed-off-by: Lonnie Liu <[email protected]>

## Why are these changes needed? Update the experimental feature guide on multi-container deployment approach for Ray Serve. ## Related issue number Closes: #45026 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: dudeperf3ct <[email protected]>

Signed-off-by: lishuo121 <[email protected]>

…chanism (#45156) Signed-off-by: Cindy Zhang <[email protected]>

…n MultiAgentEnvRunner when sampling whole episodes. (#45617)

for bumping package versions up in the container and dodging cve's also upgrade `idna` and add missing `cupy-cuda11x` package in constraints.. Signed-off-by: Lonnie Liu <[email protected]>

some packages are declared more than once. Signed-off-by: Lonnie Liu <[email protected]>

This PR adds multi-arg and kwarg support by serializing all positional args and kwargs and passing it through the channel. When the channel is read at runtime, the individual args are extracted first before passing to the consuming tasks. Closes #42793 --------- Signed-off-by: Rui Qiao <[email protected]> Signed-off-by: Rui Qiao <[email protected]>

…ult value is set (#45301) Currently it's unclear how the default value is set Signed-off-by: Jiajun Yao <[email protected]>

This code path deletes the release test working directory upon the job completion. We found repetitive cases where users want the data to be available for debugging purpose. Let's rely on s3 policy to clean up the data after a few days. Test: - CI Signed-off-by: can <[email protected]>

Notice that we haven't removed this support completely once I work on upgrading python 3.12. Need to change some runtime environment to `oss-ci-base_build` since `forge` is using python 3.8. Test: - CI Signed-off-by: can <[email protected]>

Refactor ResourceManager and avoid it directly depending on concrete operators. --------- Signed-off-by: Hao Chen <[email protected]>

Signed-off-by: Rui Qiao <[email protected]>

… symlinks (#45618) New env var is called RAY_DASHBOARD_BUILD_FOLLOW_SYMLINKS. This is an advanced setting that should only be used with special Ray installations where the dashboard build files are symlinked to a different directory. This is not recommended for most users and can pose a security risk. Please reference the aiohttp docs here: https://docs.aiohttp.org/en/stable/web_reference.html#aiohttp.web.UrlDispatcher.add_static

)

Signed-off-by: hongchaodeng <[email protected]>

add oss tag to container tests Add `oss` tag to container tests. Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: Cindy Zhang <[email protected]>

…45217) This PR adds an example for stable diffusion model fine-tuning and serving using HPU. Moreover, it also covers how to adapt an existing HPU example to run on Ray, so that users can use Ray to run the examples on huggingface/optimum-habana. --------- Signed-off-by: Zhi Lin <[email protected]> Signed-off-by: Yunxuan Xiao <[email protected]> Signed-off-by: Samuel Chan <[email protected]> Co-authored-by: Yunxuan Xiao <[email protected]> Co-authored-by: Yunxuan Xiao <[email protected]> Co-authored-by: Samuel Chan <[email protected]> Co-authored-by: Peyton Murray <[email protected]>

…onGroup` (#45523) Signed-off-by: Yang, Bo <[email protected]>

Signed-off-by: Rui Qiao <[email protected]>

…45614)

) Signed-off-by: hejialing.hjl <[email protected]>

…arnerGroup.update_from_batch()`. (#45419)

Add keys to a few cheap builds and tests that I noticed failed on people's PR so we can include them in microcheck. These tests are not covered in the scope of test_in_docker. Test: - CI Signed-off-by: can <[email protected]>

Signed-off-by: Jiajun Yao <[email protected]>

This PR is to add the telemetry recording for newly added datasources. Signed-off-by: Cheng Su <[email protected]>

Signed-off-by: Rui Qiao <[email protected]>

…/2.24.0 fast forward

Generated by release-automation bot --------- Signed-off-by: kevin <[email protected]> Signed-off-by: khluu <[email protected]>

aslonnie and others added 30 commits May 17, 2024 22:46

[cpp] stop mentioning gcc (#45428)

a96866e

we use clang now; no gcc Signed-off-by: Lonnie Liu <[email protected]>

[ci] remove old docker image building logic (#45429)

4db8b1c

not being used anymore. images are built with wanda and `ray_ci` scripts Signed-off-by: Lonnie Liu <[email protected]>

[bazel] use new platform constraint (#45424)

be2272d

old `bazel_tools` constraints are deprecated. Signed-off-by: Lonnie Liu <[email protected]>

[bazel] upgrade skylib (#45435)

3f5aa5c

to latest version of 1.6.1 required to upgrade bazel. old skylib uses platform constraints that are depredated in newer versions of bazel. Signed-off-by: Lonnie Liu <[email protected]>

[core] Deflake windows://python/ray/tests:test_get_locations by red…

74fc9be

…ucing object size (#45309) Signed-off-by: Ruiyang Wang <[email protected]>

[ci] add depends_on for wheels steps (#45425)

6ae3f8c

so that they do not have to execute in sequential order Signed-off-by: Lonnie Liu <[email protected]>

[Data] Avoid pickling LanceFragment when creating read tasks for Lance (

e2028e0

#45392) Avoid pickling LanceFragment when creating read tasks for Lance, as this is expensive. Signed-off-by: Cheng Su <[email protected]>

[docker] remove worker container (#45447)

fbcd106

not built or used anywhere anymore Signed-off-by: Lonnie Liu <[email protected]>

Add perf metrics for 2.23.0 (#45443)

9eb2ce7

approved by @jjyao --------- Signed-off-by: khluu <[email protected]> Signed-off-by: kevin <[email protected]>

[Data] Remove some dead code from BlockList (#45398)

348145c

This PR removes several methods from BlockList and LazyBlockList that aren't used anywhere. Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: Cuong Nguyen <[email protected]>

[core] chaos test code cleanup (#45450)

56c1df5

Some minor code cleanup separated from #45450 . To focus that PR more on new changes only.

[core][experimental] Fix GPU microbenchmark (#45426)

ab2b442

Fix compute config for microbenchmark_gpu_unstable. Closes #45322. --------- Signed-off-by: Stephanie Wang <[email protected]>

[bazel] upgrade googletest package (#45449)

41a2b4b

to version 1.14.0 Signed-off-by: Lonnie Liu <[email protected]>

[bazel] remove flag that has no use (#45455)

6c2424c

not supported on newer version of bazel Signed-off-by: Lonnie Liu <[email protected]>

[bazel] remove incompatible_linkopts_to_linklibs flag (#45457)

592b5da

the flag already flipped its default to true in bazel 5.6.x , and it is removed in bazel 6.x Signed-off-by: Lonnie Liu <[email protected]>

[java] fix double bracket init (#45460)

7767457

fixes https://errorprone.info/bugpattern/DoubleBraceInitialization Signed-off-by: Lonnie Liu <[email protected]>

[train] Pin jax for Dreambooth Fine-Tuning template (#45389)

f403087

More recent versions of `jax` (e.g. `0.4.28`) will cause this to fail. Signed-off-by: Matthew Deng <[email protected]>

[bazel] upgrade cython package (#45436)

fab4bcb

to 0.29.37; required for bazel upgrade. Signed-off-by: Lonnie Liu <[email protected]>

[RLlib] Add option for APPO/IMPALA to change number of GPU-loader thr…

2f5d9c7

…eads and skip mixin buffer if not needed. (#45467)

[bazel] add name in auto_http_archive error msg (#45471)

ac80772

so that we know which archive import it is talking about Signed-off-by: Lonnie Liu <[email protected]>

[Data] Remove unused _split_at_index (#45481)

cd3d3b7

The _split_at_index function isn't used anywhere. This PR removes it. Signed-off-by: Balaji Veeramani <[email protected]>

[bazel] use .bazelversion file (#45476)

7d516c2

cleaner to write, and easier to parse Signed-off-by: Lonnie Liu <[email protected]>

[ci] fix mac build (#45482)

da9946a

This package is not available for mac, let's skip it on mac platform Test: - CI Signed-off-by: can <[email protected]>

[bazel] upgrade platforms to 0.0.9 (#45470)

6a8997c

and moving it out, as it is a very fundamental bazel package, not specific to ray. Signed-off-by: Lonnie Liu <[email protected]>

dudeperf3ct and others added 30 commits May 29, 2024 00:08

[Core] Remove duplicate included header (#45406)

1719a8f

Signed-off-by: lishuo121 <[email protected]>

[runtime_env] unify container and (new) image_uri under plugin me…

a84a1b2

…chanism (#45156) Signed-off-by: Cindy Zhang <[email protected]>

[RLlib] Fix wrong env being passed into on_episode_end callback o…

3f29274

…n MultiAgentEnvRunner when sampling whole episodes. (#45617)

[ci] add security requirements constraint (#45616)

14bf327

for bumping package versions up in the container and dodging cve's also upgrade `idna` and add missing `cupy-cuda11x` package in constraints.. Signed-off-by: Lonnie Liu <[email protected]>

[deps] remove duplicated lines in requirements.txt (#45615)

75161ab

some packages are declared more than once. Signed-off-by: Lonnie Liu <[email protected]>

[Core] Improve doc for --object-store-memory to describe how the defa…

d76518f

…ult value is set (#45301) Currently it's unclear how the default value is set Signed-off-by: Jiajun Yao <[email protected]>

Fail gracefully when cluster status is not available yet. (#45620)

0bb2600

[data] Refactor resource manager (#45623)

5faf476

Refactor ResourceManager and avoid it directly depending on concrete operators. --------- Signed-off-by: Hao Chen <[email protected]>

[Core] Fix race condition in setting node death info (#45619)

f2dfc37

Signed-off-by: Rui Qiao <[email protected]>

[RLlib] Complete do-over of RLlib release tests (new API stack). (#45589

c94140a

)

[core] add EC2InstanceTerminator and refactor killer creation (#45630)

e528cb0

Signed-off-by: hongchaodeng <[email protected]>

add oss tag to container tests (#45629)

ff3e393

add oss tag to container tests Add `oss` tag to container tests. Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: Cindy Zhang <[email protected]>

[Core] Handle TypeError when RayTaskError.cause is a `BaseExcepti…

3240167

…onGroup` (#45523) Signed-off-by: Yang, Bo <[email protected]>

[Core] Fix worker column off-by-one in dashboard (#45648)

b73e037

Signed-off-by: Rui Qiao <[email protected]>

[RLlib] Fix bug: Target nets are not synched with main nets in SAC. (#…

d6f97cc

…45614)

[Core] Fix the GIL deadlock issue caused by list_named_actors. (#45582

f9ab439

) Signed-off-by: hejialing.hjl <[email protected]>

[RLlib] DreamerV3 on tf: Fix bug w/ reduce_fn still passed into `Le…

a95ec7f

…arnerGroup.update_from_batch()`. (#45419)

[Core] Ray c++ backend structured logging (#44468)

a30630a

Signed-off-by: Jiajun Yao <[email protected]>

[Data] Record more telemetry for newly added datasources (#45647)

3c9edf1

This PR is to add the telemetry recording for newly added datasources. Signed-off-by: Cheng Su <[email protected]>

[Core] Expose NodeDeathInfo in state CLI (#45644)

fe191e6

Signed-off-by: Rui Qiao <[email protected]>

[Core] Expose NodeDeathInfo in ActorDiedError (#45497)

7021b10

Signed-off-by: Rui Qiao <[email protected]>

Merge commit '7021b10356069cf424556f1a5683c5f270a87e5b' into releases…

cfea8b2

…/2.24.0 fast forward

[release] Update Docker dependencies for 2.24.0 (#45788)

f18654a

Generated by release-automation bot --------- Signed-off-by: kevin <[email protected]> Signed-off-by: khluu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT SUBMIT] diff 2.23 to 2.24 #48263

[DO NOT SUBMIT] diff 2.23 to 2.24 #48263

rynewang commented Oct 24, 2024

[DO NOT SUBMIT] diff 2.23 to 2.24 #48263

Are you sure you want to change the base?

[DO NOT SUBMIT] diff 2.23 to 2.24 #48263

Conversation

rynewang commented Oct 24, 2024