26 Jun 22:06

khluu

1240d3f

Ray-2.31.0

Ray Libraries

Ray Data

🔨 Fixes:

Fixed bug where preserve_order doesn’t work with file reads (#46135)

📖 Documentation:

Added documentation for dataset.Schema (#46170)

Ray Train

💫 Enhancements:

Add API for Ray Train run stats (#45711)

Ray Tune

💫 Enhancements:

Missing stopping criterion should not error (just warn). (#45613)

📖 Documentation:

Fix broken references in Ray Tune documentation (#45233)

Ray Serve

WARNING: the following default values will change in Ray 2.32:

Default for max_ongoing_requests will change from 100 to 5.
Default for target_ongoing_requests will change from 1 to 2.

💫 Enhancements:

Optimize DeploymentStateManager.get_deployment_statuses (#45872)

🔨 Fixes:

Fix logging error on passing traceback object into exc_info (#46105)
Run del even if constructor is still in-progress (#45882)
Spread replicas with custom resources in torch tune serve release test (#46093)
[1k release test] don't run replicas on head node (#46130)

📖 Documentation:

Remove todo since issue is fixed (#45941)

RLlib

🎉 New Features:

IMPALA runs on the new API stack (with EnvRunners and ConnectorV2s). (#42085)
SAC/DQN: Prioritized multi-agent episode replay buffer. (#45576 )

💫 Enhancements:

New API stack stability: Add systematic CI learning tests for all possible combinations of: [PPO|IMPALA] + [1CPU|2CPU|1GPU|2GPU] + [single-agent|multi-agent]. (#46162, #46161)

📖 Documentation:

New API stack: Example script for action masking (#46146)
New API stack: PyFlight example script cleanup (#45956 )
Old API stack: Enhanced ONNX example (+LSTM). (#43592 )

Ray Core and Ray Clusters

Ray Core

💫 Enhancements:

[runtime-env] automatically infer worker path when starting worker in container (#42304)

🔨 Fixes:

On GCS restart, destroy not forget the unused workers. Fixing PG leaks. (#45854)
Cancel lease requests before returning a PG bundle (#45919)
Fix boost fiber stack overflow (#46133)

Thanks

Many thanks to all those who contributed to this release!

@jjyao, @kevin85421, @vincent-pli, @khluu, @simonsays1980, @sven1977, @rynewang, @can-anyscale, @richardsliu, @jackhumphries, @alexeykudinkin, @bveeramani, @ruisearch42, @shrekris-anyscale, @stephanie-wang, @matthewdeng, @zcin, @hongchaodeng, @ryanaoleary, @liuxsh9, @GeneDer, @aslonnie, @peytondmurray, @Bye-legumes, @woshiyyya, @scottjlee, @JoshKarpel

Contributors

alexeykudinkin, jjyao, and 25 other contributors

Assets 2

20 Jun 23:08

can-anyscale

ray-2.30.0

97c3729

Ray-2.30.0

Ray Libraries

Ray Data

💫 Enhancements:

Improve fractional CPU/GPU formatting (#45673)
Use sampled fragments to estimate Parquet reader batch size (#45749)
Refactoring ParquetDatasource and metadata fetching logic (#45728, #45727, #45733, #45734, #45767)
Refactor planner.py (#45706)

Ray Tune

💫 Enhancements:

Change the behavior of a missing stopping criterion metric to warn instead of raising an error. This enables the use case of reporting different sets of metrics on different iterations (ex: a separate set of training and validation metrics). (#45613)

Ray Serve

💫 Enhancements:

Create internal request id to track request objects (#45761)

RLLib

💫 Enhancements:

Stability: DreamerV3 weekly release test (#45654); Add "official" benchmark script for Atari PPO benchmarks. (#45697)
Enhance env-rendering callback (#45682)

🔨 Fixes:

Bug fix in new MetricsLogger API: EMA stats w/o window would lead to infinite list mem-leak. (#45752)
Various other bug fixes: (#45819, #45820, #45683, #45651, #45753)

📖 Documentation:

Re-do examples overview page (new API stack): #45382
- PyFlyt QuadX WayPoints example #44758, #45956
- RLModule inference on new API stack (#45831, #45845)
- How to resume a tune.Tuner.fit() experiment from checkpoint. (#45681)
- Custom RLModule (tiny CNN): #45774
- Connector examples docstrings (#45864)
Old API stack examples: #43592, #45829

Ray Core

🎉 New Features:

Alpha release of job level logging configuration: users can now config the user logging to be logfmt format with logging context attached. (#45344)

💫 Enhancements:

Integrate amdsmi in AMDAcceleratorManager (#44572)

🔨 Fixes:

Fix the C++ GcsClient Del not respecting del_by_prefix (#45604)
Fix exit handling of FiberState threads (#45834)

Dashboard

💫 Enhancements:

Parse out json logs (#45853)

Many thanks to all those who contributed to this release: @liuxsh9, @peytondmurray, @pcmoritz, @GeneDer, @saihaj, @khluu, @aslonnie, @yucai, @vickytsang, @can-anyscale, @bthananjeyan, @raulchen, @hongchaodeng, @x13n, @simonsays1980, @peterghaddad, @kevin85421, @rynewang, @angelinalg, @jjyao, @BenWilson2, @jackhumphries, @zcin, @chris-ray-zhang, @c21, @shrekris-anyscale, @alanwguo, @stephanie-wang, @Bye-legumes, @sven1977, @WeichenXu123, @bveeramani, @nikitavemuri

Contributors

pcmoritz, x13n, and 31 other contributors

Assets 2

06 Jun 18:16

khluu

ray-2.24.0

cfea8b2

Ray-2.24.0

Ray Libraries

Ray Data

🎉 New Features:

Allow user to configure timeout for actor pool (#45508)
Add override_num_blocks to from_pandas and perform auto-partition (#44937)
Upgrade Arrow version to 16 in CI (#45565)

💫 Enhancements:

Clarify that num_rows_per_file isn't strict (#45529)
Record more telemetry for newly added datasources (#45647)
Avoid pickling LanceFragment when creating read tasks for Lance (#45392)

Ray Train

📖 Documentation:

[HPU] Add example of Stable Diffusion fine-tuning and serving on Intel Gaudi (#45217)
[HPU] Add example of Llama-2 fine-tuning on Intel Gaudi (#44667)

Ray Tune

🏗 Architecture refactoring:

Improve excessive syncing warning and deprecate TUNE_RESULT_DIR, RAY_AIR_LOCAL_CACHE_DIR, local_dir (#45210)

Ray Serve

💫 Enhancements:

Clean up Serve proxy files (#45486)

📖 Documentation:

vllm example to serve llm models (#45430)

RLLib

💫 Enhancements:

DreamerV3 on tf: Bug fix, so it can run again with tf==2.11.1 (2.11.0 is not available anymore) (#45419); Added weekly release test for DreamerV3.
Added support for multi-agent off-policy algorithms (DQN and SAC) in the new (#45182)
Config option for APPO/IMPALA to change number of GPU-loader threads (#45467)

🔨 Fixes:

Various MetricsLogger bug fixes (#45543, #45585, #45575)
Other fixes: #45588, #45617, #45517, #45465

📖 Documentation:

Example script for new API stack: How-to restore 1 of n agents from a checkpoint. (#45462)
Example script for new API stack: Autoregressive action module. #45525

Ray Core

💫 Enhancements:

Improve node death observability (#45320, #45357, #45533, #45644, #45497)
Ray c++ backend structured logging (#44468)

🔨 Fixes:

Fix worker crash when getting actor name from runtime context (#45194)
log dedup should not dedup number only lines (#45385)

📖 Documentation:

Improve doc for --object-store-memory to describe how the default value is set (#45301)

Dashboard

🔨 Fixes:

Move Job package uploading to another thread to unblock the event loop. (#45282)

Many thanks to all those who contributed to this release: @maxliuofficial, @simonsays1980, @GeneDer, @dudeperf3ct, @khluu, @justinvyu, @andrewsykim, @Catch-Bull, @zcin, @bveeramani, @rynewang, @angelinalg, @matthewdeng, @jjyao, @kira-lin, @harborn, @hongchaodeng, @peytondmurray, @aslonnie, @timkpaine, @982945902, @maxpumperla, @stephanie-wang, @ruisearch42, @alanwguo, @can-anyscale, @c21, @Atry, @KamenShah, @sven1977, @raulchen

Contributors

Atry, alanwguo, and 29 other contributors

Assets 2

22 May 23:37

khluu

ray-2.23.0

a0947ea

Ray-2.23.0

Ray Libraries

Ray Data

🎉 New Features:

Add support for using GPUs with map_groups (#45305)
Add support for using actors with map_groups (#45310)

💫 Enhancements:

Refine exception handling from arrow data conversion (#45294)

🔨 Fixes:

Fix Ray databricks UC reader with dynamic Databricks notebook scope token (#45153)
Fix bug where you can't return objects and array from UDF (#45287 )
Fix bug where map_groups triggers execution during input validation (#45314)

Ray Tune

🔨 Fixes:

[tune] Fix PB2 scheduler error resulting from trying to sort by Trial objects (#45161)

Ray Serve

🔨 Fixes:

Log application unhealthy errors at error level instead of warning level (#45211)

RLLib

💫 Enhancements:

Examples and tuned_examples learning test for new API stack are now “self-executable” (don’t require a third-party script anymore to run them). + WandB support. (#45023)

🔨 Fixes:

Fix result dict “spam” (duplicate, deprecated keys, e.g. “sampler_results” dumped into top level). (#45330)

📖 Documentation:

Add example for training with fractional GPUs on new API stack. (#45379)
Cleanup examples folder and remove deprecated sub directories. (#45327)

Ray Core

💫 Enhancements:

[Logs] Add runtime env started logs to job driver (#45255)
ray.util.collective support torch.bfloat16 (#39845)
[Core] Better propagate node death information (#45128)

🔨 Fixes:

[Core] Fix worker process leaks after job finishes (#44214)

Many thanks to all those who contributed to this release: @hongchaodeng, @khluu, @antoni-jamiolkowski, @ameroyer, @bveeramani, @can-anyscale, @WeichenXu123, @peytondmurray, @jackhumphries, @kevin85421, @jjyao, @robcaulk, @rynewang, @scottsun94, @swang, @GeneDer, @zcin, @ruisearch42, @aslonnie, @angelinalg, @raulchen, @ArthurBook, @sven1977, @wuxibin89

Contributors

swang, jjyao, and 22 other contributors

Assets 2

14 May 23:39

aslonnie

ray-2.22.0

a8ab7b8

Ray-2.22.0

Ray Libraries

Ray Data

🎉 New Features:

Add function to dynamically generate ray_remote_args for Map APIs (#45143)
Allow manually setting resource limits for training jobs (#45188)

💫 Enhancements:

Introduce abstract interface for data autoscaling (#45002)
Add debugging info for SplitCoordinator (#45226)

🔨 Fixes:

Don’t show AllToAllOperator progress bar if the disable flag is set (#45136)
Don't load Arrow PyExtensionType by default (#45084)
Don't raise batch size error if num_gpus=0 (#45202)

Ray Train

💫 Enhancements:

[XGBoost][LightGBM] Update RayTrainReportCallback to only save checkpoints on rank 0 (#45083)

Ray Core

🔨 Fixes:

Fix the cpu percentage metrics for dashboard process (#45124)

Dashboard

💫 Enhancements:

Improvements to log viewer so line numbers do not get selected when copying text.
Improvements to the log viewer to avoid unnecessary re-rendering which causes text selection to clear.

Many thanks to all those who contributed to this release: @justinvyu, @simonsays1980, @chris-ray-zhang, @kevin85421, @angelinalg, @rynewang, @brycehuang30, @alanwguo, @jjyao, @shaikhismail, @khluu, @can-anyscale, @bveeramani, @jrosti, @WeichenXu123, @MortalHappiness, @raulchen, @scottjlee, @ruisearch42, @aslonnie, @alexeykudinkin

Contributors

alexeykudinkin, jrosti, and 19 other contributors

Assets 2

08 May 20:34

khluu

ray-2.21.0

a912be8

Ray-2.21.0

Ray Libraries

Ray Data

🎉 New features:

Add read_lance API to read Lance Dataset (#45106)

🔨 Fixes:

Retry RaySystemError application errors (#45079)

📖 Documentation:

Fix broken references in data documentation (#44956)

Ray Train

📖 Documentation:

Fix broken links in Train documentation (#44953)

Ray Tune

📖 Documentation:

Update Hugging Face example to add reference (#42771)

🏗 Architecture refactoring:

Remove deprecated ray.air.callbacks modules (#45104)

Ray Serve

💫 Enhancements:

Allow methods to pass type @serve.batch type hint (#45004)
Allow configuring Serve control loop interval (#45063)

🔨 Fixes:

Fix bug with controller failing to recover for autoscaling deployments (#45118)
Fix control+c after serve run doesn't shutdown serve components (#45087)
Fix lightweight update max ongoing requests (#45006)

RLlib

🎉 New Features:

New MetricsLogger API now fully functional on the new API stack (working now also inside Learner classes, i.e. loss functions). (#44995, #45109)

💫 Enhancements:

Renamings and cleanups (toward new API stack and more consistent naming schemata): WorkerSet -> EnvRunnerGroup, DEFAULT_POLICY_ID -> DEFAULT_MODULE_ID, config.rollouts() -> config.env_runners(), etc.. (#45022, #44920)
Changed behavior of EnvRunnerGroup.foreach_worker… methods to new defaults: mark_healthy=True (used to be False) and healthy_only=True (used to be False). (#44993)
Fix get_state()/from_state() methods in SingleAgent- and MultiAgentEpisodes. (#45012)

🔨 Fixes:

Bug fix for (torch) global_norm clipping overflow problem: (#45055)
Various bug- and test case fixes: #45030, #45031, #45070, #45053, #45110, #45111

📖 Documentation:

Example scripts using the MetricsLogger for env rendering and recording w/ WandB: #45073, #45107

Ray Core

🔨 Fixes:

Fix ray.init(logging_format) argument is ignored (#45037)
Handle unserializable user exception (#44878)
Fix dashboard process event loop blocking issues (#45048, #45047)

Dashboard

🔨 Fixes:

Fix Nodes page sorting not working correctly.
Add back “actors per page” UI control in the actors page.

Many thanks to all those who contributed to this release: @rynewang, @can-anyscale, @scottsun94, @bveeramani, @ceddy4395, @GeneDer, @zcin, @JoshKarpel, @nikitavemuri, @stephanie-wang, @jackhumphries, @matthewdeng, @yash97, @simonsays1980, @peytondmurray, @evalaiyc98, @c21, @alanwguo, @shrekris-anyscale, @kevin85421, @hongchaodeng, @sven1977, @st--, @khluu

Contributors

alanwguo, hongchaodeng, and 22 other contributors

Assets 2

01 May 21:58

can-anyscale

ray-2.20.0

5708e75

Ray-2.20.0

Ray Libraries

Ray Data

💫 Enhancements:

Dedupe repeated schema during ParquetDatasource metadata prefetching (#44750)
Update map_groups implementation to better handle large outputs (#44862)
Deprecate prefetch_batches arg of iter_rows and change default value (#44982)
Adding in default behavior to false for creating dirs on s3 writes (#44972)
Make internal UDF names more descriptive (#44985)
Make name a required argument for AggregateFn (#44880)

📖 Documentation:

Add key concepts to and revise "Data Internals" page (#44751)

Ray Train

💫 Enhancements:

Setup XGBoost CommunicatorContext automatically (#44883)
Track Train Run Info with TrainStateActor (#44585)

📖 Documentation:

Add documentation for accelerator_type (#44882)
Update Ray Train example titles (#44369)

Ray Tune

💫 Enhancements:

Remove trial table when running Ray Train in a Jupyter notebook (#44858)
Clean up temporary checkpoint directories for class Trainables (ex: RLlib) (#44366)

📖 Documentation:

Fix minor doc format issues (#44865)
Remove outdated ScalingConfig references (#44918)

Ray Serve

💫 Enhancements:

Handle push metric interval is now configurable with environment variable RAY_SERVE_HANDLE_METRIC_PUSH_INTERVAL_S (#32920)
Improve performance of developer API serve.get_app_handle (#44812)

🔨 Fixes:

Fix memory leak in handles for autoscaling deployments (the leak happens when
RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=1) (#44877)

RLlib

🎉 New Features:

Introduce MetricsLogger, a unified API for users of RLlib to log custom metrics and stats in all of RLlib’s components (Algorithm, EnvRunners, and Learners). Rolled out for new API stack for Algorithm (training_step) and EnvRunners (custom callbacks). Learner (custom loss functions) support in progress. #44888, #44442
Introduce “inference-only” (slim) mode for RLModules that run inside an EnvRunner (and thus don’t require value-functions or target networks): #44797

💫 Enhancements:

MultiAgentEpisodeReplayBuffer for new API stack (preparation for multi-agent support of SAC and DQN): #44450
AlgorithmConfig cleanup and renaming of properties and methods for better consistency/transparency: #44896

🔨 Fixes:

Various minor bug fixes: #44989, #44988, #44891, #44898, #44868, #44867, #44845

Ray Core and Ray Clusters

💫 Enhancements:

Report GCS internal pubsub buffer metrics and cap message size (#44749)

🔨 Fixes:

Fix task submission never return when network partition happens (#44692)
Fix incorrect use of ssh port forward option. (#44973)
Make sure dashboard will exit if grpc server fails (#44928)
Make sure dashboard agent will exit if grpc server fails (#44899)

Thanks @can-anyscale, @hongchaodeng, @zcin, @marwan116, @khluu, @bewestphal, @scottjlee, @andrewsykim, @anyscalesam, @MortalHappiness, @justinvyu, @JoshKarpel, @woshiyyya, @rynewang, @Abirdcfly, @omatthew98, @sven1977, @marcelocarmona, @rueian, @mattip, @angelinalg, @aslonnie, @matthewdeng, @abizjakpro, @simonsays1980, @jjyao, @terraflops1048576, @hongpeng-guo, @stephanie-wang, @bw-matthew, @bveeramani, @ruisearch42, @kevin85421, @Tongruizhe

Many thanks to all those who contributed to this release!

Contributors

mattip, jjyao, and 32 other contributors

Assets 2

25 Apr 21:50

khluu

ray-2.12.0

549c4b7

Ray-2.12.0

Ray Libraries

Ray Data

🎉 New Features:

Store Ray Data logs in special subdirectory (#44743)

💫 Enhancements:

Add in local_read option to from_torch (#44752)

🔨 Fixes:

Fix the config to disable progress bar (#44342)

📖 Documentation:

Clarify deprecated Datasource docstrings (#44790)

Ray Train

🔨 Fixes:

Disable gathering the full state dict in RayFSDPStrategy for lightning>2.1 (#44569)

Ray Tune

💫 Enhancements:

Remove spammy log for "new output engine" (#44824)
Enable isort (#44693)

Ray Serve

🔨 Fixes:

[Serve] fix getting attributes on stdout during Serve logging redirect (#44787)

RLlib

🎉 New Features:

Support of images and video logging in WandB (env rendering example script for the new API stack coming up). (#43356)

💫 Enhancements:

Better support and separation-of-concerns for model_config_dict in new API stack. (#44263)
Added example script to pre-train an RLModule in single-agent fashion, then bring checkpoint into multi-agent setup and continue training. (#44674)
More examples scripts got translated from the old- to the new API stack: Curriculum learning, custom-gym-env, etc..: (#44706, #44707, #44735, #44841)

Ray Core and Ray Clusters

🔨 Fixes:

Fix GetAllJobInfo is_running_tasks is not returning the correct value when driver starts ray (#44459)

Thanks

Many thanks to all those who contributed to this release!
@can-anyscale, @hongpeng-guo, @sven1977, @zcin, @shrekris-anyscale, @liuxsh9, @jackhumphries, @GeneDer, @woshiyyya, @simonsays1980, @omatthew98, @andrewsykim, @n30111, @architkulkarni, @bveeramani, @aslonnie, @alexeykudinkin, @WeichenXu123, @rynewang, @matthewdeng, @angelinalg, @c21

Contributors

alexeykudinkin, jackhumphries, and 20 other contributors

Assets 2

17 Apr 23:31

aslonnie

ray-2.11.0

2eb4a81

Ray-2.11.0

Release Highlights

[data] Support reading Avro files with ray.data.read_avro
[train] Added experimental support for AWS Trainium (Neuron) and Intel HPU.

Ray Libraries

Ray Data

🎉 New Features:

Support reading Avro files with ray.data.read_avro (#43663)

💫 Enhancements:

Pin ipywidgets==7.7.2 to enable Data progress bars in VSCode Web (#44398)
Change log level for ignored exceptions (#44408)

🔨 Fixes:

Change Parquet encoding ratio lower bound from 2 to 1 (#44470)
Fix throughput time calculations for metrics (#44138)
Fix nested ragged numpy.ndarray (#44236)
Fix Ray debugger incompatibility caused by trimmed error stack trace (#44496)

📖 Documentation:

Update "Data Loading and Preprocessing" doc (#44165)
Move imports into TFPRedictor in batch inference example (#44434)

Ray Train

🎉 New Features:

Add experimental support for AWS Trainium (Neuron) (#39130)
Add experimental support for Intel HPU (#43343)

💫 Enhancements:

Log a deprecation warning for local_dir and related environment variables (#44029)
Enforce xgboost>=1.7 for XGBoostTrainer usage (#44269)

🔨 Fixes:

Fix ScalingConfig(accelerator_type) to request an appropriate resource amount (#44225)
Fix maximum recursion issue when serializing exceptions (#43952)
Remove base config deepcopy when initializing the trainer actor (#44611)

🏗 Architecture refactoring:

Remove deprecated BatchPredictor (#43934)

Ray Tune

💫 Enhancements:

Add support for new style lightning import (#44339)
Log a deprecation warning for local_dir and related environment variables (#44029)

🏗 Architecture refactoring:

Remove scikit-optimize search algorithm (#43969)

Ray Serve

🔨 Fixes:

Dynamically-created applications will no longer be deleted when a config is PUT via the REST API (#44476).
Fix _to_object_ref memory leak (#43763)
Log warning to reconfigure max_ongoing_requests if max_batch_size is less than max_ongoing_requests (#43840)
Deployment fails to start with ModuleNotFoundError in Ray 3.10 (#44329)
- This was fixed by reverting the original core changes on the sys.path behavior. Revert "[core] If there's working_dir, don't set _py_driver_sys_path." (#44435)
The batch_queue_cls parameter is removed from the @serve.batch decorator (#43935)

RLlib

🎉 New Features:

New API stack: DQN Rainbow is now available for single-agent (#43196, #43198, #43199)
PrioritizedEpisodeReplayBuffer is available for off-policy learning using the EnvRunner API (SingleAgentEnvRunner) and supports random n-step sampling (#42832, #43258, #43458, #43496, #44262)

💫 Enhancements:

Restructured examples/ folder; started moving example scripts to the new API stack (#44559, #44067, #44603)
Evaluation do-over: Deprecate enable_async_evaluation option (in favor of existing evaluation_parallel_to_training setting). (#43787)
Add: module_for API to MultiAgentEpisode (analogous to policy_for API of the old Episode classes). (#44241)
All rllib_contrib old stack algorithms have been removed from rllib/algorithms (#43656)

🔨 Fixes:

New API stack: Multi-GPU + multi-agent has been fixed. This completes support for any combinations of the following on the new API stack: [single-agent, multi-agent] vs [0 GPUs, 1 GPU, >1GPUs] vs [any number of EnvRunners] (#44420, #44664, #44594, #44677, #44082, #44669, #44622)
Various other bug fixes: #43906, #43871, #44000, #44340, #44491, #43959, #44043, #44446, #44040

📖 Documentation:

Re-announced new API stack in alpha stage (#44090).

Ray Core and Ray Clusters

🎉 New Features:

Added Ray check-open-ports CLI for checking potential open ports to the public (#44488)

💫 Enhancements:

Support nodes sharing the same spilling directory without conflicts. (#44487)
Create two subclasses of RayActorError to distinguish between actor died (ActorDiedError) and actor temporarily unavailable (ActorUnavailableError) cases.

🔨 Fixes:

Fixed the ModuleNotFound issued introduced in 2.10 (#44435)
Fixed an issue where agent process is using too much CPU (#44348)
Fixed race condition in multi-threaded actor creation (#44232)
Fixed several streaming generator bugs (#44079, #44257, #44197)
Fixed an issue where user exception raised from tasks cannot be subclassed (#44379)

Dashboard

💫 Enhancements:

Add serve controller metrics to serve system dashboard page (#43797)
Add Serve Application rows to Serve top-level deployments details page (#43506)
[Actor table page enhancements] Include "NodeId", "CPU", "Memory", "GPU", "GRAM" columns in the actor table page. Add sort functionality to resource utilization columns. Enable searching table by "Class" and "Repr". (#42588) (#42633) (#42788)

🔨 Fixes:

Fix default sorting of nodes in Cluster table page to first be by "Alive" nodes, then head nodes, then alphabetical by node ID. (#42929)
Fix bug where the Serve Deployment detail page fails to load if the deployment is in "Starting" state (#43279)

Docs

💫 Enhancements:

Landing page refreshes its look and feel. (#44251)

Thanks

Many thanks to all those who contributed to this release!

@aslonnie, @brycehuang30, @MortalHappiness, @astron8t-voyagerx, @edoakes, @sven1977, @anyscalesam, @scottjlee, @hongchaodeng, @slfan1989, @hebiao064, @fishbone, @zcin, @GeneDer, @shrekris-anyscale, @kira-lin, @chappidim, @raulchen, @c21, @WeichenXu123, @marian-code, @bveeramani, @can-anyscale, @mjd3, @justinvyu, @jackhumphries, @Bye-legumes, @ashione, @alanwguo, @Dreamsorcerer, @KamenShah, @jjyao, @omatthew98, @autolisis, @Superskyyy, @stephanie-wang, @simonsays1980, @davidxia, @angelinalg, @architkulkarni, @chris-ray-zhang, @kevin85421, @rynewang, @peytondmurray, @zhangyilun, @khluu, @matthewdeng, @ruisearch42, @pcmoritz, @mattip, @jerome-habana, @alexeykudinkin

Contributors

pcmoritz, alexeykudinkin, and 50 other contributors

Assets 2

21 Mar 19:02

khluu

ray-2.10.0

09abba2

Ray-2.10.0

Release Highlights

Ray 2.10 release brings important stability improvements and enhancements to Ray Data, with Ray Data becoming generally available (GA).

[Data] Ray Data becomes generally available with stability improvements in streaming execution, reading and writing data, better tasks concurrency control, and debuggability improvement with dashboard, logging and metrics visualization.
[RLlib] “New API Stack” officially announced as alpha for PPO and SAC.
[Serve] Added a default autoscaling policy set via num_replicas=”auto” (#42613).
[Serve] Added support for active load shedding via max_queued_requests (#42950).
[Serve] Added replica queue length caching to the DeploymentHandle scheduler (#42943).
- This should improve overhead in the Serve proxy and handles.
- max_ongoing_requests (max_concurrent_queries) is also now strictly enforced (#42947).
- If you see any issues, please report them on GitHub and you can disable this behavior by setting: RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=0.
[Serve] Renamed the following parameters. Each of the old names will be supported for another release before removal.
- max_concurrent_queries -> max_ongoing_requests
- target_num_ongoing_requests_per_replica -> target_ongoing_requests
- downscale_smoothing_factor -> downscaling_factor
- upscale_smoothing_factor -> upscaling_factor
[Core] Autoscaler v2 is in alpha and can be tried out with Kuberay. It has improved observability and stability compared to v1.
[Train] Added support for accelerator types via ScalingConfig(accelerator_type).
[Train] Revamped the XGBoostTrainer and LightGBMTrainer to no longer depend on xgboost_ray and lightgbm_ray. A new, more flexible API will be released in a future release.
[Train/Tune] Refactored local staging directory to remove the need for local_dir and RAY_AIR_LOCAL_CACHE_DIR.

Ray Libraries

Ray Data

🎉 New Features:

Streaming execution stability improvement to avoid memory issue, including per-operator resource reservation, streaming generator output buffer management, and better runtime resource estimation (#43026, #43171, #43298, #43299, #42930, #42504)
Metadata read stability improvement to avoid AWS transient error, including retry on application-level exception, spread tasks across multiple nodes, and configure retry interval (#42044, #43216, #42922, #42759).
Allow tasks concurrency control for read, map, and write APIs (#42849, #43113, #43177, #42637)
Data dashboard and statistics improvement with more runtime metrics for each components (#43790, #43628, #43241, #43477, #43110, #43112)
Allow to specify application-level error to retry for actor task (#42492)
Add num_rows_per_file parameter to file-based writes (#42694)
Add DataIterator.materialize (#43210)
Skip schema call in DataIterator.to_tf if tf.TypeSpec is provided (#42917)
Add option to append for Dataset.write_bigquery (#42584)
Deprecate legacy components and classes (#43575, #43178, #43347, #43349, #43342, #43341, #42936, #43144, #43022, #43023)

💫 Enhancements:

Restructure stdout logging for better readability (#43360)
Add a more performant way to read large TFRecord datasets (#42277)
Modify ImageDatasource to use Image.BILINEAR as the default image resampling filter (#43484)
Reduce internal stack trace output by default (#43251)
Perform incremental writes to Parquet files (#43563)
Warn on excessive driver memory usage during shuffle ops (#42574)
Distributed reads for ray.data.from_huggingface (#42599)
Remove Stage class and related usages (#42685)
Improve stability of reading JSON files to avoid PyArrow errors (#42558, #42357)

🔨 Fixes:

Turn off actor locality by default (#44124)
Normalize block types before internal multi-block operations (#43764)
Fix memory metrics for OutputSplitter (#43740)
Fix race condition issue in OpBufferQueue (#43015)
Fix early stop for multiple Limit operators. (#42958)
Fix deadlocks caused by Dataset.streaming_split for job hanging (#42601)

📖 Documentation:

Revamp Ray Data documentation for GA (#44006, #44007, #44008, #44098, #44168, #44093, #44105)

Ray Train

🎉 New Features:

Add support for accelerator types via ScalingConfig(accelerator_type) for improved worker scheduling (#43090)

💫 Enhancements:

Add a backend-specific context manager for train_func for setup/teardown logic (#43209)
Remove DEFAULT_NCCL_SOCKET_IFNAME to simplify network configuration (#42808)
Colocate Trainer with rank 0 Worker for to improve scheduling behavior (#43115)

🔨 Fixes:

Enable scheduling workers with memory resource requirements (#42999)
Make path behavior OS-agnostic by using Path.as_posix over os.path.join (#42037)
[Lightning] Fix resuming from checkpoint when using RayFSDPStrategy (#43594)
[Lightning] Fix deadlock in RayTrainReportCallback (#42751)
[Transformers] Fix checkpoint reporting behavior when get_latest_checkpoint returns None (#42953)

📖 Documentation:

Enhance docstring and user guides for train_loop_config (#43691)
Clarify in ray.train.report docstring that it is not a barrier (#42422)
Improve documentation for prepare_data_loader shuffle behavior and set_epoch (#41807)

🏗 Architecture refactoring:

Simplify XGBoost and LightGBM Trainer integrations. Implemented XGBoostTrainer and LightGBMTrainer as DataParallelTrainer. Removed dependency on xgboost_ray and lightgbm_ray. (#42111, #42767, #43244, #43424)
Refactor local staging directory to remove the need for local_dir and RAY_AIR_LOCAL_CACHE_DIR. Add isolation between driver and distributed worker artifacts so that large files written by workers are not uploaded implicitly. Results are now only written to storage_path, rather than having another copy in the user’s home directory (~/ray_results). (#43369, #43403, #43689)
Split overloaded ray.train.torch.get_device into another get_devices API for multi-GPU worker setup (#42314)
Refactor restoration configuration to be centered around storage_path (#42853, #43179)
Deprecations related to SyncConfig (#42909)
Remove deprecated preprocessor argument from Trainers (#43146, #43234)
Hard-deprecate MosaicTrainer and remove SklearnTrainer (#42814)

Ray Tune

💫 Enhancements:

Increase the minimum number of allowed pending trials for faster auto-scaleup (#43455)
Add support to TBXLogger for logging images (#37822)
Improve validation of Experiment(config) to handle RLlib AlgorithmConfig (#42816, #42116)

🔨 Fixes:

Fix reuse_actors error on actor cleanup for function trainables (#42951)
Make path behavior OS-agnostic by using Path.as_posix over os.path.join (#42037)

📖 Documentation:

Minor documentation fixes (#42118, #41982)

🏗 Architecture refactoring:

Refactor local staging directory to remove the need for local_dir and RAY_AIR_LOCAL_CACHE_DIR. Add isolation between driver and distributed worker artifacts so that large files written by workers are not uploaded implicitly. Results are now only written to storage_path, rather than having another copy in the user’s home directory (~/ray_results). (#43369, #43403, #43689)
Deprecations related to SyncConfig and chdir_to_trial_dir (#42909)
Refactor restoration configuration to be centered around storage_path (#42853, #43179)
Add back NevergradSearch (#42305)
Clean up invalid checkpoint_dir and reporter deprecation notices (#42698)

Ray Serve

🎉 New Features:

Added support for active load shedding via max_queued_requests (#42950).
Added a default autoscaling policy set via num_replicas=”auto” (#42613).

🏗 API Changes:

Renamed the following parameters. Each of the old names will be supported for another release before removal.
- max_concurrent_queries to max_ongoing_requests
- target_num_ongoing_requests_per_replica to target_ongoing_requests
- downscale_smoothing_factor to downscaling_factor
- upscale_smoothing_factor to upscaling_factor
WARNING: the following default values will change in Ray 2.11:
- Default for max_ongoing_requests will change from 100 to 5.
- Default for target_ongoing_requests will change from 1 to 2.

💫 Enhancements:

Add RAY_SERVE_LOG_ENCODING env to set the global logging behavior for Serve (#42781).
Config Serve's gRPC proxy to allow large payload (#43114).
Add blocking flag to serve.run() (#43227).
Add actor id and worker id to Serve structured logs (#43725).
Added replica queue length caching to the DeploymentHandle scheduler (#42943).
- This should improve overhead in the Serve proxy and handles.
- max_ongoing_requests (max_concurrent_queries) is also now strictly enforced (#42947).
- If you see any issues, please report them on GitHub and you can disable this behavior by setting: RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=0.
Autoscaling metrics (trackin...