Releases: ray-project/ray
Ray-2.31.0
Ray Libraries
Ray Data
🔨 Fixes:
- Fixed bug where
preserve_order
doesn’t work with file reads (#46135)
📖 Documentation:
- Added documentation for
dataset.Schema
(#46170)
Ray Train
💫 Enhancements:
- Add API for Ray Train run stats (#45711)
Ray Tune
💫 Enhancements:
- Missing stopping criterion should not error (just warn). (#45613)
📖 Documentation:
- Fix broken references in Ray Tune documentation (#45233)
Ray Serve
WARNING: the following default values will change in Ray 2.32:
- Default for
max_ongoing_requests
will change from 100 to 5. - Default for
target_ongoing_requests
will change from 1 to 2.
💫 Enhancements:
- Optimize DeploymentStateManager.get_deployment_statuses (#45872)
🔨 Fixes:
- Fix logging error on passing traceback object into exc_info (#46105)
- Run del even if constructor is still in-progress (#45882)
- Spread replicas with custom resources in torch tune serve release test (#46093)
- [1k release test] don't run replicas on head node (#46130)
📖 Documentation:
- Remove todo since issue is fixed (#45941)
RLlib
🎉 New Features:
- IMPALA runs on the new API stack (with EnvRunners and ConnectorV2s). (#42085)
- SAC/DQN: Prioritized multi-agent episode replay buffer. (#45576)
💫 Enhancements:
- New API stack stability: Add systematic CI learning tests for all possible combinations of: [PPO|IMPALA] + [1CPU|2CPU|1GPU|2GPU] + [single-agent|multi-agent]. (#46162, #46161)
📖 Documentation:
- New API stack: Example script for action masking (#46146)
- New API stack: PyFlight example script cleanup (#45956)
- Old API stack: Enhanced ONNX example (+LSTM). (#43592)
Ray Core and Ray Clusters
Ray Core
💫 Enhancements:
- [runtime-env] automatically infer worker path when starting worker in container (#42304)
🔨 Fixes:
- On GCS restart, destroy not forget the unused workers. Fixing PG leaks. (#45854)
- Cancel lease requests before returning a PG bundle (#45919)
- Fix boost fiber stack overflow (#46133)
Thanks
Many thanks to all those who contributed to this release!
@jjyao, @kevin85421, @vincent-pli, @khluu, @simonsays1980, @sven1977, @rynewang, @can-anyscale, @richardsliu, @jackhumphries, @alexeykudinkin, @bveeramani, @ruisearch42, @shrekris-anyscale, @stephanie-wang, @matthewdeng, @zcin, @hongchaodeng, @ryanaoleary, @liuxsh9, @GeneDer, @aslonnie, @peytondmurray, @Bye-legumes, @woshiyyya, @scottjlee, @JoshKarpel
Ray-2.30.0
Ray Libraries
Ray Data
💫 Enhancements:
- Improve fractional CPU/GPU formatting (#45673)
- Use sampled fragments to estimate Parquet reader batch size (#45749)
- Refactoring ParquetDatasource and metadata fetching logic (#45728, #45727, #45733, #45734, #45767)
- Refactor planner.py (#45706)
Ray Tune
💫 Enhancements:
- Change the behavior of a missing stopping criterion metric to warn instead of raising an error. This enables the use case of reporting different sets of metrics on different iterations (ex: a separate set of training and validation metrics). (#45613)
Ray Serve
💫 Enhancements:
- Create internal request id to track request objects (#45761)
RLLib
💫 Enhancements:
- Stability: DreamerV3 weekly release test (#45654); Add "official" benchmark script for Atari PPO benchmarks. (#45697)
- Enhance env-rendering callback (#45682)
🔨 Fixes:
- Bug fix in new MetricsLogger API: EMA stats w/o window would lead to infinite list mem-leak. (#45752)
- Various other bug fixes: (#45819, #45820, #45683, #45651, #45753)
📖 Documentation:
Ray Core
🎉 New Features:
- Alpha release of job level logging configuration: users can now config the user logging to be logfmt format with logging context attached. (#45344)
💫 Enhancements:
- Integrate amdsmi in AMDAcceleratorManager (#44572)
🔨 Fixes:
- Fix the C++ GcsClient Del not respecting del_by_prefix (#45604)
- Fix exit handling of FiberState threads (#45834)
Dashboard
💫 Enhancements:
- Parse out json logs (#45853)
Many thanks to all those who contributed to this release: @liuxsh9, @peytondmurray, @pcmoritz, @GeneDer, @saihaj, @khluu, @aslonnie, @yucai, @vickytsang, @can-anyscale, @bthananjeyan, @raulchen, @hongchaodeng, @x13n, @simonsays1980, @peterghaddad, @kevin85421, @rynewang, @angelinalg, @jjyao, @BenWilson2, @jackhumphries, @zcin, @chris-ray-zhang, @c21, @shrekris-anyscale, @alanwguo, @stephanie-wang, @Bye-legumes, @sven1977, @WeichenXu123, @bveeramani, @nikitavemuri
Ray-2.24.0
Ray Libraries
Ray Data
🎉 New Features:
- Allow user to configure timeout for actor pool (#45508)
- Add override_num_blocks to from_pandas and perform auto-partition (#44937)
- Upgrade Arrow version to 16 in CI (#45565)
💫 Enhancements:
- Clarify that num_rows_per_file isn't strict (#45529)
- Record more telemetry for newly added datasources (#45647)
- Avoid pickling LanceFragment when creating read tasks for Lance (#45392)
Ray Train
📖 Documentation:
- [HPU] Add example of Stable Diffusion fine-tuning and serving on Intel Gaudi (#45217)
- [HPU] Add example of Llama-2 fine-tuning on Intel Gaudi (#44667)
Ray Tune
🏗 Architecture refactoring:
- Improve excessive syncing warning and deprecate TUNE_RESULT_DIR, RAY_AIR_LOCAL_CACHE_DIR, local_dir (#45210)
Ray Serve
💫 Enhancements:
- Clean up Serve proxy files (#45486)
📖 Documentation:
- vllm example to serve llm models (#45430)
RLLib
💫 Enhancements:
- DreamerV3 on tf: Bug fix, so it can run again with tf==2.11.1 (2.11.0 is not available anymore) (#45419); Added weekly release test for DreamerV3.
- Added support for multi-agent off-policy algorithms (DQN and SAC) in the new (#45182)
- Config option for APPO/IMPALA to change number of GPU-loader threads (#45467)
🔨 Fixes:
- Various MetricsLogger bug fixes (#45543, #45585, #45575)
- Other fixes: #45588, #45617, #45517, #45465
📖 Documentation:
- Example script for new API stack: How-to restore 1 of n agents from a checkpoint. (#45462)
- Example script for new API stack: Autoregressive action module. #45525
Ray Core
💫 Enhancements:
- Improve node death observability (#45320, #45357, #45533, #45644, #45497)
- Ray c++ backend structured logging (#44468)
🔨 Fixes:
- Fix worker crash when getting actor name from runtime context (#45194)
- log dedup should not dedup number only lines (#45385)
📖 Documentation:
- Improve doc for
--object-store-memory
to describe how the default value is set (#45301)
Dashboard
🔨 Fixes:
- Move Job package uploading to another thread to unblock the event loop. (#45282)
Many thanks to all those who contributed to this release: @maxliuofficial, @simonsays1980, @GeneDer, @dudeperf3ct, @khluu, @justinvyu, @andrewsykim, @Catch-Bull, @zcin, @bveeramani, @rynewang, @angelinalg, @matthewdeng, @jjyao, @kira-lin, @harborn, @hongchaodeng, @peytondmurray, @aslonnie, @timkpaine, @982945902, @maxpumperla, @stephanie-wang, @ruisearch42, @alanwguo, @can-anyscale, @c21, @Atry, @KamenShah, @sven1977, @raulchen
Ray-2.23.0
Ray Libraries
Ray Data
🎉 New Features:
- Add support for using GPUs with map_groups (#45305)
- Add support for using actors with map_groups (#45310)
💫 Enhancements:
- Refine exception handling from arrow data conversion (#45294)
🔨 Fixes:
- Fix Ray databricks UC reader with dynamic Databricks notebook scope token (#45153)
- Fix bug where you can't return objects and array from UDF (#45287 )
- Fix bug where map_groups triggers execution during input validation (#45314)
Ray Tune
🔨 Fixes:
- [tune] Fix PB2 scheduler error resulting from trying to sort by Trial objects (#45161)
Ray Serve
🔨 Fixes:
- Log application unhealthy errors at error level instead of warning level (#45211)
RLLib
💫 Enhancements:
- Examples and
tuned_examples
learning test for new API stack are now “self-executable” (don’t require a third-party script anymore to run them). + WandB support. (#45023)
🔨 Fixes:
- Fix result dict “spam” (duplicate, deprecated keys, e.g. “sampler_results” dumped into top level). (#45330)
📖 Documentation:
- Add example for training with fractional GPUs on new API stack. (#45379)
- Cleanup examples folder and remove deprecated sub directories. (#45327)
Ray Core
💫 Enhancements:
- [Logs] Add runtime env started logs to job driver (#45255)
ray.util.collective
supporttorch.bfloat16
(#39845)- [Core] Better propagate node death information (#45128)
🔨 Fixes:
- [Core] Fix worker process leaks after job finishes (#44214)
Many thanks to all those who contributed to this release: @hongchaodeng, @khluu, @antoni-jamiolkowski, @ameroyer, @bveeramani, @can-anyscale, @WeichenXu123, @peytondmurray, @jackhumphries, @kevin85421, @jjyao, @robcaulk, @rynewang, @scottsun94, @swang, @GeneDer, @zcin, @ruisearch42, @aslonnie, @angelinalg, @raulchen, @ArthurBook, @sven1977, @wuxibin89
Ray-2.22.0
Ray Libraries
Ray Data
🎉 New Features:
- Add function to dynamically generate
ray_remote_args
for Map APIs (#45143) - Allow manually setting resource limits for training jobs (#45188)
💫 Enhancements:
- Introduce abstract interface for data autoscaling (#45002)
- Add debugging info for
SplitCoordinator
(#45226)
🔨 Fixes:
- Don’t show
AllToAllOperator
progress bar if the disable flag is set (#45136) - Don't load Arrow
PyExtensionType
by default (#45084) - Don't raise batch size error if
num_gpus=0
(#45202)
Ray Train
💫 Enhancements:
- [XGBoost][LightGBM] Update RayTrainReportCallback to only save checkpoints on rank 0 (#45083)
Ray Core
🔨 Fixes:
- Fix the cpu percentage metrics for dashboard process (#45124)
Dashboard
💫 Enhancements:
- Improvements to log viewer so line numbers do not get selected when copying text.
- Improvements to the log viewer to avoid unnecessary re-rendering which causes text selection to clear.
Many thanks to all those who contributed to this release: @justinvyu, @simonsays1980, @chris-ray-zhang, @kevin85421, @angelinalg, @rynewang, @brycehuang30, @alanwguo, @jjyao, @shaikhismail, @khluu, @can-anyscale, @bveeramani, @jrosti, @WeichenXu123, @MortalHappiness, @raulchen, @scottjlee, @ruisearch42, @aslonnie, @alexeykudinkin
Ray-2.21.0
Ray Libraries
Ray Data
🎉 New features:
- Add
read_lance
API to read Lance Dataset (#45106)
🔨 Fixes:
- Retry RaySystemError application errors (#45079)
📖 Documentation:
- Fix broken references in data documentation (#44956)
Ray Train
📖 Documentation:
- Fix broken links in Train documentation (#44953)
Ray Tune
📖 Documentation:
- Update Hugging Face example to add reference (#42771)
🏗 Architecture refactoring:
- Remove deprecated
ray.air.callbacks
modules (#45104)
Ray Serve
💫 Enhancements:
- Allow methods to pass type @serve.batch type hint (#45004)
- Allow configuring Serve control loop interval (#45063)
🔨 Fixes:
- Fix bug with controller failing to recover for autoscaling deployments (#45118)
- Fix control+c after serve run doesn't shutdown serve components (#45087)
- Fix lightweight update max ongoing requests (#45006)
RLlib
🎉 New Features:
- New MetricsLogger API now fully functional on the new API stack (working now also inside Learner classes, i.e. loss functions). (#44995, #45109)
💫 Enhancements:
- Renamings and cleanups (toward new API stack and more consistent naming schemata): WorkerSet -> EnvRunnerGroup, DEFAULT_POLICY_ID -> DEFAULT_MODULE_ID, config.rollouts() -> config.env_runners(), etc.. (#45022, #44920)
- Changed behavior of
EnvRunnerGroup.foreach_worker…
methods to new defaults:mark_healthy=True
(used to be False) andhealthy_only=True
(used to be False). (#44993) - Fix
get_state()/from_state()
methods in SingleAgent- and MultiAgentEpisodes. (#45012)
🔨 Fixes:
- Bug fix for (torch) global_norm clipping overflow problem: (#45055)
- Various bug- and test case fixes: #45030, #45031, #45070, #45053, #45110, #45111
📖 Documentation:
Ray Core
🔨 Fixes:
- Fix
ray.init(logging_format)
argument is ignored (#45037) - Handle unserializable user exception (#44878)
- Fix dashboard process event loop blocking issues (#45048, #45047)
Dashboard
🔨 Fixes:
- Fix Nodes page sorting not working correctly.
- Add back “actors per page” UI control in the actors page.
Many thanks to all those who contributed to this release: @rynewang, @can-anyscale, @scottsun94, @bveeramani, @ceddy4395, @GeneDer, @zcin, @JoshKarpel, @nikitavemuri, @stephanie-wang, @jackhumphries, @matthewdeng, @yash97, @simonsays1980, @peytondmurray, @evalaiyc98, @c21, @alanwguo, @shrekris-anyscale, @kevin85421, @hongchaodeng, @sven1977, @st--, @khluu
Ray-2.20.0
Ray Libraries
Ray Data
💫 Enhancements:
- Dedupe repeated schema during
ParquetDatasource
metadata prefetching (#44750) - Update
map_groups
implementation to better handle large outputs (#44862) - Deprecate
prefetch_batches
arg ofiter_rows
and change default value (#44982) - Adding in default behavior to false for creating dirs on s3 writes (#44972)
- Make internal UDF names more descriptive (#44985)
- Make
name
a required argument forAggregateFn
(#44880)
📖 Documentation:
- Add key concepts to and revise "Data Internals" page (#44751)
Ray Train
💫 Enhancements:
- Setup XGBoost
CommunicatorContext
automatically (#44883) - Track Train Run Info with
TrainStateActor
(#44585)
📖 Documentation:
Ray Tune
💫 Enhancements:
- Remove trial table when running Ray Train in a Jupyter notebook (#44858)
- Clean up temporary checkpoint directories for class Trainables (ex: RLlib) (#44366)
📖 Documentation:
Ray Serve
💫 Enhancements:
- Handle push metric interval is now configurable with environment variable RAY_SERVE_HANDLE_METRIC_PUSH_INTERVAL_S (#32920)
- Improve performance of developer API serve.get_app_handle (#44812)
🔨 Fixes:
- Fix memory leak in handles for autoscaling deployments (the leak happens when
- RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=1) (#44877)
RLlib
🎉 New Features:
- Introduce
MetricsLogger
, a unified API for users of RLlib to log custom metrics and stats in all of RLlib’s components (Algorithm, EnvRunners, and Learners). Rolled out for new API stack for Algorithm (training_step
) and EnvRunners (custom callbacks).Learner
(custom loss functions) support in progress. #44888, #44442 - Introduce “inference-only” (slim) mode for RLModules that run inside an EnvRunner (and thus don’t require value-functions or target networks): #44797
💫 Enhancements:
- MultiAgentEpisodeReplayBuffer for new API stack (preparation for multi-agent support of SAC and DQN): #44450
- AlgorithmConfig cleanup and renaming of properties and methods for better consistency/transparency: #44896
🔨 Fixes:
Ray Core and Ray Clusters
💫 Enhancements:
- Report GCS internal pubsub buffer metrics and cap message size (#44749)
🔨 Fixes:
- Fix task submission never return when network partition happens (#44692)
- Fix incorrect use of ssh port forward option. (#44973)
- Make sure dashboard will exit if grpc server fails (#44928)
- Make sure dashboard agent will exit if grpc server fails (#44899)
Thanks @can-anyscale, @hongchaodeng, @zcin, @marwan116, @khluu, @bewestphal, @scottjlee, @andrewsykim, @anyscalesam, @MortalHappiness, @justinvyu, @JoshKarpel, @woshiyyya, @rynewang, @Abirdcfly, @omatthew98, @sven1977, @marcelocarmona, @rueian, @mattip, @angelinalg, @aslonnie, @matthewdeng, @abizjakpro, @simonsays1980, @jjyao, @terraflops1048576, @hongpeng-guo, @stephanie-wang, @bw-matthew, @bveeramani, @ruisearch42, @kevin85421, @Tongruizhe
Many thanks to all those who contributed to this release!
Ray-2.12.0
Ray Libraries
Ray Data
🎉 New Features:
- Store Ray Data logs in special subdirectory (#44743)
💫 Enhancements:
- Add in
local_read
option tofrom_torch
(#44752)
🔨 Fixes:
- Fix the config to disable progress bar (#44342)
📖 Documentation:
- Clarify deprecated Datasource docstrings (#44790)
Ray Train
🔨 Fixes:
- Disable gathering the full state dict in
RayFSDPStrategy
forlightning>2.1
(#44569)
Ray Tune
💫 Enhancements:
Ray Serve
🔨 Fixes:
- [Serve] fix getting attributes on stdout during Serve logging redirect (#44787)
RLlib
🎉 New Features:
- Support of images and video logging in WandB (env rendering example script for the new API stack coming up). (#43356)
💫 Enhancements:
- Better support and separation-of-concerns for
model_config_dict
in new API stack. (#44263) - Added example script to pre-train an
RLModule
in single-agent fashion, then bring checkpoint into multi-agent setup and continue training. (#44674) - More
examples
scripts got translated from the old- to the new API stack: Curriculum learning, custom-gym-env, etc..: (#44706, #44707, #44735, #44841)
Ray Core and Ray Clusters
🔨 Fixes:
- Fix GetAllJobInfo
is_running_tasks
is not returning the correct value when driver starts ray (#44459)
Thanks
Many thanks to all those who contributed to this release!
@can-anyscale, @hongpeng-guo, @sven1977, @zcin, @shrekris-anyscale, @liuxsh9, @jackhumphries, @GeneDer, @woshiyyya, @simonsays1980, @omatthew98, @andrewsykim, @n30111, @architkulkarni, @bveeramani, @aslonnie, @alexeykudinkin, @WeichenXu123, @rynewang, @matthewdeng, @angelinalg, @c21
Ray-2.11.0
Release Highlights
- [data] Support reading Avro files with
ray.data.read_avro
- [train] Added experimental support for AWS Trainium (Neuron) and Intel HPU.
Ray Libraries
Ray Data
🎉 New Features:
- Support reading Avro files with
ray.data.read_avro
(#43663)
💫 Enhancements:
- Pin
ipywidgets==7.7.2
to enable Data progress bars in VSCode Web (#44398) - Change log level for ignored exceptions (#44408)
🔨 Fixes:
- Change Parquet encoding ratio lower bound from 2 to 1 (#44470)
- Fix throughput time calculations for metrics (#44138)
- Fix nested ragged
numpy.ndarray
(#44236) - Fix Ray debugger incompatibility caused by trimmed error stack trace (#44496)
📖 Documentation:
- Update "Data Loading and Preprocessing" doc (#44165)
- Move imports into
TFPRedictor
in batch inference example (#44434)
Ray Train
🎉 New Features:
- Add experimental support for AWS Trainium (Neuron) (#39130)
- Add experimental support for Intel HPU (#43343)
💫 Enhancements:
- Log a deprecation warning for local_dir and related environment variables (#44029)
- Enforce xgboost>=1.7 for XGBoostTrainer usage (#44269)
🔨 Fixes:
- Fix ScalingConfig(accelerator_type) to request an appropriate resource amount (#44225)
- Fix maximum recursion issue when serializing exceptions (#43952)
- Remove base config deepcopy when initializing the trainer actor (#44611)
🏗 Architecture refactoring:
- Remove deprecated
BatchPredictor
(#43934)
Ray Tune
💫 Enhancements:
- Add support for new style lightning import (#44339)
- Log a deprecation warning for local_dir and related environment variables (#44029)
🏗 Architecture refactoring:
- Remove scikit-optimize search algorithm (#43969)
Ray Serve
🔨 Fixes:
- Dynamically-created applications will no longer be deleted when a config is PUT via the REST API (#44476).
- Fix
_to_object_ref
memory leak (#43763) - Log warning to reconfigure
max_ongoing_requests
ifmax_batch_size
is less thanmax_ongoing_requests
(#43840) - Deployment fails to start with
ModuleNotFoundError
in Ray 3.10 (#44329)- This was fixed by reverting the original core changes on the
sys.path
behavior. Revert "[core] If there's working_dir, don't set _py_driver_sys_path." (#44435)
- This was fixed by reverting the original core changes on the
- The
batch_queue_cls
parameter is removed from the@serve.batch
decorator (#43935)
RLlib
🎉 New Features:
- New API stack: DQN Rainbow is now available for single-agent (#43196, #43198, #43199)
PrioritizedEpisodeReplayBuffer
is available for off-policy learning using the EnvRunner API (SingleAgentEnvRunner
) and supports random n-step sampling (#42832, #43258, #43458, #43496, #44262)
💫 Enhancements:
- Restructured
examples/
folder; started moving example scripts to the new API stack (#44559, #44067, #44603) - Evaluation do-over: Deprecate
enable_async_evaluation
option (in favor of existingevaluation_parallel_to_training
setting). (#43787) - Add:
module_for
API to MultiAgentEpisode (analogous topolicy_for
API of the old Episode classes). (#44241) - All
rllib_contrib
old stack algorithms have been removed fromrllib/algorithms
(#43656)
🔨 Fixes:
- New API stack: Multi-GPU + multi-agent has been fixed. This completes support for any combinations of the following on the new API stack: [single-agent, multi-agent] vs [0 GPUs, 1 GPU, >1GPUs] vs [any number of EnvRunners] (#44420, #44664, #44594, #44677, #44082, #44669, #44622)
- Various other bug fixes: #43906, #43871, #44000, #44340, #44491, #43959, #44043, #44446, #44040
📖 Documentation:
Ray Core and Ray Clusters
🎉 New Features:
- Added Ray check-open-ports CLI for checking potential open ports to the public (#44488)
💫 Enhancements:
- Support nodes sharing the same spilling directory without conflicts. (#44487)
- Create two subclasses of
RayActorError
to distinguish between actor died (ActorDiedError
) and actor temporarily unavailable (ActorUnavailableError
) cases.
🔨 Fixes:
- Fixed the
ModuleNotFound
issued introduced in 2.10 (#44435) - Fixed an issue where agent process is using too much CPU (#44348)
- Fixed race condition in multi-threaded actor creation (#44232)
- Fixed several streaming generator bugs (#44079, #44257, #44197)
- Fixed an issue where user exception raised from tasks cannot be subclassed (#44379)
Dashboard
💫 Enhancements:
- Add serve controller metrics to serve system dashboard page (#43797)
- Add Serve Application rows to Serve top-level deployments details page (#43506)
- [Actor table page enhancements] Include "NodeId", "CPU", "Memory", "GPU", "GRAM" columns in the actor table page. Add sort functionality to resource utilization columns. Enable searching table by "Class" and "Repr". (#42588) (#42633) (#42788)
🔨 Fixes:
- Fix default sorting of nodes in Cluster table page to first be by "Alive" nodes, then head nodes, then alphabetical by node ID. (#42929)
- Fix bug where the Serve Deployment detail page fails to load if the deployment is in "Starting" state (#43279)
Docs
💫 Enhancements:
- Landing page refreshes its look and feel. (#44251)
Thanks
Many thanks to all those who contributed to this release!
@aslonnie, @brycehuang30, @MortalHappiness, @astron8t-voyagerx, @edoakes, @sven1977, @anyscalesam, @scottjlee, @hongchaodeng, @slfan1989, @hebiao064, @fishbone, @zcin, @GeneDer, @shrekris-anyscale, @kira-lin, @chappidim, @raulchen, @c21, @WeichenXu123, @marian-code, @bveeramani, @can-anyscale, @mjd3, @justinvyu, @jackhumphries, @Bye-legumes, @ashione, @alanwguo, @Dreamsorcerer, @KamenShah, @jjyao, @omatthew98, @autolisis, @Superskyyy, @stephanie-wang, @simonsays1980, @davidxia, @angelinalg, @architkulkarni, @chris-ray-zhang, @kevin85421, @rynewang, @peytondmurray, @zhangyilun, @khluu, @matthewdeng, @ruisearch42, @pcmoritz, @mattip, @jerome-habana, @alexeykudinkin
Ray-2.10.0
Release Highlights
Ray 2.10 release brings important stability improvements and enhancements to Ray Data, with Ray Data becoming generally available (GA).
- [Data] Ray Data becomes generally available with stability improvements in streaming execution, reading and writing data, better tasks concurrency control, and debuggability improvement with dashboard, logging and metrics visualization.
- [RLlib] “New API Stack” officially announced as alpha for PPO and SAC.
- [Serve] Added a default autoscaling policy set via
num_replicas=”auto”
(#42613). - [Serve] Added support for active load shedding via
max_queued_requests
(#42950). - [Serve] Added replica queue length caching to the DeploymentHandle scheduler (#42943).
- This should improve overhead in the Serve proxy and handles.
max_ongoing_requests (max_concurrent_queries)
is also now strictly enforced (#42947).- If you see any issues, please report them on GitHub and you can disable this behavior by setting:
RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=0
.
- [Serve] Renamed the following parameters. Each of the old names will be supported for another release before removal.
max_concurrent_queries
->max_ongoing_requests
target_num_ongoing_requests_per_replica
->target_ongoing_requests
downscale_smoothing_factor
->downscaling_factor
upscale_smoothing_factor
->upscaling_factor
- [Core] Autoscaler v2 is in alpha and can be tried out with Kuberay. It has improved observability and stability compared to v1.
- [Train] Added support for accelerator types via
ScalingConfig(accelerator_type)
. - [Train] Revamped the
XGBoostTrainer
andLightGBMTrainer
to no longer depend onxgboost_ray
andlightgbm_ray
. A new, more flexible API will be released in a future release. - [Train/Tune] Refactored local staging directory to remove the need for
local_dir
andRAY_AIR_LOCAL_CACHE_DIR
.
Ray Libraries
Ray Data
🎉 New Features:
- Streaming execution stability improvement to avoid memory issue, including per-operator resource reservation, streaming generator output buffer management, and better runtime resource estimation (#43026, #43171, #43298, #43299, #42930, #42504)
- Metadata read stability improvement to avoid AWS transient error, including retry on application-level exception, spread tasks across multiple nodes, and configure retry interval (#42044, #43216, #42922, #42759).
- Allow tasks concurrency control for read, map, and write APIs (#42849, #43113, #43177, #42637)
- Data dashboard and statistics improvement with more runtime metrics for each components (#43790, #43628, #43241, #43477, #43110, #43112)
- Allow to specify application-level error to retry for actor task (#42492)
- Add
num_rows_per_file
parameter to file-based writes (#42694) - Add
DataIterator.materialize
(#43210) - Skip schema call in
DataIterator.to_tf
iftf.TypeSpec
is provided (#42917) - Add option to append for
Dataset.write_bigquery
(#42584) - Deprecate legacy components and classes (#43575, #43178, #43347, #43349, #43342, #43341, #42936, #43144, #43022, #43023)
💫 Enhancements:
- Restructure stdout logging for better readability (#43360)
- Add a more performant way to read large TFRecord datasets (#42277)
- Modify
ImageDatasource
to useImage.BILINEAR
as the default image resampling filter (#43484) - Reduce internal stack trace output by default (#43251)
- Perform incremental writes to Parquet files (#43563)
- Warn on excessive driver memory usage during shuffle ops (#42574)
- Distributed reads for
ray.data.from_huggingface
(#42599) - Remove
Stage
class and related usages (#42685) - Improve stability of reading JSON files to avoid PyArrow errors (#42558, #42357)
🔨 Fixes:
- Turn off actor locality by default (#44124)
- Normalize block types before internal multi-block operations (#43764)
- Fix memory metrics for
OutputSplitter
(#43740) - Fix race condition issue in
OpBufferQueue
(#43015) - Fix early stop for multiple
Limit
operators. (#42958) - Fix deadlocks caused by
Dataset.streaming_split
for job hanging (#42601)
📖 Documentation:
Ray Train
🎉 New Features:
- Add support for accelerator types via
ScalingConfig(accelerator_type)
for improved worker scheduling (#43090)
💫 Enhancements:
- Add a backend-specific context manager for
train_func
for setup/teardown logic (#43209) - Remove
DEFAULT_NCCL_SOCKET_IFNAME
to simplify network configuration (#42808) - Colocate Trainer with rank 0 Worker for to improve scheduling behavior (#43115)
🔨 Fixes:
- Enable scheduling workers with
memory
resource requirements (#42999) - Make path behavior OS-agnostic by using
Path.as_posix
overos.path.join
(#42037) - [Lightning] Fix resuming from checkpoint when using
RayFSDPStrategy
(#43594) - [Lightning] Fix deadlock in
RayTrainReportCallback
(#42751) - [Transformers] Fix checkpoint reporting behavior when
get_latest_checkpoint
returns None (#42953)
📖 Documentation:
- Enhance docstring and user guides for
train_loop_config
(#43691) - Clarify in
ray.train.report
docstring that it is not a barrier (#42422) - Improve documentation for
prepare_data_loader
shuffle behavior andset_epoch
(#41807)
🏗 Architecture refactoring:
- Simplify XGBoost and LightGBM Trainer integrations. Implemented
XGBoostTrainer
andLightGBMTrainer
asDataParallelTrainer
. Removed dependency onxgboost_ray
andlightgbm_ray
. (#42111, #42767, #43244, #43424) - Refactor local staging directory to remove the need for
local_dir
andRAY_AIR_LOCAL_CACHE_DIR
. Add isolation between driver and distributed worker artifacts so that large files written by workers are not uploaded implicitly. Results are now only written tostorage_path
, rather than having another copy in the user’s home directory (~/ray_results
). (#43369, #43403, #43689) - Split overloaded
ray.train.torch.get_device
into anotherget_devices
API for multi-GPU worker setup (#42314) - Refactor restoration configuration to be centered around
storage_path
(#42853, #43179) - Deprecations related to
SyncConfig
(#42909) - Remove deprecated
preprocessor
argument from Trainers (#43146, #43234) - Hard-deprecate
MosaicTrainer
and removeSklearnTrainer
(#42814)
Ray Tune
💫 Enhancements:
- Increase the minimum number of allowed pending trials for faster auto-scaleup (#43455)
- Add support to
TBXLogger
for logging images (#37822) - Improve validation of
Experiment(config)
to handle RLlibAlgorithmConfig
(#42816, #42116)
🔨 Fixes:
- Fix
reuse_actors
error on actor cleanup for function trainables (#42951) - Make path behavior OS-agnostic by using Path.as_posix over
os.path.join
(#42037)
📖 Documentation:
🏗 Architecture refactoring:
- Refactor local staging directory to remove the need for
local_dir
andRAY_AIR_LOCAL_CACHE_DIR
. Add isolation between driver and distributed worker artifacts so that large files written by workers are not uploaded implicitly. Results are now only written tostorage_path
, rather than having another copy in the user’s home directory (~/ray_results
). (#43369, #43403, #43689) - Deprecations related to
SyncConfig
andchdir_to_trial_dir
(#42909) - Refactor restoration configuration to be centered around
storage_path
(#42853, #43179) - Add back
NevergradSearch
(#42305) - Clean up invalid
checkpoint_dir
andreporter
deprecation notices (#42698)
Ray Serve
🎉 New Features:
- Added support for active load shedding via
max_queued_requests
(#42950). - Added a default autoscaling policy set via
num_replicas=”auto”
(#42613).
🏗 API Changes:
- Renamed the following parameters. Each of the old names will be supported for another release before removal.
max_concurrent_queries
tomax_ongoing_requests
target_num_ongoing_requests_per_replica
totarget_ongoing_requests
downscale_smoothing_factor
todownscaling_factor
upscale_smoothing_factor
toupscaling_factor
- WARNING: the following default values will change in Ray 2.11:
- Default for
max_ongoing_requests
will change from 100 to 5. - Default for
target_ongoing_requests
will change from 1 to 2.
- Default for
💫 Enhancements:
- Add
RAY_SERVE_LOG_ENCODING
env to set the global logging behavior for Serve (#42781). - Config Serve's gRPC proxy to allow large payload (#43114).
- Add blocking flag to serve.run() (#43227).
- Add actor id and worker id to Serve structured logs (#43725).
- Added replica queue length caching to the DeploymentHandle scheduler (#42943).
- This should improve overhead in the Serve proxy and handles.
max_ongoing_requests
(max_concurrent_queries
) is also now strictly enforced (#42947).- If you see any issues, please report them on GitHub and you can disable this behavior by setting:
RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=0
.
- Autoscaling metrics (trackin...