Releases: ray-project/ray
ray-1.0.1
Ray 1.0.1
Ray 1.0.1 is now officially released!
Highlights
- If you're migrating from Ray < 1.0.0, be sure to check out the 1.0 Migration Guide.
- Autoscaler is now docker by default.
- RLLib features multiple new environments.
- Tune supports population based bandits, checkpointing in Docker, and multiple usability improvements.
- SGD supports PyTorch Lightning
- All of Ray's components and libraries have improved performance, scalability, and stability.
Core
- 1.0 Migration Guide.
- Many bug fixes and optimizations in GCS.
- Polishing of the Placement Group API.
- Improved Java language support
RLlib
- Added documentation for Curiosity exploration module (#11066).
- Added RecSym environment wrapper (#11205).
- Added Kaggle’s football environment (multi-agent) wrapper (#11249).
- Multiple bug fixes: GPU related fixes for SAC (#11298), MARWIL, all example scripts run on GPU (#11105), lifted limitation on 2^31 timesteps (#11301), fixed eval workers for ES and ARS (#11308), fixed broken no-eager-no-workers mode (#10745).
- Support custom MultiAction distributions (#11311).
- No environment is created on driver (local worker) if not necessary (#11307).
- Added simple SampleCollector class for Trajectory View API (#11056).
- Code cleanup: Docstrings and type annotations for Exploration classes (#11251), DQN (#10710), MB-MPO algorithm, SAC algorithm (#10825).
Serve
- API: Serve will error when
serve_client
is serialized. (#11181) - Performance:
serve_client.get_handle("endpoint")
will now get a handle to nearest node, increasing scalability in distributed mode. (#11477) - Doc: Added FAQ page and updated architecture page (#10754, #11258)
- Testing: New distributed tests and benchmarks are added (#11386)
- Testing: Serve now run on Windows (#10682)
SGD
- Pytorch Lightning integration is now supported (#11042)
- Support
num_steps
continue training (#11142) - Callback API for SGD+Tune (#11316)
Tune
- New Algorithm: Population-based Bandits (#11466)
tune.with_parameters()
, a wrapper function to pass arbitrary objects through the object store to trainables (#11504)- Strict metric checking - by default, Tune will now error if a result dict does not include the optimization metric as a key. You can disable this with TUNE_DISABLE_STRICT_METRIC_CHECKING (#10972)
- Syncing checkpoints between multiple Docker containers on a cluster is now supported with the
DockerSyncer
(#11035) - Added type hints (#10806)
- Trials are now dynamically created (instead of created up front) (#10802)
- Use
tune.is_session_enabled()
in the Function API to toggle between Tune and non-tune code (#10840) - Support hierarchical search spaces for hyperopt (#11431)
- Tune function API now also supports
yield
andreturn
statements (#10857) - Tune now supports callbacks with
tune.run(callbacks=...
(#11001) - By default, the experiment directory will be dated (#11104)
- Tune now supports
reuse_actors
for function API, which can largely accelerate tuning jobs.
Thanks
We thank all the contributors for their contribution to this release!
@acxz, @gekho457, @allenyin55, @AnesBenmerzoug, @michaelzhiluo, @SongGuyang, @maximsmol, @WangTaoTheTonic, @Basasuya, @sumanthratna, @juliusfrost, @maxco2, @Xuxue1, @jparkerholder, @AmeerHajAli, @raulchen, @justinkterry, @herve-alanaai, @richardliaw, @raoul-khour-ts, @C-K-Loan, @mattearllongshot, @robertnishihara, @internetcoffeephone, @Servon-Lee, @clay4444, @fangyeqing, @krfricke, @ffbin, @akotlar, @rkooo567, @chaokunyang, @PidgeyBE, @kfstorm, @barakmich, @amogkam, @edoakes, @ashione, @jseppanen, @ttumiel, @desktable, @pcmoritz, @ingambe, @ConeyLiu, @wuisawesome, @fyrestone, @oliverhu, @ericl, @weepingwillowben, @rkube, @alanwguo, @architkulkarni, @lasagnaphil, @rohitrawat, @ThomasLecat, @stephanie-wang, @suquark, @ijrsvt, @VishDev12, @Leemoonsoo, @scottwedge, @sven1977, @yiranwang52, @carlos-aguayo, @mvindiola1, @zhongchun, @mfitton, @simon-mo
ray-1.0.0
Ray 1.0
We're happy to announce the release of Ray 1.0, an important step towards the goal of providing a universal API for distributed computing.
To learn more about Ray 1.0, check out our blog post and whitepaper.
Ray Core
- The ray.init() and
ray start
commands have been cleaned up to remove deprecated arguments - The Ray Java API is now stable
- Improved detection of Docker CPU limits
- Add support and documentation for Dask-on-Ray and MARS-on-Ray: https://docs.ray.io/en/master/ray-libraries.html
- Placement groups for fine-grained control over scheduling decisions: https://docs.ray.io/en/latest/placement-group.html.
- New architecture whitepaper: https://docs.ray.io/en/master/whitepaper.html
Autoscaler
- Support for multiple instance types in the same cluster: https://docs.ray.io/en/master/cluster/autoscaling.html
- Support for specifying GPU/accelerator type in
@ray.remote
Dashboard & Metrics
- Improvements to the memory usage tab and machine view
- The dashboard now supports visualization of actor states
- Support for Prometheus metrics reporting: https://docs.ray.io/en/latest/ray-metrics.html
RLlib
- Two Model-based RL algorithms were added: MB-MPO (“Model-based meta-policy optimization”) and “Dreamer”. Both algos were benchmarked and are performing comparably to the respective papers’ reported results.
- A “Curiosity” (intrinsic motivation) module was added via RLlib’s Exploration API and benchmarked on a sparse-reward Unity3D environment (Pyramids).
- Added documentation for the Distributed Execution API.
- Removed (already soft-deprecated) APIs: Model(V1) class, Trainer config keys, some methods/functions. Where you would see a warning previously when using these, there will be an error thrown now.
- Added DeepMind Control Suite examples.
Tune
Breaking changes:
- Multiple tune.run parameters have been deprecated:
ray_auto_init, run_errored_only, global_checkpoint_period, with_server
(#10518) tune.run(upload_dir, sync_to_cloud, sync_to_driver, sync_on_checkpoint
have been moved totune.SyncConfig
[docs] (#10518)
New APIs:
mode, metric, time_budget
parameters for tune.run (#10627, #10642)- Search Algorithms now share a uniform API: (#10621, #10444). You can also use the new
create_scheduler/create_searcher
shim layer to create search algorithms/schedulers via string, reducing boilerplate code (#10456). - Native callbacks for: MXNet, Horovod, Keras, XGBoost, PytorchLightning (#10533, #10304, #10509, #10502, #10220)
- PBT runs can be replayed with PopulationBasedTrainingReplay scheduler (#9953)
- Search Algorithms are saved/resumed automatically (#9972)
- New Optuna Search Algorithm docs (#10044)
- Tune now can sync checkpoints across Kubernetes pods (#10097)
- Failed trials can be rerun with
tune.run(resume="run_errored_only")
(#10060)
Other Changes:
- Trial outputs can be saved to file via
tune.run(log_to_file=...)
(#9817) - Trial directories can be customized, and default trial directory now includes trial name (#10608, #10214)
- Improved Experiment Analysis API (#10645)
- Support for Multi-objective search via SigOpt Wrapper (#10457, #10446)
- BOHB Fixes (#10531, #10320)
- Wandb improvements + RLlib compatibility (#10950, #10799, #10680, #10654, #10614, #10441, #10252, #8521)
- Updated documentation for FAQ, Tune+serve, search space API, lifecycle (#10813, #10925, #10662, #10576, #9713, #10222, #10126, #9908)
RaySGD:
- Creator functions are subsumed by the TrainingOperator API (#10321)
- Training happens on actors by default (#10539)
Serve
serve.client
API makes it easy to appropriately manage lifetime for multiple Serve clusters. (#10460)- Serve APIs are fully typed. (#10205, #10288)
- Backend configs are now typed and validated via Pydantic. (#10559, #10389)
- Progress towards application level backend autoscaler. (#9955, #9845, #9828)
- New architecture page in documentation. (#10204)
Thanks
We thank all the contributors for their contribution to this release!
@MissiontoMars, @ijrsvt, @desktable, @kfstorm, @lixin-wei, @Yard1, @chaokunyang, @justinkterry, @pxc, @ericl, @WangTaoTheTonic, @carlos-aguayo, @sven1977, @gabrieleoliaro, @alanwguo, @aryairani, @kishansagathiya, @barakmich, @rkube, @SongGuyang, @qicosmos, @ffbin, @PidgeyBE, @sumanthratna, @yushan111, @juliusfrost, @edoakes, @mehrdadn, @Basasuya, @icaropires, @michaelzhiluo, @fyrestone, @robertnishihara, @yncxcw, @oliverhu, @yiranwang52, @ChuaCheowHuan, @raphaelavalos, @suquark, @krfricke, @pcmoritz, @stephanie-wang, @hekaisheng, @zhijunfu, @Vysybyl, @wuisawesome, @sanderland, @richardliaw, @simon-mo, @janblumenkamp, @zhuohan123, @AmeerHajAli, @iamhatesz, @mfitton, @noahshpak, @maximsmol, @weepingwillowben, @raulchen, @09wakharet, @ashione, @henktillman, @architkulkarni, @rkooo567, @zhe-thoughts, @amogkam, @kisuke95, @clarkzinzow, @holli, @raoul-khour-ts
ray-0.8.7
Highlight
- Ray is moving towards 1.0! It has had several important naming changes.
ObjectID
s are now calledObjectRef
s because they are not just IDs.- The Ray Autoscaler is now called the Ray Cluster Launcher. The autoscaler will be a module of the Ray Cluster Launcher.
- The Ray Cluster Launcher now has a much cleaner and concise output style. Try it out with
ray up --log-new-style
. The new output style will be enabled by default (with opt-out) in a later release. - Windows is now officially supported by RLlib. Multi node support for Windows is still in progress.
Cluster Launcher/CLI (formerly autoscaler)
- Highlight: This release contains a new colorful, concise output style for
ray up
andray down
, available with the--log-new-style
flag. It will be enabled by default (with opt-out) in a later release. Full output style coverage for Cluster Launcher commands will also be available in a later release. (#9322, #9943, #9960, #9690) - Documentation improvements (with guides and new sections) (#9687
- Improved Cluster launcher docker support (#9001, #9105, #8840)
- Ray now has Docker images available on Docker hub. Please check out the ray image (#9732, #9556, #9458, #9281)
- Azure improvements (#8938)
- Improved on-prem cluster autoscaler (#9663)
- Add option for continuous sync of file mounts (#9544)
- Add
ray status
debug tool andray --version
(#9091, #8886). ray memory
now also supports redis_password (#9492)- Bug fixes for the Kubernetes cluster launcher mode (#9968)
- Various improvements: disabling the cluster config cache (#8117), Python API requires keyword arguments (#9256), removed fingerprint checking for SSH (#9133), Initial support for multiple worker types (#9096), various changes to the internal node provider interface (#9340, #9443)
Core
- Support Python type checking for Ray tasks (#9574)
- Rename ObjectID => ObjectRef (#9353)
- New GCS Actor manager on by default (#8845, #9883, #9715, #9473, #9275)
- Worker towards placement groups (#9039)
- Plasma store process is merged with raylet (#8939, #8897)
- Option to automatically reconstruct objects stored in plasma after a failure. See the documentation for more information. (#9394, #9557, #9488)
- Many bug fixes.
RLlib
- New algorithm: “Model-Agnostic Meta-Learning” (MAML). An algo that learns and generalizes well across a distribution of environments.
- New algorithm: “Model-Based Meta-Policy-Optimization” (MB-MPO). Our first model-based RL algo.
- Windows is now officially supported by RLlib.
- Native TensorFlow 2.x support. Use framework=”tf2” in your config to tap into TF2’s full potential. Also: SAC, DDPG, DQN Rainbow, ES, and ARS now run in TF1.x Eager mode.
- DQN PyTorch support for full Rainbow setup (including distributional DQN).
- Python type hints for Policy, Model, Offline, Evaluation, and Env classes.
- Deprecated “Policy Optimizer” package (in favor of new distributed execution API).
- Enhanced test coverage and stability.
- Flexible multi-agent replay modes and
replay_sequence_length
. We now allow a) storing sequences (over time) in replay buffers and retrieving “lock-stepped” multi-agent samples. - Environments: Unity3D soccer game (tuned example/benchmark) and DM Control Suite wrapper and examples.
- Various Bug fixes: QMIX not learning, DDPG torch bugs, IMPALA learning rate updates, PyTorch custom loss, PPO not learning MuJoCo due to action clipping bug, DQN w/o dueling layer error.
Tune
- API Changes:
- You can now stop experiments upon convergence with Bayesian Optimization (#8808)
DistributedTrainableCreator
, a simple wrapper for distributed parameter tuning with multi-node DistributedDataParallel models (#9550, #9739)- New integration and tutorial for using Ray Tune with Weights and Biases (Logger and native API) (#9725)
- Tune now provides a Scikit-learn compatible wrapper for hyperparameter tuning (#9129)
- New tutorials for integrations like XGBoost (#9060), multi GPU PyTorch (#9338), PyTorch Lightning (#9151, #9451), and Huggingface-Transformers (#9789)
- CLI Progress reporting improvements (#8802, #9537, #9525)
- Various bug fixes: handling of NaN values (#9381), Tensorboard logging improvements (#9297, #9691, #8918), enhanced cross-platform compatibility (#9141), re-structured testing (#9609), documentation reorganization and versioning (#9600, #9427, #9448)
RaySGD
Serve
- Horizontal scalability: Serve will now start one HTTP server per Ray node. (#9523)
- Various performance improvement matching Serve to FastAPI (#9490,#8709, #9531, #9479 ,#9225, #9216, #9485)
- API changes
serve.shadow_traffic(endpoint, backend, fraction)
duplicates and sends a fraction of the incoming traffic to a specific backend. (#9106)serve.shutdown()
cleanup the current Serve instance in Ray cluster. (#8766)- Exception will be raised if
num_replicas
exceeds the maximum resource in the cluster (#9005)
- Added doc examples for how to perform metric monitoring and model composition.
Dashboard
- Configurable Dashboard Port: The port on which the dashboard will run is now configurable using the argument
--dashboard-port
and the argumentdashboard_port
toray.init
- GPU monitoring improvements
- For machines with more than one GPU, the GPU and GRAM utilization is now broken out on a per-GPU basis.
- Assignments to physical GPUs are now shown at the worker level.
- Sortable Machine View: It is now possible to sort the machine view by almost any of its columns by clicking next to the title. In addition, whereas the workers are normally grouped by node, you can now ungroup them if you only want to see details about workers.
- Actor Search Bar: It is possible to search for actors by their title now (this is the class name of the actor in python in addition to the arguments it received.)
- Logical View UI Updates: This includes things like color-coded names for each of the actor states, a more grid-like layout, and tooltips for the various data.
- Sortable Memory View: Like the machine view, the memory view now has sortable columns and can be grouped / ungrouped by node.
Windows Support
Others
- Ray Streaming Library Improvements (#9240, #8910, #8780)
- Java Support Improvements (#9371, #9033, #9037, #9032, #8858, #9777, #9836, #9377)
- Parallel Iterator Improvements (#8964, #8978)
Thanks
We thank the following contributors for their work on this release:
@jsuarez5341, @amitsadaphule, @krfricke, @williamFalcon, @richardliaw, @heyitsmui, @mehrdadn, @robertnishihara, @gabrieleoliaro, @amogkam, @fyrestone, @mimoralea, @edoakes, @andrijazz, @ElektroChan89, @kisuke95, @justinkterry, @SongGuyang, @barakmich, @bloodymeli, @simon-mo, @TomVeniat, @lixin-wei, @alanwguo, @zhuohan123, @michaelzhiluo, @ijrsvt, @pcmoritz, @LecJackS, @sven1977, @ashione, @JerryLeeCS, @raphaelavalos, @stephanie-wang, @ruifangChen, @vnlitvinov, @yncxcw, @weepingwillowben, @goulou, @acmore, @wuisawesome, @gramhagen, @anabranch, @internetcoffeephone, @Alisahhh, @henktillman, @deanwampler, @p-christ, @Nicolaus93, @WangTaoTheTonic, @allenyin55, @kfstorm, @rkooo567, @ConeyLiu, @09wakharet, @piojanu, @mfitton, @KristianHolsheimer, @AmeerHajAli, @pdames, @ericl, @VishDev12, @suquark, @stefanbschneider, @raulchen, @dcfidalgo, @chappers, @aaarne, @chaokunyang, @sumanthratna, @clarkzinzow, @BalaBalaYi, @maximsmol, @zhongchun, @wumuzi520, @ffbin
ray-0.8.6
Highlight
- Experimental support for Windows is now available for single node Ray usage. Check out the Windows section below for known issues and other details.
- Have you had troubles monitoring GPU or memory usage while you used Ray? The Ray dashboard now supports the GPU monitoring and a memory view.
- Want to use RLlib with Unity? RLlib officially supports the Unity3D adapter! Please check out the documentation.
- Ray Serve is ready for feedback! We've gotten feedback from many users, and Ray Serve is already being used in production. Please reach out to us with your use cases, ideas, documentation improvements, and feedback. We'd love to hear from you. Please do so on the Ray Slack and join #serve! Please see the Serve section below for more details.
Core
- We’ve introduced a new feature to automatically retry failed actor tasks after an actor has been restarted by Ray (by specifying
max_restarts
in@ray.remote
). Try it out withmax_task_retries=-1
where -1 indicates that the system can retry the task until it succeeds.
API Change
- To enable automatic restarts of a failed actor, you must now use
max_restarts
in the@ray.remote
decorator instead ofmax_reconstructions
. You can use -1 to indicate infinity, i.e., the system should always restart the actor if it fails unexpectedly. - We’ve merged the named and detached actor APIs. To create an actor that will survive past the duration of its job (a “detached” actor), specify
name=<str>
in its remote constructor (Actor.options(name='<str>').remote()
). To delete the actor, you can useray.kill
.
RLlib
- PyTorch: IMPALA PyTorch version and all
rllib/examples
scripts now work for either TensorFlow or PyTorch (--torch
command line option). - Switched to using distributed execution API by default (replaces Policy Optimizers) for all algorithms.
- Unity3D adapter (supports all Env types: multi-agent, external env, vectorized) with example scripts for running locally or in the cloud.
- Added support for variable length observation Spaces ("Repeated").
- Added support for arbitrarily nested action spaces.
- Added experimental GTrXL (Transformer/Attention net) support to RLlib + learning tests for PPO and IMPALA.
- QMIX now supports complex observation spaces.
API Change
- Retire
use_pytorch
andeager
flags in configs and replace these withframework=[tf|tfe|torch]
. - Deprecate PolicyOptimizers in favor of the new distributed execution API.
- Retired support for Model(V1) class. Custom Models should now only use the ModelV2 API. There is still a warning when using ModelV1, which will be changed into an error message in the next release.
- Retired TupleActions (in favor of arbitrarily nested action Spaces).
Ray Tune / RaySGD
- There is now a Dataset API for handling large datasets with RaySGD. (#7839)
- You can now filter by an average of the last results using the
ExperimentAnalysis
tool (#8445). - BayesOptSearch received numerous contributions, enabling preliminary random search and warm starting. (#8541, #8486, #8488)
API Changes
tune.report
is now the right way to use the Tune function API.tune.track
is deprecated (#8388)
Serve
- New APIs to inspect and manage Serve objects:
serve.create_endpoint
now requires specifying the backend directly. You can removeserve.set_traffic
if there's only one backend per endpoint. (#8764)serve.init
API cleanup, the following options were removed:serve.init
now supports namespacing withname
. You can run multiple serve clusters with different names on the same ray cluster. (#8449)- You can specify session affinity when splitting traffic with backends using
X-SERVE-SHARD-KEY
HTTP header. (#8449) - Various documentation improvements. Highlights:
Dashboard / Metrics
- The Machine View of the dashboard now shows information about GPU utilization such as:
- Average GPU/GRAM utilization at a node and cluster level
- Worker-level information about how many GPUs each worker is assigned as well as its GRAM use.
- The dashboard has a new Memory View tab that should be very useful for debugging memory issues. It has:
- Information about objects in the Ray object store, including size and call-site
- Information about reference counts and what is keeping an object pinned in the Ray object store.
Small changes
- IDLE workers get automatically sorted to the end of the worker list in the Machine View
Autoscaler
- Improved logging output. Errors are more clearly propagated and excess output has been reduced. (#7198, #8751, #8753)
- Added support for k8s services.
API Changes
ray up
accepts remote URLs that point to the desired cluster YAML. (#8279)
Windows support
- Windows wheels are now available for basic experimental usage (via
ray.init()
). - Windows support is currently unstable. Unusual, unattended, or production usage is not recommended.
- Various functionality may still lack support, including Ray Serve, Ray SGD, the autoscaler, the dashboard, non-ASCII file paths, etc.
- Please check the latest nightly wheels & known issues (#9114), and let us know if any issue you encounter has not yet been addressed.
- Wheels are available for Python 3.6, 3.7, and 3.8. (#8369)
- redis-py has been patched for Windows sockets. (#8386)
Others
- Moving towards highly available Ray (#8650, #8639, #8606, #8601, #8591, #8442)
- Java Support (#8730, #8640, #8637)
- Ray streaming improvements (#8612, #8594, #7464)
- Parallel iterator improvements (#8140, #7931, #8712)
Thanks
We thank the following contributors for their work on this release:
@pcmoritz, @akharitonov, @devanderhoff, @ffbin, @anabranch, @jasonjmcghee, @kfstorm, @mfitton, @alecbrick, @simon-mo, @konichuvak, @aniryou, @wuisawesome, @robertnishihara, @ramanNarasimhan77, @09wakharet, @richardliaw, @istoica, @ThomasLecat, @sven1977, @ceteri, @acxz, @iamhatesz, @JarnoRFB, @rkooo567, @mehrdadn, @thomasdesr, @janblumenkamp, @ujvl, @edoakes, @maximsmol, @krfricke, @amogkam, @gehring, @ijrsvt, @internetcoffeephone, @LucaCappelletti94, @chaokunyang, @WangTaoTheTonic, @fyrestone, @raulchen, @ConeyLiu, @stephanie-wang, @suquark, @ashione, @Coac, @JosephTLucas, @ericl, @AmeerHajAli, @pdames
Ray 0.8.5
Highlight
- You can now cancel remote tasks using the
ray.cancel
API. - PyTorch is now a first-class citizen in RLlib! We've achieved parity between TensorFlow and PyTorch.
- Did you struggle to find good example code for Ray ML libraries? We wrote more examples for Ray SGD and Ray Serve.
- Ray serve: Keras/Tensorflow, PyTorch, Scikit-Learn.
- Ray SGD: New Semantic Segmentation and HuggingFace GLUE Fine-tuning Examples.
Core
- Task cancellation is now available for locally submitted tasks. (#7699)
- Experimental support for recovering objects that were lost from the Ray distributed memory store. You can try this out by setting
lineage_pinning_enabled: 1
in the internal config. (#7733)
RLlib
- PyTorch support has now reached parity with TensorFlow. (#7926, #8188, #8120, #8101, #8106, #8104, #8082, #7953, #7984, #7836, #7597, #7797)
- Improved callbacks API. (#6972)
- Enable Ray distributed reference counting. (#8037)
- Work towards customizable distributed training workflows. (#7958, #8077)
Tune
- Documentation has improved with a new format. (#8083, #8201, #7716)
- Search algorithms are refactored to make them easier to extend, deprecating
max_concurrent
argument. (#7037, #8258, #8285) - TensorboardX errors are now handled safely. (#8174)
- Bug fix in PBT checkpointing. (#7794)
- New ZOOpt search algorithm added. (#7960)
Serve
- Improved APIs.
- Added overview section to the documentation.
- Added tutorials for serving models in Tensorflow/Keras, PyTorch, and Scikit-Learn.
- Made serve clusters tolerant to process failures. (#8116, #8008,#7970,#7936)
SGD
- New Semantic Segmentation and HuggingFace GLUE Fine-tuning Examples. (#7792, #7825)
- Fix GPU Reservations in SLURM usage. (#8157)
- Update learning rate scheduler stepping parameter. (#8107)
- Make serialization of data creation optional. (#8027)
- Automatic DDP wrapping is now optional. (#7875)
Others Projects
- Progress towards the highly available and fault tolerant control plane. (#8144, #8119, #8145, #7909, #7949, #7771, #7557, #7675)
- Progress towards the Ray streaming library. (#8044, #7827, #7955, #7961, #7348)
- Autoscaler improvement. (#8178, #8168, #7986, #7844, #7717)
- Progress towards Java support. (#8014)
- Progress towards the Window compatibility. (#8237, #8186)
- Progress towards cross language support. (#7711)
Thanks
We thank the following contributors for their work on this release:
@simon-mo, @robertnishihara, @BalaBalaYi, @ericl, @kfstorm, @tirkarthi, @nflu, @ffbin, @chaokunyang, @ijrsvt, @pcmoritz, @mehrdadn, @sven1977, @iamhatesz, @nmatthews-asapp, @mitchellstern, @edoakes, @anabranch, @billowkiller, @eisber, @ujvl, @allenyin55, @yncxcw, @deanwampler, @DavidMChan, @ConeyLiu, @micafan, @rkooo567, @datayjz, @wizardfishball, @sumanthratna, @ashione, @marload, @stephanie-wang, @richardliaw, @jovany-wang, @MissiontoMars, @aannadi, @fyrestone, @JarnoRFB, @wumuzi520, @roireshef, @acxz, @gramhagen, @Servon-Lee, @clarkzinzow, @mfitton, @maximsmol, @janblumenkamp, @istoica
Ray 0.8.4
Highlight
- Add Python 3.8 support. (#7754)
Core
- Fix asycnio actor deserialization. (#7806)
- Fix importing Pyarrow lead to symbol collison segfault. (#7568)
ray memory
will collect statistics from all nodes. (#7721)- Pin lineage of plasma objects that are still in scope. (#7690)
RLlib
- Add contextual bandit algorithms. (#7642)
- Add parameter noise exploration API. (#7772)
- Add scaling guide. (#7780)
- Enable restore keras model from h5 file. (#7482)
- Store tf-graph by default when doing
Policy.export_model()
. (#7759) - Fix default policy overrides torch policy. (#7756, #7769)
RaySGD
- BREAKING: Add new API for tuning TorchTrainer using Tune. (#7547)
- BREAKING: Convert the head worker to a local model. (#7746)
- Added a new API for save/restore. (#7547)
- Add tqdm support to TorchTrainer. (#7588)
Tune
- Add sorted columns and TensorBoard to Tune tab. (#7140)
- Tune experiments can now be cancelled via the REST client. (#7719)
fail_fast
enables experiments to fail quickly. (#7528)- override the IP retrieval process if needed. (#7705)
- TensorBoardX nested dictionary support. (#7705)
Serve
- Performance improvements:
- Add async methods support for serve actors. (#7682)
- Add multiple method support for serve actors. (#7709)
- You can specify HTTP methods in
serve.create_backend(..., methods=["GET", "POST"])
. - The ability to specify which actor method to execute in HTTP through
X-SERVE-CALL-METHOD
header or inRayServeHandle
throughhandle.options("method").remote(...)
.
- You can specify HTTP methods in
Others
- Progress towards highly available control plane. (#7822, #7742)
- Progress towards Windows compatibility. (#7740, #7739, #7657)
- Progress towards Ray Streaming library. (#7813)
- Progress towards metrics export service. (#7809)
- Basic C++ worker implementation. (#6125)
Thanks
We thank the following contributors for their work on this release:
@carlbalmer, @BalaBalaYi, @saurabh3949, @maximsmol, @SongGuyang, @istoica, @pcmoritz, @aannadi, @kfstorm, @ijrsvt, @richardliaw, @mehrdadn, @wumuzi520, @cloudhan, @edoakes, @mitchellstern, @robertnishihara, @hhoke, @simon-mo, @ConeyLiu, @stephanie-wang, @rkooo567, @ffbin, @ericl, @hubcity, @sven1977
Ray 0.8.3
Highlights
- Autoscaler has added Azure Support. (#7080, #7515, #7558, #7494)
- Ray autoscaler helps you launch a distributed ray cluster using a single command line call!
- It works on Azure, AWS, GCP, Kubernetes, Yarn, Slurm and local nodes.
- Distributed reference counting is turned on by default. (#7628, #7337)
- This means all ray objects are tracked and garbage collected only when all references go out of scope. It can be turned off with:
ray.init(_internal_config=json.dumps({"distributed_ref_counting_enabled": 0}))
. - When the object store is full with objects that are still in scope, you can turn on least-recently-used eviction to force remove objects using
ray.init(lru_evict=True)
.
- This means all ray objects are tracked and garbage collected only when all references go out of scope. It can be turned off with:
- A new command
ray memory
is added to help debug memory usage: (#7589)- It shows all object IDs that are in scope, their reference types, sizes and creation site.
- Read more in the docs: https://ray.readthedocs.io/en/latest/memory-management.html.
- It shows all object IDs that are in scope, their reference types, sizes and creation site.
> ray memory
-----------------------------------------------------------------------------------------------------
Object ID Reference Type Object Size Reference Creation Site
=====================================================================================================
; worker pid=51230
ffffffffffffffffffffffff0100008801000000 PINNED_IN_MEMORY 8231 (deserialize task arg) __main__..sum_task
; driver pid=51174
45b95b1c8bd3a9c4ffffffff010000c801000000 USED_BY_PENDING_TASK ? (task call) memory_demo.py:<module>:13
ffffffffffffffffffffffff0100008801000000 USED_BY_PENDING_TASK 8231 (put object) memory_demo.py:<module>:6
ef0a6c221819881cffffffff010000c801000000 LOCAL_REFERENCE ? (task call) memory_demo.py:<module>:14
-----------------------------------------------------------------------------------------------------
API change
- Change
actor.__ray_kill__()
toray.kill(actor)
. (#7360) - Deprecate
use_pickle
flag for serialization. (#7474) - Remove
experimental.NoReturn
. (#7475) - Remove
experimental.signal API
. (#7477)
Core
- Add Apache 2 license header to C++ files. (#7520)
- Reduce per worker memory usage to 50MB. (#7573)
- Option to fallback to LRU on OutOfMemory. (#7410)
- Reference counting for actor handles. (#7434)
- Reference counting for returning object IDs created by a different process. (#7221)
- Use
prctl(PR_SET_PDEATHSIG)
on Linux instead of reaper. (#7150) - Route asyncio plasma through raylet instead of direct plasma connection. (#7234)
- Remove static concurrency limit from gRPC server. (#7544)
- Remove
get_global_worker()
,RuntimeContext
. (#7638) - Fix known issues from 0.8.2 release:
RLlib
- New features:
- Bug fix highlights:
Tune
- Integrate Dragonfly optimizer. (#5955)
- Fix HyperBand errors. (#7563)
- Access Trial Name, Trial ID inside trainable. (#7378)
- Add a new
repeater
class for high variance trials. (#7366) - Prevent deletion of checkpoint from user-initiated restoration. (#7501)
Libraries
- [Parallel Iterators] Allow for operator chaining after repartition. (#7268)
- [Parallel Iterators] Repartition functionality. (#7163)
- [Serve]
@serve.route
returns a handle, addhandle.scale
,handle.set_max_batch_size
. (#7569) - [RaySGD] PyTorchTrainer --> TorchTrainer. (#7425)
- [RaySGD] Custom training API. (#7211)
- [RaySGD] Breaking User API changes: (#7384)
data_creator
fed to TorchTrainer now must return a dataloader rather than datasets.- TorchTrainer automatically sets "DistributedSampler" if a DataLoader is returned.
data_loader_config
andbatch_size
are no longer parameters for TorchTrainer.- TorchTrainer parallelism is now set by
num_workers
. - All TorchTrainer args now must be named parameters.
Java
- New Java actor API (#7414)
@RayRemote
annotation is removed.- Instead of
Ray.call(ActorClass::method, actor)
, the new API isactor.call(ActorClass::method)
.
- Allow passing internal config from raylet to Java worker. (#7532)
- Enable direct call by default. (#7408)
- Pass large object by reference. (#7595)
Others
- Progress towards Ray Streaming, including a Python API. (#7070, #6755, #7152, #7582)
- Progress towards GCS Service for GCS fault tolerance. (#7292, #7592, #7601, #7166)
- Progress towards cross language call between Java and Python. (#7614, #7634)
- Progress towards Windows compatibility. (#7529, #7509, #7658, #7315)
- Improvement in K8s Operator. (#7521, #7621, #7498, #7459, #7622)
- New documentation for Ray Dashboard. (#7304)
Known issues
- Ray currently doesn't work on Python 3.5.0, but works on 3.5.3 and above.
Thanks
We thank the following contributors for their work on this release:
@rkooo567, @maximsmol, @suquark, @mitchellstern, @micafan, @clarkzinzow, @Jimpachnet, @mwbrulhardt, @ujvl, @chaokunyang, @robertnishihara, @jovany-wang, @hyeonjames, @zhijunfu, @datayjz, @fyrestone, @eisber, @stephanie-wang, @allenyin55, @BalaBalaYi, @simon-mo, @thedrow, @ffbin, @amogkam, @tisonkun, @richardliaw, @ijrsvt, @wumuzi520, @mehrdadn, @raulchen, @landcold7, @ericl, @edoakes, @sven1977, @ashione, @jorenretel, @gramhagen, @kfstorm, @anthonyhsyu, @pcmoritz
Ray 0.8.2
Highlights
- Pyarrow is no longer vendored. Ray directly uses the C++ Arrow API. You can use any version of pyarrow with ray. (#7233)
- The dashboard is turned on by default. It shows node and process information, actor information, and Ray Tune trials information. You can also use
ray.show_in_webui
to display custom messages for actors. Please try it out and send us feedback! (#6705, #6820, #6822, #6911, #6932, #6955, #7028, #7034) - We have made progress on distributed reference counting (behind a feature flag). You can try it out with
ray.init(_internal_config=json.dumps({"distributed_ref_counting_enabled": 1}))
. It is designed to help manage memory using precise distributed garbage collection. (#6945, #6946, #7029, #7075, #7218, #7220, #7222, #7235, #7249)
Breaking changes
- Many experimental Ray libraries are moved to the util namespace. (#7100)
ray.experimental.multiprocessing
=>ray.util.multiprocessing
ray.experimental.joblib
=>ray.util.joblib
ray.experimental.iter
=>ray.util.iter
ray.experimental.serve
=>ray.serve
ray.experimental.sgd
=>ray.util.sgd
- Tasks and actors are cleaned up if their owner process dies. (#6818)
- The
OMP_NUM_THREADS
environment variable defaults to 1 if unset. This improves training performance and reduces resource contention. (#6998) - We now vendor
psutil
andsetproctitle
to support turning the dashboard on by default. Runningimport psutil
afterimport ray
will use the version of psutil that ships with Ray. (#7031)
Core
- The Python raylet client is removed. All raylet communication now goes through the core worker. (#6018)
- Calling
delete()
will not delete objects in the in-memory store. (#7117) - Removed vanilla pickle serialization for task arguments. (#6948)
- Fix bug passing empty bytes into Python tasks. (#7045)
- Progress toward next generation ray scheduler. (#6913)
- Progress toward service based global control store (GCS). (#6686, #7041)
RLlib
- Improved PyTorch support, including a PyTorch version of PPO. (#6826, #6770)
- Added distributed SGD for PPO. (#6918, #7084)
- Added an exploration API for controlling epsilon greedy and stochastic exploration. (#6974, #7155)
- Fixed schedule values going negative past the end of the schedule. (#6971, #6973)
- Added support for histogram outputs in TensorBoard. (#6942)
- Added support for parallel and customizable evaluation step. (#6981)
Tune
- Improved Ax Example. (#7012)
- Process saves asynchronously. (#6912)
- Default to tensorboardx and include it in requirements. (#6836)
- Added experiment stopping api. (#6886)
- Expose progress reporter to users. (#6915)
- Fix directory naming regression. (#6839)
- Handles nan case for asynchyperband. (#6916)
- Prevent memory checkpoints from breaking trial fault tolerance. (#6691)
- Remove keras dependency. (#6827)
- Remove unused tf loggers. (#7090)
- Set correct path when deleting checkpoint folder. (#6758)
- Support callable objects in variant generation. (#6849)
Autoscaler
- Ray nodes now respect docker limits. (#7039)
- Add
--all-nodes
option to rsync-up. (#7065) - Add port-forwarding support for attach. (#7145)
- For AWS, default to latest deep learning AMI. (#6922)
- Added 'ray dashboard' command to proxy ray dashboard in remote machine. (#6959)
Utility libraries
- Support of scikit-learn with Ray joblib backend. (#6925)
- Parallel iterator support local shuffle. (#6921)
- [Serve] support no http headless services. (#7010)
- [Serve] refactor router to use Ray asyncio support. (#6873)
- [Serve] support composing arbitrary dags. (#7015)
- [RaySGD] support fp16 via PyTorch apex. (#7061)
- [RaySGD] refactor PyTorch sgd documentation. (#6910)
- Improvement in Ray Streaming. (#7043, #6666, #7071)
Other improvements
- Progress toward Windows compatibility. (#6882, #6823)
- Ray Kubernetes operator improvements. (#6852, #6851, #7091)
- Java support for concurrent actor calls API. (#7022)
- Java support for direct call for normal tasks. (#7193)
- Java support for cross language Python invocation. (#6709)
- Java support for cross language serialization for actor handles. (#7134)
Known issue
- Passing the same ObjectIDs multiple time as arguments currently doesn't work. (#7296)
- Tasks can exceed gRPC max message size. (#7263)
Thanks
We thank the following contributors for their work on this release:
@mitchellstern, @hugwi, @deanwampler, @alindkhare, @ericl, @ashione, @fyrestone, @robertnishihara, @pcmoritz, @richardliaw, @yutaizhou, @istoica, @edoakes, @ls-daniel, @BalaBalaYi, @raulchen, @justinkterry, @roireshef, @elpollouk, @kfstorm, @Bassstring, @hhbyyh, @Qstar, @mehrdadn, @chaokunyang, @flying-mojo, @ujvl, @AnanthHari, @rkooo567, @simon-mo, @jovany-wang, @ijrsvt, @ffbin, @AmeerHajAli, @gaocegege, @suquark, @MissiontoMars, @zzyunzhi, @sven1977, @stephanie-wang, @amogkam, @wuisawesome, @aannadi, @maximsmol
ray-0.8.1
Ray 0.8.1 Release Notes
Highlights
ObjectID
s corresponding toray.put()
objects and task returns are now reference counted locally in Python and when passed into a remote task as an argument.ObjectID
s that have a nonzero reference count will not be evicted from the object store. Note that references forObjectID
s passed into remote tasks inside of other objects (e.g.,f.remote((ObjectID,))
orf.remote([ObjectID])
) are not currently accounted for. (#6554)asyncio
actor support: actors can now defineasync def
method and Ray will run multiple method invocations in the same event loop. The maximum concurrency level can be adjusted withActorClass.options(max_concurrency=2000).remote()
.asyncio
ObjectID
support: Ray ObjectIDs can now be directly awaited using the Python API.await my_object_id
is similar toray.get(my_object_id)
, but allows context switching to make the operation non-blocking. You can also convert anObjectID
to aasyncio.Future
usingObjectID.as_future()
.- Added experimental parallel iterators API (#6644, #6726):
ParallelIterator
s can be used to more convienently load and process data into Ray actors. See the documentation for details. - Added multiprocessing.Pool API (#6194): Ray now supports the
multiprocessing.Pool
API out of the box, so you can scale existing programs up from a single node to a cluster by only changing the import statment. See the documentation for details.
Core
- Deprecated Python 2 (#6581, #6601, #6624, #6665)
- Fixed bug when failing to import remote functions or actors with args and kwargs (#6577)
- Many improvements to the dashboard (#6493, #6516, #6521, #6574, #6590, #6652, #6671, #6683, #6810)
- Progress towards Windows compatibility (#6446, #6548, #6653, #6706)
- Redis now binds to localhost and has a password set by default (#6481)
- Added
actor.__ray_kill__()
to terminate actors immediately (#6523) - Added 'ray stat' command for debugging (#6622)
- Added documentation for fault tolerance behavior (#6698)
- Treat static methods as class methods instead of instance methods in actors (#6756)
RLlib
- DQN distributional model: Replace all legacy tf.contrib imports with tf.keras.layers.xyz or tf.initializers.xyz (#6772)
- SAC site changes (#6759)
- PG unify/cleanup tf vs torch and PG functionality test cases (tf + torch) (#6650)
- SAC for Mujoco Environments (#6642)
- Tuple action dist tensors not reduced properly in eager mode (#6615)
- Changed foreach_policy to foreach_trainable_policy (#6564)
- Wrapper for the dm_env interface (#6468)
Tune
- Get checkpoints paths for a trial after tuning (#6643)
- Async restores and S3/GCP-capable trial FT (#6376)
- Usability errors PBT (#5972)
- Demo exporting trained models in pbt examples (#6533)
- Avoid duplication in TrialRunner execution (#6598)
- Update params for optimizer in reset_config (#6522)
- Support Type Hinting for py3 (#6571)
Other Libraries
- [serve] Pluggable Queueing Policy (#6492)
- [serve] Added BackendConfig (#6541)
- [sgd] Fault tolerance support for pytorch + revamp documentation (#6465)
Thanks
We thank the following contributors for their work on this release:
@chaokunyang, @Qstar, @simon-mo, @wlx65003, @stephanie-wang, @alindkhare, @ashione, @harrisonfeng, @JingGe, @pcmoritz, @zhijunfu, @BalaBalaYi, @kfstorm, @richardliaw, @mitchellstern, @michaelzhiluo, @ziyadedher, @istoica, @EyalSel, @ffbin, @raulchen, @edoakes, @chenk008, @frthjf, @mslapek, @gehring, @hhbyyh, @zzyunzhi, @zhu-eric, @MissiontoMars, @sven1977, @walterddr, @micafan, @inventormc, @robertnishihara, @ericl, @ZhongxiaYan, @mehrdadn, @jovany-wang, @ujvl, @bharatpn
Ray 0.8.0 Release Notes
Ray 0.8.0 Release Notes
This is the first release with gRPC direct calls enabled by default for both tasks and actors, which substantially improves task submission performance.
Highlights
- Enable gRPC direct calls by default (#6367). In this mode, actor tasks are sent directly from actor to actor over gRPC; the Raylet only coordinates actor creation. Similarly, with tasks, tasks are submitted directly from worker to worker over gRPC; the Raylet only coordinates the scheduling decisions. In addition, small objects (<100KB in size) are no longer placed in the object store. They are inlined into task submissions and returns when possible.
Note: in some cases, reconstruction of large evicted objects is not possible with direct calls. To revert to the 0.7.7 behaviour, you can set the environment variable RAY_FORCE_DIRECT=0
.
Core
- [Dashboard] Add remaining features from old dashboard (#6489)
- Ray Kubernetes Operator Part 1: readme, structure, config and CRD realted file (#6332)
- Make sure numpy >= 1.16.0 is installed for fast pickling support (#6486)
- Avoid workers starting with the same random seed (#6471)
- Properly handle a forwarded task that gets forwarded back (#6271)
RLlib
- (Bug Fix): Remove the extra 0.5 in the Diagonal Gaussian entropy (#6475)
- AlphaZero and Ranked reward implementation (#6385)
Tune
- Add example and tutorial for DCGAN (#6400)
- Report trials by state fairly (#6395)
- Fixed bug in PBT where initial trial result is empty. (#6351)
Other Libraries
- [sgd] Add support for multi-model multi-optimizer training (#6317)
- [serve] Added deadline awareness (#6442)
- [projects] Return parameters for a command (#6409)
- [streaming] Streaming data transfer and python integration (#6185)
Thanks
We thank the following contributors for their work on this release:
@zplizzi, @istoica, @ericl, @mehrdadn, @walterddr, @ujvl, @alindkhare, @timgates42, @chaokunyang, @eugenevinitsky, @kfstorm, @Maltimore, @visatish, @simon-mo, @AmeerHajAli, @wumuzi520, @robertnishihara, @micafan, @pcmoritz, @zhijunfu, @edoakes, @sytelus, @ffbin, @richardliaw, @Qstar, @stephanie-wang, @Coac, @mitchellstern, @MissiontoMars, @deanwampler, @hhbyyh, @raulchen