-
Breaking changes
- Drop support for python 3.7.
- Upgrade docker base image python version to 2.0
-
torchx.schedulers
- Add support for options in create_schedulers factory method that allows scheduler configuration in runner
- Kubernetes MCAD Scheduler
- Add support for retrying
- Test, formatting and documentation updates
- AWS Batch Scheduler
- Fix logging rank attribution
- Ray Scheduler
- Add ability to programmatically define ray job client
-
torchx.tracker
- Fix adding artifacts to MLFlowTracker by multiple ranks
-
torchx.components
- dst.ddp
- Add ability to specify rendezvous backend and use c10d as a default mechanism
- Add node_rank parameter value for static rank setup
- dst.ddp
-
torchx.runner
- Resolve run_opts when passing to
torchx.workpsace
and for dry-run to correctly populate the values
- Resolve run_opts when passing to
-
torchx.runner.events
- Add support to log CPU and wall times
-
torchx.cli
- Wait for app to start when logging
-
torchx.specs
- Role.resources uses default_factory method to initialize its value
-
torchx.schedulers
- Kubernetes MCAD Scheduler (Prototype)
- Newly added integration for easily scheduling jobs on Multi-Cluster-Application-Dispatcher (MCAD).
- Features include:
- scheduling different types of components including DDP components
- scheduling on different compute resources (CPU, GPU)
- support for docker workspace
- support for bind, volume and device mounts
- getting logs for jobs
- describing, listing and cancelling jobs
- can be used with a secondary scheduler on Kubernetes
- AWS Batch
- Add privileged option to enable running containers on EFA enabled instances with elevated networking permissions
- Kubernetes MCAD Scheduler (Prototype)
-
torchx.tracker
- MLflow backend (Prototype)
- New support for MLFlow backend for torchx tracker
- Add ability for fsspec tracker to read nested kwargs
- Support for tracking apps not launched by torchx
- Load tracker config from .torchxconfig
- MLflow backend (Prototype)
-
torchx.components
- Add dist.spmd component to support Single-Process-Multiple-Data style applications
-
torchx.workspace
- Add ability to access image and workspace path from Dockerfile while building docker workspace
-
Usability imporvements
- Fix entrypoint loading to deal with deferred loading of modules to enable component registration to work properly
-
Changes to ease maintenance
- Add ability to run integration tests for AWS Batch, Slurm, and Kubernetes, instead of running in a remote dedicated clusters. This makes the environment reproducible, reduces maintenance, and makes it easier for more users to contribute.
-
Additional changes
- Bug fixes: Make it possible to launch jobs with more than 5 nodes on AWS Batch
-
torchx.schedulers
- GCP Batch (Prototype)
- Newly added integration for easily scheduling jobs on GCP Batch.
- Features include:
- scheduling different types of components including DDP components
- scheduling on different compute resources (CPU, GPU)
- describing jobs including getting job status
- getting logs for jobs
- listing jobs
- cancelling jobs
- AWS Batch
- Listing jobs now returns just jobs launched on AWS Batch by TorchX and uses pagination to enable listing all jobs in all queues.
- Named resources now account for ECS and EC2 memtax, and suggests closest match when resource is not found.
- Named resources expanded to include all instance types for g4d, g5, p4d, p3 and trn1.
- GCP Batch (Prototype)
-
torchx.workspace
- Improve docker push logging to prevent log spamming when pushing for the first time
-
Additional Changes
- Remove classyvision from examples since it's no longer supported in OSS. Uses torchvision/torch dataset APIs instead of ClassyDataset.
-
torchx.schedulers
- List API (Prototype)
- New list API to list jobs and their statuses for all schedulers which removes the need to use secondary tools to list jobs
- AWS Batch (promoted to Beta)
- Get logs for running jobs
- Added configs for job priorities and queue policies
- Easily access job UI via ui_url
- Ray
- Add elasticity to jobs launched on ray cluster to automatically scale jobs up as resources become available
- Kubernetes
- Add elasticity to jobs launched on Kubernetes
- LSF Scheduler (Prototype)
- Newly added support for scheduling on IBM Spectrum LSF scheduler
- Local Scheduler
- Better formatting when using pdb
- List API (Prototype)
-
torchx.tracker
(Prototype)- TorchX Tracker is a new lightweight experiment and artifact tracking tool
- Add tracker API that can track any inputs and outputs to your model in any infrastructure
- FSSpec based Torchx tracking implementation and sample app
-
torchx.runner
- Allow overriding TORCHX_IMAGE via entrypoints
- Capture the image used when logging schedule calls
-
torchx.components
- Add debug flag to dist component
-
torchx.cli
- New list feature also available as a subcommand to list jobs and their statuses on a given scheduler
- New tracker feature also available as a subcommand to track experiments and artifacts
- Defer loading schedulers until used
-
torchx.workspace
- Preserve Unix file mode when patching files into docker image.
-
Docs
- Add airflow example
-
Additional changes
- Bug fixes for Python 3.10 support
-
torchx.schedulers
- DeviceMounts
- New mount type 'DeviceMount' that allows mounting a host device into a container in the supported schedulers (Docker, AWS Batch, K8). Custom accelerators and network devices such as Infiniband or Amazon EFA are now supported.
- Slurm
- Scheduler integration now supports "max_retries" the same way that our other schedulers do. This only handles whole job level retries and doesn't support per replica retries.
- Autodetects "nomem" setting by using
sinfo
to get the "Memory" setting for the specified partition - More robust slurmint script
- Kubernetes
- Support for k8s device plugins/resource limits
- Added "devices" list of (str, int) tuples to role/resource
- Added devices.py to map from named devices to DeviceMounts
- Added logic in kubernetes_scheduler to add devices from resource to resource limits
- Added logic in aws_batch_scheduler and docker_scheduler to add DeviceMounts for any devices from resource
- Added "priority_class" argument to kubernetes scheduler to set the priorityClassName of the volcano job.
- Support for k8s device plugins/resource limits
- Ray
- fixes for distributed training, now supported in Beta
- DeviceMounts
-
torchx.specs
- Moved factory/builder methods from datastruct specific "specs.api" to "specs.factory" module
-
torchx.runner
- Renamed "stop" method to "cancel" for consistency.
Runner.stop
is now deprecated - Added warning message when "name" parameter is specified. It is used as part of Session name, which is deprecated so makes "name" obsolete.
- New env variable TORCHXCONFIG for specified config
- Renamed "stop" method to "cancel" for consistency.
-
torchx.components
- Removed "base" + "torch_dist_role" since users should prefer to use the
dist.ddp
components instead - Removed custom components for example apps in favor of using builtins.
- Added "env", "max_retries" and "mounts" arguments to utils.sh
- Removed "base" + "torch_dist_role" since users should prefer to use the
-
torchx.cli
- Better parsing of configs from a string literal
- Added support to delimit kv-pairs and list values with "," and ";" interchangeably
- allow the default scheduler to be specified via .torchxconfig
- better invalid scheduler messaging
- Log message about how to disable workspaces
- Job cancellation support via
torchx cancel <job>
torchx.workspace
* Support for .dockerignore files used as include lists to fixe some behavioral differences between how .dockerignore files are interpreted by torchx and docker
-
Testing
- Component tests now run sequentially
- Components can be tested with a runner using
components.components_test_base.ComponentTestCase#run_component()
method.
-
Additional Changes
- Updated Pyre configuration to preemptively guard again upcoming semantic changes
- Formatting changes from black 22.3.0
- Now using pyfmt with usort 1.0 and the new import merging behavior.
- Added script to automatically get system diagnostics for reporting purposes
Milestone: https://github.com/pytorch/torchx/milestones/3
- PyTorch 1.11 Support
- Python 3.10 Support
torchx.workspace
- TorchX now supports a concept of workspaces. This enables seamless launching of jobs using changes present in your local workspace. For Docker based schedulers, we automatically build a new docker container on job launch making it easier than ever to run experiments. #333
torchx.schedulers
- Ray #329
- Newly added Ray scheduler makes it easy to launch jobs on Ray.
- https://pytorch.medium.com/large-scale-distributed-training-with-torchx-and-ray-1d09a329aacb
- AWS Batch #381
- Newly added AWS Batch scheduler makes it easy to launch jobs in AWS with minimal infrastructure setup.
- Slurm
- Slurm jobs will by default launch in the current working directory to match
local_cwd
and workspace behavior. #372 - Replicas now have their own log files and can be accessed programmatically. #373
- Support for
comment
,mail-user
andconstraint
fields. #391 - WorkspaceMixin support (prototype) - Slurm jobs can now be launched in isolated experiment directories. #416
- Slurm jobs will by default launch in the current working directory to match
- Kubernetes
- Support for running jobs under service accounts. #408
- Support for specifying instance types. #433
- All Docker-based Schedulers (Kubernetes, Batch, Docker)
- Added bind mount and volume supports #420, #426
- Bug fix: Better shm support for large dataloader #429
- Support for
.dockerignore
and custom Dockerfiles #401
- Local Scheduler
- Automatically set
CUDA_VISIBLE_DEVICES
#383 - Improved log ordering #366
- Automatically set
- Ray #329
torchx.components
dist.ddp
- Rendezvous works out of the box on all schedulers #400
- Logs are now prefixed with local ranks #412
- Can specify resources via the CLI #395
- Can specify environment variables via the CLI #399
- HPO
- Ax runner now lives in the Ax repo https://github.com/facebook/Ax/commit/8e2e68f21155e918996bda0b7d97b5b9ef4e0cba
torchx.cli
.torchxconfig
- You can now specify component argument defaults
.torchxconfig
https://github.com/pytorch/torchx/commit/c37cfd7846d5a0cb527dd19c8c95e881858f8f0a ~/.torchxconfig
can now be used to set user level defaults. #378--workspace
can be configured #397
- You can now specify component argument defaults
- Color change and bug fixes #419
torchx.runner
- Now supports workspace interfaces. #360
- Returned lines now preserve whitespace to provide support for progress bars #425
- Events are now logged to
torch.monitor
when available. #379
torchx.notebook
(prototype)- Added new workspace interface for developing models and launching jobs via a Jupyter Notebook. #356
- Docs
- Improvements to clarify TorchX usage w/ workspaces and general cleanups.
- #374, #402, #404, #407, #434
-
torchx.schedulers
- #287, #286 - Implement
local_docker
scheduler using docker client lib
- #287, #286 - Implement
-
Docs
- #336 - Add context/intro to each docs page
- Minor document corrections
-
torchx
- #267 - Make torchx.version.TORCHX_IMAGE follow the same semantics as version
- #299 - Use base docker image
pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime
-
torchx.specs
- #301 - Add
metadata
field totorchx.specs.Role
dataclass - #302 - Deprecate RunConfig in favor of raw
Dict[str, ConfigValue]
- #301 - Add
-
torchx.cli
- #316 - Implement
torchx builtins --print
that prints the source code of the component
- #316 - Implement
-
torchx.runner
- #331 - Split run_component into run_component and dryrun_component
-
torchx.schedulers
local_docker
print a nicer error if Docker is not installed #284
-
torchx.cli
- Improved error messages when
-cfg
is not provided #271
- Improved error messages when
-
torchx.components
- Update
dist.ddp
to usec10d
backend as default #263
- Update
-
torchx.aws
- Removed entirely as it was unused
-
Docs
- Restructure documentation to be more clear
- Merged Hello World example with the Quickstart guide to reduce confusion
- Updated Train / Distributed component documentation
- Renamed configure page to "Advanced Usage" to avoid confusion with experimental .torchxconfig
- Renamed Localhost page to just Local to better match the class name
- Misc cleanups / improvements
-
Tests
- Fixed test failure when no secrets are present #274
- Added macOS variant to our unit tests #209
-
torchx.specs
- base_image has been deprecated
- Some predefined AWS specific named_resources have been added
- Docstrings are no longer required for component definitions to make it easier to write them. They will be still rendered as help text if present and are encouraged but aren't required.
- Improved vararg handling logic for components
-
torchx.runner
- Username has been removed from the session name
- Standardized
runopts
naming
-
torchx.cli
- Added experimental
.torchxconfig
file which can be used to set default scheduler arguments for all runs. - Added
--version
flag builtins
ignorestorchx.components.base
folder
- Added experimental
-
Docs
- Improved entry_points and resources docs
- Better component documentation
- General improvements and fixes
-
Examples
- Moved examples to be under torchx/ and merged the examples container with the primary container to simplify usage.
- Added a self contained "Hello World" example
- Switched lightning_classy_vision example to use ResNet model architecture so it will actually converage
- Removed CIFAR example and merged functionality into lightning_classy_vision
-
CI
- Switched to OpenID Connect based auth
torchx.specs
API release candidate (still experimental but no major changes expected for0.1.0
)torchx.components
- made all components use docker images by default for consistency
- removed binary_component in favor of directly writing app defs
serve.torchserve
- added optional--port
argument for upload serverutils.copy
- added copy component for easy file transfer betweenfsspec
path locationsddp
nnodes
no longer needs to be specified and is set fromnum_replicas
instead.- Bug fixes.
- End to end integration tests on Slurm and Kubernetes.
- better unit testing support via
ComponentTestCase
.
torchx.schedulers
- Split
local
scheduler intolocal_docker
andlocal_cwd
.- For local execution
local_docker
provides the closest experience to remote behavior. local_cwd
allows reusing the same component definition for local development purposes but resolves entrypoint and deps relative to the current working directory.
- For local execution
- Improvements/bug fixes to Slurm and Kubernetes schedulers.
- Split
torchx.pipelines
kfp
Added the ability to launch distributed apps via the newresource_from_app
method which creates a Volcano Job from Kubeflow Pipelines.
torchx.runner
- general fixes and improvements around wait behaviortorchx.cli
- Improvements to output formatting to improve clarity.
log
can now log from all roles instead of just onerun
now supports boolean arguments- Experimental support for CLI being used from scripts. Exit codes are consistent and only script consumable data is logged on stdout for key commands such as
run
. --log_level
configuration flag- Default scheduler is now
local_docker
and decided by the first scheduler in entrypoints. - More robust component finding and better behavior on malformed components.
torchx.examples
- Distributed CIFAR Training Example
- HPO
- Improvements to lightning_classy_vision example -- uses components, datapreproc separated from injection
- Updated to use same file directory layout as github repo
- Added documentation on setting up kubernetes cluster for use with TorchX
- Added distributed KFP pipeline example
torchx.runtime
- Added experimental
hpo
support with Ax (https://github.com/facebook/Ax) - Added experimental
tracking.ResultTracker
for distributed tracking of metrics for use with HPO. - Bumped pytorch version to 1.9.0.
- Deleted deprecated storage/plugins interface.
- Added experimental
- Docs
- Added app/component best practices
- Added more information on different component archetypes such as training
- Refactored structure to more accurately represent components, runtime and scheduler libraries.
- README: added information on how to install from source, nightly and different dependencies
- Added scheduler feature compatibility matrices
- General cleanups and improvements
- CI
- component integration test framework
- codecoverage
- renamed primary branch to main
- automated doc push
- distributed kubernetes integration tests
- nightly builds at https://pypi.org/project/torchx-nightly/
- pyre now uses nightly builds
- added slurm integration tests
-
torchx.specs
API release candidate (still experimental but no major changes expected for0.1.0
) -
torchx.pipelines
- Kubeflow Pipeline adapter support -
torchx.runner
- SLURM and local scheduler support -
torchx.components
- several utils, ddp, torchserve builtin components -
torchx.examples
- Colab support for examples
apps
:- classy vision + lightning trainer
- torchserve deploy
- captum model visualization
pipelines
:- apps above as a Kubeflow Pipeline
- basic vs advanced Kubeflow Pipeline examples
-
CI
- unittest, pyre, linter, KFP launch, doc build/test