Skip to content

Latest commit

 

History

History
429 lines (353 loc) · 17.9 KB

CHANGELOG.md

File metadata and controls

429 lines (353 loc) · 17.9 KB

CHANGELOG

torchx-0.6.0

  • Breaking changes

    • Drop support for python 3.7.
    • Upgrade docker base image python version to 2.0
  • torchx.schedulers

    • Add support for options in create_schedulers factory method that allows scheduler configuration in runner
    • Kubernetes MCAD Scheduler
      • Add support for retrying
      • Test, formatting and documentation updates
    • AWS Batch Scheduler
      • Fix logging rank attribution
    • Ray Scheduler
      • Add ability to programmatically define ray job client
  • torchx.tracker

    • Fix adding artifacts to MLFlowTracker by multiple ranks
  • torchx.components

    • dst.ddp
      • Add ability to specify rendezvous backend and use c10d as a default mechanism
      • Add node_rank parameter value for static rank setup
  • torchx.runner

    • Resolve run_opts when passing to torchx.workpsace and for dry-run to correctly populate the values
  • torchx.runner.events

    • Add support to log CPU and wall times
  • torchx.cli

    • Wait for app to start when logging
  • torchx.specs

    • Role.resources uses default_factory method to initialize its value

torchx-0.5.0

  • Milestone: https://github.com/pytorch/torchx/milestone/7

  • torchx.schedulers

    • Kubernetes MCAD Scheduler (Prototype)
      • Newly added integration for easily scheduling jobs on Multi-Cluster-Application-Dispatcher (MCAD).
      • Features include:
        • scheduling different types of components including DDP components
        • scheduling on different compute resources (CPU, GPU)
        • support for docker workspace
        • support for bind, volume and device mounts
        • getting logs for jobs
        • describing, listing and cancelling jobs
        • can be used with a secondary scheduler on Kubernetes
    • AWS Batch
      • Add privileged option to enable running containers on EFA enabled instances with elevated networking permissions
  • torchx.tracker

    • MLflow backend (Prototype)
      • New support for MLFlow backend for torchx tracker
    • Add ability for fsspec tracker to read nested kwargs
    • Support for tracking apps not launched by torchx
    • Load tracker config from .torchxconfig
  • torchx.components

    • Add dist.spmd component to support Single-Process-Multiple-Data style applications
  • torchx.workspace

    • Add ability to access image and workspace path from Dockerfile while building docker workspace
  • Usability imporvements

    • Fix entrypoint loading to deal with deferred loading of modules to enable component registration to work properly
  • Changes to ease maintenance

    • Add ability to run integration tests for AWS Batch, Slurm, and Kubernetes, instead of running in a remote dedicated clusters. This makes the environment reproducible, reduces maintenance, and makes it easier for more users to contribute.
  • Additional changes

    • Bug fixes: Make it possible to launch jobs with more than 5 nodes on AWS Batch

torchx-0.4.0

  • Milestone: https://github.com/pytorch/torchx/milestone/6

  • torchx.schedulers

    • GCP Batch (Prototype)
      • Newly added integration for easily scheduling jobs on GCP Batch.
      • Features include:
        • scheduling different types of components including DDP components
        • scheduling on different compute resources (CPU, GPU)
        • describing jobs including getting job status
        • getting logs for jobs
        • listing jobs
        • cancelling jobs
    • AWS Batch
      • Listing jobs now returns just jobs launched on AWS Batch by TorchX and uses pagination to enable listing all jobs in all queues.
      • Named resources now account for ECS and EC2 memtax, and suggests closest match when resource is not found.
      • Named resources expanded to include all instance types for g4d, g5, p4d, p3 and trn1.
  • torchx.workspace

    • Improve docker push logging to prevent log spamming when pushing for the first time
  • Additional Changes

    • Remove classyvision from examples since it's no longer supported in OSS. Uses torchvision/torch dataset APIs instead of ClassyDataset.

torchx-0.3.0

  • Milestone: https://github.com/pytorch/torchx/milestone/5

  • torchx.schedulers

    • List API (Prototype)
      • New list API to list jobs and their statuses for all schedulers which removes the need to use secondary tools to list jobs
    • AWS Batch (promoted to Beta)
      • Get logs for running jobs
      • Added configs for job priorities and queue policies
      • Easily access job UI via ui_url
    • Ray
      • Add elasticity to jobs launched on ray cluster to automatically scale jobs up as resources become available
    • Kubernetes
      • Add elasticity to jobs launched on Kubernetes
    • LSF Scheduler (Prototype)
      • Newly added support for scheduling on IBM Spectrum LSF scheduler
    • Local Scheduler
      • Better formatting when using pdb
  • torchx.tracker (Prototype)

    • TorchX Tracker is a new lightweight experiment and artifact tracking tool
    • Add tracker API that can track any inputs and outputs to your model in any infrastructure
    • FSSpec based Torchx tracking implementation and sample app
  • torchx.runner

    • Allow overriding TORCHX_IMAGE via entrypoints
    • Capture the image used when logging schedule calls
  • torchx.components

    • Add debug flag to dist component
  • torchx.cli

    • New list feature also available as a subcommand to list jobs and their statuses on a given scheduler
    • New tracker feature also available as a subcommand to track experiments and artifacts
    • Defer loading schedulers until used
  • torchx.workspace

    • Preserve Unix file mode when patching files into docker image.
  • Docs

    • Add airflow example
  • Additional changes

    • Bug fixes for Python 3.10 support

torchx-0.2.0

  • Milestone: https://github.com/pytorch/torchx/milestone/4

  • torchx.schedulers

    • DeviceMounts
      • New mount type 'DeviceMount' that allows mounting a host device into a container in the supported schedulers (Docker, AWS Batch, K8). Custom accelerators and network devices such as Infiniband or Amazon EFA are now supported.
    • Slurm
      • Scheduler integration now supports "max_retries" the same way that our other schedulers do. This only handles whole job level retries and doesn't support per replica retries.
      • Autodetects "nomem" setting by using sinfo to get the "Memory" setting for the specified partition
      • More robust slurmint script
    • Kubernetes
      • Support for k8s device plugins/resource limits
        • Added "devices" list of (str, int) tuples to role/resource
        • Added devices.py to map from named devices to DeviceMounts
        • Added logic in kubernetes_scheduler to add devices from resource to resource limits
        • Added logic in aws_batch_scheduler and docker_scheduler to add DeviceMounts for any devices from resource
      • Added "priority_class" argument to kubernetes scheduler to set the priorityClassName of the volcano job.
    • Ray
      • fixes for distributed training, now supported in Beta
  • torchx.specs

    • Moved factory/builder methods from datastruct specific "specs.api" to "specs.factory" module
  • torchx.runner

    • Renamed "stop" method to "cancel" for consistency. Runner.stop is now deprecated
    • Added warning message when "name" parameter is specified. It is used as part of Session name, which is deprecated so makes "name" obsolete.
    • New env variable TORCHXCONFIG for specified config
  • torchx.components

    • Removed "base" + "torch_dist_role" since users should prefer to use the dist.ddp components instead
    • Removed custom components for example apps in favor of using builtins.
    • Added "env", "max_retries" and "mounts" arguments to utils.sh
  • torchx.cli

    • Better parsing of configs from a string literal
    • Added support to delimit kv-pairs and list values with "," and ";" interchangeably
    • allow the default scheduler to be specified via .torchxconfig
    • better invalid scheduler messaging
    • Log message about how to disable workspaces
    • Job cancellation support via torchx cancel <job>

torchx.workspace * Support for .dockerignore files used as include lists to fixe some behavioral differences between how .dockerignore files are interpreted by torchx and docker

  • Testing

    • Component tests now run sequentially
    • Components can be tested with a runner using components.components_test_base.ComponentTestCase#run_component() method.
  • Additional Changes

    • Updated Pyre configuration to preemptively guard again upcoming semantic changes
    • Formatting changes from black 22.3.0
    • Now using pyfmt with usort 1.0 and the new import merging behavior.
    • Added script to automatically get system diagnostics for reporting purposes

torchx-0.1.2

Milestone: https://github.com/pytorch/torchx/milestones/3

  • PyTorch 1.11 Support
  • Python 3.10 Support
  • torchx.workspace
    • TorchX now supports a concept of workspaces. This enables seamless launching of jobs using changes present in your local workspace. For Docker based schedulers, we automatically build a new docker container on job launch making it easier than ever to run experiments. #333
  • torchx.schedulers
    • Ray #329
    • AWS Batch #381
      • Newly added AWS Batch scheduler makes it easy to launch jobs in AWS with minimal infrastructure setup.
    • Slurm
      • Slurm jobs will by default launch in the current working directory to match local_cwd and workspace behavior. #372
      • Replicas now have their own log files and can be accessed programmatically. #373
      • Support for comment, mail-user and constraint fields. #391
      • WorkspaceMixin support (prototype) - Slurm jobs can now be launched in isolated experiment directories. #416
    • Kubernetes
      • Support for running jobs under service accounts. #408
      • Support for specifying instance types. #433
    • All Docker-based Schedulers (Kubernetes, Batch, Docker)
      • Added bind mount and volume supports #420, #426
      • Bug fix: Better shm support for large dataloader #429
      • Support for .dockerignore and custom Dockerfiles #401
    • Local Scheduler
      • Automatically set CUDA_VISIBLE_DEVICES #383
      • Improved log ordering #366
  • torchx.components
  • torchx.cli
  • torchx.runner
    • Now supports workspace interfaces. #360
    • Returned lines now preserve whitespace to provide support for progress bars #425
    • Events are now logged to torch.monitor when available. #379
  • torchx.notebook (prototype)
    • Added new workspace interface for developing models and launching jobs via a Jupyter Notebook. #356
  • Docs
    • Improvements to clarify TorchX usage w/ workspaces and general cleanups.
    • #374, #402, #404, #407, #434

torchx-0.1.1

  • Milestone: https://github.com/pytorch/torchx/milestone/2

  • torchx.schedulers

    • #287, #286 - Implement local_docker scheduler using docker client lib
  • Docs

    • #336 - Add context/intro to each docs page
    • Minor document corrections
  • torchx

    • #267 - Make torchx.version.TORCHX_IMAGE follow the same semantics as version
    • #299 - Use base docker image pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime
  • torchx.specs

    • #301 - Add metadata field to torchx.specs.Role dataclass
    • #302 - Deprecate RunConfig in favor of raw Dict[str, ConfigValue]
  • torchx.cli

    • #316 - Implement torchx builtins --print that prints the source code of the component
  • torchx.runner

    • #331 - Split run_component into run_component and dryrun_component

torchx-0.1.0

  • torchx.schedulers

    • local_docker print a nicer error if Docker is not installed #284
  • torchx.cli

    • Improved error messages when -cfg is not provided #271
  • torchx.components

    • Update dist.ddp to use c10d backend as default #263
  • torchx.aws

    • Removed entirely as it was unused
  • Docs

    • Restructure documentation to be more clear
    • Merged Hello World example with the Quickstart guide to reduce confusion
    • Updated Train / Distributed component documentation
    • Renamed configure page to "Advanced Usage" to avoid confusion with experimental .torchxconfig
    • Renamed Localhost page to just Local to better match the class name
    • Misc cleanups / improvements
  • Tests

    • Fixed test failure when no secrets are present #274
    • Added macOS variant to our unit tests #209

torchx-0.1.0rc1

  • torchx.specs

    • base_image has been deprecated
    • Some predefined AWS specific named_resources have been added
    • Docstrings are no longer required for component definitions to make it easier to write them. They will be still rendered as help text if present and are encouraged but aren't required.
    • Improved vararg handling logic for components
  • torchx.runner

    • Username has been removed from the session name
    • Standardized runopts naming
  • torchx.cli

    • Added experimental .torchxconfig file which can be used to set default scheduler arguments for all runs.
    • Added --version flag
    • builtins ignores torchx.components.base folder
  • Docs

    • Improved entry_points and resources docs
    • Better component documentation
    • General improvements and fixes
  • Examples

    • Moved examples to be under torchx/ and merged the examples container with the primary container to simplify usage.
    • Added a self contained "Hello World" example
    • Switched lightning_classy_vision example to use ResNet model architecture so it will actually converage
    • Removed CIFAR example and merged functionality into lightning_classy_vision
  • CI

    • Switched to OpenID Connect based auth

torchx-0.1.0rc0

  • torchx.specs API release candidate (still experimental but no major changes expected for 0.1.0)
  • torchx.components
    • made all components use docker images by default for consistency
    • removed binary_component in favor of directly writing app defs
    • serve.torchserve - added optional --port argument for upload server
    • utils.copy - added copy component for easy file transfer between fsspec path locations
    • ddp
      • nnodes no longer needs to be specified and is set from num_replicas instead.
      • Bug fixes.
      • End to end integration tests on Slurm and Kubernetes.
    • better unit testing support via ComponentTestCase.
  • torchx.schedulers
    • Split local scheduler into local_docker and local_cwd.
      • For local execution local_docker provides the closest experience to remote behavior.
      • local_cwd allows reusing the same component definition for local development purposes but resolves entrypoint and deps relative to the current working directory.
    • Improvements/bug fixes to Slurm and Kubernetes schedulers.
  • torchx.pipelines
    • kfp Added the ability to launch distributed apps via the new resource_from_app method which creates a Volcano Job from Kubeflow Pipelines.
  • torchx.runner - general fixes and improvements around wait behavior
  • torchx.cli
    • Improvements to output formatting to improve clarity.
    • log can now log from all roles instead of just one
    • run now supports boolean arguments
    • Experimental support for CLI being used from scripts. Exit codes are consistent and only script consumable data is logged on stdout for key commands such as run.
    • --log_level configuration flag
    • Default scheduler is now local_docker and decided by the first scheduler in entrypoints.
    • More robust component finding and better behavior on malformed components.
  • torchx.examples
    • Distributed CIFAR Training Example
    • HPO
    • Improvements to lightning_classy_vision example -- uses components, datapreproc separated from injection
    • Updated to use same file directory layout as github repo
    • Added documentation on setting up kubernetes cluster for use with TorchX
    • Added distributed KFP pipeline example
  • torchx.runtime
    • Added experimental hpo support with Ax (https://github.com/facebook/Ax)
    • Added experimental tracking.ResultTracker for distributed tracking of metrics for use with HPO.
    • Bumped pytorch version to 1.9.0.
    • Deleted deprecated storage/plugins interface.
  • Docs
    • Added app/component best practices
    • Added more information on different component archetypes such as training
    • Refactored structure to more accurately represent components, runtime and scheduler libraries.
    • README: added information on how to install from source, nightly and different dependencies
    • Added scheduler feature compatibility matrices
    • General cleanups and improvements
  • CI
    • component integration test framework
    • codecoverage
    • renamed primary branch to main
    • automated doc push
    • distributed kubernetes integration tests
    • nightly builds at https://pypi.org/project/torchx-nightly/
    • pyre now uses nightly builds
    • added slurm integration tests

torchx-0.1.0b0

  • torchx.specs API release candidate (still experimental but no major changes expected for 0.1.0)

  • torchx.pipelines - Kubeflow Pipeline adapter support

  • torchx.runner - SLURM and local scheduler support

  • torchx.components - several utils, ddp, torchserve builtin components

  • torchx.examples

    • Colab support for examples
    • apps:
      • classy vision + lightning trainer
      • torchserve deploy
      • captum model visualization
    • pipelines:
      • apps above as a Kubeflow Pipeline
      • basic vs advanced Kubeflow Pipeline examples
  • CI

    • unittest, pyre, linter, KFP launch, doc build/test