AIStore is a lightweight object storage system with the capability to linearly scale out with each added storage node and a special focus on petascale deep learning.
AIStore (AIS for short) is a built-from-scratch, lightweight storage stack tailored for AI apps. It's an elastic cluster that can grow and shrink at runtime and can be ad-hoc deployed, with or without Kubernetes, anywhere from a single Linux machine to a bare-metal cluster of any size.
AIS consistently shows balanced I/O distribution and linear scalability across arbitrary numbers of clustered nodes. The ability to scale linearly with each added disk was, and remains, one of the main incentives. Much of the initial design was also driven by the ideas to offload custom dataset transformations (often referred to as ETL). And finally, since AIS is a software system that aggregates Linux machines to provide storage for user data, there's the requirement number one: reliability and data protection.
- Deploys anywhere. AIS clusters are immediately deployable on any commodity hardware, on any Linux machine(s).
- Highly available control and data planes, end-to-end data protection, self-healing, n-way mirroring, erasure coding, and arbitrary number of extremely lightweight access points.
- REST API. Comprehensive native HTTP-based API, as well as compliant Amazon S3 API to run unmodified S3 clients and apps.
- Unified namespace across multiple remote backends including Amazon S3, Google Cloud, and Microsoft Azure.
- Network of clusters. Any AIS cluster can attach any other AIS cluster, thus gaining immediate visibility and fast access to the respective hosted datasets.
- Turn-key cache. Can be used as a standalone highly-available protected storage and/or LRU-based fast cache. Eviction watermarks, as well as numerous other management policies, are per-bucket configurable.
- ETL offload. The capability to run I/O intensive custom data transformations close to data - offline (dataset to dataset) and inline (on-the-fly).
- File datasets. AIS can be immediately populated from any file-based data source (local or remote, ad-hoc/on-demand or via asynchronus batch).
- Read-after-write consistency. Reading and writing (as well as all other control and data plane operations) can be performed via any (random, selected, or load-balanced) AIS gateway (a.k.a. "proxy"). Once the first replica of an object is written and finalized subsequent reads are guaranteed to view the same content. Additional copies and/or EC slices, if configured, are added asynchronously via
put-copies
andec-put
jobs, respectively. - Write-through. In presence of any remote backend, AIS executes remote write (e.g., using vendor's SDK) as part of the transaction that places and finalizes the first replica.
- Small file datasets. To serialize small files and facilitate batch processing, AIS supports TAR, TAR.GZ (or TGZ), ZIP, and TAR.LZ4 formatted objects (often called shards). Resharding (for optimal sorting and sizing), listing contained files (samples), appending to existing shards, and generating new ones from existing objects and/or client-side files - is also fully supported.
- Kubernetes. Provides for easy Kubernetes deployment via a separate GitHub repo and AIS/K8s Operator.
- Access control. For security and fine-grained access control, AIS includes OAuth 2.0 compliant Authentication Server (AuthN). A single AuthN instance executes CLI requests over HTTPS and can serve multiple clusters.
- Distributed shuffle extension for massively parallel resharding of very large datasets.
- Batch jobs. APIs and CLI to start, stop, and monitor documented batch operations, such as
prefetch
,download
, copy or transform datasets, and many more.
For easy usage, management, and monitoring, there's also:
- Integrated and powerful CLI. As of early 2024, top-level CLI commands include:
$ ais
bucket etl help log create dsort stop blob-download
object job advanced performance download evict cp rmo
cluster auth storage remote-cluster prefetch get rmb wait
config show archive alias put ls start search
AIS runs natively on Kubernetes and features open format - thus, the freedom to copy or move your data from AIS at any time using the familiar Linux tar(1)
, scp(1)
, rsync(1)
and similar.
For developers and data scientists, there's also:
- native Go (language) API that we utilize in a variety of tools including CLI and Load Generator;
- native Python SDK
- PyTorch integration and usage examples
- Boto3 support for interoperability with AWS SDK for Python (aka Boto3) client
- and other Botocore derivatives.
For the original AIStore white paper and design philosophy, for introduction to large-scale deep learning and the most recently added features, please see AIStore Overview (where you can also find six alternative ways to work with existing datasets). Videos and animated presentations can be found at videos.
Finally, getting started with AIS takes only a few minutes.
AIS deployment options, as well as intended (development vs. production vs. first-time) usages, are all summarized here.
Since prerequisites boil down to, essentially, having Linux with a disk the deployment options range from all-in-one container to a petascale bare-metal cluster of any size, and from a single VM to multiple racks of high-end servers. But practical use cases require, of course, further consideration and may include:
Option | Objective |
---|---|
Local playground | AIS developers or first-time users, Linux or Mac OS; to get started, run make kill cli aisloader deploy <<< $'N\nM' , where N is a number of targets, M - gateways |
Minimal production-ready deployment | This option utilizes preinstalled docker image and is targeting first-time users or researchers (who could immediately start training their models on smaller datasets) |
Easy automated GCP/GKE deployment | Developers, first-time users, AI researchers |
Large-scale production deployment | Requires Kubernetes and is provided via a separate repository: ais-k8s |
Further, there's the capability referred to as global namespace: given HTTP(S) connectivity, AIS clusters can be easily interconnected to "see" each other's datasets. Hence, the idea to start "small" to gradually and incrementally build high-performance shared capacity.
For detailed discussion on supported deployments, please refer to Getting Started.
For performance tuning and preparing AIS nodes for bare-metal deployment, see performance.
AIStore supports multiple ways to populate itself with existing datasets, including (but not limited to):
- on demand, often during the first epoch;
- copy entire bucket or its selected virtual subdirectories;
- copy multiple matching objects;
- archive multiple objects
- prefetch remote bucket or parts of thereof;
- download raw http(s) addressable directories, including (but not limited to) Cloud storages;
- promote NFS or SMB shares accessible by one or multiple (or all) AIS target nodes;
The on-demand "way" is maybe the most popular, whereby users just start running their workloads against a remote bucket with AIS cluster positioned as an intermediate fast tier.
But there's more. In v3.22, we introduce blob downloader, a special facility to download very large remote objects (BLOBs). And in v3.23, there's a new capability, dubbed bucket inventory, to list very large S3 buckets fast.
Generally, AIStore (cluster) requires at least some sort of deployment procedure. There are standalone binaries, though, that can be built from source or installed directly from GitHub:
$ ./scripts/install_from_binaries.sh --help
The script installs aisloader and CLI from the most recent, or the previous, GitHub release. For CLI, it'll also enable auto-completions (which is strongly recommended).
PyTorch integration is a growing set of datasets (both iterable and map-style), samplers, and dataloaders:
- Taxonomy of abstractions and API reference
- AIS plugin for PyTorch: usage examples
- Jupyter notebook examples
Since AIS natively supports remote backends, you can also use (PyTorch + AIS) to iterate over Amazon S3, GCS and Azure buckets, and more.
- Getting Started
- Technical Blog
- API and SDK
- Amazon S3
- CLI
- Security and Access Control
- Power tools and extensions
- Benchmarking and tuning Performance
- Buckets and Backend Providers
- Storage Services
- Cluster Management
- Configuration
- Observability
- For users and developers
- Getting started
- Docker
- Useful scripts
- Profiling, race-detecting and more
- Batch jobs
- Assorted Topics
- Virtual directories
- System files
- Switching cluster between HTTP and HTTPS
- TLS: testing with self-signed certificates
- Feature flags
aisnode
command line- Traffic patterns
- Highly available control plane
- Start/stop maintenance mode, shutdown, decommission, and related operations
- Downloader
- On-disk layout
- Buckets: definition, operations, properties
- Out-of-band updates
MIT
Alex Aizman (NVIDIA)