Skip to content

PlatformAwareProgramming/CloudClusters.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CloudClusters.jl

TagBot

CompatHelper

A package for creating, using, and managing clusters of virtual machine (VM) instances deployed with IaaS cloud providers.

Target users

CloudClusters.jl targets Julia programming language users who need on-demand access to cutting-edge computing resources that IaaS cloud providers provide to meet high-performance computing (HPC) application requirements.

Pre-requisites

Cloud providers' credentials

Currently, CloudClusters.jl supports AWS EC2 and Google Cloud Platform (GCP). In future versions, the support to other IaaS cloud providers may be implemented.

CloudClusters.jl assumes that the user has configured the system with the required credentials for the cloud providers' services they will use. For GCP, CloudClusters.jl starts a session using the JSON credential file informed through the GOOGLE_APPLICATION_CREDENTIALS environment variable. In turn, the EC2 API will look for credential files in the $HOME/.aws folder.

The configuration file (CCconfig.toml)

Creating clusters with CloudClusters.jl requires specifying some configuration parameters. By default, they are specified in a file named CCconfig.toml that is searched in the following locations, in this order:

  • a path pointed by the CLOUD_CLUSTERS_CONFIG environment variable, if it exists;
  • the current path;
  • the home path.

Section Configuration parameters describes default configuration parameters and how they can be overridden in programs.

A CCconfig.toml file is provided in the repository's top-level directory. It is downloaded to the current directory if a CCconfig.toml file is not found. It is configured to create clusters using prebuilt virtual machine images for each supported cloud provider. These images are based on the latest version of Ubuntu and include a Julia installation of a recent stable version with all the packages needed to instantiate the clusters added and precompiled. Users can create customized images, possibly derived from the provided image, using their preferred version of Julia and adding the packages they need.

Warning

The version of Julia on the host computer using CloudClusters.jl must be the same version as the image used to deploy the clusters. [!NOTE] The current prebuilt image for EC2 is located at the us-east-1 (North Virginia) region. Suppose the user is going to deploy a cluster in another region. In that case, they must create a copy of the image for that region in their account and assign their id to the imageid parameter of CCConfig.toml.

The PlatformAware.jl package

CloudClusters.jl relies on an experimental package called PlatformAware.jl for the specification of platform types, aimed at specifying assumptions about architectural features of virtual machines instances. Indeed, PlatformAware.jl may be used with CloudClusters.jl to write functions specifically tuned according to the features of VM instances that comprise the clusters. This is called _platform-aware programming_. The users of CloudClusters.jl, particularly package developers, are invited to explore and use the ideas behind PlatformAware.jl.

Section The integration with PlatformAware.jl provides a deeper discussion about the integration of PlatformAware.jl within CloudClusters.jl.

Tutorial

Next, we show a tutorial on how CloudClusters.jl works, divided into two parts: basic use and advanced use.

The basic tutorial teaches the reader how to create and deploy computations on peer-workers clusters, comprising a set of homogeneous VM instances deployed in the infrastructure of an IaaS cloud provider.

The advanced tutorial includes:

Basic use

In what follows, we teach how to create peer-workers clusters and deploy computations on them using Distributed.jl primitives.

How to create a cluster

CloudClusters.jl offers seven primitives, as macros, to create and manage a cluster's lifecycle. They are: @cluster, @resolve, @deploy, @terminate, @interrupt, @resume, and @restart.

First, let's try a simple scenario where a user creates a cluster comprising four t3.xlarge virtual machines (VM) instances through the AWS EC2 services. In the simplest way to do this, the user applies the @cluster macro to the number of nodes and instance type, as arguments.

using CloudClusters
my_first_contract = @cluster node_count => 4 node_machinetype => PlatformAware.EC2Type_T3_xLarge

EC2Type_T3_xLarge is a Julia type from the PlatformAware.jl package that represents the t3.xlarge instance type (and size) of EC2. PlatformAware.jl offers a hierarchy of types representing instance types of supported providers (e.g., MachineTypeEC2TypeEC2Type_T3EC2Type_T3_xLarge and MachineTypeGCPTypeGCPType_E2GCPType_E2_Medium).

The user may list all the supported EC2 instance types by executing subtypes(PlatformAware.EC2Type) in the REPL, or subtypes(PlatformAware.EC2Type_T3) if the user intends to list the available instance sizes for the t3 instance type. Analougously for GCP.

@cluster does not instantiate a cluster yet. It creates a cluster contract and returns a handle for it. In the example, the contract handle is stored in the my_first_cluster_contract variable, from which the user can create one or more clusters later.

[!NOTE]

In CloudClusters.jl, a handle is a symbol comprising 15 random lower and upper case alphabetic characters (e.g.,:FXqElAnSeTEpQAm ). As symbols, they are printable and may be used directly to refer to a cluster contract.

A cluster contract must be resolved before creating clusters using it. For that, the user needs to apply @resolve to the contract handle, as below:

@resolve my_first_contract

The @resolve macro triggers a resolution procedure to calculate which instance type offered by one of the supported IaaS providers satisfies the contract. For my_first_contract, the result is explicitly specified: the t3.xlarge instance type of AWS EC2. For advanced contract specifications, where cluster contract resolution shows its power, the reader can read the Working with cluster contracts section.

A cluster may be instantiated by using @deploy:

my_first_cluster = @deploy my_first_cluster_contract

The @deploy macro will create a 4-node cluster comprising t3.xlarge AWS EC2 instances, returning a cluster handle, assigned to the my_first_cluster variable.

After @deploy, a set of worker processes is created, one at each cluster node. Their PIDs may be inspected by applying the nodes function to the cluster handle.

In the following code, the user uses @nodes to fetch the PIDs of the processes running at the nodes of the cluster referred to by my_first_cluster.

julia> @nodes my_first_cluster
4-element Vector{Int64}
2
3
4
5

The example shows that the default number of worker processes per cluster node is 1. However, the user may create N worker processes per cluster node using the node_process_count => N parameter in the contract specification. For example, in the following contract, the number of worker processes per cluster node is set to 2:

@cluster node_count => 4 node_process_count => 2 node_machinetype => EC2Type_T3_xLarge

Running computations on the cluster

The user may execute parallel computations on the cluster using Distributed.jl operations. In fact, the user can employ any parallel/distributed computing package in the Julia ecosystem to launch computations across a set of worker processes. For instance, the advanced tutorial will show how to use MPI.jl integrated with Distributed.jl.

The following code, adapted from The ultimate guide to distributed computing in Julia, processes a set of CSV files in a data folder in parallel, using pmap, across the worker processes placed at the cluster nodes. The result of each file processing is saved locally, as a CSV file in a results folder.

using Distributed

@everywhere cluster_nodes(my_first_cluster) begin

	# load dependencies
	using ProgressMeter
	using CSV

	# helper functions
	function  process(infile, outfile)
		# read file from disk
		csv = CSV.File(infile)

		# perform calculations
		sleep(60)

		# save new file to disk
		CSV.write(outfile, csv)
	end
end

# MAIN SCRIPT
# -----------

# relevant directories
indir = joinpath(@__DIR__,"data")
outdir = joinpath(@__DIR__,"results")

# files to process
infiles = readdir(indir, join=true)
outfiles = joinpath.(outdir, basename.(infiles))
nfiles = length(infiles)

status = @showprogress  pmap(1:nfiles; pids=cluster_nodes(my_first_cluster)) do i
	try
		process(infiles[i], outfiles[i])
		true  # success
	catch e
		false  # failure
	end
end

Multiple clusters

Users can create cluster contracts and deploy clusters from them as many times as they need. For example, the following code creates a second cluster contract, named my_second_cluster_contract, asking for a cluster comprising eight VM instances equipped with exactly eight NVIDIA GPUs of Ada-Lovelace architecture and at least 512GB of memory per node. Then, it creates two clusters from the new contract.

my_second_contract = @cluster(node_count => 8,
                              node_memory_size => @atleast(512G),
                              accelerator_count => @just(8),
                              accelerator_architecture => Ada)

@resolve my_second_cluster_contract

my_second_cluster = @deploy my_second_cluster_contract
my_third_cluster = @deploy my_second_cluster_contract

This is an advanced use of cluster contracts, requiring instance types that satisfy a set of assumptions specified in the contract through instance parameters. This tutorial was written when the AWS EC2 instance type satisfying these assumptions is g6.48xlarge, equipped with eight NVIDIA L4 T4 Tensor Core GPUs and 768GB of memory.

Now, there are three available clusters. The PIDs of the last two ones may also be inspected:

julia> @nodes my_second_cluster
8-element Vector{Int64}
6
7
8
9
10
11
12
13

julia> @ nodes my_third_cluster
8-element Vector{Int64}
14
15
16
17
18
19
20
21

The user may orchestrate the processing power of multiple clusters to run computations of their interest, independent of their providers. This is multicluster computation. However, it is important to note that communication operations between processes placed at nodes of different clusters (inter-cluster communication), mainly when these clusters are deployed at different IaaS providers, must be used with care due to the high communication cost, only when necessary and overlapping communication and computation using asynchronous operations.

Interrupting and resuming a cluster

A cluster may be interrupted through the @interrupt macro:

@interrupt my_first_cluster

The effect of @interrupt is pausing/stopping the VM instances of the cluster nodes. The semantics of interrupting a cluster may vary accross IaaS providers.

An interrupted cluster can be brought back to the running state using the @resume macro:

@resume my_first_cluster

The resume operation starts the VM instances and creates a fresh set of worker processes, with new PIDs.

[!CAUTION]

@interrupt does not preserve the state of undergoing computations in the cluster, since it kills the worker processes running at the cluster nodes. The interruption of a cluster may be used to avoid the cost of cloud resources that are not currently being used. The user is responsible for saving the state of undergoing computations in a cluster to be interrupted and reloading the state after resuming, if necessary.

Restarting processes

A user can restart the processes at the cluster nodes by using the @restart macro:

@restart my_first_cluster

The restart procedure kills all the current processes at the cluster nodes, losing their current state, and creates new processes, with fresh PIDs.

Terminating a cluster

Finally, a cluster may be finished/terminated using the @terminate macro:

@terminate my_first_cluster

After terminating, the cloud resources associated with the cluster are released.

How to reconnect to a non-terminated cluster

If a cluster was not terminated in a previous execution of a Julia program or REPL session, the user may reconnect it using the @reconnect macro. For example:

@reconnect :FXqElAnSeTEpQAm

In the above code, :FXqElAnSeTEpQAm is the handle of a cluster not terminated in a previous execution session. But how may the user discover the cluster handle of a non-terminated cluster? For example, after a system crash? For that, the user may invoke the @clusters macro, which returns a list of non-terminated clusters in previous sessions that are still alive and can be reconnected:

julia> @clusters
[ Info: PeerWorkers FXqElAnSeTEpQAm, created at 2024-10-08T09:12:40.847 on PlatformAware.AmazonEC2
1-element Vector{Any}:
Dict{Any, Any}(:handle => :FXqElAnSeTEpQAm, :provider => PlatformAware.AmazonEC2, :type => PeerWorkers, :timestamp => Dates.DateTime("2024-10-08T09:12:40.847"))

Advanced Use

Working with cluster contracts

As shown in the previous examples of using the @cluster macro, CloudClusters.jl supports cluster contracts to specify assumptions about cluster features, with special attention to the types of VM instances comprising cluster nodes.

Cluster contracts are a set of key-value pairs k => v called assumption parameters, where k is a name and v is a value or platform type. A predefined set of assumption parameters is supported, each with a name and a default value or base platform type.

The currently supported set of assumption parameters is listed here, providing a wide spectrum of assumptions for users to specify the architectural characteristics of a cluster to satisfy their needs. Note that assumption parameters are classified into cluster and instance parameters, where instance parameters are taken into consideration in the instance resolution procedure (@resolve).

In the case of my_first_contract, the user uses the assumption parameters node_count and nodes_machinetype to specify that the required cluster must have four nodes and that the VM instances that comprise the cluster nodes must be of the t3.xlarge type, offered by the AWS EC2 provider. This is a direct approach, the simplest and least abstract one, where the resolution procedure, triggered by a call to @resolve, will return the EC2's t3.xlarge as the VM instance type that satisfies the contract.

On the other hand, my_second_contract employs an indirect approach, demonstrating that the resolution procedure may look for a VM instance type from a set of abstract assumptions. They are specified using the instance parameters accelerator_count, accelerator_architecture, and accelerator_memory, asking for cluster nodes with eight GPUs of NVIDIA Ada Lovelace architecture and at least 512GB of memory. Under these assumptions, the call to @resolve returns the g6.48xlarge instance type of AWS EC2.

List of assumption parameters

Cluster parameters specify features of the cluster:

  • cluster_type::Cluster, denoting the cluster type: ManagerWorkers, PeerWorkers, or PeerWorkersMPI;
  • node_count::Integer, denoting the number of cluster nodes (default to 1);
  • node_process_count::Integer, denoting the number of Julia processes (MPI ranks) per node (default to 1).

Instance parameters, with their respective base platform types, are listed below:

  • node_provider::CloudProvider, the provider of VM instances for the cluster nodes;
  • cluster_locale::Locale, the geographic location where the cluster nodes will be instantiated;
  • node_machinetype::InstanceType, the VM instance type of cluster nodes;
  • node_memory_size::@atleast 0, the memory size of each cluster node;
  • node_ecu_count::@atleast 1, the EC2 compute unit, a processing performance measure for VM instances (only for EC2 instances);
  • node_vcpus_unit::@atleast 1, the number of virtual CPUs in each cluster node;
  • accelerator_count::@atleast 0, the number of accelerators in the cluster node;
  • accelerator_memory::@atleast 0, the amount of memory of the cluster node accelerators;
  • accelerator_type::AcceleratorType, the type of accelerator;
  • accelerator_manufacturer::AcceleratorManufacturer, the manufacturer of the accelerator;
  • accelerator_arch::AcceleratorArchitecture, the architecture of the accelerator, depending on its type and manufacturer.
  • accelerator::AcceleratorModel, the accelerator model;
  • processor_manufacturer::Manufacturer, the processor manufacturer;
  • processor_microarchitecture::ProcessorArchitecture, the processor microarchitecture;
  • processor::ProcessorModel, the processor model;
  • storage_type::StorageType, the type of storage in cluster nodes;
  • storage_size::@atleast 0, the size of the storage in cluster nodes;
  • network_performance::@atleast 0, the network performance between cluster nodes.

Most platform types are specified in the PlatformAware.jl package. The user may open a REPL section to query types defined in PlatformAware.jl. For example, the user may apply the subtypes function to know the subtypes of a given platform type, which define the available choices:

julia> using PlatformAware

julia> subtypes(Accelerator)
3-element Vector{Any}:
NVIDIAAccelerator
AMDAccelerator
IntelAccelerator

julia> subtypes(EC2Type_T3)
8-element Vector{Any}:
EC2Type_T3A
EC2Type_T3_2xLarge
EC2Type_T3_Large
EC2Type_T3_Medium
EC2Type_T3_Micro
EC2Type_T3_Nano
EC2Type_T3_Small
EC2Type_T3_xLarge

Querying contracts

In the current implementation of CloudClusters.jl, since contract resolution, using @resolve, is implemented on top of Julia's multiple dispatch mechanism, it does not support ambiguity, i.e., only a single VM instance type must satisfy the contract. Otherwise, resolve returns an ambiguity error, like in the example below:

julia> cc = @cluster(node_count => 4,
                     accelerator_count => @atleast(4),
                     accelerator_architecture => Ada,
                     node_memory_size => @atleast(256G))
:NKPlCvagfSSpIgD

julia> @resolve cc
ERROR: MethodError: resolve(::Type{CloudProvider}, ::Type{MachineType}, ::Type{Tuple{AtLeast256G, AtMostInf, var"#92#X"} where var"#92#X"}, ::Type{Tuple{AtLeast1, AtMostInf, Q} where Q}, ::Type{Tuple{AtLeast4, AtMostInf, var"#91#X"} where var"#91#X"}, ::Type{AcceleratorType}, ::Type{Ada}, ::Type{Manufacturer}, ::Type{Tuple{AtLeast0, AtMostInf, Q} where Q}, ::Type{Accelerator}, ::Type{Processor}, ::Type{Manufacturer}, ::Type{ProcessorMicroarchitecture}, ::Type{StorageType}, ::Type{Tuple{AtLeast0, AtMostInf, Q} where Q}, ::Type{Tuple{AtLeast0, AtMostInf, Q} where Q}) is ambiguous.

The user can use the @select macro to query which instance types satisfy the ambiguous contract:

julia> @select(node_count => 4,
               accelerator_count => @atleast(4),
               accelerator_architecture => Ada,
               node_memory_size => @atleast(256G))
┌ Warning: Only instance features are allowed. Ignoring node_count.
└ @ CloudClusters ~/Dropbox/Copy/ufc_mdcc_hpc/CloudClusters.jl/src/resolve.jl:78
Dict{String, Any} with 3 entries:
"g6.48xlarge" => Dict{Symbol, Any}(:processor => Type{>:AMDEPYC_7R13}, :accelerator_architecture => Type{>:Ada}, :processor_manufacturer => Type{>:AMD}, :storage_type => Type{>:StorageType_EC2_NVMeSSD}, :node_memory_size => Type{>:Tuple{AtLeast512G, AtMost1T, 8.24634e11}}, :storage_size => Type{>:Tuple{AtLeast32T, AtMost64T, 6.52835e13}}, :node_provider => Type{>:AmazonEC2}, :node_vcpus_count => Type{>:Tuple{AtLeast128, AtMost256, 192.0}}, :accelerator_count => Type{>:Tuple{AtLeast8, AtMost8, 8.0}}, :network_performance => Type{>:Tuple{AtLeast64G, AtMost128G, 1.07374e11}}, :accelerator => Type{>:NVIDIA_L4}, :accelerator_type => Type{>:GPU}, :accelerator_memory_size => Type{>:Tuple{AtLeast16G, AtMost32G, 2.57698e10}}, :accelerator_manufacturer => Type{>:NVIDIA}, :node_machinetype => Type{>:EC2Type_G6_48xLarge}, :processor_microarchitecture => Type{>:Zen})
"g2-standard-96" => Dict{Symbol, Any}(:processor => Type{>:IntelXeon_8280L}, :accelerator_architecture => Type{>:Ada}, :processor_manufacturer => Type{>:Intel}, :storage_type => Type{>:StorageType}, :node_memory_size => Type{>:Tuple{AtLeast256G, AtMost512G, 4.12317e11}}, :storage_size => Type{>:Tuple{AtLeast0, AtMostInf, Q} where Q}, :node_provider => Type{>:GoogleCloud}, :node_vcpus_count => Type{>:Tuple{AtLeast64, AtMost128, 96.0}}, :accelerator_count => Type{>:Tuple{AtLeast8, AtMost8, 8.0}}, :network_performance => Type{>:Tuple{AtLeast64G, AtMost128G, 1.07374e11}}, :accelerator => Type{>:NVIDIA_L4}, :accelerator_type => Type{>:GPU}, :accelerator_memory_size => Type{>:Tuple{AtLeast16G, AtMost32G, 2.57698e10}}, :accelerator_manufacturer => Type{>:NVIDIA}, :node_machinetype => Type{>:GCPType_G2}, :processor_microarchitecture => Type{>:CascadeLake})
"g6.24xlarge" => Dict{Symbol, Any}(:processor => Type{>:AMDEPYC_7R13}, :accelerator_architecture => Type{>:Ada}, :processor_manufacturer => Type{>:AMD}, :storage_type => Type{>:StorageType_EC2_NVMeSSD}, :node_memory_size => Type{>:Tuple{AtLeast256G, AtMost512G, 4.12317e11}}, :storage_size => Type{>:Tuple{AtLeast8T, AtMost16T, 1.63209e13}}, :node_provider => Type{>:AmazonEC2}, :node_vcpus_count => Type{>:Tuple{AtLeast64, AtMost128, 96.0}}, :accelerator_count => Type{>:Tuple{AtLeast4, AtMost4, 4.0}}, :network_performance => Type{>:Tuple{AtLeast32G, AtMost64G, 5.36871e10}}, :accelerator => Type{>:NVIDIA_L4}, :accelerator_type => Type{>:GPU}, :accelerator_memory_size => Type{>:Tuple{AtLeast16G, AtMost32G, 2.57698e10}}, :accelerator_manufacturer => Type{>:NVIDIA}, :node_machinetype => Type{>:EC2Type_G6_24xLarge}, :processor_microarchitecture => Type{>:Zen})

Notice that @select emits a warning because node_count is ignored since only instance features are considered in contract resolution.

Three VM instance types satisfy the contract, since they provide at least 256GB of memory and at least four NVIDIA GPUs of Ada architecture (L4 Tensor Core). They are: g6.48xlarge, g2-standard-96, and g6.24xlarge. The user may inspect the features of each instance type and write a contract that selects one directly.

julia> cc = @cluster node_count => 4 node_machinetype => EC2Type_G6_48xLarge
:mBrvXUsilkpxWJC

julia> @resolve cc
1-element Vector{Pair{Symbol, SubString{String}}}:
:instance_type => "g6.48xlarge"

Peer-Workers-MPI clusters

Peer-Workers-MPI is a variation of Peer-Workers clusters, where worker processes are connected through a global MPI communicator. This is possible through MPI.jl and MPIClusterManagers.jl.

In what follows, we modify the my_second_contract to build a Peer-Workers-MPI cluster that will be referred by my_fourth_cluster, by using the cluster_type parameter:

my_third__contract = @cluster(cluster_type => PeerWorkersMPI,
                              node_count => 8,
                              node_memory_size => @atleast(512G),
                              accelerator_count => @just(8),
                              accelerator_architecture => Ada)

my_fourth_cluster = @deploy my_third_cluster_contract

The following code launches a simple MPI.jl code in my_fourth_cluster, using the @everywhere primitive of Distributed.jl.

@everywhere  cluster_nodes(my_fourth_cluster) begin
	@eval  using MPI
	MPI.Init()
	rank = MPI.Comm_rank(MPI.COMM_WORLD)
	size = MPI.Comm_size(MPI.COMM_WORLD)
	@info  "I am $rank among $size processes"
	root_rank = 0
	rank_sum = MPI.Reduce(rank, (x,y) -> x + y, root_rank, MPI.COMM_WORLD)
end

result = @fetchfrom  ranks(my_first_cluster)[0] rank_sum
@info  "The sum of ranks in the cluster is $result"

The parallel code sums the ranks of the processes using the Reduce collective operation of MPI.jl and stores the result in the global variable rank_sum of the root process (rank 0). Then, this value is fetched by the program and assigned to the result variable using @fetchfrom. For that, the ranks function is used to discover the PID of the root process.

Manager-Workers clusters

A Manager-Workers cluster comprises an access node and a homogenous set of compute nodes. The compute nodes are only accessible from the access node. The instance type of the access node may be different from the instance type of the compute nodes.

In a Manager-Workers cluster, the master process, running in the REPL or main program, is called the driver process. It is responsible for launching the so-called entry process in the cluster's access node. In turn, the entry process launches worker processes across the compute nodes, using MPIClusterManagers.jl. The worker processes perform the computation, while the entry process is responsible for communication between the driver and the worker processes. A global MPI communicator exists between worker processes, like in Peer-Workers-MPI clusters.

A Manager-Workers cluster is useful when compute nodes are not directly accessible from the external network. This is a common situation in on-premises clusters. However, this is also possible in clusters built from the services of cluster providers specifically tailored to HPC applications.

[!IMPORTANT]

Manager-Workers are not natively supported by Julia, because Distributed.jl does not support that worker processes create new processes, as shown below:

 julia>addprocs(1)
 1-element Vector{Int64}:
 
 julia> @fetchfrom  2  addprocs(1)
 ERROR: On worker 2:
 Only process 1 can add or remove workers

The CloudClusters.jl developers have developed an extended version of Distributed.jl that removes this limitation, making it possible to create hierarchies of Julia processes [2]. However, the multilevel extension of Distributed.jl is necessary only for the access node of Manager-Workers cluster, where the so-called entry processes, launched by the master process at the REPL/program and responsible for launching the worker processes across computing nodes of the cluster, will be running.

So, only users who need to develop customized images to instantiate cluster nodes must be concerned with adapting the Julia installation for the extended Distributed.jl version, and only if an image is intended to be used for manager nodes of Manager-Workers clusters.

The multilevel extension to Distributed.jl is hosted at https://github.com/PlatformAwareProgramming/Distributed.jl, as a fork of the original Distributed.jl repository. The README of Distributed.jl explains how to use development versions in a current Julia installation. In case of difficulties, the user may contact the developers of CloudClusters.jl. For more information about the multilevel extension of Distributed.jl, read the SSCAD'2024 paper Towards multicluster computations with Julia.

Users may apply the cluster_type parameter to command the creation of a Manager-Workers cluster. Let us modify the my_first_cluster_contract to create a Manager-Workers cluster instead of a Peer-Workers one (default):

my_first_cluster_contract = @cluster(cluster_type => ManageWorkers,
node_count => 4,
node_machinetype => EC2Type_T3_xLarge)

In this case, the node_count parameter specifies the number of worker nodes. So, for a cluster deployed using my_first_cluster_contract, five VM instances will be created, including the manager node.

The user may use "dot notation" to specify different assumptions for manager and worker nodes. For example:

my_second_contract = @cluster(cluster_type => ManageWorkers,
                              node_count => 8,
                              manager.node_machinetype => EC2Type_T3_xLarge,
                              worker.accelerator_count => @just(8),
                              worker.accelerator_architecture => Ada,
                              worker.accelerator_memory => @atleast(512G))

This contract specifies that the manager node must be a t3.xlarge VM instance, while the worker nodes will have eight NVIDIA GPUs of Ada architecture and at least 512GB of memory.

Configuration parameters

Configuration parameters exist for the proper instantiation of clusters, whose default values are specified in the CCconfig.toml file. The user may override the default values by passing configuration parameters through @cluster and @deploy operations. For instance:

my_cluster_contract = @cluster(node_count => 4,
                               node_machinetype => EC2Type_T3_xLarge,
                               image_id => "ami-07f6c5b6de73ce7ae")
                               my_cluster = @deploy(my_first_cluster,
                               user => "ubuntu",
                               sshflags => "-i mykey.pem")

In the above code, image_id specifies that the EC2 image identified by ami-07f6c5b6de73ce7ae must be used when creating clusters from my_cluster_contract. On the other hand, user and sshflags will be used to access the nodes of my_cluster. For instance, ami-07f6c5b6de73ce7ae may provide a set of predefined users with different privileges to access the features offered by such an image.

Currently, there are four categories of configuration parameters. They are described in the following paragraphs.

The following configuration parameters set up the SSH connections to nodes of Peer-Workers clusters and the manager node of Master-Worker clusters, i.e., those nodes that are externally accessible:

  • user::String, the user login to access VM instances (e.g., [email protected], where xxx.xxx.xxx.xxx is the public IP of the VM instance);
  • sshflags::String, the flags that must be passed to the ssh command to access the VM instances;
  • tunneled::Bool, a keyword argument to be passed to addprocs to determine whether or not ssh access should be tunneled.

The following configuration parameters apply to cluster nodes of any cluster type:

  • exename::String, the full path to the julia executable (e.g., /home/ubuntu/.juliaup/bin/julia);
  • exeflags::String, flags to be passed to the julia executable when starting processes on cluster nodes;
  • directory::String, the current directory of the julia execution in the VM instance.

The following configuration parameters apply to nodes of Peer-Workers-MPI and worker nodes of Manager-Workers clusters, i.e., the ones with MPI-based message-passing enabled:

  • threadlevel::Symbol, a keyword argument passed to MPI.Init, whose possible values are: single, :serialized, :funneled, :multiple;
  • mpiflags::String, a keyword argument passed to MPI (e.g., "--map-by node --hostfile /home/ubuntu/hostfile").

The last set of configuration parameters depends on the IaaS provider selected through @resolve.

For AWS EC2, they are:

  • imageid::String, the ID of the image used to instantiate the VM instances that form the cluster nodes;
  • subnet_id::String, the ID of a subnet for the communication between VM instances that form the cluster nodes;
  • placement_group::String, the ID of an existing placement group where the user wishes to colocate the VM instances that form the cluster nodes (the default is to create a temporary placement group);
  • security_group_id::String, the ID of an existing security group for the VM instances that form the cluster nodes.

Finally, for GCP, they are:

  • imageid::String, the ID of the image used to instantiate the VM instances that form the cluster nodes;
  • zone::String, the zone where the cluster node instances will be placed;
  • project::String, the project where the cluster node instances will be created;
  • network_interface::String, the network interface of cluster node instances.

The integration with PlatformAware.jl

UNDER CONSTRUCTION

Publications

  • Francisco Heron de Carvalho Junior, João Marcelo Uchoa de Alencar, and Claro Henrique Silva Sales. 2024. Cloud-based parallel computing across multiple clusters in Julia. In Proceedings of the 28th Brazilian Symposium on Programming Languages (SBLP'2024), September 30, 2024, Curitiba, Brazil. SBC, Porto Alegre, Brasil, 44-52. DOI: https://doi.org/10.5753/sblp.2024.3470.

  • Francisco Heron de Carvalho Junior and Tiago Carneiro. 2024. Towards multicluster computations with Julia. In Proceedings of the XXV Symposium on High-Performance Computational Systems (SSCAD’2024), October 25, 2024, São Carlos, Brazil. SBC, Porto Alegre, Brazil. DOI: https://doi.org/10.5753/sscad.2024.244307

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages