Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip: add support for node extraction -> cluster metadata #26

Merged
merged 2 commits into from
Feb 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,7 @@
# Dependency directories (remove the comment below to include it)
# vendor/
bin
vendor
vendor
cache
lib
*.json
35 changes: 35 additions & 0 deletions Makefile.hwloc
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# This makefile will be used when we can add hwloc - there is currently a bug.
HERE ?= $(shell pwd)
LOCALBIN ?= $(shell pwd)/bin

# Install hwloc here for use to compile, etc.
LOCALLIB ?= $(shell pwd)/lib
HWLOC_INCLUDE ?= $(LOCALLIB)/include/hwloc.h
BUILDENVVAR=CGO_CFLAGS="-I$(LOCALLIB)/include" CGO_LDFLAGS="-L$(LOCALLIB)/lib -lhwloc"

.PHONY: all

all: build

.PHONY: $(LOCALBIN)
$(LOCALBIN):
mkdir -p $(LOCALBIN)

.PHONY: $(LOCALLIB)
$(LOCALLIB):
mkdir -p $(LOCALLIB)

$(HWLOC_INCLUDE):
git clone --depth 1 https://github.com/open-mpi/hwloc /tmp/hwloc || true && \
cd /tmp/hwloc && ./autogen.sh && \
./configure --enable-static --disable-shared LDFLAGS="-static" --prefix=$(LOCALLIB)/ && \
make LDFLAGS=-all-static && make install

build: $(LOCALBIN) $(HWLOC_INCLUDE)
GO111MODULE="on" $(BUILDENVVAR) go build -ldflags '-w' -o $(LOCALBIN)/compspec cmd/compspec/compspec.go

build-arm: $(LOCALBIN) $(HWLOC_INCLUDE)
GO111MODULE="on" $(BUILDENVVAR) GOARCH=arm64 go build -ldflags '-w' -o $(LOCALBIN)/compspec-arm cmd/compspec/compspec.go

build-ppc: $(LOCALBIN) $(HWLOC_INCLUDE)
GO111MODULE="on" $(BUILDENVVAR) GOARCH=ppc64le go build -ldflags '-w' -o $(LOCALBIN)/compspec-ppc cmd/compspec/compspec.go
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ This is a prototype compatibility checking tool. Right now our aim is to use in

- I'm starting with just Linux. I know there are those "other" platforms, but if it doesn't run on HPC or Kubernetes easily I'm not super interested (ahem, Mac and Windows)!
- not all extractors work in containers (e.g., kernel needs to be on the host)
- The node feature discovery source doesn't provide mapping of socket -> cores, nor does it give details about logical vs. physical CPU.
- We will likely want to add hwloc go bindings, but there is a bug currently.

Note that for development we are using nfd-source that does not require kubernetes:

Expand Down
38 changes: 29 additions & 9 deletions cmd/compspec/compspec.go
Original file line number Diff line number Diff line change
Expand Up @@ -54,12 +54,21 @@ func main() {
cachePath := matchCmd.String("", "cache", &argparse.Options{Help: "A path to a cache for artifacts"})
saveGraph := matchCmd.String("", "cache-graph", &argparse.Options{Help: "Load or use a cached graph"})

// Create arguments
options := createCmd.StringList("a", "append", &argparse.Options{Help: "Append one or more custom metadata fields to append"})
specname := createCmd.String("i", "in", &argparse.Options{Required: true, Help: "Input yaml that contains spec for creation"})
specfile := createCmd.String("o", "out", &argparse.Options{Help: "Save compatibility json artifact to this file"})
mediaType := createCmd.String("m", "media-type", &argparse.Options{Help: "The expected media-type for the compatibility artifact"})
allowFailCreate := createCmd.Flag("f", "allow-fail", &argparse.Options{Help: "Allow any specific extractor to fail (and continue extraction)"})
// Create subcommands - note that "nodes" could be cluster, but could want to make a subset of one
artifactCmd := createCmd.NewCommand("artifact", "Create a new artifact")
nodesCmd := createCmd.NewCommand("nodes", "Create nodes in Json Graph format from extraction data")

// Artifaction creation arguments
options := artifactCmd.StringList("a", "append", &argparse.Options{Help: "Append one or more custom metadata fields to append"})
specname := artifactCmd.String("i", "in", &argparse.Options{Required: true, Help: "Input yaml that contains spec for creation"})
specfile := artifactCmd.String("o", "out", &argparse.Options{Help: "Save compatibility json artifact to this file"})
mediaType := artifactCmd.String("m", "media-type", &argparse.Options{Help: "The expected media-type for the compatibility artifact"})
allowFailCreate := artifactCmd.Flag("f", "allow-fail", &argparse.Options{Help: "Allow any specific extractor to fail (and continue extraction)"})

// Nodes creation arguments
nodesOutFile := nodesCmd.String("", "nodes-output", &argparse.Options{Help: "Output json file for cluster nodes"})
nodesDir := nodesCmd.String("", "node-dir", &argparse.Options{Required: true, Help: "Input directory with extraction data for nodes"})
clusterName := nodesCmd.String("", "cluster-name", &argparse.Options{Required: true, Help: "Cluster name to describe in graph"})

// Now parse the arguments
err := parser.Parse(os.Args)
Expand All @@ -75,10 +84,21 @@ func main() {
log.Fatalf("Issue with extraction: %s\n", err)
}
} else if createCmd.Happened() {
err := create.Run(*specname, *options, *specfile, *allowFailCreate)
if err != nil {
log.Fatal(err.Error())
if artifactCmd.Happened() {
err := create.Artifact(*specname, *options, *specfile, *allowFailCreate)
if err != nil {
log.Fatal(err.Error())
}
} else if nodesCmd.Happened() {
err := create.Nodes(*nodesDir, *clusterName, *nodesOutFile)
if err != nil {
log.Fatal(err.Error())
}
} else {
fmt.Println(Header)
fmt.Println("Please provide a --node-dir and (optionally) --nodes-output (json file to write)")
}

} else if matchCmd.Happened() {
err := match.Run(
*manifestFile,
Expand Down
31 changes: 31 additions & 0 deletions cmd/compspec/create/artifact.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
package create

import (
"strings"

"github.com/compspec/compspec-go/plugins/creators/artifact"
)

// Artifact will create a compatibility artifact based on a request in YAML
// TODO likely want to refactor this into a proper create plugin
func Artifact(specname string, fields []string, saveto string, allowFail bool) error {

// This is janky, oh well
allowFailFlag := "false"
if allowFail {
allowFailFlag = "true"
}

// assemble options for node creator
creator, err := artifact.NewPlugin()
if err != nil {
return err
}
options := map[string]string{
"specname": specname,
"fields": strings.Join(fields, "||"),
"saveto": saveto,
"allowFail": allowFailFlag,
}
return creator.Create(options)
}
23 changes: 23 additions & 0 deletions cmd/compspec/create/nodes.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
package create

import (
"github.com/compspec/compspec-go/plugins/creators/cluster"
)

// Nodes will read in one or more node extraction metadata files and generate a single nodes JGF graph
// This is intentended for a registration command.
// TODO this should be converted to a creation (converter) plugin
func Nodes(nodesDir, clusterName, nodeOutFile string) error {

// assemble options for node creator
creator, err := cluster.NewPlugin()
if err != nil {
return err
}
options := map[string]string{
"nodes-dir": nodesDir,
"cluster-name": clusterName,
"node-outfile": nodeOutFile,
}
return creator.Create(options)
}
4 changes: 2 additions & 2 deletions cmd/compspec/extract/extract.go
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ func Run(filename string, pluginNames []string, allowFail bool) error {
// Womp womp, we only support linux! There is no other way.
operatingSystem := runtime.GOOS
if operatingSystem != "linux" {
return fmt.Errorf("🤓️ Sorry, we only support linux.")
return fmt.Errorf("🤓️ sorry, we only support linux")
}

// parse [section,...,section] into named plugins and sections
Expand All @@ -37,7 +37,7 @@ func Run(filename string, pluginNames []string, allowFail bool) error {
// This returns an array of bytes
b, err := result.ToJson()
if err != nil {
return fmt.Errorf("There was an issue marshalling to JSON: %s\n", err)
return fmt.Errorf("there was an issue marshalling to JSON: %s", err)
}
err = os.WriteFile(filename, b, 0644)
if err != nil {
Expand Down
6 changes: 3 additions & 3 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

This is early documentation that will be converted eventually to something prettier. Read more about:

- [Design](design.md)
- [Usage](usage.md)

- [Design](design.md) of compspec
- [Usage](usage.md) generic use cases
- [Rainbow](rainbow.md) use cases and examples for the rainbow scheduler

## Thanks and Previous Art

Expand Down
26 changes: 17 additions & 9 deletions docs/design.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,18 +7,28 @@ The compatibility tool is responsible for extracting information about a system,

## Definitions

### Extractor
### Plugin

An **extractor** is a core plugin that knows how to retrieve metadata about a host. An extractor is usually going to be run for two cases:
A plugin can define one or more functionalities;

- "Extract" is expected to know how to extract metadata about an application or environment
- "Create" is expected to create something from extracted data

This means that an **extractor** is a core plugin that knows how to retrieve metadata about a host. An extractor is usually going to be run for two cases:

1. During CI to extract (and save) metadata about a particular build to put in a compatibility artifact.
2. During image selection to extract information about the host to compare to.

Examples extractors could be "library" or "system."
Examples extractors could be "library" or "system." You interact with extractor plugins via the "extract" command.

A **creator** is a plugin that is responsible for creating an artifact that includes some extracted metadata. The creator is agnostic to what it it being asked to generate in the sense that it just needs a mapping. The mapping will be from the extractor namespace to the compatibility artifact namespace. For our first prototype, this just means asking for particular extractor attributes to map to a set of annotations that we want to dump into json. To start there should only be one creator plugin needed, however if there are different structures of artifacts needed, I could imagine more. An example creation specification for a prototype experiment where we care about architecture, MPI, and GPU is provided in [examples](examples).

Plugins can be one or the other, or both.

### Section
#### Section

A **section** is a group of metadata within an extractor. For example, within "library" a section is for "mpi." This allows a user to specify running the `--name library[mpi]` extractor to ask for the mpi section of the library family. Another example is under kernel.
A **section** is a group of metadata typically within an extractor, and could also be defined for creators when we have more use cases.
For example, within "library" a section is for "mpi." This allows a user to specify running the `--name library[mpi]` extractor to ask for the mpi section of the library family. Another example is under kernel.
The user might want to ask for more than one group to be extracted and might ask for `--name kernel[boot,config]`. Section basically provides more granularity to an extractor namespace. For the above two examples, the metadata generated would be organized like:

```
Expand All @@ -31,12 +41,10 @@ kernel

For the above, right now I am implementing extractors generally, or "wild-westy" in the sense that the namespace is oriented toward the extractor name and sections it owns (e.g., no community namespaces like archspec, spack, opencontainers, etc). This is subject to change depending on the design the working group decides on.

### Creator

A creator is a plugin that is responsible for creating an artifact that includes some extracted metadata. The creator is agnostic to what it it being asked to generate in the sense that it just needs a mapping. The mapping will be from the extractor namespace to the compatibility artifact namespace. For our first prototype, this just means asking for particular extractor attributes to map to a set of annotations that we want to dump into json. To start there should only be one creator plugin needed, however if there are different structures of artifacts needed, I could imagine more. An example creation specification for a prototype experiment where we care about architecture, MPI, and GPU is provided in [examples](examples).

## Overview

> This was the original proposal and may be out of date.

The design is based on the prototype from that pull request, shown below.

![img/proposal-c-plugin-design.png](img/proposal-c-plugin-design.png)
Expand Down
Binary file added docs/img/rainbow-scheduler-register.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
48 changes: 48 additions & 0 deletions docs/rainbow/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Rainbow Scheduler

The [rainbow scheduler](https://github.com/converged-computing/rainbow) has a registration step that requires a cluster to send over node metadata. The reason is because when a user sends a request for work, the scheduler needs to understand
how to properly assign it. To do that, it needs to be able to see all the resources (clusters) available to it.

![../img/rainbow-scheduler-register.png](../img/rainbow-scheduler-register.png)

For the purposes of compspec here, we care about the registration step. This is what that includes:

## Registration

1. At registration, the cluster also sends over metadata about itself (and the nodes it has). This is going to allow for selection for those nodes.
1. When submitting a job, the user no longer is giving an exact command, but a command + an image with compatibility metadata. The compatibility metadata (somehow) needs to be used to inform the cluster selection.
1. At selection, the rainbow schdeuler needs to filter down cluster options, and choose a subset.
- Level 1: Don't ask, just choose the top choice and submit
- Level 2: Ask the cluster for TBA time or cost, choose based on that.
- Job is added to that queue.

Specifically, this means two steps for compspec go:

1. A step to ask each node to extract it's own metadata, saved to a directory.
2. A second step to combine those nodes into a graph.

Likely we will take a simple approach to do an extract for one node that captures it's metadata into Json Graph Format (JGF) and then dumps into a shared directory (we might imagine this being run with a flux job)
and then some combination step.

## Example

In the example below, we will extract node level metadata with `compspec extract` and then generate the cluster JGF to send for registration with compspec create-nodes.

### 1. Extract Metadata

Let's first generate faux node metadata for a "cluster" - I will just run an extraction a few times and generate equivalent files :) This isn't such a crazy idea because it emulates nodes that are the same!

```bash
mkdir -p ./docs/rainbow/cluster
compspec extract --name library --name nfd[cpu,memory,network,storage,system] --name system[cpu,processor,arch,memory] --out ./docs/rainbow/cluster/node-1.json
compspec extract --name library --name nfd[cpu,memory,network,storage,system] --name system[cpu,processor,arch,memory] --out ./docs/rainbow/cluster/node-2.json
compspec extract --name library --name nfd[cpu,memory,network,storage,system] --name system[cpu,processor,arch,memory] --out ./docs/rainbow/cluster/node-3.json
```

### 2. Create Nodes

Now we are going to give compspec the directory, and ask it to create nodes. This will be in JSON graph format. This outputs to the terminal:

```bash
compspec create nodes --cluster-name cluster-red --node-dir ./docs/rainbow/cluster/
```
Loading