Skip to content

Commit

Permalink
add ldms (via ovis-hpc) (#46)
Browse files Browse the repository at this point in the history
* add ldms (via ovis-hpc)
* wrong url and remove non used prefix
* default completions should be 0 (unset)

Signed-off-by: vsoch <[email protected]>
  • Loading branch information
vsoch authored Aug 17, 2023
1 parent 1a57bc0 commit fd599cc
Show file tree
Hide file tree
Showing 10 changed files with 980 additions and 20 deletions.
1 change: 1 addition & 0 deletions .github/workflows/main.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ jobs:
test: [["perf-hello-world", "ghcr.io/converged-computing/metric-sysstat:latest", 60], # performance test
["io-host-volume", "ghcr.io/converged-computing/metric-sysstat:latest", 60], # storage test
["io-fio", "ghcr.io/converged-computing/metric-fio:latest", 120], # storage test
["app-ldms", "ghcr.io/converged-computing/metric-ovis-hpc:latest", 120], # standalone app test
["app-amg", "ghcr.io/converged-computing/metric-amg:latest", 120], # standalone app test
["app-kripke", "ghcr.io/converged-computing/metric-kripke:latest", 120], # standalone app test
["app-pennant", "ghcr.io/converged-computing/metric-pennant:latest", 120], # standalone app test
Expand Down
8 changes: 8 additions & 0 deletions docs/_static/data/metrics.json
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,14 @@
"image": "ghcr.io/converged-computing/metric-lammps:latest",
"url": "https://www.lammps.org/"
},
{
"name": "app-ldms",
"description": "provides LDMS, a low-overhead, low-latency framework for collecting, transferring, and storing metric data on a large distributed computer system.",
"family": "performance",
"type": "application",
"image": "ghcr.io/converged-computing/metric-ovis-hpc:latest",
"url": "https://github.com/ovis-hpc/ovis"
},
{
"name": "app-pennant",
"description": "Unstructured mesh hydrodynamics for advanced architectures ",
Expand Down
47 changes: 28 additions & 19 deletions docs/getting_started/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Each of the above is a metric design, which is primarily represented in the Metr
there are different families of metrics (e.g., storage, network, performance, simulation) shown in the table below as the "Family" column.
We likely will tweak and improve upon these categories.

<iframe src="../_static/data/table.html" style="width:100%; height:850px;" frameBorder="0"></iframe>
<iframe src="../_static/data/table.html" style="width:100%; height:900px;" frameBorder="0"></iframe>


## Implemented Metrics
Expand All @@ -21,7 +21,7 @@ family once we decide on a more final set.

### Performance

These metrics are intended to assess application performance.
These metrics are intended to assess application performance, where they run alongside an application of interest.

#### perf-sysstat

Expand All @@ -32,7 +32,7 @@ These metrics are intended to assess application performance.
This metric provides the "pidstat" executable of the sysstat library. The following options are available:


|Name | Description | Type | Default |
| Name | Description | Type | Default |
|-----|-------------|------------|------|
| color | Set to turn on color parsing | Anything set | unset |
| pids | For debugging, show consistent output of ps aux | Anything set | unset |
Expand Down Expand Up @@ -82,7 +82,7 @@ Options you can set include:
|Name | Description | Type | Default |
|-----|-------------|------------|------|
|testname | Name for the test | string | test |
| blocksize | Size of block to write. It dfaults to 4k, but can be set from 256 to 8k. | string | 4k |
| blocksize | Size of block to write. It defaults to 4k, but can be set from 256 to 8k. | string | 4k |
| iodepth | Number of I/O units to keep in flight against the file. | int | 64 |
| size | Total size of file to write | string | 4G |
| directory | Directory (usually mounted) to test. | string | /tmp |
Expand All @@ -105,9 +105,11 @@ This is the "iostat" executable of the sysstat library.
This is good for mounted storage that can be seen by the operating system, but may not work for something like NFS.
### Standalone
Standalone metrics can take on many designs, from a launcher/worker design to test networking, to running
a metric across nodes to assess the node performance.
#### network-netmark
- [Standalone Metric Set](user-guide.md#application-metric-set)
Expand Down Expand Up @@ -505,24 +507,31 @@ ex3_colored-indexset_solution ex6_stencil-offset-layout_solution ex9_matrix-tr
(meaning on the PATH in `/opt/Kripke/build/bin` in the container).
For apps / metrics to be added, please see [this issue](https://github.com/converged-computing/metrics-operator/issues/30).
## Containers
#### app-ldms
The following tools are folded into the metrics above. Often, one tool can be built into one container and used across multiple metrics.
- [Standalone Metric Set](user-guide.md#application-metric-set)
- *[app-ldms](https://github.com/converged-computing/metrics-operator/tree/main/examples/tests/app-ldms)*
### Sysstat
- [ghcr.io/converged-computing/metric-sysstat](https://github.com/converged-computing/metrics-operator/pkgs/container/metric-sysstat)
LDMS is "a low-overhead, low-latency framework for collecting, transferring, and storing metric data on a large distributed computer system"
and is packaged alongside [ovis-hpc](https://github.com/ovis-hpc/ovis). While there are complex aggregator setups we could run,
for this simple metric we simply run (on each separate pod/node). The following variables are supported:
Sysstat is stored as a general metrics analyzer, as it provides several different metric types; It generally provides utils to monitor system performance and usage, including:
|Name | Description | Type | Default |
|-----|-------------|------|------|
| command | The command to issue to ldms_ls (or that) |string | (see below) |
| workdir | The working directory for the command | string | /opt |
| completions | Number of times to run metric | int32 | unset (runs for lifetime of application or indefinitely) |
| rate | Seconds to pause between measurements | int32 | 10 |
The following is the default command:
- *iostat* reports CPU statistics and input/output statistics for block devices and partitions.
- *mpstat* reports individual or combined processor related statistics.
- *pidstat* reports statistics for Linux tasks (processes) : I/O, CPU, memory, etc.
- *tapestat* reports statistics for tape drives connected to the system.
- *cifsiostat* reports CIFS statistics.
```bash
ldms_ls -h localhost -x sock -p 10444 -l -v
```
## LLNL Storage / Filesystems
## Containers
- NFS
- Vast
- Lustre
To see all associated app containers, look at the [converged-computing/metrics-container](https://github.com/converged-computing/metrics-containers)
repository (with `Dockerfile`s and automation) and associated packages.
Loading

0 comments on commit fd599cc

Please sign in to comment.