Skip to content

Commit

Permalink
Cleanup
Browse files Browse the repository at this point in the history
  • Loading branch information
moebiusband73 committed Jun 18, 2024
1 parent 79e4929 commit 7538570
Show file tree
Hide file tree
Showing 7 changed files with 80 additions and 781 deletions.
126 changes: 0 additions & 126 deletions Makefile.orig

This file was deleted.

115 changes: 70 additions & 45 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,29 @@

[![Build & Test](https://github.com/ClusterCockpit/cc-metric-store/actions/workflows/test.yml/badge.svg)](https://github.com/ClusterCockpit/cc-metric-store/actions/workflows/test.yml)

The cc-metric-store provides a simple in-memory time series database for storing metrics of cluster nodes at preconfigured intervals. It is meant to be used as part of the [ClusterCockpit suite](https://github.com/ClusterCockpit). As all data is kept in-memory (but written to disk as compressed JSON for long term storage), accessing it is very fast. It also provides aggregations over time *and* nodes/sockets/cpus.
The cc-metric-store provides a simple in-memory time series database for storing
metrics of cluster nodes at preconfigured intervals. It is meant to be used as
part of the [ClusterCockpit suite](https://github.com/ClusterCockpit). As all
data is kept in-memory (but written to disk as compressed JSON for long term
storage), accessing it is very fast. It also provides aggregations over time
_and_ nodes/sockets/cpus.

There are major limitations: Data only gets written to disk at periodic checkpoints, not as soon as it is received.
There are major limitations: Data only gets written to disk at periodic
checkpoints, not as soon as it is received.

Go look at the `TODO.md` file and the [GitHub Issues](https://github.com/ClusterCockpit/cc-metric-store/issues) for a progress overview. Things work, but are not properly tested.
The [NATS.io](https://nats.io/) based writing endpoint consumes messages in [this format of the InfluxDB line protocol](https://github.com/ClusterCockpit/cc-specifications/blob/master/metrics/lineprotocol_alternative.md).
Go look at the `TODO.md` file and the [GitHub
Issues](https://github.com/ClusterCockpit/cc-metric-store/issues) for a progress
overview. Things work, but are not properly tested. The
[NATS.io](https://nats.io/) based writing endpoint consumes messages in [this
format of the InfluxDB line
protocol](https://github.com/ClusterCockpit/cc-specifications/blob/master/metrics/lineprotocol_alternative.md).

### REST API Endpoints
## REST API Endpoints

The REST API is documented in [openapi.yaml](./openapi.yaml) in the OpenAPI 3.0 format.
The REST API is documented in [openapi.yaml](./api/openapi.yaml) in the OpenAPI
3.0 format.

### Run tests
## Run tests

Some benchmarks concurrently access the `MemoryStore`, so enabling the
[Race Detector](https://golang.org/doc/articles/race_detector) might be useful.
Expand All @@ -28,18 +39,21 @@ go test -v ./...
go test -bench=. -race -v ./...
```

### What are these selectors mentioned in the code?
## What are these selectors mentioned in the code?

Tags in InfluxDB are used to build indexes over the stored data. InfluxDB-Tags have no
relation to each other, they do not depend on each other and have no hierarchy.
Different tags build up different indexes (I am no expert at all, but this is how i think they work).
Tags in InfluxDB are used to build indexes over the stored data. InfluxDB-Tags
have no relation to each other, they do not depend on each other and have no
hierarchy. Different tags build up different indexes (I am no expert at all, but
this is how i think they work).

This project also works as a time-series database and uses the InfluxDB line protocol.
Unlike InfluxDB, the data is indexed by one single strictly hierarchical tree structure.
A selector is build out of the tags in the InfluxDB line protocol, and can be used to select
a node (not in the sense of a compute node, can also be a socket, cpu, ...) in that tree.
The implementation calls those nodes `level` to avoid confusion.
It is impossible to access data only by knowing the *socket* or *cpu* tag, all higher up levels have to be specified as well.
This project also works as a time-series database and uses the InfluxDB line
protocol. Unlike InfluxDB, the data is indexed by one single strictly
hierarchical tree structure. A selector is build out of the tags in the InfluxDB
line protocol, and can be used to select a node (not in the sense of a compute
node, can also be a socket, cpu, ...) in that tree. The implementation calls
those nodes `level` to avoid confusion. It is impossible to access data only by
knowing the _socket_ or _cpu_ tag, all higher up levels have to be specified as
well.

This is what the hierarchy currently looks like:

Expand All @@ -59,43 +73,49 @@ This is what the hierarchy currently looks like:
- ...

Example selectors:

1. `["cluster1", "host1", "cpu0"]`: Select only the cpu0 of host1 in cluster1
2. `["cluster1", "host1", ["cpu4", "cpu5", "cpu6", "cpu7"]]`: Select only CPUs 4-7 of host1 in cluster1
3. `["cluster1", "host1"]`: Select the complete node. If querying for a CPU-specific metric such as floats, all CPUs are implied

### Config file
## Config file

All durations are specified as string that will be parsed [like this](https://pkg.go.dev/time#ParseDuration) (Allowed suffixes: `s`, `m`, `h`, ...).
All durations are specified as string that will be parsed [like
this](https://pkg.go.dev/time#ParseDuration) (Allowed suffixes: `s`, `m`, `h`,
...).

- `metrics`: Map of metric-name to objects with the following properties
- `frequency`: Timestep/Interval/Resolution of this metric
- `aggregation`: Can be `"sum"`, `"avg"` or `null`
- `null` means aggregation across nodes is forbidden for this metric
- `"sum"` means that values from the child levels are summed up for the parent level
- `"avg"` means that values from the child levels are averaged for the parent level
- `scope`: Unused at the moment, should be something like `"node"`, `"socket"` or `"hwthread"`
- `frequency`: Timestep/Interval/Resolution of this metric
- `aggregation`: Can be `"sum"`, `"avg"` or `null`
- `null` means aggregation across nodes is forbidden for this metric
- `"sum"` means that values from the child levels are summed up for the parent level
- `"avg"` means that values from the child levels are averaged for the parent level
- `scope`: Unused at the moment, should be something like `"node"`, `"socket"` or `"hwthread"`
- `nats`:
- `address`: Url of NATS.io server, example: "nats://localhost:4222"
- `username` and `password`: Optional, if provided use those for the connection
- `subscriptions`:
- `subscribe-to`: Where to expect the measurements to be published
- `cluster-tag`: Default value for the cluster tag
- `address`: Url of NATS.io server, example: "nats://localhost:4222"
- `username` and `password`: Optional, if provided use those for the connection
- `subscriptions`:
- `subscribe-to`: Where to expect the measurements to be published
- `cluster-tag`: Default value for the cluster tag
- `http-api`:
- `address`: Address to bind to, for example `0.0.0.0:8080`
- `https-cert-file` and `https-key-file`: Optional, if provided enable HTTPS using those files as certificate/key
- `address`: Address to bind to, for example `0.0.0.0:8080`
- `https-cert-file` and `https-key-file`: Optional, if provided enable HTTPS using those files as certificate/key
- `jwt-public-key`: Base64 encoded string, use this to verify requests to the HTTP API
- `retention-on-memory`: Keep all values in memory for at least that amount of time
- `checkpoints`:
- `interval`: Do checkpoints every X seconds/minutes/hours
- `directory`: Path to a directory
- `restore`: After a restart, load the last X seconds/minutes/hours of data back into memory
- `interval`: Do checkpoints every X seconds/minutes/hours
- `directory`: Path to a directory
- `restore`: After a restart, load the last X seconds/minutes/hours of data back into memory
- `archive`:
- `interval`: Move and compress all checkpoints not needed anymore every X seconds/minutes/hours
- `directory`: Path to a directory
- `interval`: Move and compress all checkpoints not needed anymore every X seconds/minutes/hours
- `directory`: Path to a directory

### Test the complete setup (excluding ClusterCockpit itself)
## Test the complete setup (excluding cc-backend itself)

There are two ways for sending data to the cc-metric-store, both of which are supported by the [cc-metric-collector](https://github.com/ClusterCockpit/cc-metric-collector). This example uses Nats, the alternative is to use HTTP.
There are two ways for sending data to the cc-metric-store, both of which are
supported by the
[cc-metric-collector](https://github.com/ClusterCockpit/cc-metric-collector).
This example uses Nats, the alternative is to use HTTP.

```sh
# Only needed once, downloads the docker image
Expand All @@ -105,7 +125,9 @@ docker pull nats:latest
docker run -p 4222:4222 -ti nats:latest
```

Second, build and start the [cc-metric-collector](https://github.com/ClusterCockpit/cc-metric-collector) using the following as Sink-Config:
Second, build and start the
[cc-metric-collector](https://github.com/ClusterCockpit/cc-metric-collector)
using the following as Sink-Config:

```json
{
Expand All @@ -116,18 +138,20 @@ Second, build and start the [cc-metric-collector](https://github.com/ClusterCock
}
```

Third, build and start the metric store. For this example here, the `config.json` file
already in the repository should work just fine.
Third, build and start the metric store. For this example here, the
`config.json` file already in the repository should work just fine.

```sh
# Assuming you have a clone of this repo in ./cc-metric-store:
cd cc-metric-store
go get
go build
make
./cc-metric-store
```

And finally, use the API to fetch some data. The API is protected by JWT based authentication if `jwt-public-key` is set in `config.json`. You can use this JWT for testing: `eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw`
And finally, use the API to fetch some data. The API is protected by JWT based
authentication if `jwt-public-key` is set in `config.json`. You can use this JWT
for testing:
`eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw`

```sh
JWT="eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw"
Expand All @@ -142,6 +166,7 @@ curl -H "Authorization: Bearer $JWT" -D - "http://localhost:8080/api/query" -d "
```

For debugging there is a debug endpoint to dump the current content to stdout:

```sh
JWT="eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw"

Expand Down
19 changes: 10 additions & 9 deletions TODO.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
# TODO
# TODOs

- Improve checkpoints/archives
- Store information in each buffer if already archived
- Do not create new checkpoint if all buffers already archived
- Store information in each buffer if already archived
- Do not create new checkpoint if all buffers already archived
- Missing Testcases:
- General tests
- Check for corner cases that should fail gracefully
- Write a more realistic `ToArchive`/`FromArchive` tests
- General tests
- Check for corner cases that should fail gracefully
- Write a more realistic `ToArchive`/`FromArchive` tests
- Optimization: Once a buffer is full, calculate min, max and avg
- Calculate averages buffer-wise, average weighted by length of buffer
- Only the head-buffer needs to be fully traversed
- Optimization: If aggregating over hwthreads/cores/sockets cache those results and reuse some of that for new queres aggregating only over the newer data
- Calculate averages buffer-wise, average weighted by length of buffer
- Only the head-buffer needs to be fully traversed
- Optimization: If aggregating over hwthreads/cores/sockets cache those results
and reuse some of that for new queres aggregating only over the newer data
- ...
File renamed without changes.
37 changes: 0 additions & 37 deletions go.mod.orig

This file was deleted.

Loading

0 comments on commit 7538570

Please sign in to comment.