feat: add module to deploy and integrate grafana agent with slurmctld #9

NucciTheBoss · 2024-11-22T18:57:18Z

Changes:

Replace juju_applicaton entry for mysql with the tf module from the upstream mysql-operator GitHub repository.

Docs:

Added comments for what each section does.
Removed mentions that the plan was originally used to deploy a small cluster on LXD. The main terraform plan can be used to deploy Charmed HPC pretty much anywhere.

Changes: * Replace `juju_applicaton` entry for `mysql` with the tf module from the upstream `mysql-operator` GitHub repository. Signed-off-by: Jason C. Nucciarone <[email protected]>

NucciTheBoss · 2024-11-25T21:40:59Z

@jedel1043 @dsloanm R4R - deploys a grafana agent on slurmctld that's ready to party with a deployed COS cloud. We'll need to figure out if we want to have a subdirectoy that contains a plan for deploying COS, or if we just want to define the grafana-agent endpoints to be consumed by another product module, but this tf plan can at least get you a Charmed HPC cluster that's ready to be integrated with COS.

Here's what the final deployment looks like with grafana-agent-operator added:

jedel1043

Looks great!

We'll need to figure out if we want to have a subdirectoy that contains a plan for deploying COS, or if we just want to define the grafana-agent endpoints to be consumed by another product module, but this tf plan can at least get you a Charmed HPC cluster that's ready to be integrated with COS.

Sounds like the COS plans should be the responsibility of the observability team, but we can discuss that later.

jedel1043 · 2024-11-25T23:22:20Z

main.tf

+## Grafana Agent - forwards collected cluster metrics to COS.
+module "grafana-agent" {
+  source = "git::https://github.com/canonical/grafana-agent-operator//terraform"
+
+  model_name = juju_model.charmed-hpc.name
+  app_name   = "grafana-agent"
+  channel    = var.grafana-agent-channel
+  units      = 0 # Units should always be zero since grafana-agent is a subordinate operator.
+}


Thought: hmm, I'm wondering if we really want to always deploy it. From the user's perspective, seeing a big red "BLOCKED" message could trigger alarm sounds. Maybe make this optional with a configuration?

Can we set the status to active even if it isn't related?

Externally I don't think so; that's just the logic of the grafana-agent charm.

Blocked and error mean two different things in my mind. Blocked implies that further conditions must be met before the application is active, while error implies that something went wrong in the deployment.

Tbh I'd like to avoid making the Terraform for our reference deployment complicated with conditionals and dynamic blocks as they make it harder to maintain the deployment plan. I'd rather deploy the Grafana Agent operator and then tell folks "hey this will stay in a blocked state until you integrate the Canonical Observability Stack (COS). See for how to set up COS with your Charmed HPC cluster."

We could also just add a module that deploys COS Lite; it's pretty straight forward from what I have seen. We just need to add a cos.tf plan and provide a Kubernetes cluster that it can use.

# Example COS Lite deployment with Terraform. resource "juju_application" "alertmanager" { name = "alertmanager" trust = true model = juju_model.cos.name charm { name = "alertmanager-k8s" channel = "latest/stable" } units = 1 constraints = "arch=amd64" storage_directives = { data = "10G" } } resource "juju_application" "catalogue" { name = "catalogue" trust = true model = juju_model.cos.name charm { name = "catalogue-k8s" channel = "latest/stable" } units = 1 constraints = "arch=amd64" config = { "description" : "Canonical Observability Stack Lite" } } resource "juju_application" "grafana" { name = "grafana" trust = true model = juju_model.cos.name charm { name = "grafana-k8s" channel = "latest/stable" } units = 1 constraints = "arch=amd64" storage_directives = { database = "10G" } } resource "juju_application" "loki" { name = "loki" trust = true model = juju_model.cos.name charm { name = "loki-k8s" channel = "latest/stable" } units = 1 constraints = "arch=amd64" storage_directives = { active-index-directory = "10G" loki-chunks = "500G" } } resource "juju_application" "prometheus" { name = "prometheus" trust = true model = juju_model.cos.name charm { name = "prometheus-k8s" channel = "latest/stable" } config = { "metrics_retention_time" : "90d" } units = 1 constraints = "arch=amd64" storage_directives = { database = "500G" } } resource "juju_application" "traefik" { name = "traefik" trust = true model = juju_model.cos.name charm { name = "traefik-k8s" channel = "latest/stable" } config = { "tls-cert" : var.COS_TLS_CERT, "tls-key" : var.COS_TLS_KEY, "tls-ca" : var.COS_TLS_CA } units = 1 constraints = "arch=amd64" storage_directives = { configurations = "10G" } } resource "juju_integration" "traefik-grafana" { model = juju_model.cos.name application { name = juju_application.traefik.name endpoint = "traefik-route" } application { name = juju_application.grafana.name endpoint = "ingress" } } resource "juju_integration" "prometheus-alertmanager-alerting" { model = juju_model.cos.name application { name = juju_application.prometheus.name endpoint = "alertmanager" } application { name = juju_application.alertmanager.name endpoint = "alerting" } } resource "juju_integration" "grafana-prometheus-source" { model = juju_model.cos.name application { name = juju_application.grafana.name endpoint = "grafana-source" } application { name = juju_application.prometheus.name endpoint = "grafana-source" } } resource "juju_integration" "grafana-loki-source" { model = juju_model.cos.name application { name = juju_application.grafana.name endpoint = "grafana-source" } application { name = juju_application.loki.name endpoint = "grafana-source" } } resource "juju_integration" "grafana-alertmanager-source" { model = juju_model.cos.name application { name = juju_application.grafana.name endpoint = "grafana-source" } application { name = juju_application.alertmanager.name endpoint = "grafana-source" } } resource "juju_integration" "loki-alertmanager" { model = juju_model.cos.name application { name = juju_application.loki.name endpoint = "alertmanager" } application { name = juju_application.alertmanager.name endpoint = "alerting" } } resource "juju_integration" "prometheus-traefik" { model = juju_model.cos.name application { name = juju_application.prometheus.name endpoint = "metrics-endpoint" } application { name = juju_application.traefik.name endpoint = "metrics-endpoint" } } resource "juju_integration" "prometheus-alertmanager-metrics" { model = juju_model.cos.name application { name = juju_application.prometheus.name endpoint = "metrics-endpoint" } application { name = juju_application.alertmanager.name endpoint = "self-metrics-endpoint" } } resource "juju_integration" "prometheus-loki" { model = juju_model.cos.name application { name = juju_application.prometheus.name endpoint = "metrics-endpoint" } application { name = juju_application.loki.name endpoint = "metrics-endpoint" } } resource "juju_integration" "prometheus-grafana" { model = juju_model.cos.name application { name = juju_application.prometheus.name endpoint = "metrics-endpoint" } application { name = juju_application.grafana.name endpoint = "metrics-endpoint" } } resource "juju_integration" "grafana-loki-dashboard" { model = juju_model.cos.name application { name = juju_application.grafana.name endpoint = "grafana-dashboard" } application { name = juju_application.loki.name endpoint = "grafana-dashboard" } } resource "juju_integration" "grafana-prometheus-dashboard" { model = juju_model.cos.name application { name = juju_application.grafana.name endpoint = "grafana-dashboard" } application { name = juju_application.prometheus.name endpoint = "grafana-dashboard" } } resource "juju_integration" "grafana-alertmanager-dashboard" { model = juju_model.cos.name application { name = juju_application.grafana.name endpoint = "grafana-dashboard" } application { name = juju_application.alertmanager.name endpoint = "grafana-dashboard" } } resource "juju_integration" "catalogue-traefik" { model = juju_model.cos.name application { name = juju_application.catalogue.name endpoint = "ingress" } application { name = juju_application.traefik.name endpoint = "ingress" } } resource "juju_integration" "catalogue-grafana" { model = juju_model.cos.name application { name = juju_application.catalogue.name endpoint = "catalogue" } application { name = juju_application.grafana.name endpoint = "catalogue" } } resource "juju_integration" "catalogue-prometheus" { model = juju_model.cos.name application { name = juju_application.catalogue.name endpoint = "catalogue" } application { name = juju_application.prometheus.name endpoint = "catalogue" } } resource "juju_integration" "catalogue-alertmanager" { model = juju_model.cos.name application { name = juju_application.catalogue.name endpoint = "catalogue" } application { name = juju_application.alertmanager.name endpoint = "catalogue" } } resource "juju_offer" "prometheus-receive-remote-write" { model = juju_model.cos.name application_name = juju_application.prometheus.name endpoint = "receive-remote-write" } resource "juju_offer" "grafana-dashboards" { model = juju_model.cos.name application_name = juju_application.grafana.name endpoint = "grafana-dashboard" }

NucciTheBoss · 2024-11-26T14:21:40Z

I'm also wondering if it's perhaps better to have this be a module? And then someone else writes the plan that deploys Slurm with COS, consuming this module as a product module within their own deployment plan. So something like the following:

# Some magic with Juju Offers happens in the back-end of the Charmed HPC plan. 
# Still requires us to deploy Grafana Agent with Charmed HPC however.

terraform {
  required_providers {
    juju = {
      version = "~> 0.15.0"
      source  = "juju/juju"
    }
  }
}


terraform {
  backend "http" {
  }
}

provider "juju" {
  controller_addresses = var.JUJU_CONTROLLER_IPS
  username             = var.JUJU_USERNAME
  password             = var.JUJU_PASSWORD
  ca_certificate       = base64decode(var.JUJU_CA_CERTIFICATE)
}

resource "juju_model" "charmed-hpc" {
  name       = "charmed-hpc"
  credential = var.CREDENTIAL
  cloud {
    name = var.CLOUD
  }
  config = {
    agent-version = "3.5.4"
    resource-group-name = var.RESOURCE_GROUP
    network = var.NETWORK
  }
}

resource "juju_model" "cos-lite" {
  name       = "cos-lite"
  credential = var.K8S_CREDENTIAL
  cloud {
    name = var.K8S_CLOUD
  }
  config = {
    agent-version = "3.5.4"
  }
  depends_on = [
    juju_model.charmed-hpc
  ]
}

feat: add module to deploy and integrate grafana agent with slurmctld

a9efc5c

Changes: * Replace `juju_applicaton` entry for `mysql` with the tf module from the upstream `mysql-operator` GitHub repository. Signed-off-by: Jason C. Nucciarone <[email protected]>

NucciTheBoss requested review from jedel1043 and dsloanm November 22, 2024 18:57

NucciTheBoss added the enhancement New feature or request label Nov 22, 2024

NucciTheBoss mentioned this pull request Nov 22, 2024

fix(ci): do not use destructive packing when building charms for release charmed-hpc/slurm-charms#42

Merged

NucciTheBoss marked this pull request as ready for review November 25, 2024 21:38

jedel1043 reviewed Nov 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add module to deploy and integrate grafana agent with slurmctld #9

feat: add module to deploy and integrate grafana agent with slurmctld #9

NucciTheBoss commented Nov 22, 2024 •

edited

Loading

NucciTheBoss commented Nov 25, 2024

jedel1043 left a comment •

edited

Loading

jedel1043 Nov 25, 2024

jamesbeedy Nov 25, 2024

jedel1043 Nov 25, 2024

NucciTheBoss Nov 26, 2024

NucciTheBoss commented Nov 26, 2024

feat: add module to deploy and integrate grafana agent with slurmctld #9

Are you sure you want to change the base?

feat: add module to deploy and integrate grafana agent with slurmctld #9

Conversation

NucciTheBoss commented Nov 22, 2024 • edited Loading

NucciTheBoss commented Nov 25, 2024

jedel1043 left a comment • edited Loading

Choose a reason for hiding this comment

jedel1043 Nov 25, 2024

Choose a reason for hiding this comment

jamesbeedy Nov 25, 2024

Choose a reason for hiding this comment

jedel1043 Nov 25, 2024

Choose a reason for hiding this comment

NucciTheBoss Nov 26, 2024

Choose a reason for hiding this comment

NucciTheBoss commented Nov 26, 2024

NucciTheBoss commented Nov 22, 2024 •

edited

Loading

jedel1043 left a comment •

edited

Loading