Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add module to deploy and integrate grafana agent with slurmctld #9

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

NucciTheBoss
Copy link
Member

@NucciTheBoss NucciTheBoss commented Nov 22, 2024

Changes:

  • Replace juju_applicaton entry for mysql with the tf module from the upstream mysql-operator GitHub repository.

Docs:

  • Added comments for what each section does.
  • Removed mentions that the plan was originally used to deploy a small cluster on LXD. The main terraform plan can be used to deploy Charmed HPC pretty much anywhere.

Changes:

* Replace `juju_applicaton` entry for `mysql` with the
  tf module from the upstream `mysql-operator` GitHub repository.

Signed-off-by: Jason C. Nucciarone <[email protected]>
@NucciTheBoss
Copy link
Member Author

@jedel1043 @dsloanm R4R - deploys a grafana agent on slurmctld that's ready to party with a deployed COS cloud. We'll need to figure out if we want to have a subdirectoy that contains a plan for deploying COS, or if we just want to define the grafana-agent endpoints to be consumed by another product module, but this tf plan can at least get you a Charmed HPC cluster that's ready to be integrated with COS.

Here's what the final deployment looks like with grafana-agent-operator added:

image

Copy link
Contributor

@jedel1043 jedel1043 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

We'll need to figure out if we want to have a subdirectoy that contains a plan for deploying COS, or if we just want to define the grafana-agent endpoints to be consumed by another product module, but this tf plan can at least get you a Charmed HPC cluster that's ready to be integrated with COS.

Sounds like the COS plans should be the responsibility of the observability team, but we can discuss that later.

Comment on lines +85 to +93
## Grafana Agent - forwards collected cluster metrics to COS.
module "grafana-agent" {
source = "git::https://github.com/canonical/grafana-agent-operator//terraform"

model_name = juju_model.charmed-hpc.name
app_name = "grafana-agent"
channel = var.grafana-agent-channel
units = 0 # Units should always be zero since grafana-agent is a subordinate operator.
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: hmm, I'm wondering if we really want to always deploy it. From the user's perspective, seeing a big red "BLOCKED" message could trigger alarm sounds. Maybe make this optional with a configuration?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we set the status to active even if it isn't related?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Externally I don't think so; that's just the logic of the grafana-agent charm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocked and error mean two different things in my mind. Blocked implies that further conditions must be met before the application is active, while error implies that something went wrong in the deployment.

Tbh I'd like to avoid making the Terraform for our reference deployment complicated with conditionals and dynamic blocks as they make it harder to maintain the deployment plan. I'd rather deploy the Grafana Agent operator and then tell folks "hey this will stay in a blocked state until you integrate the Canonical Observability Stack (COS). See for how to set up COS with your Charmed HPC cluster."

We could also just add a module that deploys COS Lite; it's pretty straight forward from what I have seen. We just need to add a cos.tf plan and provide a Kubernetes cluster that it can use.

# Example COS Lite deployment with Terraform.

resource "juju_application" "alertmanager" {
  name  = "alertmanager"
  trust = true
  model = juju_model.cos.name
  charm {
    name    = "alertmanager-k8s"
    channel = "latest/stable"
  }
  units       = 1
  constraints = "arch=amd64"
  storage_directives = {
    data = "10G"
  }
}

resource "juju_application" "catalogue" {
  name  = "catalogue"
  trust = true
  model = juju_model.cos.name
  charm {
    name    = "catalogue-k8s"
    channel = "latest/stable"
  }
  units       = 1
  constraints = "arch=amd64"
  config = {
    "description" : "Canonical Observability Stack Lite"
  }

}


resource "juju_application" "grafana" {
  name  = "grafana"
  trust = true
  model = juju_model.cos.name
  charm {
    name    = "grafana-k8s"
    channel = "latest/stable"
  }
  units       = 1
  constraints = "arch=amd64"
  storage_directives = {
    database = "10G"
  }
}


resource "juju_application" "loki" {
  name  = "loki"
  trust = true
  model = juju_model.cos.name
  charm {
    name    = "loki-k8s"
    channel = "latest/stable"
  }
  units       = 1
  constraints = "arch=amd64"
  storage_directives = {
    active-index-directory = "10G"
    loki-chunks            = "500G"
  }
}


resource "juju_application" "prometheus" {
  name  = "prometheus"
  trust = true
  model = juju_model.cos.name
  charm {
    name    = "prometheus-k8s"
    channel = "latest/stable"
  }
  config = {
    "metrics_retention_time" : "90d"
  }
  units       = 1
  constraints = "arch=amd64"
  storage_directives = {
    database = "500G"
  }
}


resource "juju_application" "traefik" {
  name  = "traefik"
  trust = true
  model = juju_model.cos.name
  charm {
    name    = "traefik-k8s"
    channel = "latest/stable"
  }
  config = {
    "tls-cert" : var.COS_TLS_CERT,
    "tls-key" : var.COS_TLS_KEY,
    "tls-ca" : var.COS_TLS_CA
  }
  units       = 1
  constraints = "arch=amd64"
  storage_directives = {
    configurations = "10G"
  }
}


resource "juju_integration" "traefik-grafana" {
  model = juju_model.cos.name

  application {
    name     = juju_application.traefik.name
    endpoint = "traefik-route"
  }

  application {
    name     = juju_application.grafana.name
    endpoint = "ingress"
  }
}

resource "juju_integration" "prometheus-alertmanager-alerting" {
  model = juju_model.cos.name

  application {
    name     = juju_application.prometheus.name
    endpoint = "alertmanager"
  }

  application {
    name     = juju_application.alertmanager.name
    endpoint = "alerting"
  }
}


resource "juju_integration" "grafana-prometheus-source" {
  model = juju_model.cos.name

  application {
    name     = juju_application.grafana.name
    endpoint = "grafana-source"
  }

  application {
    name     = juju_application.prometheus.name
    endpoint = "grafana-source"
  }
}


resource "juju_integration" "grafana-loki-source" {
  model = juju_model.cos.name

  application {
    name     = juju_application.grafana.name
    endpoint = "grafana-source"
  }

  application {
    name     = juju_application.loki.name
    endpoint = "grafana-source"
  }
}


resource "juju_integration" "grafana-alertmanager-source" {
  model = juju_model.cos.name

  application {
    name     = juju_application.grafana.name
    endpoint = "grafana-source"
  }

  application {
    name     = juju_application.alertmanager.name
    endpoint = "grafana-source"
  }
}


resource "juju_integration" "loki-alertmanager" {
  model = juju_model.cos.name

  application {
    name     = juju_application.loki.name
    endpoint = "alertmanager"
  }

  application {
    name     = juju_application.alertmanager.name
    endpoint = "alerting"
  }
}


resource "juju_integration" "prometheus-traefik" {
  model = juju_model.cos.name

  application {
    name     = juju_application.prometheus.name
    endpoint = "metrics-endpoint"
  }

  application {
    name     = juju_application.traefik.name
    endpoint = "metrics-endpoint"
  }
}


resource "juju_integration" "prometheus-alertmanager-metrics" {
  model = juju_model.cos.name

  application {
    name     = juju_application.prometheus.name
    endpoint = "metrics-endpoint"
  }

  application {
    name     = juju_application.alertmanager.name
    endpoint = "self-metrics-endpoint"
  }
}


resource "juju_integration" "prometheus-loki" {
  model = juju_model.cos.name

  application {
    name     = juju_application.prometheus.name
    endpoint = "metrics-endpoint"
  }

  application {
    name     = juju_application.loki.name
    endpoint = "metrics-endpoint"
  }
}


resource "juju_integration" "prometheus-grafana" {
  model = juju_model.cos.name

  application {
    name     = juju_application.prometheus.name
    endpoint = "metrics-endpoint"
  }

  application {
    name     = juju_application.grafana.name
    endpoint = "metrics-endpoint"
  }
}


resource "juju_integration" "grafana-loki-dashboard" {
  model = juju_model.cos.name

  application {
    name     = juju_application.grafana.name
    endpoint = "grafana-dashboard"
  }

  application {
    name     = juju_application.loki.name
    endpoint = "grafana-dashboard"
  }
}


resource "juju_integration" "grafana-prometheus-dashboard" {
  model = juju_model.cos.name

  application {
    name     = juju_application.grafana.name
    endpoint = "grafana-dashboard"
  }

  application {
    name     = juju_application.prometheus.name
    endpoint = "grafana-dashboard"
  }
}


resource "juju_integration" "grafana-alertmanager-dashboard" {
  model = juju_model.cos.name

  application {
    name     = juju_application.grafana.name
    endpoint = "grafana-dashboard"
  }

  application {
    name     = juju_application.alertmanager.name
    endpoint = "grafana-dashboard"
  }
}


resource "juju_integration" "catalogue-traefik" {
  model = juju_model.cos.name

  application {
    name     = juju_application.catalogue.name
    endpoint = "ingress"
  }

  application {
    name     = juju_application.traefik.name
    endpoint = "ingress"
  }
}


resource "juju_integration" "catalogue-grafana" {
  model = juju_model.cos.name

  application {
    name     = juju_application.catalogue.name
    endpoint = "catalogue"
  }

  application {
    name     = juju_application.grafana.name
    endpoint = "catalogue"
  }
}


resource "juju_integration" "catalogue-prometheus" {
  model = juju_model.cos.name

  application {
    name     = juju_application.catalogue.name
    endpoint = "catalogue"
  }

  application {
    name     = juju_application.prometheus.name
    endpoint = "catalogue"
  }
}


resource "juju_integration" "catalogue-alertmanager" {
  model = juju_model.cos.name

  application {
    name     = juju_application.catalogue.name
    endpoint = "catalogue"
  }

  application {
    name     = juju_application.alertmanager.name
    endpoint = "catalogue"
  }
}


resource "juju_offer" "prometheus-receive-remote-write" {
  model            = juju_model.cos.name
  application_name = juju_application.prometheus.name
  endpoint         = "receive-remote-write"
}


resource "juju_offer" "grafana-dashboards" {
  model            = juju_model.cos.name
  application_name = juju_application.grafana.name
  endpoint         = "grafana-dashboard"
}

@NucciTheBoss
Copy link
Member Author

I'm also wondering if it's perhaps better to have this be a module? And then someone else writes the plan that deploys Slurm with COS, consuming this module as a product module within their own deployment plan. So something like the following:

# Some magic with Juju Offers happens in the back-end of the Charmed HPC plan. 
# Still requires us to deploy Grafana Agent with Charmed HPC however.

terraform {
  required_providers {
    juju = {
      version = "~> 0.15.0"
      source  = "juju/juju"
    }
  }
}


terraform {
  backend "http" {
  }
}

provider "juju" {
  controller_addresses = var.JUJU_CONTROLLER_IPS
  username             = var.JUJU_USERNAME
  password             = var.JUJU_PASSWORD
  ca_certificate       = base64decode(var.JUJU_CA_CERTIFICATE)
}

resource "juju_model" "charmed-hpc" {
  name       = "charmed-hpc"
  credential = var.CREDENTIAL
  cloud {
    name = var.CLOUD
  }
  config = {
    agent-version = "3.5.4"
    resource-group-name = var.RESOURCE_GROUP
    network = var.NETWORK
  }
}

resource "juju_model" "cos-lite" {
  name       = "cos-lite"
  credential = var.K8S_CREDENTIAL
  cloud {
    name = var.K8S_CLOUD
  }
  config = {
    agent-version = "3.5.4"
  }
  depends_on = [
    juju_model.charmed-hpc
  ]
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants