Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add module to deploy and integrate grafana agent with slurmctld #9

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 44 additions & 20 deletions main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,15 @@
# See the License for the specific language governing permissions and
# limitations under the License.

# Deploy a minimal Charmed HPC cloud on LXD
## Deploy a Charmed HPC cluster.

provider "juju" {}

resource "juju_model" "charmed-hpc" {
name = var.model
}

## Slurm - workload manager for Charmed HPC.
module "controller" {
source = "git::https://github.com/charmed-hpc/slurm-charms//charms/slurmctld/terraform"

Expand Down Expand Up @@ -56,21 +57,19 @@ module "rest-api" {
units = var.rest-api-scale
}

# FIXME: Source from upstream mysql operator once tf module is published.
resource "juju_application" "mysql" {
name = "mysql"
model = juju_model.charmed-hpc.name

charm {
name = "mysql"
channel = var.mysql-channel
revision = var.mysql-revision
}
## MySQL - provides backing database for `slurmdbd`.
module "mysql" {
source = "git::https://github.com/canonical/mysql-operator//terraform"

units = var.mysql-scale
juju_model_name = juju_model.charmed-hpc.name
app_name = "mysql"
channel = var.mysql-channel
units = var.mysql-scale
}

# FIXME: Source from upstream mysql-router operator once tf module is published.
# TODO:
# Pull a Terraform module for mysql-router-operator once
# it has been published to the upstream repository.
resource "juju_application" "database-mysql-router" {
name = "database-mysql-router"
model = juju_model.charmed-hpc.name
Expand All @@ -80,9 +79,20 @@ resource "juju_application" "database-mysql-router" {
channel = var.mysql-router-channel
revision = var.mysql-router-revision
}
units = 0
units = 0 # Units should always be zero since mysql-router is a subordinate operator.
}

## Grafana Agent - forwards collected cluster metrics to COS.
module "grafana-agent" {
source = "git::https://github.com/canonical/grafana-agent-operator//terraform"

model_name = juju_model.charmed-hpc.name
app_name = "grafana-agent"
channel = var.grafana-agent-channel
units = 0 # Units should always be zero since grafana-agent is a subordinate operator.
}
Comment on lines +85 to +93
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: hmm, I'm wondering if we really want to always deploy it. From the user's perspective, seeing a big red "BLOCKED" message could trigger alarm sounds. Maybe make this optional with a configuration?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we set the status to active even if it isn't related?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Externally I don't think so; that's just the logic of the grafana-agent charm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocked and error mean two different things in my mind. Blocked implies that further conditions must be met before the application is active, while error implies that something went wrong in the deployment.

Tbh I'd like to avoid making the Terraform for our reference deployment complicated with conditionals and dynamic blocks as they make it harder to maintain the deployment plan. I'd rather deploy the Grafana Agent operator and then tell folks "hey this will stay in a blocked state until you integrate the Canonical Observability Stack (COS). See for how to set up COS with your Charmed HPC cluster."

We could also just add a module that deploys COS Lite; it's pretty straight forward from what I have seen. We just need to add a cos.tf plan and provide a Kubernetes cluster that it can use.

# Example COS Lite deployment with Terraform.

resource "juju_application" "alertmanager" {
  name  = "alertmanager"
  trust = true
  model = juju_model.cos.name
  charm {
    name    = "alertmanager-k8s"
    channel = "latest/stable"
  }
  units       = 1
  constraints = "arch=amd64"
  storage_directives = {
    data = "10G"
  }
}

resource "juju_application" "catalogue" {
  name  = "catalogue"
  trust = true
  model = juju_model.cos.name
  charm {
    name    = "catalogue-k8s"
    channel = "latest/stable"
  }
  units       = 1
  constraints = "arch=amd64"
  config = {
    "description" : "Canonical Observability Stack Lite"
  }

}


resource "juju_application" "grafana" {
  name  = "grafana"
  trust = true
  model = juju_model.cos.name
  charm {
    name    = "grafana-k8s"
    channel = "latest/stable"
  }
  units       = 1
  constraints = "arch=amd64"
  storage_directives = {
    database = "10G"
  }
}


resource "juju_application" "loki" {
  name  = "loki"
  trust = true
  model = juju_model.cos.name
  charm {
    name    = "loki-k8s"
    channel = "latest/stable"
  }
  units       = 1
  constraints = "arch=amd64"
  storage_directives = {
    active-index-directory = "10G"
    loki-chunks            = "500G"
  }
}


resource "juju_application" "prometheus" {
  name  = "prometheus"
  trust = true
  model = juju_model.cos.name
  charm {
    name    = "prometheus-k8s"
    channel = "latest/stable"
  }
  config = {
    "metrics_retention_time" : "90d"
  }
  units       = 1
  constraints = "arch=amd64"
  storage_directives = {
    database = "500G"
  }
}


resource "juju_application" "traefik" {
  name  = "traefik"
  trust = true
  model = juju_model.cos.name
  charm {
    name    = "traefik-k8s"
    channel = "latest/stable"
  }
  config = {
    "tls-cert" : var.COS_TLS_CERT,
    "tls-key" : var.COS_TLS_KEY,
    "tls-ca" : var.COS_TLS_CA
  }
  units       = 1
  constraints = "arch=amd64"
  storage_directives = {
    configurations = "10G"
  }
}


resource "juju_integration" "traefik-grafana" {
  model = juju_model.cos.name

  application {
    name     = juju_application.traefik.name
    endpoint = "traefik-route"
  }

  application {
    name     = juju_application.grafana.name
    endpoint = "ingress"
  }
}

resource "juju_integration" "prometheus-alertmanager-alerting" {
  model = juju_model.cos.name

  application {
    name     = juju_application.prometheus.name
    endpoint = "alertmanager"
  }

  application {
    name     = juju_application.alertmanager.name
    endpoint = "alerting"
  }
}


resource "juju_integration" "grafana-prometheus-source" {
  model = juju_model.cos.name

  application {
    name     = juju_application.grafana.name
    endpoint = "grafana-source"
  }

  application {
    name     = juju_application.prometheus.name
    endpoint = "grafana-source"
  }
}


resource "juju_integration" "grafana-loki-source" {
  model = juju_model.cos.name

  application {
    name     = juju_application.grafana.name
    endpoint = "grafana-source"
  }

  application {
    name     = juju_application.loki.name
    endpoint = "grafana-source"
  }
}


resource "juju_integration" "grafana-alertmanager-source" {
  model = juju_model.cos.name

  application {
    name     = juju_application.grafana.name
    endpoint = "grafana-source"
  }

  application {
    name     = juju_application.alertmanager.name
    endpoint = "grafana-source"
  }
}


resource "juju_integration" "loki-alertmanager" {
  model = juju_model.cos.name

  application {
    name     = juju_application.loki.name
    endpoint = "alertmanager"
  }

  application {
    name     = juju_application.alertmanager.name
    endpoint = "alerting"
  }
}


resource "juju_integration" "prometheus-traefik" {
  model = juju_model.cos.name

  application {
    name     = juju_application.prometheus.name
    endpoint = "metrics-endpoint"
  }

  application {
    name     = juju_application.traefik.name
    endpoint = "metrics-endpoint"
  }
}


resource "juju_integration" "prometheus-alertmanager-metrics" {
  model = juju_model.cos.name

  application {
    name     = juju_application.prometheus.name
    endpoint = "metrics-endpoint"
  }

  application {
    name     = juju_application.alertmanager.name
    endpoint = "self-metrics-endpoint"
  }
}


resource "juju_integration" "prometheus-loki" {
  model = juju_model.cos.name

  application {
    name     = juju_application.prometheus.name
    endpoint = "metrics-endpoint"
  }

  application {
    name     = juju_application.loki.name
    endpoint = "metrics-endpoint"
  }
}


resource "juju_integration" "prometheus-grafana" {
  model = juju_model.cos.name

  application {
    name     = juju_application.prometheus.name
    endpoint = "metrics-endpoint"
  }

  application {
    name     = juju_application.grafana.name
    endpoint = "metrics-endpoint"
  }
}


resource "juju_integration" "grafana-loki-dashboard" {
  model = juju_model.cos.name

  application {
    name     = juju_application.grafana.name
    endpoint = "grafana-dashboard"
  }

  application {
    name     = juju_application.loki.name
    endpoint = "grafana-dashboard"
  }
}


resource "juju_integration" "grafana-prometheus-dashboard" {
  model = juju_model.cos.name

  application {
    name     = juju_application.grafana.name
    endpoint = "grafana-dashboard"
  }

  application {
    name     = juju_application.prometheus.name
    endpoint = "grafana-dashboard"
  }
}


resource "juju_integration" "grafana-alertmanager-dashboard" {
  model = juju_model.cos.name

  application {
    name     = juju_application.grafana.name
    endpoint = "grafana-dashboard"
  }

  application {
    name     = juju_application.alertmanager.name
    endpoint = "grafana-dashboard"
  }
}


resource "juju_integration" "catalogue-traefik" {
  model = juju_model.cos.name

  application {
    name     = juju_application.catalogue.name
    endpoint = "ingress"
  }

  application {
    name     = juju_application.traefik.name
    endpoint = "ingress"
  }
}


resource "juju_integration" "catalogue-grafana" {
  model = juju_model.cos.name

  application {
    name     = juju_application.catalogue.name
    endpoint = "catalogue"
  }

  application {
    name     = juju_application.grafana.name
    endpoint = "catalogue"
  }
}


resource "juju_integration" "catalogue-prometheus" {
  model = juju_model.cos.name

  application {
    name     = juju_application.catalogue.name
    endpoint = "catalogue"
  }

  application {
    name     = juju_application.prometheus.name
    endpoint = "catalogue"
  }
}


resource "juju_integration" "catalogue-alertmanager" {
  model = juju_model.cos.name

  application {
    name     = juju_application.catalogue.name
    endpoint = "catalogue"
  }

  application {
    name     = juju_application.alertmanager.name
    endpoint = "catalogue"
  }
}


resource "juju_offer" "prometheus-receive-remote-write" {
  model            = juju_model.cos.name
  application_name = juju_application.prometheus.name
  endpoint         = "receive-remote-write"
}


resource "juju_offer" "grafana-dashboards" {
  model            = juju_model.cos.name
  application_name = juju_application.grafana.name
  endpoint         = "grafana-dashboard"
}


## Integrate `slurmctld`, `slurmd`, `slurmdbd`, and `slurmrestd` together.
resource "juju_integration" "compute-to-controller" {
model = juju_model.charmed-hpc.name

Expand Down Expand Up @@ -125,30 +135,44 @@ resource "juju_integration" "rest-api-to-controller" {
}
}

## Integrate `slurmd` with `mysql`.
resource "juju_integration" "database-to-mysql-router" {
model = juju_model.charmed-hpc.name

application {
name = module.database.app_name
endpoint = module.database.requires.database
name = juju_application.database-mysql-router.name
endpoint = "database"
}

application {
name = juju_application.database-mysql-router.name
endpoint = "database"
name = module.database.app_name
endpoint = module.database.requires.database
}
}

resource "juju_integration" "mysql-router-to-mysql" {
model = juju_model.charmed-hpc.name

application {
name = module.mysql.application_name
endpoint = module.mysql.provides.database
}

application {
name = juju_application.database-mysql-router.name
endpoint = "backend-database"
}
}

## Integrate `slurmctld` with `grafana-agent`.
resource "juju_integration" "controller-to-grafana-agent" {
model = juju_model.charmed-hpc.name

application {
name = juju_application.mysql.name
endpoint = "database"
name = module.controller.app_name
}

application {
name = module.grafana-agent.app_name
}
}
12 changes: 6 additions & 6 deletions variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -54,18 +54,18 @@ variable "database-scale" {
default = 1
}

variable "grafana-agent-channel" {
description = "Channel to deploy grafana-agent-operator from."
type = string
default = "latest/stable"
}

variable "mysql-channel" {
description = "Channel to deploy mysql from."
type = string
default = "8.0/stable"
}

variable "mysql-revision" {
description = "Revision of mysql to deploy from channel."
type = number
default = null
}

variable "mysql-scale" {
description = "Scale of mysql application"
type = number
Expand Down