-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add module to deploy and integrate grafana agent with slurmctld #9
base: main
Are you sure you want to change the base?
feat: add module to deploy and integrate grafana agent with slurmctld #9
Conversation
Changes: * Replace `juju_applicaton` entry for `mysql` with the tf module from the upstream `mysql-operator` GitHub repository. Signed-off-by: Jason C. Nucciarone <[email protected]>
@jedel1043 @dsloanm R4R - deploys a grafana agent on slurmctld that's ready to party with a deployed COS cloud. We'll need to figure out if we want to have a subdirectoy that contains a plan for deploying COS, or if we just want to define the grafana-agent endpoints to be consumed by another product module, but this tf plan can at least get you a Charmed HPC cluster that's ready to be integrated with COS. Here's what the final deployment looks like with grafana-agent-operator added: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
We'll need to figure out if we want to have a subdirectoy that contains a plan for deploying COS, or if we just want to define the grafana-agent endpoints to be consumed by another product module, but this tf plan can at least get you a Charmed HPC cluster that's ready to be integrated with COS.
Sounds like the COS plans should be the responsibility of the observability team, but we can discuss that later.
## Grafana Agent - forwards collected cluster metrics to COS. | ||
module "grafana-agent" { | ||
source = "git::https://github.com/canonical/grafana-agent-operator//terraform" | ||
|
||
model_name = juju_model.charmed-hpc.name | ||
app_name = "grafana-agent" | ||
channel = var.grafana-agent-channel | ||
units = 0 # Units should always be zero since grafana-agent is a subordinate operator. | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought: hmm, I'm wondering if we really want to always deploy it. From the user's perspective, seeing a big red "BLOCKED" message could trigger alarm sounds. Maybe make this optional with a configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we set the status to active even if it isn't related?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Externally I don't think so; that's just the logic of the grafana-agent charm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocked and error mean two different things in my mind. Blocked implies that further conditions must be met before the application is active, while error implies that something went wrong in the deployment.
Tbh I'd like to avoid making the Terraform for our reference deployment complicated with conditionals and dynamic blocks as they make it harder to maintain the deployment plan. I'd rather deploy the Grafana Agent operator and then tell folks "hey this will stay in a blocked state until you integrate the Canonical Observability Stack (COS). See for how to set up COS with your Charmed HPC cluster."
We could also just add a module that deploys COS Lite; it's pretty straight forward from what I have seen. We just need to add a cos.tf
plan and provide a Kubernetes cluster that it can use.
# Example COS Lite deployment with Terraform.
resource "juju_application" "alertmanager" {
name = "alertmanager"
trust = true
model = juju_model.cos.name
charm {
name = "alertmanager-k8s"
channel = "latest/stable"
}
units = 1
constraints = "arch=amd64"
storage_directives = {
data = "10G"
}
}
resource "juju_application" "catalogue" {
name = "catalogue"
trust = true
model = juju_model.cos.name
charm {
name = "catalogue-k8s"
channel = "latest/stable"
}
units = 1
constraints = "arch=amd64"
config = {
"description" : "Canonical Observability Stack Lite"
}
}
resource "juju_application" "grafana" {
name = "grafana"
trust = true
model = juju_model.cos.name
charm {
name = "grafana-k8s"
channel = "latest/stable"
}
units = 1
constraints = "arch=amd64"
storage_directives = {
database = "10G"
}
}
resource "juju_application" "loki" {
name = "loki"
trust = true
model = juju_model.cos.name
charm {
name = "loki-k8s"
channel = "latest/stable"
}
units = 1
constraints = "arch=amd64"
storage_directives = {
active-index-directory = "10G"
loki-chunks = "500G"
}
}
resource "juju_application" "prometheus" {
name = "prometheus"
trust = true
model = juju_model.cos.name
charm {
name = "prometheus-k8s"
channel = "latest/stable"
}
config = {
"metrics_retention_time" : "90d"
}
units = 1
constraints = "arch=amd64"
storage_directives = {
database = "500G"
}
}
resource "juju_application" "traefik" {
name = "traefik"
trust = true
model = juju_model.cos.name
charm {
name = "traefik-k8s"
channel = "latest/stable"
}
config = {
"tls-cert" : var.COS_TLS_CERT,
"tls-key" : var.COS_TLS_KEY,
"tls-ca" : var.COS_TLS_CA
}
units = 1
constraints = "arch=amd64"
storage_directives = {
configurations = "10G"
}
}
resource "juju_integration" "traefik-grafana" {
model = juju_model.cos.name
application {
name = juju_application.traefik.name
endpoint = "traefik-route"
}
application {
name = juju_application.grafana.name
endpoint = "ingress"
}
}
resource "juju_integration" "prometheus-alertmanager-alerting" {
model = juju_model.cos.name
application {
name = juju_application.prometheus.name
endpoint = "alertmanager"
}
application {
name = juju_application.alertmanager.name
endpoint = "alerting"
}
}
resource "juju_integration" "grafana-prometheus-source" {
model = juju_model.cos.name
application {
name = juju_application.grafana.name
endpoint = "grafana-source"
}
application {
name = juju_application.prometheus.name
endpoint = "grafana-source"
}
}
resource "juju_integration" "grafana-loki-source" {
model = juju_model.cos.name
application {
name = juju_application.grafana.name
endpoint = "grafana-source"
}
application {
name = juju_application.loki.name
endpoint = "grafana-source"
}
}
resource "juju_integration" "grafana-alertmanager-source" {
model = juju_model.cos.name
application {
name = juju_application.grafana.name
endpoint = "grafana-source"
}
application {
name = juju_application.alertmanager.name
endpoint = "grafana-source"
}
}
resource "juju_integration" "loki-alertmanager" {
model = juju_model.cos.name
application {
name = juju_application.loki.name
endpoint = "alertmanager"
}
application {
name = juju_application.alertmanager.name
endpoint = "alerting"
}
}
resource "juju_integration" "prometheus-traefik" {
model = juju_model.cos.name
application {
name = juju_application.prometheus.name
endpoint = "metrics-endpoint"
}
application {
name = juju_application.traefik.name
endpoint = "metrics-endpoint"
}
}
resource "juju_integration" "prometheus-alertmanager-metrics" {
model = juju_model.cos.name
application {
name = juju_application.prometheus.name
endpoint = "metrics-endpoint"
}
application {
name = juju_application.alertmanager.name
endpoint = "self-metrics-endpoint"
}
}
resource "juju_integration" "prometheus-loki" {
model = juju_model.cos.name
application {
name = juju_application.prometheus.name
endpoint = "metrics-endpoint"
}
application {
name = juju_application.loki.name
endpoint = "metrics-endpoint"
}
}
resource "juju_integration" "prometheus-grafana" {
model = juju_model.cos.name
application {
name = juju_application.prometheus.name
endpoint = "metrics-endpoint"
}
application {
name = juju_application.grafana.name
endpoint = "metrics-endpoint"
}
}
resource "juju_integration" "grafana-loki-dashboard" {
model = juju_model.cos.name
application {
name = juju_application.grafana.name
endpoint = "grafana-dashboard"
}
application {
name = juju_application.loki.name
endpoint = "grafana-dashboard"
}
}
resource "juju_integration" "grafana-prometheus-dashboard" {
model = juju_model.cos.name
application {
name = juju_application.grafana.name
endpoint = "grafana-dashboard"
}
application {
name = juju_application.prometheus.name
endpoint = "grafana-dashboard"
}
}
resource "juju_integration" "grafana-alertmanager-dashboard" {
model = juju_model.cos.name
application {
name = juju_application.grafana.name
endpoint = "grafana-dashboard"
}
application {
name = juju_application.alertmanager.name
endpoint = "grafana-dashboard"
}
}
resource "juju_integration" "catalogue-traefik" {
model = juju_model.cos.name
application {
name = juju_application.catalogue.name
endpoint = "ingress"
}
application {
name = juju_application.traefik.name
endpoint = "ingress"
}
}
resource "juju_integration" "catalogue-grafana" {
model = juju_model.cos.name
application {
name = juju_application.catalogue.name
endpoint = "catalogue"
}
application {
name = juju_application.grafana.name
endpoint = "catalogue"
}
}
resource "juju_integration" "catalogue-prometheus" {
model = juju_model.cos.name
application {
name = juju_application.catalogue.name
endpoint = "catalogue"
}
application {
name = juju_application.prometheus.name
endpoint = "catalogue"
}
}
resource "juju_integration" "catalogue-alertmanager" {
model = juju_model.cos.name
application {
name = juju_application.catalogue.name
endpoint = "catalogue"
}
application {
name = juju_application.alertmanager.name
endpoint = "catalogue"
}
}
resource "juju_offer" "prometheus-receive-remote-write" {
model = juju_model.cos.name
application_name = juju_application.prometheus.name
endpoint = "receive-remote-write"
}
resource "juju_offer" "grafana-dashboards" {
model = juju_model.cos.name
application_name = juju_application.grafana.name
endpoint = "grafana-dashboard"
}
I'm also wondering if it's perhaps better to have this be a module? And then someone else writes the plan that deploys Slurm with COS, consuming this module as a product module within their own deployment plan. So something like the following: # Some magic with Juju Offers happens in the back-end of the Charmed HPC plan.
# Still requires us to deploy Grafana Agent with Charmed HPC however.
terraform {
required_providers {
juju = {
version = "~> 0.15.0"
source = "juju/juju"
}
}
}
terraform {
backend "http" {
}
}
provider "juju" {
controller_addresses = var.JUJU_CONTROLLER_IPS
username = var.JUJU_USERNAME
password = var.JUJU_PASSWORD
ca_certificate = base64decode(var.JUJU_CA_CERTIFICATE)
}
resource "juju_model" "charmed-hpc" {
name = "charmed-hpc"
credential = var.CREDENTIAL
cloud {
name = var.CLOUD
}
config = {
agent-version = "3.5.4"
resource-group-name = var.RESOURCE_GROUP
network = var.NETWORK
}
}
resource "juju_model" "cos-lite" {
name = "cos-lite"
credential = var.K8S_CREDENTIAL
cloud {
name = var.K8S_CLOUD
}
config = {
agent-version = "3.5.4"
}
depends_on = [
juju_model.charmed-hpc
]
} |
Changes:
juju_applicaton
entry formysql
with the tf module from the upstreammysql-operator
GitHub repository.Docs: