Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Module to upgrade servers #24971

Merged
merged 55 commits into from
Feb 7, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
ff74b47
func: add initial enos skeleton
Juanadelacuesta Jan 6, 2025
4ae6856
style: add headers
Juanadelacuesta Jan 6, 2025
43bbce8
func: change the variables input to a map of objects to simplify the …
Juanadelacuesta Jan 8, 2025
b9bc8e5
style: formating
Juanadelacuesta Jan 8, 2025
4f3e78d
Add tests for servers and clients
Juanadelacuesta Jan 8, 2025
2d5f43f
style: separate the tests in diferent scripts
Juanadelacuesta Jan 8, 2025
feca4a5
style: add missing headers
Juanadelacuesta Jan 8, 2025
df96270
func: add tests for allocs
Juanadelacuesta Jan 9, 2025
30ffd32
style: improve output
Juanadelacuesta Jan 10, 2025
dc4ff8f
func: add step to copy remote upgrade version
Juanadelacuesta Jan 10, 2025
1e063e2
style: hcl formatting
Juanadelacuesta Jan 10, 2025
550e238
fix: remove the terraform nomad provider
Juanadelacuesta Jan 15, 2025
9607720
fix: Add clean token to remove extra new line added in provision
Juanadelacuesta Jan 17, 2025
36e5269
fix: Add clean token to remove extra new line added in provision
Juanadelacuesta Jan 17, 2025
13b0878
fix: Add clean token to remove extra new line added in provision
Juanadelacuesta Jan 17, 2025
63d3d76
fix: add missing license headers
Juanadelacuesta Jan 29, 2025
292db44
style: hcl fmt
Juanadelacuesta Jan 29, 2025
6dcd06a
style: rename variables and fix format
Juanadelacuesta Jan 29, 2025
5132fe7
func: remove the template step on the workloads module and chop the n…
Juanadelacuesta Jan 29, 2025
a1e08f0
fix: correct the jobspec path on the workloads module
Juanadelacuesta Jan 29, 2025
4d8ea8a
fix: add missing variable definitions on job specs for workloads
Juanadelacuesta Jan 29, 2025
000dc44
style: formatting
Juanadelacuesta Jan 30, 2025
b317bda
fix: Add clean token to remove extra new line added in provision
Juanadelacuesta Jan 17, 2025
cc7362b
func: add module to upgrade servers
Juanadelacuesta Jan 20, 2025
e8bf2c7
style: missing headers
Juanadelacuesta Jan 20, 2025
a540135
func: add upgrade module
Juanadelacuesta Jan 27, 2025
d514c55
func: add install for windows as well
Juanadelacuesta Jan 27, 2025
2caa315
func: add an intermediate module that runs the upgrade server for eac…
Juanadelacuesta Jan 28, 2025
4ec10cd
fix: add missing license headers
Juanadelacuesta Jan 28, 2025
bc78c95
fix: remove extra input variables and connect upgrade servers to the …
Juanadelacuesta Jan 29, 2025
663f41c
fix: rename missing env variables for cluster health scripts
Juanadelacuesta Jan 29, 2025
4529b34
func: move the cluster health test outside of the modules and into th…
Juanadelacuesta Jan 29, 2025
8c79529
fix: fix the regex to ignore snap files on the gitignore file
Juanadelacuesta Jan 29, 2025
6e6d023
fix: Add clean token to remove extra new line added in provision
Juanadelacuesta Jan 17, 2025
e14ac9c
fix: Add clean token to remove extra new line added in provision
Juanadelacuesta Jan 17, 2025
f56a65b
fix: Add clean token to remove extra new line added in provision
Juanadelacuesta Jan 17, 2025
2c509e0
fix: remove extra input variables and connect upgrade servers to the …
Juanadelacuesta Jan 29, 2025
729f370
style: formatting
Juanadelacuesta Jan 30, 2025
ae40fc4
fix: move taken and restoring snapshots out of the upgrade_single_ser…
Juanadelacuesta Jan 30, 2025
36cdab4
Merge branch 'main' into f-NET-11478-enos-2
Juanadelacuesta Feb 5, 2025
83fed5a
fix: rename variable in health test
Juanadelacuesta Jan 30, 2025
45c19ca
fix: Add clean token to remove extra new line added in provision
Juanadelacuesta Jan 17, 2025
75a3c08
func: add an intermediate module that runs the upgrade server for eac…
Juanadelacuesta Jan 28, 2025
515f104
fix: Add clean token to remove extra new line added in provision
Juanadelacuesta Jan 17, 2025
0402284
fix: Add clean token to remove extra new line added in provision
Juanadelacuesta Jan 17, 2025
d72f8ec
fix: Add clean token to remove extra new line added in provision
Juanadelacuesta Jan 17, 2025
1c9e683
func: fix the last_log_index check and add a versions check
Juanadelacuesta Jan 31, 2025
140bb64
func: done use for_each when upgrading the servers, hardcodes each on…
Juanadelacuesta Feb 5, 2025
63bee4f
Update enos/modules/upgrade_instance/variables.tf
Juanadelacuesta Feb 5, 2025
d224b75
Update enos/modules/upgrade_instance/variables.tf
Juanadelacuesta Feb 5, 2025
395b5a0
Update enos/modules/upgrade_instance/variables.tf
Juanadelacuesta Feb 5, 2025
cb32f51
func: make snapshot by calling every server and allowing stale data
Juanadelacuesta Feb 5, 2025
609919e
style: formatting
Juanadelacuesta Feb 5, 2025
655c1f3
fix: make the source for the upgrade binary unknow until apply
Juanadelacuesta Feb 5, 2025
982d154
func: use enos bundle to install remote upgrade version, enos_files i…
Juanadelacuesta Feb 5, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions enos/enos-modules.hcl
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
// Copyright (c) HashiCorp, Inc.
// SPDX-License-Identifier: BUSL-1.1

// Find any released RPM or Deb in Artifactory. Requires the version, edition, distro, and distro
// version.
module "build_artifactory" {
source = "./modules/fetch_artifactory"
}
Expand All @@ -18,3 +16,7 @@ module "run_workloads" {
module "test_cluster_health" {
source = "./modules/test_cluster_health"
}

module "upgrade_servers" {
source = "./modules/upgrade_servers"
}
125 changes: 81 additions & 44 deletions enos/enos-scenario-upgrade.hcl
Original file line number Diff line number Diff line change
Expand Up @@ -30,12 +30,14 @@ scenario "upgrade" {
linux_count = matrix.os == "linux" ? "4" : "0"
windows_count = matrix.os == "windows" ? "4" : "0"
arch = matrix.arch
clients_count = local.linux_count + local.windows_count
}

step "copy_initial_binary" {
description = <<-EOF
Determine which Nomad artifact we want to use for the scenario, depending on the
'arch', 'edition' and 'os' and bring it from the artifactory to a local instance.
'arch', 'edition' and 'os' and bring it from the artifactory to the local instance
running enos.
EOF

module = module.build_artifactory
Expand All @@ -52,9 +54,11 @@ scenario "upgrade" {
}

step "provision_cluster" {
depends_on = [step.copy_initial_binary]
depends_on = [step.copy_initial_binary]

description = <<-EOF
Using the binary from the previous step, provision a Nomad cluster using the e2e
module.
EOF

module = module.provision_cluster
Expand All @@ -73,7 +77,8 @@ scenario "upgrade" {
}

step "run_initial_workloads" {
depends_on = [step.provision_cluster]
depends_on = [step.provision_cluster]

description = <<-EOF
Verify the health of the cluster by running new workloads
EOF
Expand All @@ -86,28 +91,34 @@ scenario "upgrade" {
key_file = step.provision_cluster.key_file
nomad_token = step.provision_cluster.nomad_token
}

verifies = [
quality.nomad_register_job,
]
}

step "initial_test_cluster_health" {
depends_on = [step.run_initial_workloads]
depends_on = [step.run_initial_workloads]

description = <<-EOF
Verify the health of the cluster by checking the status of all servers, nodes, jobs and allocs and stopping random allocs to check for correct reschedules"
Verify the health of the cluster by checking the status of all servers, nodes,
jobs and allocs and stopping random allocs to check for correct reschedules"
EOF

module = module.test_cluster_health
variables {
nomad_addr = step.provision_cluster.nomad_addr
ca_file = step.provision_cluster.ca_file
cert_file = step.provision_cluster.cert_file
key_file = step.provision_cluster.key_file
nomad_token = step.provision_cluster.nomad_token
server_count = var.server_count
client_count = local.linux_count + local.windows_count
jobs_count = step.run_initial_workloads.jobs_count
alloc_count = step.run_initial_workloads.allocs_count
nomad_addr = step.provision_cluster.nomad_addr
ca_file = step.provision_cluster.ca_file
cert_file = step.provision_cluster.cert_file
key_file = step.provision_cluster.key_file
nomad_token = step.provision_cluster.nomad_token
server_count = var.server_count
client_count = local.clients_count
jobs_count = step.run_initial_workloads.jobs_count
alloc_count = step.run_initial_workloads.allocs_count
servers = step.provision_cluster.servers
clients_version = var.product_version
servers_version = var.product_version
}

verifies = [
Expand All @@ -120,10 +131,11 @@ scenario "upgrade" {
]
}

step "copy_upgrade_binary" {
depends_on = [step.provision_cluster]
step "fetch_upgrade_binary" {
depends_on = [step.provision_cluster]

description = <<-EOF
Bring the new upgraded binary from the artifactory
Bring the new upgraded binary from the artifactory to the instance running enos.
EOF

module = module.build_artifactory
Expand All @@ -135,51 +147,71 @@ scenario "upgrade" {
edition = matrix.edition
product_version = var.upgrade_version
os = matrix.os
binary_path = "${var.nomad_local_binary}/${matrix.os}-${matrix.arch}-${matrix.edition}-${var.upgrade_version}"
download_binary = false
}
}
/*

step "upgrade_servers" {
depends_on = [step.fetch_upgrade_binary]

description = <<-EOF
Upgrade the cluster's servers by invoking nomad-cc ...
EOF
Takes the servers one by one, makes a snapshot, updates the binary with the
new one previously fetched and restarts the servers.

module = module.run_cc_nomad
Important: The path where the binary will be placed is hardcoded to match
what the provision-cluster module does. It can be configurable in the future
but for now it is:

* "C:/opt/nomad.exe" for windows
* "/usr/local/bin/nomad" for linux

To ensure the servers are upgraded one by one, they use the depends_on meta,
there are ONLY 3 SERVERS being upgraded in the module.
EOF
module = module.upgrade_servers

verifies = [
quality.nomad_agent_info,
quality.nomad_agent_info_self,
nomad_restore_snapshot
quality.nomad_agent_info,
quality.nomad_agent_info_self,
quality.nomad_restore_snapshot
]

variables {
cc_update_type = "server"
nomad_upgraded_binary = step.copy_initial_binary.nomad_local_binary
// ...
nomad_addr = step.provision_cluster.nomad_addr
ca_file = step.provision_cluster.ca_file
cert_file = step.provision_cluster.cert_file
key_file = step.provision_cluster.key_file
nomad_token = step.provision_cluster.nomad_token
servers = step.provision_cluster.servers
ssh_key_path = step.provision_cluster.ssh_key_file
artifactory_username = var.artifactory_username
artifactory_token = var.artifactory_token
artifact_url = step.fetch_upgrade_binary.artifact_url
artifact_sha = step.fetch_upgrade_binary.artifact_sha
}
}

step "run_servers_workloads" {
// ...
}

step "server_upgrade_test_cluster_health" {
depends_on = [step.run_initial_workloads]
depends_on = [step.upgrade_servers]
description = <<-EOF
Verify the health of the cluster by checking the status of all servers, nodes, jobs and allocs and stopping random allocs to check for correct reschedules"
Verify the health of the cluster by checking the status of all servers, nodes,
jobs and allocs and stopping random allocs to check for correct reschedules"
EOF

module = module.test_cluster_health
variables {
nomad_addr = step.provision_cluster.nomad_addr
ca_file = step.provision_cluster.ca_file
cert_file = step.provision_cluster.cert_file
key_file = step.provision_cluster.key_file
nomad_token = step.provision_cluster.nomad_token
server_count = var.server_count
client_count = local.linux_count + local.windows_count
jobs_count = step.run_initial_workloads.jobs_count
alloc_count = step.run_initial_workloads.allocs_count
nomad_addr = step.provision_cluster.nomad_addr
ca_file = step.provision_cluster.ca_file
cert_file = step.provision_cluster.cert_file
key_file = step.provision_cluster.key_file
nomad_token = step.provision_cluster.nomad_token
server_count = var.server_count
client_count = local.linux_count + local.windows_count
jobs_count = step.run_initial_workloads.jobs_count
alloc_count = step.run_initial_workloads.allocs_count
servers = step.provision_cluster.servers
clients_version = var.product_version
servers_version = var.upgrade_version
}

verifies = [
Expand All @@ -192,6 +224,11 @@ scenario "upgrade" {
]
}

/*
step "run_servers_workloads" {
// ...
}

step "upgrade_client" {
description = <<-EOF
Upgrade the cluster's clients by invoking nomad-cc ...
Expand Down Expand Up @@ -244,6 +281,7 @@ scenario "upgrade" {
]
}
*/

output "servers" {
value = step.provision_cluster.servers
}
Expand Down Expand Up @@ -280,5 +318,4 @@ scenario "upgrade" {
value = step.provision_cluster.nomad_token
sensitive = true
}

}
2 changes: 2 additions & 0 deletions enos/modules/fetch_artifactory/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ data "enos_artifactory_item" "nomad" {
}

resource "enos_local_exec" "install_binary" {
count = var.download_binary ? 1 : 0

environment = {
URL = data.enos_artifactory_item.nomad.results[0].url
BINARY_PATH = var.binary_path
Expand Down
10 changes: 10 additions & 0 deletions enos/modules/fetch_artifactory/outputs.tf
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,13 @@ output "nomad_local_binary" {
description = "Path where the binary will be placed"
value = var.os == "windows" ? "${var.binary_path}/nomad.exe" : "${var.binary_path}/nomad"
}

output "artifact_url" {
description = "URL to fetch the artifact"
value = data.enos_artifactory_item.nomad.results[0].url
}

output "artifact_sha" {
description = "sha256 to fetch the artifact"
value = data.enos_artifactory_item.nomad.results[0].sha256
}
5 changes: 5 additions & 0 deletions enos/modules/fetch_artifactory/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -53,3 +53,8 @@ variable "binary_path" {
type = string
default = "/home/ubuntu/nomad"
}

variable "download_binary" {
description = "Used to control if the artifact should be downloaded to the local instance or not"
default = true
}
1 change: 1 addition & 0 deletions enos/modules/test_cluster_health/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -51,3 +51,4 @@ resource "enos_local_exec" "verify_versions" {
]
}


3 changes: 2 additions & 1 deletion enos/modules/test_cluster_health/scripts/allocs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,8 @@ MAX_WAIT_TIME=30 # Maximum wait time in seconds
POLL_INTERVAL=2 # Interval between status checks

random_alloc_id=$(echo "$running_allocs" | jq -r ".[$((RANDOM % ($allocs_length + 1)))].ID")
nomad alloc stop -detach "$random_alloc_id" || error_exit "Failed to stop allocation $random_alloc_id."
nomad alloc stop "$random_alloc_id" || error_exit "Failed to stop allocation $random_alloc_id."


echo "Waiting for allocation $random_alloc_id to reach 'complete' status..."
elapsed_time=0
Expand Down
1 change: 1 addition & 0 deletions enos/modules/test_cluster_health/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ variable "jobs_count" {

variable "alloc_count" {
description = "Number of allocation that should be running in the cluster"
type = number
}

variable "clients_version" {
Expand Down
2 changes: 2 additions & 0 deletions enos/modules/upgrade_instance/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
// Don't commit cluster snapshots
*.snap
64 changes: 64 additions & 0 deletions enos/modules/upgrade_instance/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Copyright (c) HashiCorp, Inc.
# SPDX-License-Identifier: BUSL-1.1

terraform {
required_providers {
enos = {
source = "registry.terraform.io/hashicorp-forge/enos"
}
}
}

locals {
binary_destination = var.platform == "windows" ? "C:/opt/" : "/usr/local/bin/"
ssh_user = var.platform == "windows" ? "Administrator" : "ubuntu"
}

resource "enos_bundle_install" "nomad" {
destination = local.binary_destination

artifactory = var.artifactory_release

transport = {
ssh = {
host = var.server_address
private_key_path = var.ssh_key_path
user = local.ssh_user
}
}
}

resource "enos_remote_exec" "restart_linux_services" {
count = var.platform == "linux" ? 1 : 0
depends_on = [enos_bundle_install.nomad]


transport = {
ssh = {
host = var.server_address
private_key_path = var.ssh_key_path
user = local.ssh_user
}
}

inline = [
"sudo systemctl restart nomad",
]
}

resource "enos_remote_exec" "restart_windows_services" {
count = var.platform == "windows" ? 1 : 0
depends_on = [enos_bundle_install.nomad]

transport = {
ssh = {
host = var.server_address
private_key_path = var.ssh_key_path
user = local.ssh_user
}
}

inline = [
"powershell Restart-Service Nomad"
]
}
Loading