Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Staggered upgrade procedure for OVS clouds #1408

Open
wants to merge 1 commit into
base: stackhpc/2024.1
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 65 additions & 0 deletions doc/source/operations/upgrading-openstack.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1063,6 +1063,12 @@ This will block the upgrade, but may be overridden by setting
``etc/kayobe/kolla/globals.yml`` or
``etc/kayobe/environments/<env>/kolla/globals.yml``.

Depending on the networking architecture of your cloud, the steps used
to upgrade the containerised services will differ.

OVN
^^^

To upgrade the containerised control plane services:

.. code-block:: console
Expand All @@ -1076,6 +1082,65 @@ scope of the upgrade:

kayobe overcloud service upgrade --tags config --kolla-tags keystone

OVS (w/ Dedicated network nodes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You should first stop the Octavia health manager to prevent alerts during
the service upgrade.

.. code-block:: console

kayobe overcloud host command run --command "docker stop octavia_health_manager" --limit controllers --become

Upgrade the control plane services

.. code-block:: console

kayobe overcloud serivce upgrade --kolla-limit controllers

To ensure L3 reliability during the upgrade, we will need to manually drain
the network nodes of all agents, and upgrade the nodes sequentially.

Kolla credentials will need to be activated before running the neutron-namespace-drain
role.

.. code-block:: console

source $KOLLA_CONFIG_PATH/public-openrc.sh

You should substitute <network0> with the first network node to be drained, To set
the node for maintenance and begin draining the agents:

.. code-block:: console

kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/neutron-l3-drain.yml -e neutron_drain_host=<network0> -e maintenance=true -e drain_dhcp_agents=true

You can monitor the L3/DHCP agents being drained from the node by running:

.. code-block:: console

ssh -t <network0> watch ip netns ls

Once all agents have been drained, you can upgrade the containerised services
on the network node.

.. code-block:: console

kayobe overcloud service upgrade --kolla-limit <network0>

Following the service upgrade, the agents can be restored on the node by disabling maintenance:

.. code-block:: console

kayobe playbook run $KAYOBE_CONFIG_PATH/ansible/neutron-l3-drain.yml -e neutron_drain_host=<network0> -e maintenance=false -e drain_dhcp_agents=true

The above steps should be repeated for the remaining network nodes, once all network nodes have been upgraded
the remaining containerised services can be upgraded:

.. code-block:: console

kayobe overcloud service upgrade --kolla-tags common,nova,prometheus,openvswitch,neutron --skip-prechecks -kl controllers,compute --limit controllers,compute

Updating the Octavia Amphora Image
----------------------------------

Expand Down
23 changes: 23 additions & 0 deletions etc/kayobe/ansible/neutron-l3-drain.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
- name: Drain neutron of l3 agents and dhcp agents
hosts: localhost
gather_facts: true
tags:
- neutron-l3-drain
vars:
maintenance: false
drain_ctrl1: false
drain_ctrl2: false
drain_ctrl3: false
tasks:
- import_role:
name: neutron-namespace-drain
tasks_from: main.yml
when: drain_ctrl1 | bool or drain_ctrl2 | bool or drain_ctrl3 | bool or neutron_drain_host is defined
- name: "Print Info"
debug:
msg:
- "{{ neutron_drain_host }} is read for maintenance"
- "rerun this play book with -e maintenance=false to re-add"
- "routers"
when: maintenance | bool
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
neutron_drain_venv: "{{ virtualenv_path }}/openstack"
neutron_drain_host: "{% if drain_ctrl1 | bool %}{{ groups['controllers'][0] }}{% elif drain_ctrl2 | bool %}{{ groups['controllers'][1] }}{% elif drain_ctrl3 | bool %}{{ groups['controllers'][2] }}{% endif %}"
neutron_drain_venv_python: "{{ 'python' ~ ansible_facts.python.version.major ~ '.' ~ ansible_facts.python.version.minor }}"
drain_ctrl1: false
drain_ctrl2: false
drain_ctrl3: false
drain_dhcp_agents: false
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
---
- name: Query source SRC_DHCP_ID
command: >
{{ neutron_drain_venv }}/bin/openstack
network agent list --host {{ neutron_drain_host }}
--agent-type dhcp -f value -c ID
register: SRC_DHCP_ID
environment: "{{ openstack_auth_env }}"
delegate_to: "{{ neutron_drain_host }}"
vars:
ansible_host: "{{ hostvars[neutron_drain_host].ansible_host }}"

- name: Set fact containing SRC_DHCP_ID
set_fact:
DHCP_SRC_ID: "{{ SRC_DHCP_ID.stdout }}"

- name: Enable DHCP agent
command: >
{{ neutron_drain_venv }}/bin/openstack
network agent set "{{ DHCP_SRC_ID }}" --enable
environment: "{{ openstack_auth_env }}"
delegate_to: "{{ neutron_drain_host }}"
vars:
ansible_host: "{{ hostvars[neutron_drain_host].ansible_host }}"

- name: Get Network IDs
command: >
{{ neutron_drain_venv }}/bin/openstack
network list -f value -c ID
register: network_ids
environment: "{{ openstack_auth_env }}"
delegate_to: "{{ neutron_drain_host }}"
vars:
ansible_host: "{{ hostvars[neutron_drain_host].ansible_host }}"

- name: Set Network IDs
set_fact:
NETWORK_IDS: "{{ network_ids.stdout_lines }}"

- name: Add DHCP agent
command: >
{{ neutron_drain_venv }}/bin/openstack
network agent add network {{ DHCP_SRC_ID }} {{ item }} --dhcp
environment: "{{ openstack_auth_env }}"
delegate_to: "{{ neutron_drain_host }}"
vars:
ansible_host: "{{ hostvars[neutron_drain_host].ansible_host }}"
with_items:
- "{{ NETWORK_IDS }}"
loop_control:
pause: 10
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
- name: Query L3_IDs
command: >
{{ neutron_drain_venv }}/bin/openstack
network agent list --agent-type l3 -f value -c ID
register: L3_ID
environment: "{{ openstack_auth_env }}"
delegate_to: "{{ neutron_drain_host }}"
vars:
ansible_host: "{{ hostvars[neutron_drain_host].ansible_host }}"

- name: Set fact containing SRC_L3_IDs
set_fact:
L3_IDS: "{{ L3_ID.stdout_lines }}"

- name: Get agents for each router
command: >
{{ neutron_drain_venv }}/bin/openstack
network agent list --router {{ router_id }} --agent-type l3 -f value -c ID
environment: "{{ openstack_auth_env }}"
delegate_to: "{{ neutron_drain_host }}"
failed_when: false
register: ROUTER_L3_IDS
vars:
ansible_host: "{{ hostvars[neutron_drain_host].ansible_host }}"
loop: "{{ ROUTER_IDS }}"
loop_control:
loop_var: router_id

- name: Add agent to router
command: >
{{ neutron_drain_venv }}/bin/openstack
network agent add router --l3 {{ L3_ADD }} {{ item.router_id }}
loop: "{{ ROUTER_L3_IDS.results }}"
loop_control:
label: "{{ item.router_id }}"
environment: "{{ openstack_auth_env }}"
delegate_to: "{{ neutron_drain_host }}"
vars:
ansible_host: "{{ hostvars[neutron_drain_host].ansible_host }}"
L3_ADD_DIFF: "{{ L3_IDS | difference([L3_SRC_ID]) | difference(item.stdout_lines) }}"
L3_ADD: "{{ L3_ADD_DIFF[:1] | first }}"
when: L3_ADD_DIFF | length > 0
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
- name: Query source SRC_DHCP_ID
command: >
{{ neutron_drain_venv }}/bin/openstack
network agent list --host {{ neutron_drain_host }}
--agent-type dhcp -f value -c ID
register: SRC_DHCP_ID
environment: "{{ openstack_auth_env }}"
delegate_to: "{{ neutron_drain_host }}"
vars:
ansible_host: "{{ hostvars[neutron_drain_host].ansible_host }}"

- name: Set fact containing SRC_DHCP_ID
set_fact:
DHCP_SRC_ID: "{{ SRC_DHCP_ID.stdout }}"

- name: Get DHCP agent network IDs
command: >
{{ neutron_drain_venv }}/bin/openstack
network list --agent {{ DHCP_SRC_ID }} -f value -c ID
register: dhcp_agent_ids
environment: "{{ openstack_auth_env }}"
delegate_to: "{{ neutron_drain_host }}"
vars:
ansible_host: "{{ hostvars[neutron_drain_host].ansible_host }}"

- name: Set DHCP agent network IDs
set_fact:
DHCP_AGENT_IDS: "{{ dhcp_agent_ids.stdout_lines }}"

- name: Remove DHCP agent
command: >
{{ neutron_drain_venv }}/bin/openstack
network agent remove network {{ DHCP_SRC_ID }} {{ item }} --dhcp
environment: "{{ openstack_auth_env }}"
delegate_to: "{{ neutron_drain_host }}"
vars:
ansible_host: "{{ hostvars[neutron_drain_host].ansible_host }}"
with_items:
- "{{ DHCP_AGENT_IDS }}"

- name: Wait for no more dhcp agents to be attached to the host
block:
- name: Retry count
set_fact:
retry_count: "{{ 0 if retry_count is undefined or retry_count == 'reset' else retry_count | int + 1 }}"
max_retries: 20

- name: Verify dhcp agents exist
command: >
{{ neutron_drain_venv }}/bin/openstack
network list --agent {{ DHCP_SRC_ID }} -f value -c ID
environment: "{{ openstack_auth_env }}"
delegate_to: "{{ neutron_drain_host }}"
vars:
ansible_host: "{{ hostvars[neutron_drain_host].ansible_host }}"
register: agent_status

- name: Fail if DHCP agent still attached
fail:
msg: a DHCP agent is still attached to "{{ L3_SRC_ID }}"
when: agent_status.stdout | length > 0

- name: Reset retry count after success
set_fact:
retry_count: reset
rescue:
- fail:
msg: |
Maximum retries waiting for DHCP agents to be detached reached
when: retry_count | int == max_retries

- name: Reset retry counter if max retries reached (exit loop)
set_fact:
retry_count: reset
failed_when: retry_count == 'reset'
when: retry_count | int >= max_retries | int

- name: Disable DHCP agent
command: >
{{ neutron_drain_venv }}/bin/openstack
network agent set "{{ DHCP_SRC_ID }}" --disable
environment: "{{ openstack_auth_env }}"
delegate_to: "{{ neutron_drain_host }}"
vars:
ansible_host: "{{ hostvars[neutron_drain_host].ansible_host }}"
when: agent_status.stdout | length == 0
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
---
- name: Query source SRC_L3_ID
command: >
{{ neutron_drain_venv }}/bin/openstack
network agent list --host {{ neutron_drain_host }}
--agent-type l3 -f value -c ID
register: SRC_L3_ID
environment: "{{ openstack_auth_env }}"
delegate_to: "{{ neutron_drain_host }}"
vars:
ansible_host: "{{ hostvars[neutron_drain_host].ansible_host }}"

- name: Set fact containing SRC_L3_ID
set_fact:
L3_SRC_ID: "{{ SRC_L3_ID.stdout }}"

- name: Get Router IDs
command: >
{{ neutron_drain_venv }}/bin/openstack
router list --agent {{ L3_SRC_ID }} -f value -c ID
register: router_ids
environment: "{{ openstack_auth_env }}"
delegate_to: "{{ neutron_drain_host }}"
vars:
ansible_host: "{{ hostvars[neutron_drain_host].ansible_host }}"

- name: Set Router IDs
set_fact:
ROUTER_IDS: "{{ router_ids.stdout_lines }}"

- name: Add agents to router
include_tasks: add-new-l3.yml
when: ROUTER_IDS | length > 0

- name: Remove router
command: >
{{ neutron_drain_venv }}/bin/openstack
network agent remove router {{ L3_SRC_ID }} {{ item }} --l3
environment: "{{ openstack_auth_env }}"
delegate_to: "{{ neutron_drain_host }}"
vars:
ansible_host: "{{ hostvars[neutron_drain_host].ansible_host }}"
with_items:
- "{{ ROUTER_IDS }}"

- name: Wait for no more routers to be attached to the host
block:
- name: Retry count
set_fact:
retry_count: "{{ 0 if retry_count is undefined or retry_count == 'reset' else retry_count | int + 1 }}"
max_retries: 20

- name: Verify routers exist
command: >
{{ neutron_drain_venv }}/bin/openstack router list --agent {{ L3_SRC_ID }} -f value -c ID
environment: "{{ openstack_auth_env }}"
delegate_to: "{{ neutron_drain_host }}"
register: agent_status
vars:
ansible_host: "{{ hostvars[neutron_drain_host].ansible_host }}"

- name: Fail if routers still attached
fail:
msg: a Router is still attached to agent "{{ L3_SRC_ID }}"
when: agent_status.stdout | length > 0

- name: Reset retry count after success
set_fact:
retry_count: reset
rescue:
- fail:
msg: |
Maximum retries waiting for routers to be detached reached
when: retry_count | int == max_retries

- name: Reset retry counter if max retries reached (exit loop)
set_fact:
retry_count: reset
failed_when: retry_count == 'reset'
when: retry_count | int >= max_retries | int

- name: Disable L3 agent
command: >
{{ neutron_drain_venv }}/bin/openstack
network agent set "{{ L3_SRC_ID }}" --disable
environment: "{{ openstack_auth_env }}"
delegate_to: "{{ neutron_drain_host }}"
when: agent_status.stdout | length == 0
vars:
ansible_host: "{{ hostvars[neutron_drain_host].ansible_host }}"
Loading
Loading