Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-cell adoption #517

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

bogdando
Copy link
Contributor

@bogdando bogdando commented Jul 3, 2024

Keep renaming 'default' cell consistent for single and multi cells:

  • Default becomes cellX (or it can be imported as is, for a multi-cell
    case only)
  • cell1 becomes mapped to openstack-cell1 osdp node set
  • cell2 becomes mapped to openstack-cell2 osdp node set, etc.
  • cellX (X=3 here) becomes mapped to openstack-cell3. Alternatively,
    default cell retains its name for the openstack-default osdpns
    mapping

Evaluate podified MariaDB passwords for cells from osp-secret
to align the tests with documented commands. Remove no longer
needed podified DB password variable.

Make ansible and shell variables compute cells aware.
Split edpm nodes into compute cells by 1:1 mapping it as
dataplane nodesets.

Rework vars and secrets YAML values for the source and edpm
nodes to not confuse its different naming schemes for cells
in OSP/TripleO and RHOSO.

Use edpm_nodes var to describe compuptes for each cell,
instead of static host and ip vars that only used to work for
a single-cell standalone, or multi-node single cell cases.
Also explain EDPM net config requirements in vars.sample, when
it is used outside of ci-framework (local deployments).

Remove edpm_computes vars no longer used after moving stopping
control-plane tripleo services into edpm-ansible

Remove cached fact for pulled OSP configuration as it can no longer
be generated in a multi-cell setup, where related shell variables
become bash arrays.

Simplify ENV headers management by collecting in a single place.

Provide a variable to define the source cloud Ironic topology,
for any cells with Ironic services.

Align nova/libvirt and related services ordering in the
lists of services defined in multiple places, with those
specified in VA.

Add a missing step in the fast forward uprgade guide
to complete the adoption of the remaining dataplane services.

Align the names in the tests to follow the documented steps
to make the corresponding code easy discoverable.

Adjust storage/storageRequests values to make it better fitting
a multi-cell test scenarios. Also provide values in docs and
add a comment to adjust them as needed.

Stop ovn services only if active, or not missing (like on
the cell controllers)

Depends-On: openstack-k8s-operators/ci-framework#2485

JIRA OSPRH-6548

@bogdando bogdando changed the title Multi-cell adoption [WIP] Multi-cell adoption Jul 3, 2024
@bogdando
Copy link
Contributor Author

bogdando commented Jul 8, 2024

The recent revision gives an overview to the approach taken, PTAL.
As long as we need to maintain the docs-as-code here, I'm afraid there would be no a cleaner solution than that.
For the ci-framework and rdo-jobs side of things, which should template all that in, I have WIP as well...
@jistr @SeanMooney

Copy link

This change depends on a change that failed to merge.

Change openstack-k8s-operators/install_yamls#826 is needed.

tests/vars.sample.yaml Outdated Show resolved Hide resolved
Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/data-plane-adoption for 517,18b084c576712d289411bfab3a4bfee4b60a3fbf

Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/data-plane-adoption for 517,4687df731d7a30007950c91ac21ee931ebfebf8c

@bogdando
Copy link
Contributor Author

Based on feedback from @SeanMooney, we should not shift cells names as I proposed here. We want it instead like this:

  • A single-cell adoption (only default cell exists): rename default to cell1,
  • A multi-cell ( default, cell1, etc. exist) - omit importing the default as there is no compute hosts supported to be there for a multi-cell OSP, hence nothing to adopt from it.
  • Or, a multi-cell ( default, cell1, etc. exist) - omit renaming the default cell, and import as is
  • Or, a multi-cell ( default, cell1, etc. exist) - rename default cell to the highest cell number + 1:
default -> cell4
cell1 -> cell1
cell2 -> cell2
cell3 -> cell3

Implementing either of these is quite challenging given the local requirement to maintain code in tests in the same form as it is documented (meaning shell commands). This sofisticated logic will bring in even more loops and arrays handling into already overcomplicated code proposed in this PR draft.

@jistr @gibizer looking for your ideas on that

@gibizer
Copy link
Contributor

gibizer commented Jul 11, 2024

Based on feedback from @SeanMooney, we should not shift cells names as I proposed here. We want it instead like this:

  • A single-cell adoption (only default cell exists): rename default to cell1,
  • A multi-cell ( default, cell1, etc. exist) - omit importing the default as there is no compute hosts supported to be there for a multi-cell OSP, hence nothing to adopt from it.
  • Or, a multi-cell ( default, cell1, etc. exist) - omit renaming the default cell, and import as is
  • Or, a multi-cell ( default, cell1, etc. exist) - rename default cell to the highest cell number + 1:
default -> cell4
cell1 -> cell1
cell2 -> cell2
cell3 -> cell3

Implementing either of these is quite challenging given the local requirement to maintain code in tests in the same form as it is documented (meaning shell commands). This sofisticated logic will bring in even more loops and arrays handling into already overcomplicated code proposed in this PR draft.

@jistr @gibizer looking for your ideas on that

As nova-operator allows a cell to be named "default" the simplest solution would be your second proposal. Just import the cells as is. This has the benefit also that it will work even if a given customer wrongly attached computes to the default cell.
After GA nova-operator will get the ability to delete cells. So that feature can be used later to delete the "default" cell and therefore get the deployment structurally the same as a greenfield 18 deployment.

@bogdando
Copy link
Contributor Author

bogdando commented Jul 11, 2024

I tend now to implement the last choice: for a multi-cell ( default, cell1, etc. exist) - rename default cell to the highest cell number + 1. This keeps it consistent for single cell and multicell...

/update: See the combined option which allows both renaming or importing as is

Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/data-plane-adoption for 517,9541ce7f013b9b35b2cbd681cb30259da1a85157

Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/data-plane-adoption for 517,54d110489b8215e014580b8b77b05ce107fd1e04

Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/data-plane-adoption for 517,9954245ae2addd169cc80deab137024b7046f30e

Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Unable to update github.com/openstack-k8s-operators/install_yamls

Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/data-plane-adoption for 517,b946bca930ed67ffe94465e36e742abd9ba55d95

@bogdando
Copy link
Contributor Author

With that milestone, the minimal test suit passes on my local dev env (with a minor hiccup applying netconfig, which renders EDPM nodes disconnected, until rebooted manually)!

@bogdando bogdando force-pushed the multi-cell branch 2 times, most recently from ffe6546 to d5eaf58 Compare October 16, 2024 12:54
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/2ea9af6e68bb474e987ad859ba0bcd4f

adoption-standalone-to-crc-ceph FAILURE in 1h 34m 48s
adoption-standalone-to-crc-no-ceph FAILURE in 1h 44m 14s
adoption-docs-preview FAILURE in 1m 11s

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/84108484e70d4e55abb69276f09327e5

adoption-standalone-to-crc-ceph FAILURE in 1h 36m 08s
adoption-standalone-to-crc-no-ceph FAILURE in 1h 42m 09s
adoption-docs-preview FAILURE in 1m 10s

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/7b3becb36c664293919aae64119558a0

adoption-standalone-to-crc-ceph FAILURE in 1h 36m 29s
adoption-standalone-to-crc-no-ceph FAILURE in 1h 43m 01s
adoption-docs-preview FAILURE in 1m 11s

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/43b6623e7fde44868a93840e48caa89f

adoption-standalone-to-crc-ceph FAILURE in 50m 42s
adoption-standalone-to-crc-no-ceph FAILURE in 1h 40m 49s
adoption-docs-preview POST_FAILURE in 1m 30s

Copy link

This change depends on a change with an invalid configuration.

Copy link

This change depends on a change with an invalid configuration.

Copy link

This change depends on a change with an invalid configuration.

Copy link

This change depends on a change with an invalid configuration.

@bogdando bogdando force-pushed the multi-cell branch 3 times, most recently from b3fc88a to 3192f68 Compare October 25, 2024 14:07
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/9aa83b0821fd473686fe7b396d797661

adoption-standalone-to-crc-ceph FAILURE in 1h 43m 18s
adoption-standalone-to-crc-no-ceph FAILURE in 1h 43m 26s
✔️ adoption-docs-preview SUCCESS in 1m 14s

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/00da43296435421d9dde38cb03701b90

adoption-standalone-to-crc-ceph FAILURE in 1h 38m 41s
adoption-standalone-to-crc-no-ceph FAILURE in 1h 45m 40s
✔️ adoption-docs-preview SUCCESS in 1m 14s

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/1fd40b6d652b46dba73d848ce273ce5e

adoption-standalone-to-crc-ceph FAILURE in 1h 35m 43s
adoption-standalone-to-crc-no-ceph FAILURE in 1h 41m 17s
✔️ adoption-docs-preview SUCCESS in 1m 15s

Keep renaming 'default' cell consistent for single and multi cells:

* Default becomes cellX (or it can be imported as is, for a multi-cell
  case only)
* cell1 becomes mapped to openstack-cell1 osdp node set
* cell2 becomes mapped to openstack-cell2 osdp node set, etc.
* cellX (X=3 here) becomes mapped to openstack-cell3. Alternatively,
  default cell retains its name for the openstack-default osdpns
  mapping

Evaluate podified MariaDB passwords for cells from osp-secret
to align the tests with documented commands. Remove no longer
needed podified DB password variable.

Make ansible and shell variables compute cells aware.
Split edpm nodes into compute cells by 1:1 mapping it as
dataplane nodesets.

Rework vars and secrets YAML values for the source and edpm
nodes to not confuse its different naming schemes for cells
in OSP/TripleO and RHOSO.

Use edpm_nodes var to describe compuptes for each cell,
instead of static host and ip vars that only used to work for
a single-cell standalone, or multi-node single cell cases.
Also explain EDPM net config requirements in vars.sample, when
it is used outside of ci-framework (local deployments).

Remove edpm_computes vars no longer used after moving stopping
control-plane tripleo services into edpm-ansible

Remove cached fact for pulled OSP configuration as it can no longer
be generated in a multi-cell setup, where related shell variables
become bash arrays.

Simplify ENV headers management by collecting in a single place.

Provide a variable to define the source cloud Ironic topology,
for any cells with Ironic services.

Align nova/libvirt and related services ordering in the
lists of services defined in multiple places, with those
specified in VA.

Add a missing step in the fast forward uprgade guide
to complete the adoption of the remaining dataplane services.

Align the names in the tests to follow the documented steps
to make the corresponding code easy discoverable.

Adjust storage/storageRequests values to make it better fitting
a multi-cell test scenarios. Also provide values in docs and
add a comment to adjust them as needed.

Stop ovn services only if active, or not missing (like on
the cell controllers)

Signed-off-by: Bohdan Dobrelia <[email protected]>
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/12498e1c961d449483b185313ca815a4

adoption-standalone-to-crc-ceph FAILURE in 1h 37m 58s
adoption-standalone-to-crc-no-ceph FAILURE in 1h 48m 54s
✔️ adoption-docs-preview SUCCESS in 1m 12s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants