Add in logging and must gathers for failed jobs #37

paigerube14 · 2021-12-23T16:57:09Z

Similar to the Upgrade Ci we need to be able to run lots of jobs and log issues with out the cluster being around

It would be helpful to print off logs and maybe a must gather in certain cases to be able to properly open bugs

Some thoughts:

Add in describe or logs of machineset/node in case of cluster-workers-scaling failure
Adding in new must-gather branch to be called when failed scale-ci or upgrade job
Logs of failing pods in scale-ci jobs

paigerube14 · 2022-03-08T16:06:01Z

This issue Simon opened and the attached open PR covers the first bullet: #89

paigerube14 · 2022-03-08T17:21:57Z

@skordas @rpattath @mffiedler @qiliRedHat

What are your thoughts about adding in all job calling from the loaded-upgrade job itself? i.e. have loaded upgrade do the calls to flexy(already set up), scale up (currently in each scale-ci job specifically), the scale job itself (done), cluster check (wanting to add based on this issue) upgrade (done) and destroy if necessary.

I think this would create a lot less PR's and make things easier to manage in each of the branches. For this issue specifically, I was thinking I want to add a call to cerberus to do 1 iteration to do certain checks on the cluster after the load is complete to make sure we should continue. Currently I would have to add a PR to each of the scale-ci jobs and if any parameters ever needed to be added to the cerberus job I would have to do Pr's in each scale-ci job for that update. I am suggesting adding in the call from the base loaded-upgrade job so that the loaded-upgrade job is all encompassing.

For example, the env_vars variable weren't being passed to the scale worker job from some of the different scale-ci jobs. So I had to create a PR for each place in the scale-ci jobs that were calling the scale-worker job to add in the specific variable.

Cons I can think of: lots more clicking around to each job that ran off of the main loaded-upgrade run

Overall, what are your thoughts?

skordas · 2022-03-09T16:39:14Z

I think we can have a long pipeline if everything would be good documented - that mostly for new team members. We all are growing with the changes :)

my few cents:

(about clicking around) - Setting default values in this way to only click build - and will work (that of course would be tricky)
(about clicking around) - setting parameters input in the same order how we are running them.
must-gather - just before flexy-destroy - always true if some step failed.

qiliRedHat · 2022-03-10T10:05:47Z

@paigerube14 I agree with your thinking about reducing PR. I also think we can even use loaded-upgrade job in our scale-ci test runs. This can make 2 steps 1)calling Flexy-install job 2) trigger the scale-ci job into one and save some effort. But we have to be able to pass more configurations from loaded-upgrade job to Flexy-install job, like VARIABLES_LOCATION and LAUNCHER_VARSs. At lease vm_type should be configured as in some large scale cluster, we need bigger master and worker type, like for cluster density test https://deploy-preview-41943--osdocs.netlify.app/openshift-enterprise/latest/scalability_and_performance/recommended-host-practices.html#mast[…]ing_

qiliRedHat · 2022-03-10T10:23:07Z

@paigerube14 I created an Epic for our scale-ci automation enhancement work in 4.11 https://issues.redhat.com/browse/OCPQE-9141 to better organize our 4.11 work. I added your Jira task for this issue under it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add in logging and must gathers for failed jobs #37

Add in logging and must gathers for failed jobs #37

paigerube14 commented Dec 23, 2021

paigerube14 commented Mar 8, 2022

paigerube14 commented Mar 8, 2022

skordas commented Mar 9, 2022

qiliRedHat commented Mar 10, 2022

qiliRedHat commented Mar 10, 2022

Add in logging and must gathers for failed jobs #37

Add in logging and must gathers for failed jobs #37

Comments

paigerube14 commented Dec 23, 2021

paigerube14 commented Mar 8, 2022

paigerube14 commented Mar 8, 2022

skordas commented Mar 9, 2022

qiliRedHat commented Mar 10, 2022

qiliRedHat commented Mar 10, 2022