Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFE] QUADS Self-Scheduling Phase 1 via API #487

Open
sadsfae opened this issue Jun 14, 2024 · 16 comments
Open

[RFE] QUADS Self-Scheduling Phase 1 via API #487

sadsfae opened this issue Jun 14, 2024 · 16 comments

Comments

@sadsfae
Copy link
Member

sadsfae commented Jun 14, 2024

This is the skeleton feature that will enable future self-scheduling via the UI here: #98

  • Design flexible API scheduling framework using dedicated RBAC roles / token auth (perhaps we pre-define a set number of "slots" like we do cloud environments as users)

  • Design a threshold activation mechanism based on both cloud capacity % and per-model % usage, e.g. a host level metadata attribute like ss_enabled: true/false

    • We can manage this with a cron-interval tool to check / set as needed like selfservice_manager.py
    • We can provide a global argparse override like --enable-ss true/false
  • Design what the tenant workflow might look like, e.g.

  • Query available self scheduling
  • If enabled, what models/systems?
  • Use API to obtain a "self scheduling" cloud/user
  • Perform set commands to define cloud, add systems (schedule times are pre-set, e.g. starting now and lasting X days or a week"
  • Possibly include API mechanisms to auto-extend so long as thresholds meet.
@jtaleric
Copy link
Contributor

This is awesome!

Just to give a high-level what we are looking to do with Quads with our baremetal CPT, here is a quick and dirty visual ;-)

CPT (8)

Our use-case for integration would be for leases <= 24 hours -- depending on the cluster size we are requesting. Basically these allocations will be ephemeral in nature. Not ideal for long investigations, but great for quick response to hardened performance automation.

let us know if this fits the model you describe here! Thanks Will!

@josecastillolema
Copy link

josecastillolema commented Jun 14, 2024

Thanks @sadsfae and @jtaleric ! I think this is a great step in the right direction to enable dynamic allocations in the scale lab.

After discussing with @jtaleric the way I have thought of the process is:

  • Step 1 - The request: This would be a new Prow step that would query the QUADS API for the leftovers (servers that are not statically assigned in today's dynamic inventory via the usual reservation process). The reservation request would look something like:

    • NUM_SERVERS: the number of servers being requested
    • SERVERS_LABELS: Just some examples: sriov_intel, sriov_mellanox, dell_640, disk_nvme, disk_ssd. I don't know if this labels exist today or will have to be implemented.

    If the servers are available:

    • interact with the QUADS API to create a new cloud with those servers: this would setup the networking, BMCs, VLANs, generate the corresponding OCPINV, etc.
    • interact with the QUADS API (or the foreman one, not sure) to deploy the bastion server
  • Step 2 - Deploy OCP corresponds to the current openshift-qe-installer-bm-deploy which we would need to improve by implementing a dynamic inventory

  • Step 3 - Run benchmarks, no planned changes here, this is already implemented

  • Step 4 - Return the machines, a new Prow step to cleanup, delete the cloud, etc.

@sadsfae sadsfae moved this from To do to To Do: High Priority and Bugs in QUADS 2.0 Series Jun 26, 2024
@sadsfae
Copy link
Member Author

sadsfae commented Jun 28, 2024

Great feedback so far, keep it coming. We have a work-in-progress design document we're mostly done with but needs internal review before we share it but I gave @jtaleric and @radez a preview.

I think what you'd care about is the tenant API request workflow, it would look ideally something like this broken down into a GET request to obtain eligible systems and three API POSTs to get your own workload delivered.

You'd receive JSON returns for each of the API requests below that can be fed into any automation you have, note the jump to using the generated API token, etc.

1) Query available, eligible systems (OPEN)

  • API GET to return list of systems eligible for self-scheduling (if self-scheduling is unavailable, return as such)

2) Request assignment (OPEN)

  • API POST to Request a new self-service assignment
    • Sends: Workload description, tenant kerberos name e.g. dradez (mapped to email), optionally public VLAN requirement (we would auto-allocate a free one) and Q-in Q design
    • Receives: Permanent unique ID for their user (creates ID if it doesn’t exist)
    • Receives: Newly generated API auth token (for this request only)
    • Receives: JIRA ticket url for new JIRA issue (--cloud-ticket for this request)

3) Acquire assignment (Authenticated)

  • API POST to Acquire an assignment
    Sends: Generated token, JIRA ticket number and permanent unique ID
    Receives: JSON return with cloud number, ticket, description

4) Deploy Assignment (Authenticated)

  • Curl API POST to Deploy their assignment

    • Sends: Token, unique ID and list of hosts they want
    • Receives: JSON status return on success/failure
  • After step 4 you can query the open public API too to view your list of machines as well

  • After step 4 you can also poll the release status of the cloud environment and have your automation gate on that before proceeding.

More Details WIP

(excuse the screenshot, didn't convert to markdown)

Screenshot_2024-06-28_16-58-36

@sadsfae
Copy link
Member Author

sadsfae commented Jun 28, 2024

Thanks @sadsfae and @jtaleric ! I think this is a great step in the right direction to enable dynamic allocations in the scale lab.

After discussing with @jtaleric the way I have thought of the process is:

  • Step 1 - The request: This would be a new Prow step that would query the QUADS API for the leftovers (servers that are not statically assigned in today's dynamic inventory via the usual reservation process). The reservation request would look something like:

    • NUM_SERVERS: the number of servers being requested
    • SERVERS_LABELS: Just some examples: sriov_intel, sriov_mellanox, dell_640, disk_nvme, disk_ssd. I don't know if this labels exist today or will have to be implemented.

This already exists in our metadata models and filters for --ls-available

https://github.com/redhat-performance/quads/blob/latest/docs/quads-host-metadata-search.md

This can also be expanded upon and import anything of value via lshw and our Python metadata import tool:

https://github.com/redhat-performance/quads/blob/22bb54a2bf989103532057486542eef52fe29d1a/src/quads/tools/lshw2meta.py

If the servers are available:

  • interact with the QUADS API to create a new cloud with those servers: this would setup the networking, BMCs, VLANs, generate the corresponding OCPINV, etc.
  • interact with the QUADS API (or the foreman one, not sure) to deploy the bastion server

We would consume all the frills and features of a normal deliberately scheduled future QUADS assignment so all that would be inclusive. As to what you'd receive it would not differ at all from what someone who receives a future assignment scheduled by us would receive, only that you can request it yourself with a few API calls (we need several because we do a lot including talking to JIRA and Foreman on the backend).

  • Step 2 - Deploy OCP corresponds to the current openshift-qe-installer-bm-deploy which we would need to improve by implementing a dynamic inventory
  • Step 3 - Run benchmarks, no planned changes here, this is already implemented
  • Step 4 - Return the machines, a new Prow step to cleanup, delete the cloud, etc.

The world is your oyster, but we're just auto-delivering the hardware/networks here. Any additional hour zero work is in your purview only to action like deploying OCP and running your workloads. You will be able to release the systems (or extend them) with the same set of API's though.

One thing to keep in mind is our provisioning release time is what it is, there's no significant speedups beyond our usage of asyncio (already in codebase and current QUADS) and what having multiple, concurrent gunicorn threads/listeners provides. Bare-metal / IPMI / boot / mechanics are just slow, prone to weird issues and often needs hands-on so only thing I'd add here is keep your expectations reasonable, we're not going to have within-the-hour deployments 100% of the time, it may take a few hours to get your systems once the dust clears, longer if hands-on is required to push them through validation like any normal QUADS assignment.

@josecastillolema
Copy link

Thanks for the great explanation @sadsfae ,
some newie questions about the scale lab internals:

  • When a new cloud is assigned and goes through validation, does all its host get provisioned through foreman? Is this needed?
    • Assuming that the answer to the previous quesiton is yes, and considering our automatic deployment scenario where we only need the bastion node deployed, and the other servers will be handled by the bastion installer, would it be possible to skip the provisioning of the rest of the nodes or the provisioning itself is needed in order to do the validation?
  • Could the validation of the servers be done when the previous cloud assignment finishes instead of when the new cloud assignment is released, or is it dependent on the new assignment (i.e.: vlans, etc.)

@sadsfae
Copy link
Member Author

sadsfae commented Jul 1, 2024

Thanks for the great explanation @sadsfae , some newie questions about the scale lab internals:

  • When a new cloud is assigned and goes through validation, does all its host get provisioned through foreman? Is this needed?

Yes, it is absolutely needed. We have no other way to ensure systems data, settings, etc. are cleaned. More importantly we need a way to physically test/validate network functionality. We have a series of fping tests that occur and other validation testing to ensure the hardware is working 100%, traffic passes on ports and also that data and settings from previous tenants are removed. We have to deploy our own Foreman RHEL because it sets up templates for all of the VLAN interfaces with deliberate IP schemes to facilitate this testing.

  • Assuming that the answer to the previous quesiton is yes, and considering our automatic deployment scenario where we only need the bastion node deployed, and the other servers will be handled by the bastion installer, would it be possible to skip the provisioning of the rest of the nodes or the provisioning itself is needed in order to do the validation?

Yes but also no. There is no way to perform any validation at all without wiping the systems for a new tenant. We have an option called "no wipe" which skips all provisioning and even reboots, it simply performs the network VLAN and switch automation needed to "move" an assignment to a new cloud. We can allow this as an option via the API phases but you would have to be 100% sure they were the same systems you used before and nobody else used them since you used them last. otherwise you get stuck with effectively the same running OS/systems as whomever had those systems before you, then you'd have to burn extra time provisioning them.

No-wipe is used more frequently in expansion scenarios when we can't be sure there isn't conflicting broadcast services on the internal VLANS that would hijack kickstart (like DHCP/PXE from an internal tenant installer or service) or on occasion when we need to resurrect an expired assignment, it's really not designed to be done on new assignments unless you can be sure of the systems integrity which I don't see really being a tenable or reliable situation in a self-service pool of changing hardware.

We also have no way of knowing what system you may pick ahead of time in the future for your bastion node before you get your systems, as those systems availability would only based on what's available at the current time. The Foreman-deployed OS, while perfectly fine for a generic RHEL is more a vehicle for us to validate hardware and network functionality, ensure clean baselines and catch any hands-on issue that might occur with bare-metal which happens frequently with 1000's of something more than we'd like. There is just no getting around this and it would cause a lot more headache to try skipping it. A environment doesn't get released until all systems pass our sets of validation phases.

  • Could the validation of the servers be done when the previous cloud assignment finishes instead of when the new cloud assignment is released, or is it dependent on the new assignment (i.e.: vlans, etc.).

When machines finish their allocation they roll directly to the next tenant anyway if they have an active schedule but they still need to be wiped/validated/tested and then pass our hardware/network validation and release gating, meaning they do immediately go into this process anyway if they have another place to be.

What you're asking in general doesn't save a lot of time anyway. Kickstart via Foreman are a fairly fast part of the QUADS workflow (network automation is the most speedy which takes 5-10 seconds or so per machine).

It takes around 5-8 minutes or less not counting reboots for a modern system to fully kickstart with our local SDN mirror and it's all done in parallel via the asyncio-capable Foreman QUADS library. Even if we used image-based deployments it would still take a comparable amount of time because disk images have to be written and reboots still have to happen and it's more inflexible because we'd have to maintain many different flavors of images to account for hardware variety. In general this step is fundamental to ensuring we deliver and properly validate 100% functional systems to tenants.

@sadsfae sadsfae self-assigned this Jul 1, 2024
@jtaleric
Copy link
Contributor

jtaleric commented Jul 1, 2024

@sadsfae ack! Thanks for the detailed responses!

My concern was if we are trying to have a highly dynamic transactional lease with QUADs -- there would be this enormous amount of time for the validate/provisioning. From the above description, I don't think we are talking hours, but maybe minutes depending on the size of the request.

Initially, I envision the request size around 25 nodes, 1 bastion, 3 workers, 3 infra, 18 workers -- this would be our "largest" deployment to start with in this POC. The lease would be 24-48hours. We might want to have the ability to run 2-3 jobs in parallel so my hope would be the "dynamic pool" of machines is ~75 if we can accommodate that for a POC?

@sadsfae
Copy link
Member Author

sadsfae commented Jul 1, 2024

@sadsfae ack! Thanks for the detailed responses!

My concern was if we are trying to have a highly dynamic transactional lease with QUADs -- there would be this enormous amount of time for the validate/provisioning. From the above description, I don't think we are talking hours, but maybe minutes depending on the size of the request.

Hey Joe, I would set your expectations to 45minutes to an hour if nothing goes wrong: from the time you finish all the API POST(s) needed until you receive fully validated, fresh hardware/networks. That's a reasonable goal. The amount of systems doesn't matter as much because almost everything is done in parallel but of course one machine not passing validation holds up the rest because we demand 100% validation integrity. Most of the time this entire process goes off without a hitch unless tenants really trash the systems before you get them.

Initially, I envision the request size around 25 nodes, 1 bastion, 3 workers, 3 infra, 18 workers -- this would be our "largest" deployment to start with in this POC. The lease would be 24-48hours. We might want to have the ability to run 2-3 jobs in parallel so my hope would be the "dynamic pool" of machines is ~75 if we can accommodate that for a POC?

You'll have whatever limits we need for testing as I think you'll be our first "beta" adopters, but afterwards we will be setting per-user limits for multiple concurrent self-scheduled requests, we simply don't want one user occupying all the self-schedule pool. We don't know what that limit is yet but we'll figure out something that allows what you're trying to do but stems hogging as well.

I just don't know what our dynamic pool will look like when this is ready, it just depends on what's free at the time because regardless of the priority of this feature we need to operate the R&D and product scale business first and foremost. Yes, I think 75+ to let it really grind is a great number to aim for though and very reachable looking at usage so far this Summer.

One other note about delivery expectations, there are certainly areas we can speed up the QUADS phases (move_and_rebuild and validate_environment) but our resources are first aiming for functionality so we'd expect the same or slightly faster delivery than QUADS 1.1.x due to moving to independent gunicorn listeners and nginx benefits. We will need to pull out some profiling tools after this is working well to tighten things up where it makes sense as future RFE's.

One area we need to revisit is the Foreman library, while we are using asyncio we do limit API activity via semaphores because we've overloaded it in the past and never got back to digging into RoR and their Sinatra API too deeply. I think that this needs revisiting and there's likely tuning on the Foreman API side we need to do too beyond what we do with mod_passenger cc: @grafuls

https://quads.dev/2019/10/04/concurrent-provisioning-with-asyncio/

Edit: I just checked a few 30+ node assignments that went out flawlessly in the last week and they are completely validated and released in around 40-45minutes so I think 1 hour is a good target, possibly two hours maximum assuming no hardware/switch issue requiring hands-on.

@jtaleric
Copy link
Contributor

jtaleric commented Jul 1, 2024

Super helpful to set expectations, thanks Will!

@sadsfae
Copy link
Member Author

sadsfae commented Jul 2, 2024

Super helpful to set expectations, thanks Will!

Hey for sure, no problem. Better to set a target we're pretty sure we can hit and then start profiling and looking to see where we can reduce the delivery time going forward. So much has changed design/architecture wise that I don't want to be too aggressive on estimates. Once we have 2.0 running in production we'll have a better baseline, ideally I'd like to aim for 30 minutes or less with as a start-to-end delivery goal.

@grafuls
Copy link
Contributor

grafuls commented Aug 16, 2024

There has been some internal discussion on the design and here's a first look at how the flow of the system would be.

Self Scheduling Flow

@jtaleric
Copy link
Contributor

Thanks @grafuls -- so for our BM CPT we will focus on 2 paths I assume

  1. Pass our CloudID for our LTA -- which it will check if things are "free"
  2. "Find Free Cloud" -- which I assume we will send some concept of "machines with these characteristics, and number of nodes"

Thanks for throwing this together!

@sadsfae
Copy link
Member Author

sadsfae commented Aug 19, 2024

Hey Joe, this is how it'll likely work.

Thanks @grafuls -- so for our BM CPT we will focus on 2 paths I assume

1. Pass our CloudID for our LTA -- which it will check if things are "free"

You'll first do an open GET request to obtain a JSON list of systems (max 10) and store this somewhere and you'll pass it along later as part of a bearer-auth authenticated POST payload once the first successful 201 response is returned with your temporary token and generated JIRA ticket number and cloud that's auto-allocated.

We are using the concept of users like --cloud-owner so you could for example also specify others as --cc-users here, because we need physical people contacts for the requests. Free cloud slots are also randomly chosen here too based on what is available. Roles are then associated for that token / user to the specific cloud environment slot for the lifetime of the schedules.

2. "Find Free Cloud" -- which I assume we will send some concept of "machines with these characteristics, and number of nodes"

The way we have it scoped now is you'll just pick the actual physical systems (max of 10) that gets returned. But we do have the capability to support filtering based on hardware, model or capability that will curate this API response return.

Thanks for throwing this together!

@grafuls
Copy link
Contributor

grafuls commented Aug 19, 2024

  1. Pass our CloudID for our LTA -- which it will check if things are "free"

For self scheduling we are expecting users to not pass any cloud, for which Quads, on the backend, will auto select a cloud name that is available. We are also opening the possibility of passing the cloud name ( that should also be available ) to override the auto selection.

2. "Find Free Cloud" -- which I assume we will send some concept of "machines with these characteristics, and number of nodes"

These are kind of separate concepts.
We have a set number of clouds/name [cloud01..cloud99] which represents the naming of the environment. Find free cloud looks for cloud names that are not assigned to currently scheduled "clouds".
In order to get a list of available servers with specific characteristics, you would use the /available/ endpoint, passing the filters as arguments on the request.
E.g.:

curl http://quads2.example.com/api/v3/available?interfaces.vendor=Mellanox+Technologies&model=R640&disks.disk_type=nvme

@sadsfae
Copy link
Member Author

sadsfae commented Dec 5, 2024

Moving efforts here to development branch, first WIP patchset: https://review.gerrithub.io/c/redhat-performance/quads/+/1204959

sadsfae pushed a commit that referenced this issue Dec 11, 2024
related: #487
Change-Id: I9fab08d06b94ed0d6cbd494d4c8049b7d1bba5de
@sadsfae
Copy link
Member Author

sadsfae commented Dec 18, 2024

We have started testing self-scheduling internally, please find us via the normal mechanisms if you'd like to try it out.

@sadsfae sadsfae moved this from To do to To Do: High Priority and Bugs in QUADS 2.2 Series Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: To Do: High Priority and Bugs
Development

No branches or pull requests

4 participants