Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[spike] Gather info for Vai Testing #1276

Closed
5 tasks
git-ival opened this issue Apr 10, 2024 · 8 comments
Closed
5 tasks

[spike] Gather info for Vai Testing #1276

git-ival opened this issue Apr 10, 2024 · 8 comments
Assignees

Comments

@git-ival
Copy link

git-ival commented Apr 10, 2024

  • What framework(s) are expected to be used here?
    • Is there any known pre-existing code that can help in the test effort?
    • What functionality is needed in order to automate this testing as part of a regression suite?
  • Is there a particular cluster configuration that should be targetted for this testing?
    • HA, single-node, # of CPUs, amount of RAM, K8s distro, K8s version, etc.
    • Cloud provider (AWS, Azure, etc?)
  • What metrics must be tracked?
    • What are the cutoff points for each metric that needs to be tracked? (When should we mark a given test as pass/fail based on a given metric's performance?)
    • What tool(s) can be used to track the listed metrics? If Prometheus/Grafana: what query(ies)/dashboards should be used for tracking?
  • What benchmark testing, if any, needs to be accounted for?
    • # of clusters, # of nodes, # of nodes per cluster, etc.
    • # of rolebindings, # of secrets, # of namespaces, etc.
  • Cache testing - tbd
@git-ival git-ival changed the title [spike] Gather info on Vai Testing [spike] Gather info for Vai Testing Apr 10, 2024
@git-ival
Copy link
Author

UI Benchmark Considerations:

  • Assuming a cluster with 80,000 ConfigMaps with 1MiB payload each
  • First load of ConfigMap page should happen within 2.5 seconds
  • Changing to a different page should happen within 1 second
    • Same for changing filtering
  • Not more than 750MiB RAM per Browser Tab

Steve API Benchmark Considerations:

  • Paginated ConfigMaps must be returned from the API at a rate of 100 Resources/500ms (not including network latency and transfer time)
  • (Nice-to-have) Modify increasing amounts of objects until reaching the "limit" of objects that can be updated while remaning at <1 second page load times

Ideally we can automate these sooner rather than later, if push comes to shove we can do the 1st run "manually"

@git-ival
Copy link
Author

git-ival commented Apr 11, 2024

UI Benchmark Considerations:

  • Assuming a cluster with 80,000 ConfigMaps with 1MiB payload each

  • First load of ConfigMap page should happen within 2.5 seconds

  • Changing to a different page should happen within 1 second

    • Same for changing filtering
  • Not more than 750MiB RAM per Browser Tab

We will need to rely on Cypress or a similar tool in order to perform browser-based frontend tests. We hope to leverage the rancher/dashboard test framework wherever we can for this effort.

Loading up the cluster with ConfigMaps can be done via shepherd or a relatively simple bash script, this process will need to be batched.

Steve API Benchmark Considerations:

  • Paginated ConfigMaps must be returned from the API at a rate of 100 Resources/500ms (not including network latency and transfer time)
  • (Nice-to-have) Modify increasing amounts of objects until reaching the "limit" of objects that can be updated while remaining at <1 second page load times

Ideally we can automate these sooner rather than later, if push comes to shove we can do the 1st run "manually"

We can utilize k6 for bullet 1 and bullet 2. Bullet 2 is more complex as verification would require kicking off a Cypress test that confirms page load is < 1 each step of the way.

@git-ival
Copy link
Author

git-ival commented Apr 17, 2024

Found an old golang library for ingesting JUnit XML reports: https://github.com/joshdk/go-junit
This library could be useful in order to parse Cypress results in dartboard for a final pass/fail.

Cypress can output a JUnit XML report: https://docs.cypress.io/guides/tooling/reporters
k6 does not support outputting JUnit XML reports, but it does support outputting as JSON (https://k6.io/docs/results-output/real-time/json/).

@moio
Copy link
Collaborator

moio commented Apr 23, 2024

Organizational notes:

  • I suggest we spin two tasks off this spike: implementation of a backend and a frontend benchmark. Then tackle them in order. Reasons:
    1. frameworks and code will be necessarily different
    2. there is no way frontend will ever pass until backend passes first
    3. backend is much more likely to fail than frontend, as we are lifting complexity from frontend and moving it to backend
    4. we have less know-how about frontend benchmarking than we have in backend benchmarking as a group today, so they even have different risk levels from a project perspective

I suggest you create two separate issues and discuss frameworks, test setup, metrics and criteria separately.

@moio
Copy link
Collaborator

moio commented Apr 23, 2024

Backend test notes

Cluster setup

  • 3 server nodes in HA, 16 vCPUs, 64 GiB RAM, one local or fast network SSD (eg. AWS's EBS gp3 type volumes) each
  • external etcd cluster running on 3 servers with 8 vCPUs, 32 GiB RAM and locally attached NVMe SSDs (eg. AWS's m6gd.2xlarge)
  • Local cluster distribution is the latest supported RKE2 or latest supported K3s. RKE1 is explicitly out of scope

That is a monster setup, I expect this benchmark to actually pass with way less hardware - and in any case 95% of the development should be carried out on a smaller setup, and hardware maxing should happen as a last step, if needed.

A starting point could be: AWS: 3 nodes, 4 vCPUs, 16 GiB of RAM each (eg. t3a.xlarge) with 50 GiB EBS gp3 root volumes, on latest supported RKE2, internal etcd cluster.

(I have no problem in doing development with k3d on your laptop if that is more convenient - then re-running on the above "light" setup and leave the "heavy monster" setup only if all else fails as a last option)

Repeating the test on k3s is relatively unimportant and can be left at a later point, eg. after the browser tests are complete.

If you need any other details about setup please ask.

Benchmarking criteria notes

  • according to the PD&O "All time targets are intended as 95-percentile over an adequate number of repetitions."
  • a starting point for adequate can be 30. We can look at variance after the fact to tweak that number (if variance is low it can be safely reduced)

What you really care is that asking for a page (100 resources) consistently stays below half a second in http duration (see below) in 95% of cases - no matter the sorting, filtering, resource type and size. You should see how well that number scales if the number of virtual users grows - the minimum being 20 users making 1 request every 5 seconds.

As a second objective, add virtual users who concurrently change the ConfigMaps and see how performance degrades as more virtual users change them (this is more exploratory, we can set a pass/fail limit when we see the first results).

Metrics tracking: as a first step, make sure relevant stats are recorded in Qase (eg. p(95) expected: under 500ms, actual: 234 ms, test PASS). Full k6 output nice to have. Grafana tracking can be added later.

Framework choice notes

  • k6 makes working with the above stats easy, as it computes percentiles and divides them between download and processing time by default
  • is it possible for a shepherd-based test, integrated in one of the regularly run testsuites, to shell out to k6? Do we have a setup in which the node running shepherd+k6 is on the same network as the cluster under test, and it is decently sized hardware-wise (k6 can generate quite some load)? Is it easy to do that in, eg., AWS?

Implementation notes

  • parsing k6 JSON output is easy. What you need is something like:
type Metrics struct {
	HTTPReqDuration struct {
		Values struct {
			P95 float64 `json:"p(95)"`
		} `json:"values"`
	} `json:"http_req_duration"`
}

...
	bytes, _ := ioutil.ReadAll(jsonFile)

	var result Result
	json.Unmarshal(bytes, &result)
  • here you have a k6 script that we generally use for read benchmarks at customers as a starting point. It already supports the new Steve pagination style (and it also supports Norman which you can easily and safely drop). Feel free to reach out to me when you have the infra running and you need guidance to go deeper on benchmark specifics

@richard-cox
Copy link
Member

richard-cox commented Apr 23, 2024

UI side, it's important to note that the new vai backed API and it's features will be used

  • In eventually all resource lists via server-side pagination
  • In multiple different places to remove times when the UI fetches ALL of a resource
  • Generally for all steve based API requests, regardless of filtering / sorting / pagination

This effort is tracked in rancher/dashboard#8527 and will be partially complete in 2.9.0 (as described Server-Side Pagination - 2.9.0 State / Solution). That doc also has a rough spec to QA

@git-ival
Copy link
Author

@moio In regards to the upstream cluster setup for vai testing, should we model the same config? Example: 20 projects, 1000 Secrets, 5 users, 10 roles, 50 workload pods, etc.

@moio
Copy link
Collaborator

moio commented Apr 26, 2024

@git-ival FMPOV not necessarily. To me, they could as well be empty or almost empty (as empty as a default installation is).

What you will need tens of thousands of the specific resource under test (eg. ConfigMaps if you are testing the ConfigMaps page, Secrets if it is Secrets and so on) on the cluster under test (upstream and at least one downstream should be tested, because affected Steve code is in both). But in principle, other resources should not matter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants