Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot for load test environment has broken config for teams #24361

Closed
lucasmrod opened this issue Dec 4, 2024 · 10 comments
Closed

Snapshot for load test environment has broken config for teams #24361

lucasmrod opened this issue Dec 4, 2024 · 10 comments
Assignees
Labels
bug Something isn't working as documented ~engineering-initiated Engineering-initiated story, such as a bug, refactor, or contributor experience improvement. #g-orchestration Orchestration product group :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. ~released bug This bug was found in a stable release.
Milestone

Comments

@lucasmrod
Copy link
Member

lucasmrod commented Dec 4, 2024

This bug impacts every time we do load test. It's caused by some invalid data coming from the snapshot we use to populate load test environments when they are created.

Person running the load test needs to run a manual step (of connecting and running MySQL commands) to fix the teams after the load test environment has been created. This wastes time and also is confusing for new folks running load tests.

We should instead automate the fix.

I found the following command we used to fix this issue in the past here.

UPDATE fleet.teams SET config = '{"mdm": {"macos_setup": {"bootstrap_package": null, "macos_setup_assistant": null, "enable_end_user_authentication": false, "enable_release_device_manually": false}, "macos_updates": {"deadline": null, "minimum_version": null}, "macos_settings": {"custom_settings": null}, "windows_updates": {"deadline_days": null, "grace_period_days": null}, "windows_settings": {"custom_settings": null}, "enable_disk_encryption": false}, "scripts": null, "features": {"enable_host_users": true, "enable_software_inventory": true}, "integrations": {"jira": null, "zendesk": null, "google_calendar": null}, "agent_options": {"config": {"options": {"pack_delimiter": "/", "logger_tls_period": 10, "distributed_plugin": "tls", "disable_distributed": false, "logger_tls_endpoint": "/api/osquery/log", "distributed_interval": 10, "distributed_tls_max_attempts": 3}, "decorators": {"load": ["SELECT uuid AS host_uuid FROM system_info;", "SELECT hostname AS hostname FROM system_info;"]}}, "overrides": {}}, "webhook_settings": {"host_status_webhook": null, "failing_policies_webhook": {"policy_ids": null, "destination_url": "", "host_batch_size": 0, "enable_failing_policies_webhook": false}}, "host_expiry_settings": {"host_expiry_window": 0, "host_expiry_enabled": false}}';

Additional fixes

We also want to fix the invalid webhook configuration that's causing server error logs (noise).

@lucasmrod lucasmrod added bug Something isn't working as documented #g-mdm MDM product group #g-endpoint-ops Endpoint ops product group :incoming New issue in triage process. ~engineering-initiated Engineering-initiated story, such as a bug, refactor, or contributor experience improvement. labels Dec 4, 2024
@rfairburn
Copy link
Contributor

Wouldn't the best bet be to restore the DB, run the MySQL commands, save a new snapshot, and then update the snapshot that loadtesting uses as its starting point in the terraform config?

This would remove the need to add a permanent new first-step.

https://github.com/fleetdm/fleet/blob/main/infrastructure/loadtesting/terraform/rds.tf#L67 is where we specify the snapshot used in the terraform.

@rfairburn
Copy link
Contributor

If migrations need to be run first, we could migrate up to a desired patch-level as well prior to applying the SQL above and use that as the new starting point for the snapshot.

We'll just want to make sure that no loadtest containers connect to the system prior to saving the new snapshot.

@lucasmrod
Copy link
Member Author

Wouldn't the best bet be to restore the DB, run the MySQL commands, save a new snapshot, and then update the snapshot that loadtesting uses as its starting point in the terraform config?

Good idea, probably better, yes (assuming new config doesn't break migrations which is unlikely, we could apply the snapshot locally to the version of Fleet we use as starting point and then run migrations).

Do you have access to such snapshot to run this yourself? Alternatively, if you provide me with read&write access I could run it myself.

@rfairburn
Copy link
Contributor

I checked and you are already a member of Loadtesting Admins and should have full access to read and write RDS snapshots in AWS for that account.

@dantecatalfamo
Copy link
Member

Hey @lucasmrod, it looks like the loadtesting snapshot is also causing issues with scheduled query ingestion: #24386

@lukeheath lukeheath added ~released bug This bug was found in a stable release. :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. and removed #g-mdm MDM product group labels Dec 6, 2024
@lukeheath
Copy link
Member

@lucasmrod Thanks for filing. I'm putting on the Endpoint Ops board for now to explore.

@sharon-fdm sharon-fdm added this to the 4.62.0-tentative milestone Dec 10, 2024
@sharon-fdm sharon-fdm removed this from the 4.62.0-tentative milestone Dec 19, 2024
@lukeheath lukeheath added the #g-orchestration Orchestration product group label Dec 19, 2024
@sharon-fdm sharon-fdm removed the #g-endpoint-ops Endpoint ops product group label Jan 6, 2025
@sharon-fdm sharon-fdm added this to the 4.63.0-tentative milestone Jan 10, 2025
@sharon-fdm sharon-fdm removed the :incoming New issue in triage process. label Jan 15, 2025
@sharon-fdm sharon-fdm modified the milestones: 4.63.0, 4.64.0-tentative Jan 15, 2025
@rfairburn
Copy link
Contributor

I am creating a snapshot that does the following.

  1. It updates the minimum version of a loadtest to v4.55.0. I wanted to make sure that MDM at least existed in the snapshot to ensure the teams config above didn't cause issues
  2. I ran UPDATE fleet.teams SET config = '{"mdm": {"macos_setup": {"bootstrap_package": null, "macos_setup_assistant": null, "enable_end_user_authentication": false, "enable_release_device_manually": false}, "macos_updates": {"deadline": null, "minimum_version": null}, "macos_settings": {"custom_settings": null}, "windows_updates": {"deadline_days": null, "grace_period_days": null}, "windows_settings": {"custom_settings": null}, "enable_disk_encryption": false}, "scripts": null, "features": {"enable_host_users": true, "enable_software_inventory": true}, "integrations": {"jira": null, "zendesk": null, "google_calendar": null}, "agent_options": {"config": {"options": {"pack_delimiter": "/", "logger_tls_period": 10, "distributed_plugin": "tls", "disable_distributed": false, "logger_tls_endpoint": "/api/osquery/log", "distributed_interval": 10, "distributed_tls_max_attempts": 3}, "decorators": {"load": ["SELECT uuid AS host_uuid FROM system_info;", "SELECT hostname AS hostname FROM system_info;"]}}, "overrides": {}}, "webhook_settings": {"host_status_webhook": null, "failing_policies_webhook": {"policy_ids": null, "destination_url": "", "host_batch_size": 0, "enable_failing_policies_webhook": false}}, "host_expiry_settings": {"host_expiry_window": 0, "host_expiry_enabled": false}}';
  3. I also ran the query UPDATE app_config_json SET json_value = json_set(json_value, '$.webhook_settings.failing_policies_webhook.enable_failing_policies_webhook', false); to disable the failing webhook.

@rfairburn
Copy link
Contributor

#25495 is the PR that enables the new snapshot. @dantecatalfamo please feel free to try the snapshot at your convenience to ensure it does what you need. We can do further updates if-needed.

@PezHub
Copy link
Contributor

PezHub commented Feb 11, 2025

QA Notes:

A fresh loadtest build off RC 4.64.0 shows that -

  • The pre-populated teams are now working as expected without applying the manual fix. I spot checked the settings were set accordingly
  • The webhook has been disabled under the Other workflows automation

Image

@fleet-release
Copy link
Contributor

Snapshots fix, automated,
Smooth load tests, time liberated.
Silent servers, no more jaded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as documented ~engineering-initiated Engineering-initiated story, such as a bug, refactor, or contributor experience improvement. #g-orchestration Orchestration product group :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. ~released bug This bug was found in a stable release.
Projects
None yet
Development

No branches or pull requests

7 participants