-
Notifications
You must be signed in to change notification settings - Fork 762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set exact time of first audit #2587
Comments
To be clear, you want to set a specific start time for audit, correct? Do you still want audit to run at regular intervals as well? What happens if one cluster takes longer for audit to finish and misses its next start time? Should it run ASAP? Not run at all? Are users expected to keep the flag up-to-date, or is it "start audit the next time the pod is alive and the clock crosses the Is it important for how you collate the results that the start times line up exactly? "Scheduled start time" we can probably get consistent, but "real start time", probably not. TBH because of the time skew involved with running a process across multiple systems (there is no guarantee all audits will finish at the same time, even if they start at the same time, for example), I'm a bit leery of leaning too heavily into a synchronization approach. That being said, maybe it's only rough correlation you're looking for? The idea of blackout periods seems like a good one, and probably something we'd use separate from logs. For consistency, having audit start immediately or having the intervals be defined by something outside of the running process seems like a good idea (e.g. write out a "next audit at" resource to the API server). That would also allow audit to restart if the audit pod was killed mid-previous-run. |
Hi Max,
I want to set a specific start time for the first cycle. Then it goes periodically using specified "--auditInterval" flag.
Yes, I consider audit to run at regular intervals specified by "--auditInterval" flag as usual after the first cycle. Of course there will be a drift across clusters, but it should be no longer than several minutes.
Users are expected to provide the flag once while installing/updating, understanding that the first cycle will be executed at the specified time, then cycles occur using the specified inerval flag. If the pod is temporarily down, the next tick will launch it, as it works now.
I do not expect a precise start/stop time with a precision of milliseconds. Drift of several minutes is completely fine. If the first cycle has been failed, we expect the next one to be executed next day at the same time (86400 as an audit interval). In a long distance we expect to have results daily at the same time across the clusters with a drift of several minutes. E.g. scheduling audit at 9:00 and having a daily log query at the timeframe of "from 9:00 to 10:00". To achieve this now, we must install/update Gatekeeper at the exact same time (~9:00) and than wait 24h until the first results, which is way harder than providing a flag. I agree that our case with daily audits may be pretty unique, but I also believe that providing an exact time of first start should bring more predictability for more or less long intervals than how it works now. I see it should work with hourly intervals pretty good as well.
Great idea, it works for my case and provide even more flexibility than specyfying a start flag once. |
This depends on how homogenous the clusters are. A large size difference and the spread could approach O(hours). This could also be true if a long-running audit gets restarted due to something like a pod eviction.
+1, though I think there may be easier approaches than a flag (this sounds similar to how cronjobs work).
I don't think the scheduling requirement is that out there. I could see users have differing assumptions WRT error recovery though. One short term workaround would to have audit run as a cron job (of course this would require audit to start immediately). CronJob may even be a fair permanent solution, though that does mean each audit run would have a bootstrapping delay while the new audit pod loads state (constraints, templates, cached data). |
Good point on pod eviction, agree with you.
Does that mean it is completely fine to have audit pod existing as a cron job (only short-term appearances)? I mean aren't there any heartbeat-styled interactions between audit and controller-manager pods that could be broken? Scraping metrics from audit may be broken from the first view. |
Yes, though there is the cost of repeatedly incurring bootstrapping. Also, the infrastructure would need to be re-architected slightly, since there are other operations that are meant to be long-running singletons running on the pod as-configured, namely status/mutation-status (see this doc for more info). The "start immediately" functionality doesn't exist, however. Also, I don't think audit has a "run once and exit" mode, so we'd need to add that too, come to think of it.
Gatekeeper pods do not currently talk directly to each other, as such a design wouldn't scale as cleanly. There is some indirect coordination that happens via the
Correct, trying to scrape metrics from a transient pod would be a poor experience. I'd rely on stdout in that case, or scraping the status of constraints if you're okay with potentially partial results. |
I think having audit start immediately after gatekeeper is ready, instead of starting it after audit interval passes, would be beneficial for different use cases, and might help with this issue too. How about adding "start immediately" for now? Should there be a flag for it? If yes, what should the default value be -- start immediately or start after first interval as before? |
IMO we should just switch to starting immediately (immediately being defined as "as soon as all policy is loaded") IIRC there was some race condition that arose when we tried doing this before. @ritazh Do you remember? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
this is still an issue |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
this is still an issue |
@ikusomr @stek29 we are introducing pubsub for audit in GK 3.13 (WIP docs PR: https://github.com/open-policy-agent/gatekeeper/pull/2808/files), would this address this issue since you don't have to parse and aggregate logs? |
@sozercan thanks for bringing it up, I’ve totally missed this feature! I think this issue is still relevant even with PubSub — there are more reasons to start audit without waiting. One more use case — currently I have to wait for interval before I can see audit results and metrics after updating gatekeeper, which makes updates more difficult. However, I believe it would solve #1037. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
this is still an issue. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
Describe the solution you'd like
To be able to specify a flag for the audit process at what exact time it must start the first audit cycle.
Anything else you would like to add:
We mostly rely on the audit part of Gatekeeper functionality, providing teams across many clusters with audit results. The audit interval is 86400 (once per 24 h). The interval is the same across the clusters, but the time we install or update the audit manager across the clusters is not. The way of consuming audit results is through parsing and aggregating logs. Since the time of installing Gatekeeper within different clusters is different, the time when audit cycles occur is also different (as far as I understand the code, the first audit cycle takes place after audit interval passes - how Golang "time.NewTicker" works). The latter is also inconvinient, because in case of having 86400 we must wait 24 hours until the first results. Or install with shorter period and reinstall after getting the results.
Having the flag with exact audit time will give us:
Environment:
The text was updated successfully, but these errors were encountered: