Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

at which point do collectors shut themselves off? #9

Open
Dieterbe opened this issue May 15, 2015 · 4 comments
Open

at which point do collectors shut themselves off? #9

Dieterbe opened this issue May 15, 2015 · 4 comments

Comments

@Dieterbe
Copy link
Contributor

i know that if collectors realize that everything they monitor is unreachable, they stop sending errors under the reasoning there is something wrong with the collector or its connectivity itself.

how long does this take? the alerting config where you define "x errors for y points in a row" depends on this. around 30s seems reasonable? there's always the risk there's low-frequency checks every 60 or 120s, ideally we would wait to see errors for all of them, but that could be too long for the value those bring. if people monitor everything every 10s they would have to wait 12 steps to cover at least the "collector-shutdown" interval.

@woodsaj
Copy link
Contributor

woodsaj commented May 15, 2015

The collectors dont have anything implemented at this stage to shut themselves down. When we add it, it will probably just use the raindex as a measure. ie if x% of the alexa top 50 are unreachable then shutdown. In this scenario the shutdown delay would be controlled and would not be more then 30seconds (3 consecutive failures at 10second interval) we can even reduce this as we can check the alex sites every second if desired.

@Dieterbe
Copy link
Contributor Author

ok so for now @mattttt and i will assume it takes at least 30s for collectors to shut themselves off, so customers should wait at least 30s before alerting, and hopefully this gets implemented before alerting goes live ~ monitorama.

@nopzor1200
Copy link

So just to update this ticket with the latest thoughts around raindex since it hasn't been talked about in while.

  • still think the concept is cool and has validity - even more so than months ago
  • raindex on icmp might be the best and most consistent measurement (using the mean value of latency and the overall loss %) as opposed to http.
  • raindex is per collector and its initial use case is a mechanism for collectors to go offline or online (later on it could be used as a "trust" rating for a particular collectors measurements or something more sophisticated)
  • overall raindex for a collector is tbd but is something like a moving average or 90th percentile of latency and loss across the basket of sites.
  • the basket of raindex sites should be 50-100 sites, with as little common infra dependency as possible
  • raindex can initially be used as an alert for ops to investigate a collector and potentially disable it, before we decide to further automate the process.

@mattttt
Copy link
Contributor

mattttt commented Aug 10, 2016

I dont have anything to add.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants