Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Healthcheck and stats for monitoring #181

Open
ThomDietrich opened this issue Jan 25, 2024 · 4 comments
Open

Healthcheck and stats for monitoring #181

ThomDietrich opened this issue Jan 25, 2024 · 4 comments

Comments

@ThomDietrich
Copy link
Contributor

ThomDietrich commented Jan 25, 2024

Hey @djmaze and all,

not sure if you remember me. We did work on some good little improvements some years ago.
Since then, I've been a happy user of your image. One constant problem I had is lack of visibility. Backups could be paused because of an issue for months, until I eventually get a hold of that. This is of course partially my fault, but also the reason why the observability industry is thriving :)

I would like to discuss how this image could provide users with actionable and informative data on the activities of their backup jobs. Specifically,

  • A counter for consecutive errors (to delay warnings and notifications beyond a single hick-up)
  • An indicator for permanent errors (if possible)
  • Timestamp for the last successful sync (for downtime detection and notification)
  • Performance stats (because 📊🤩)

How does that sound?

In #171 (comment) you mentioned your solution to some of these points: Healthchecks.io. The service looks good but does not (I believe) solve all of the above. Also, everyone kind of tends to use different tools (I'd like to link Grafana), gladly most of them cater to the same needs.

Long story short, I propose to

  1. Generate a string of stats after each backup run. This string could be in any of the common formats, like JSON, Prometheus, or ... / or even translated to some of them
  2. Provide the stats string as an env variable to POST_COMMANDS_SUCCESS etc. for e.g. healthchecks.io, telegraf, or prometheus (push principle)
  3. Provide the stats string via an http endpoint (pub principle)

What do you think? Cheers!

@escoand
Copy link
Contributor

escoand commented Jan 26, 2024

I was interested in something similar, in my case to get monitored with Home Assistant. So I execute restic --no-lock stats latest --json and restic --no-lock snapshots latest --json every now and than. Not that much information but I first starting point.

The question is how to expose these values. I use a http rest endpoint to post these two jsons.

@ThomDietrich
Copy link
Contributor Author

ThomDietrich commented Jan 26, 2024

Hey @escoand the stats command is certainly a good start but it does not offer enough details about individual sync runs and snapshot, especially failures.

Regarding the transmission, I am inclined to say hat I would love to see both strategies (push and pub) implemented. This seems rather trivial after all.

The real issue here is retrieving the data from restic. I believe we all agree that human-readable output on the docker logs is desired, hence we can't just switch the backup command to --json. I think this is the way to go: restic/restic#3274

@spychodelics
Copy link

I am new to restic and resticker, right now i added a "dirty" restic ..... > filename_date.log 2>filename_error_date.log at the end of the restic backup stuff in the backup script. So i get a human readable logfile.

Now i am able to send me the 2 log files via apprise/email as attachment, but when im trying to send me the logs via apprise/curl as "Body" in an email and i dont get it working .. so far.

POST_COMMANDS_SUCCESS & POST_COMMANDS_FAILURE & POST_COMMANDS_INCOMPLETE Scripts help...

@djmaze
Copy link
Owner

djmaze commented Feb 17, 2024

Sorry for answering this late. @ThomDietrich Yeah, remembering :) Concerning your ideas, indeed Healthchecks.io solves all of those except the last one (performance stats) somehow for me. But I get that other people want to have different tools and more detailed insights.

The points 1 to 2 could probably be implemented quite easily. About point 3, that would mean running an additional http service somehow, which I am not a fan of. That also should be totally optional. (I see that there is also a request for improving restic itself.)

All that said, personally I am not really interested to use this additional functionality (not doing advanced performance monitoring for my stuff currently), so: I would be open for more detailed proposals or even PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants