-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Add more testing + memory stats from cgroups #10
Conversation
Looking into current systemd source code, this value only increases. e.g. https://github.com/systemd/systemd/blob/7c286cd6a615fa9bce8a2830133bcf89becfbf9f/src/core/socket.c#L2416
Main problem is that this does not support test coverage, because we are using an external binary as the binary-under-test and coverage works by compiling a custom binary with coverage-tracking-logic when running go test. It's not simple to force compilation of these tracking statements, so as long as we use an external binary we cannot have coverage data. Minor issue is that the process launch time is relatively slow. Not an issue when running 2-3 tests, but this is may be an issue down the road
This is a minimal proof-of-concept for how to test the exporter while also getting code coverage details. Should also be a bit faster. Bad things: If something ever goes wrong, cleaning up would be much harder because we cannot just kill the external process. So arguably this could be a bad thing, or in the future we bother to split up testing into smoke tests that can run in this manner and larger tests that should be run as an external process. But for now I'm happy with this, I like seeing code coverage reports :-) While it's tempting to think this magically enables parallelism in the test suite, remember we are sharing the global default http server here. It would not be too hard to change that, and also auto-update the port numbers, but it's overkill for now. I'll just run stuff serially until it becomes an issue
Replaces the 'call an external binary' approach with the new 'call a handler function in main.go' approach. Much faster. This commit also shows a few example test cases just to get the ball rolling
Their focus on testing-inside-docker means they do not have a good testing-in-vm experience. Only one ubuntu version is supported, and upgrading it on each CI run is both wasteful and a huge pain to get correct (we only want to upgrade golang, but they have a customized version of ubuntu to the typical backported PPA approach does not work out of the box). As a nice side bonus, travis-ci is mostly OSS while circleCI is not, so we can read the source if we have any issues with travis.
See golangci/golangci-lint#658 for details Summary - golangci-lint, built with Go 1.12, will generate this error when linting Go 1.13 code
Hey 👋 let me know when it's ready for review :) |
@povilasv Sure, anytime now should be OK. This PR is big already (sorry, seems to be my bad habit), so let's aim to clean it up as needed and get it merged before the next push ;-) I could use any insight you have on "what memory values matter enough to export as metrics" |
Also, a second concern - what should the exported metrics be named? node_exporter moved to using the pattern |
Will review this week thanks for this :) Re naming I am not sure why node exporter does this but I think we should follow https://prometheus.io/docs/practices/naming/ and do whatever makes sense for our use case and query patterns. I feel like doing @SuperQ Would be great to get some insights from you regarding naming :) |
Labels only make sense when there is an aggregation that can be applied. Using a label for Having separate metrics also helps for more typical use cases like this:
If it were a label, you would have to do something crazy like this:
|
That's very helpful, thanks. We should export distinct metrics then. @povilasv to answer the second question (what should we export) maybe we just expose the same metrics that node_exporter does, but do it per unit being monitored. It would be more than enough to get anyone started. Thoughts?
Side point - we may want to document that users may want to run more than one systemd_exporter with different units monitored at different levels to deal with cardinality issues. I've got an idea on this for future, thinking we could allow folks to filter by the slice or scope units are in
|
My opinion on metrics selection. I do not see harm on essentially exporting all of them. If they are constant, they will be well compressed by Prometheus and consume almost no resources. The more interesting question would be which metrics to keep as a separate metric/gauge, and which ones to put together with separate labels. I think it is useful to have some metrics grouped into single gauge with separate labels, if it does make sense to 1) visualise all of them at the same time, 2) make sum of them to see for a total of something. From a quick look it looks like this only applies to few metrics.
As of the others metrics, I think they should be all separate (because doing aggregation over them doesn't make much sense), but I didn't look too deeply into this, and there might be some exceptions. |
Thanks @baryluk One thing to point out, is if we have metrics that have parts where they can be summed, and we also have a "total" exposed as another part, we try and avoid exposing those in Prometheus best practices. This allows cleaner use of |
Sounds good! Following Prometheus best practices is a very good idea, to reduce amount of user confusion, but in general it is actually very logical, short and well written to avoid common mistakes with naming metrics and labels. However regarding your comment, I can't find a general guideline in Prometheues best practices about not exporting |
// path appends the given path elements to the filesystem path, adding separators | ||
// as necessary. | ||
func (fs FS) path(p ...string) string { | ||
return filepath.Join(append([]string{string(fs.mountPoint)}, p...)...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit much, consider just using flepath.Join(fs.MountPoint, subpath, suffix)
directly in https://github.com/povilasv/systemd_exporter/pull/10/files#diff-966caeb64fc8c233e22788d6e3caccbbR171 and
https://github.com/povilasv/systemd_exporter/pull/10/files#diff-966caeb64fc8c233e22788d6e3caccbbR173
cgroup/cgroup.go
Outdated
// if cgroupUnified != unifModeUnknown { | ||
// return cgroupUnified, nil | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get rid of commented out code
|
||
switch fs.Type { | ||
case cgroup2SuperMagic: | ||
log.Debugf("Found cgroup2 on /sys/fs/cgroup/, full unified hierarchy") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is a package, consider not logging or allowing users to choose their own logging pkg, more info https://dave.cheney.net/2015/11/05/lets-talk-about-logging
|
||
} | ||
|
||
func testMain(wg *sync.WaitGroup) *http.Server { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider moving this into testing package.
// b, err := ioutil.ReadAll(resp.Body) | ||
// if err != nil { | ||
// return nil, err | ||
// } | ||
// if err := resp.Body.Close(); err != nil { | ||
// return nil, err | ||
// } | ||
// if want, have := http.StatusOK, resp.StatusCode; want != have { | ||
// return nil, fmt.Errorf("want /metrics status code %d, have %d. Body:\n%s", want, have, b) | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove commented out code
if found != "service" { | ||
t.Errorf("Bad unit name parsing. Wanted %s got %s", "service", found) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get rid of empty space
break | ||
} | ||
err = c.collectUnitCPUMetrics(*cgroupPath, conn, ch, unit) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err != nil { | |
if err != nil && parseUnitType(unit) != "socket" { |
if err != nil { | ||
// Most sockets do not have a cpu cgroupfs entry, but a few big ones do (notably docker.socket). Quiet down | ||
// error reporting if error came from a socket | ||
if parseUnitType(unit) != "socket" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if parseUnitType(unit) != "socket" { |
@baryluk It's under the labels section of "writing exporters". https://prometheus.io/docs/instrumenting/writing_exporters/#labels |
Exports the ability to manually create a new cgroup filesystem struct, as well as the various mount modes. Cleans up documentation
Retitled as WIP while I work through the feedback |
@hamiltont thanks for all the amazing work you've done here. do you have plans to resume work on this PR? those cgroup rss metrics are very appealing :) |
Thanks! Yes, I actually have some nice commits to push, been busy with my
day job. I'll try to get out an update this week
…On Wed, Apr 29, 2020, 11:25 AM Hector Huertas ***@***.***> wrote:
@hamiltont <https://github.com/hamiltont> thanks for all the amazing work
you've done here. do you have plans to resume work on this PR? those cgroup
rss metrics are very appealing :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKRZFBAUJ42BRO5BJ2S4DRPBBGZANCNFSM4KO7X5JA>
.
|
What's the status here? I've been trying this out for a while and it really makes the metrics much more useful. I had to add a small patch to ignore services with the |
Well unfortunately real life jumped up and took hold, but I am thrilled to
hear that some folks found it useful
At this point.... If anyone has time to wrap this work up (as I recall
that's mainly responding to the feedback that was given) it would be super
appreciated to have the help
I'll see if I can push all the other improvements that I had to a distinct
branch in case anyone is interested. As I recall they were definitely not
ready (I was/am learning golang in the evenings, so while I can see the
concepts needed to make a system better it takes me quite a while to
translate those into useful go code)
…On Mon, Sep 13, 2021, 5:18 PM Kim Lindberger ***@***.***> wrote:
What's the status here? I've been trying this out for a while and it
really makes the metrics much more useful. I had to add a small patch to
ignore services with the RemainAfterExit property ***@***.***
<talyz@284e372>)
in order for it to not spam my logs, but that's the only issue I've had
with it so far.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKRZBGEXT4IBTHVN4FRVLUBZTBVANCNFSM4KO7X5JA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I see :) Yes, it's really useful - great work! Unfortunately, I don't think I can be of much help - I only know golang at a very basic level. I hope someone can, though. |
i filed bug #46 about how memory usage is reported incorrectly by this exporter. would this PR fix that problem? |
After a very long time, we've moved this to prometheus-community. If you're still interested in working on this, please rebase this change against the latest commits. |
I tried to "rebase this change against the latest commits." but there were a lot of merging problems, but the result as it is (squashed, because there were simply way too many conflicts in the commit-by-commit rebase) is in #66 . |
PR adds:
memory.stat
from cgroupfsRemaining TODOs
fixes #2