Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics: merge different metrics into series + labels #3309

Closed
Issif opened this issue Sep 3, 2024 · 11 comments · Fixed by #3319
Closed

Metrics: merge different metrics into series + labels #3309

Issif opened this issue Sep 3, 2024 · 11 comments · Fixed by #3319
Assignees
Labels
Milestone

Comments

@Issif
Copy link
Member

Issif commented Sep 3, 2024

Describe the bug

I started to work on the Grafana dashboard for the falco metrics, and saw something we should change, if possible before the new release.

It concerns the drops_enter/exit metrics, but I think others are concerned (like the _scap_ but I'm not sure if they measure the same things in different context or not)

For the drops we have this:

falcosecurity_falco_n_drops_buffer_clone_fork_enter_total{raw_name="n_drops_buffer_clone_fork_enter"} 0
falcosecurity_falco_n_drops_buffer_clone_fork_exit_total{raw_name="n_drops_buffer_clone_fork_exit"} 0
falcosecurity_falco_n_drops_buffer_execve_enter_total{raw_name="n_drops_buffer_execve_enter"} 0
falcosecurity_falco_n_drops_buffer_execve_exit_total{raw_name="n_drops_buffer_execve_exit"} 0
falcosecurity_falco_n_drops_buffer_connect_enter_total{raw_name="n_drops_buffer_connect_enter"} 0
falcosecurity_falco_n_drops_buffer_connect_exit_total{raw_name="n_drops_buffer_connect_exit"} 0
falcosecurity_falco_n_drops_buffer_open_enter_total{raw_name="n_drops_buffer_open_enter"} 0
falcosecurity_falco_n_drops_buffer_open_exit_total{raw_name="n_drops_buffer_open_exit"} 0
falcosecurity_falco_n_drops_buffer_dir_file_enter_total{raw_name="n_drops_buffer_dir_file_enter"} 0
falcosecurity_falco_n_drops_buffer_dir_file_exit_total{raw_name="n_drops_buffer_dir_file_exit"} 0

it should be something like this:

falcosecurity_falco_n_drops_enter_total{drop="clone_fork"} 0
falcosecurity_falco_n_drops_enter_total{drop="buffer_execve"} 0
falcosecurity_falco_n_drops_enter_total{drop="buffer_connect"} 0
falcosecurity_falco_n_drops_enter_total{drop="buffer_open"} 0
falcosecurity_falco_n_drops_enter_total{drop="buffer_dir_file"} 0

falcosecurity_falco_n_drops_exit_total{drop="clone_fork"} 0
falcosecurity_falco_n_drops_exit_total{drop="buffer_execve"} 0
falcosecurity_falco_n_drops_exit_total{drop="buffer_connect"} 0
falcosecurity_falco_n_drops_exit_total{drop="buffer_open"} 0
falcosecurity_falco_n_drops_exit_total{drop="buffer_dir_file"} 0

by this way, we can run promql queries like this one:

sum by(drop) (irate(falcosecurity_falco_n_drops_enter_total{}[$__interval]))

and get all drops in a single graph (a pie chart for example), without running 1 query by metric

How to reproduce it

call the /metrics endpoint to get the available series

Expected behaviour

Screenshots

Environment

  • Falco version:
  • System info: 0.38.2
  • Cloud provider or hardware configuration:
  • OS: any
  • Kernel: any
  • Installation method: any

Additional context

@Issif Issif added the kind/bug label Sep 3, 2024
@FedeDP
Copy link
Contributor

FedeDP commented Sep 3, 2024

Ehy i agree with the proposed changes, it follows the same spirit as #3272 !

/cc @incertum do you agree? And, if yes, are you willing to take on the duty once again? 🥇
/milestone 0.39.0

@poiana poiana added this to the 0.39.0 milestone Sep 3, 2024
@incertum
Copy link
Contributor

incertum commented Sep 3, 2024

/assign

Agree. Yes, we can have a general updated approach for Prometheus metrics. Something like if any "metric" can split out into 5+ sub-metrics or similar we follow the #3272 approach. Else we keep metrics in 1:1 sync with the JSON rule output.

It would also apply to the new per CPU counters Andrea added.

@sgaist would you have additional thoughts?

@Issif we previously concluded that it would be fine to keep the memory related metrics as separate metrics. Do we want to stick to that or is there a desire to re-discuss and change that as well given we now introduce more breaking changes?

@leogr updated info: This would mean a follow up breaking change for Prometheus metrics, but not the rule output metrics.

@Issif
Copy link
Member Author

Issif commented Sep 3, 2024

for the memory it's ok to keep it has it, we're talking about 3 metrics, not dozens like for the drops or the rule counters.

@FedeDP
Copy link
Contributor

FedeDP commented Sep 3, 2024

This would mean a follow up breaking change for Prometheus metrics, but not the rule output metrics.

Imho that's an incubating feature thus it is somewhat expect to rapidly change to the desired final design, then we'll promote it to stable.
I think we should leave it as incubating for Falco 0.39.0 and make this new breaking change right now, then in Falco 0.40 we can promote it. WDYT?

Btw thanks for tackling this once again, it's been a pleasure to work with you and @sgaist on this :)

@incertum
Copy link
Contributor

incertum commented Sep 3, 2024

for the memory it's ok to keep it has it, we're talking about 3 metrics, not dozens like for the drops or the rule counters.

Agree

I think we should leave it as incubating for Falco 0.39.0 and make this new breaking change right now, then in Falco 0.40 we can promote it. WDYT?

Also @FedeDP agreed. Just in case we need another round of tweaks 🙃 fingers crossed the metrics framework soon stabilizes.

@leogr
Copy link
Member

leogr commented Sep 6, 2024

Update:

As discussed in yesterday's maintainer meeting, the enter/exit directions should be reported as a label. For example:

falcosecurity_falco_n_drops_total{drop="clone_fork", dir="enter"} 0
falcosecurity_falco_n_drops_total{drop="clone_fork", dir="exit"} 0

Note that we must also review all other metrics to ensure they are consistent with this pattern.

Final note: introducing breaking change at this stage shouldn't be an issue; the metrics feature will still be kept as an "incubating" maturity level for 0.39

@incertum
Copy link
Contributor

PR is up, please help check the initial test data, ty!

@leogr perhaps falcosecurity_falco_n_drops_buffer_total is better and we preserve falcosecurity_falco_n_drops_total and as you see above we now also have falcosecurity_falco_n_drops_cpu_total etc

# HELP falcosecurity_falco_n_evts_cpu_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_n_evts_cpu_total counter
falcosecurity_falco_n_evts_cpu_total{cpu="2"} 873
# HELP falcosecurity_falco_n_drops_cpu_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_n_drops_cpu_total counter
falcosecurity_falco_n_drops_cpu_total{cpu="2"} 0
# HELP falcosecurity_falco_n_evts_cpu_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_n_evts_cpu_total counter
falcosecurity_falco_n_evts_cpu_total{cpu="3"} 0
# HELP falcosecurity_falco_n_drops_cpu_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_n_drops_cpu_total counter
falcosecurity_falco_n_drops_cpu_total{cpu="3"} 0
# HELP falcosecurity_falco_n_evts_cpu_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_n_evts_cpu_total counter
falcosecurity_falco_n_evts_cpu_total{cpu="4"} 2651
# HELP falcosecurity_falco_n_drops_cpu_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_n_drops_cpu_total counter
falcosecurity_falco_n_drops_cpu_total{cpu="4"} 0
# HELP falcosecurity_falco_n_evts_cpu_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_n_evts_cpu_total counter
falcosecurity_falco_n_evts_cpu_total{cpu="5"} 2
# HELP falcosecurity_falco_n_drops_cpu_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_n_drops_cpu_total counter
falcosecurity_falco_n_drops_cpu_total{cpu="5"} 0
# HELP falcosecurity_falco_n_evts_cpu_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_n_evts_cpu_total counter
falcosecurity_falco_n_evts_cpu_total{cpu="6"} 2076
# HELP falcosecurity_falco_n_drops_cpu_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_n_drops_cpu_total counter
falcosecurity_falco_n_drops_cpu_total{cpu="6"} 0
# HELP falcosecurity_falco_n_evts_cpu_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_n_evts_cpu_total counter
falcosecurity_falco_n_evts_cpu_total{cpu="7"} 237
# HELP falcosecurity_falco_n_drops_cpu_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_n_drops_cpu_total counter
falcosecurity_falco_n_drops_cpu_total{cpu="7"} 0



# HELP falcosecurity_falco_n_drops_buffer_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_n_drops_buffer_total counter
falcosecurity_falco_n_drops_buffer_total{dir="enter",drop="clone_fork"} 0
# HELP falcosecurity_falco_n_drops_buffer_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_n_drops_buffer_total counter
falcosecurity_falco_n_drops_buffer_total{dir="exit",drop="clone_fork"} 0
# HELP falcosecurity_falco_n_drops_buffer_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_n_drops_buffer_total counter
falcosecurity_falco_n_drops_buffer_total{dir="enter",drop="execve"} 0
# HELP falcosecurity_falco_n_drops_buffer_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_n_drops_buffer_total counter
falcosecurity_falco_n_drops_buffer_total{dir="exit",drop="execve"} 0
# HELP falcosecurity_falco_n_drops_buffer_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_n_drops_buffer_total counter
falcosecurity_falco_n_drops_buffer_total{dir="enter",drop="connect"} 0
# HELP falcosecurity_falco_n_drops_buffer_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_n_drops_buffer_total counter
falcosecurity_falco_n_drops_buffer_total{dir="exit",drop="connect"} 0
# HELP falcosecurity_falco_n_drops_buffer_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_n_drops_buffer_total counter
falcosecurity_falco_n_drops_buffer_total{dir="enter",drop="open"} 0
# HELP falcosecurity_falco_n_drops_buffer_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_n_drops_buffer_total counter
falcosecurity_falco_n_drops_buffer_total{dir="exit",drop="open"} 0
# HELP falcosecurity_falco_n_drops_buffer_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_n_drops_buffer_total counter
falcosecurity_falco_n_drops_buffer_total{dir="enter",drop="dir_file"} 0
# HELP falcosecurity_falco_n_drops_buffer_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_n_drops_buffer_total counter
falcosecurity_falco_n_drops_buffer_total{dir="exit",drop="dir_file"} 0
# HELP falcosecurity_falco_n_drops_buffer_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_n_drops_buffer_total counter
falcosecurity_falco_n_drops_buffer_total{dir="enter",drop="other_interest"} 0
# HELP falcosecurity_falco_n_drops_buffer_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_n_drops_buffer_total counter
falcosecurity_falco_n_drops_buffer_total{dir="exit",drop="other_interest"} 0
# HELP falcosecurity_falco_n_drops_buffer_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_n_drops_buffer_total counter
falcosecurity_falco_n_drops_buffer_total{dir="exit",drop="close"} 0
# HELP falcosecurity_falco_n_drops_buffer_total https://falco.org/docs/metrics/
# TYPE falcosecurity_falco_n_drops_buffer_total counter
falcosecurity_falco_n_drops_buffer_total{dir="exit",drop="proc"} 0

@leogr
Copy link
Member

leogr commented Sep 11, 2024

cc @Issif PTAL 🙏

@Issif
Copy link
Member Author

Issif commented Sep 11, 2024

Seems good to me 👍

@incertum
Copy link
Contributor

Another fix / breaking change is that I removed the double falco here
falcosecurity_falco_falco_sha256_rules_files_info -> falcosecurity_falco_sha256_rules_files_info

All Prometheus kernel counters had the wrong subsystem, it is "scap", not "falco" ... so all these metric names are broken

Also the label raw_name is now removed, not a breaking change necessarily

@leogr maybe an easy message for Falco 0.39.0 could be: Prometheus metrics names should be considered broken compared to previous releases.

On that note do we need / want any additional changes in names, also @Issif ?

After we merge this I can stage the website update PR.

@incertum
Copy link
Contributor

See also #3324
Changing rules_counters -> rules_matches for the Prometheus metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants