Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Add filesystem stats #38465

Open
pitrou opened this issue Oct 25, 2023 · 5 comments
Open

[C++] Add filesystem stats #38465

pitrou opened this issue Oct 25, 2023 · 5 comments

Comments

@pitrou
Copy link
Member

pitrou commented Oct 25, 2023

Describe the enhancement requested

It would be nice to have per-filesystem stats to better analyze how filesystems are used by libraries (such as the Parquet reader).

Potentially useful stats:

  • total number of IO requests issued (one IO request could be one POSIX API call for local filesystems, or one network request for S3, etc.)
  • total number of IO reads issued
  • total number of IO writes issued
  • average size of IO reads
  • average size of IO writes
  • approximate quantiles of IO read sizes
  • approximate quantiles of IO write sizes
  • average duration of IO reads
  • average duration of IO writes
  • approximate quantiles of IO read durations
  • approximate quantiles of IO write durations

(perhaps some of those would be too costly to record, we'll see)

Component(s)

C++

@pitrou
Copy link
Member Author

pitrou commented Oct 25, 2023

@mrocklin @lidavidm Any ideas and opinions here?

@pitrou
Copy link
Member Author

pitrou commented Oct 25, 2023

Note: we have a T-Digest implementation for approximate quantiles, though https://github.com/HdrHistogram/HdrHistogram_c could be another efficient solution.

@mapleFU
Copy link
Member

mapleFU commented Oct 25, 2023

Awesome, RocksDB also has a simple histogram for this(though it only does disk io)

@pitrou pitrou self-assigned this Oct 25, 2023
@mrocklin
Copy link

These things seem good to know. I think that at a high level the questions I want to know is if I'm operating efficiently. Sub-questions to this include:

  • What is my overall bandwidth over time? (including overlapping / parallel reads)
  • If it's low, what's going on?
    • What is my bandwidth per read?
    • If it's low, the next thing I'll ask is about the distribution of sizes

@wjones127
Copy link
Member

FWIW in Lance what we've found most helpful in understanding filesystem / object store use is traces rather than metrics. We use the Rust tracing library and Perfetto to visualize. Here's an example investigation: lancedb/lance#1352 (comment)

I think in Arrow C++, you could integrate OpenTelemetry tracing in the filesystems. @amoeba did some similar work integrating it with Flight C++.

That being said, collecting these kinds of metrics would be useful for additional information in benchmark comparisons. It would be nice, for example, for a benchmark to show that a certain optimization reduced that number of IO requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants