Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log fsspec caching statistics #82

Open
chuckwondo opened this issue Jul 3, 2024 · 0 comments
Open

Log fsspec caching statistics #82

chuckwondo opened this issue Jul 3, 2024 · 0 comments

Comments

@chuckwondo
Copy link
Collaborator

The implementation of #77 added a version of fsspec that includes capturing read cache statistics, including the following details: file size, total bytes requested, hit count, miss count.

The current s3fs/fsspec cache parameters were chosen only to maximize speed. After further investigation, it was discovered that the "all" cache type in fsspec behaves effectively identically to downloading files, because it simply reads the file in its entirety in a single request, caching it in memory (rather than writing it to local disk). While this produces slightly better overall algorithm speed than downloading (~10% faster on average), it produces the same volume of data transfer as downloading.

What we would like to do is also run profiling to determine which cache type and block size combination produces the least amount of data transfer (i.e., minimizes the total number of bytes requested). Now that fsspec includes capturing this statistic in its read cache implementations, we can record the stats for each granule file.

In order to easily analyze the cache stats, we want to write the stats to a separate log file so that we do not have to separate these stats from the rest of the log messages. Further, we want to write the stats in JSON format to avoid having to write any parsing logic to parse the stats log file. We should consider using https://colin-b.github.io/logging_json/ to simplify implementation of this.

Acceptance Criteria

  1. Upon successful subsetting of each granule file, the fsspec cache stats are written to a separate log file, in JSON format, one JSON record per line (which should allow us to later easily use duckdb for analysis), perhaps using https://colin-b.github.io/logging_json/ to do so. The file should land in the same directory as the other outputs already produced, and should be named fsspec-stats.json (or similarly).
  2. Each record should include:
    • file size in bytes ("filesize_bytes")
    • total number of bytes requested ("requested_bytes")
    • number of cache hits ("hits")
    • number of cache misses ("misses")
    • block size ("blocksize_bytes")
    • number of blocks ("blocks")
    • file path ("path")
    • file system type ("filesystem", likely easiest to simply use type(fs).__name__ to write the class name of the filesystem)
    • number of seconds to subset the individual file ("subset_seconds", the time to run to the subset_hdf5 function)
    • each element of fsspec_kwargs as a separate entry (e.g., separate entries for "default_cache_type", etc., excluding "default_block_size", which is captured in "blocksize_bytes" above)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant