Stats query framework #2210

phoebusm · 2025-02-27T16:34:25Z

Reference Issues/PRs

https://man312219.monday.com/boards/7852509418/pulses/8297768017

What does this implement or fix?

Add basic C++ framework to stat query
Add stat to list_symbols on s3 to demostrate the framework

Any other comments?

For python layer, the change is minimum, only serve the purpose of outputing something that can be verified in the test
The missing functions will be handled in later PRs.

Design

It is designed that one row will only have one piece of non-groupable data in it
Columns that being grouped are ["arcticdb_call", "stage", "key_type", "storage_op", "parallelized"]


1. Original data output by C++, just a simple vector of vector of string, in the form of [key, value, key, value, ...]:
[{'arcticdb_call': 'list_streams', 'count': '1', 'key_type': 'l', 'stage': 'list', 'storage_op': 'ListObjectsV2'}, 
{'arcticdb_call': 'list_streams', 'key_type': 'l', 'stage': 'list', 'storage_op': 'ListObjectsV2', 'time': '47'}, 
{'arcticdb_call': 'list_streams', 'count': '1', 'key_type': 'r', 'stage': 'list', 'storage_op': 'ListObjectsV2'}, 
{'arcticdb_call': 'list_streams', 'key_type': 'r', 'stage': 'list', 'storage_op': 'ListObjectsV2', 'time': '43'}, 
{'arcticdb_call': 'list_streams', 'count': '0', 'key_type': 'l', 'stage': 'list', 'storage_op': 'ListObjectsV2'}, 
{'arcticdb_call': 'list_streams', 'key_type': 'l', 'stage': 'list', 'storage_op': 'ListObjectsV2', 'time': '24'}, 
{'arcticdb_call': 'list_streams', 'stage': 'list', 'time': '630'}, {'arcticdb_call': 'list_streams', 'time': '630'}]


2.  Load it to pandas:
  arcticdb_call count key_type stage     storage_op time
0  list_streams     1        l  list  ListObjectsV2  NaN
1  list_streams   NaN        l  list  ListObjectsV2   47
2  list_streams     1        r  list  ListObjectsV2  NaN
3  list_streams   NaN        r  list  ListObjectsV2   43
4  list_streams     0        l  list  ListObjectsV2  NaN
5  list_streams   NaN        l  list  ListObjectsV2   24
6  list_streams   NaN      NaN  list            NaN  630
7  list_streams   NaN      NaN   NaN            NaN  630


3. Add parallelized columns and change the placeholder to _NA_
  arcticdb_call  count key_type   stage     storage_op time parallelized
0  list_streams    1.0        l    list  ListObjectsV2  NaN        False
1  list_streams    NaN        l    list  ListObjectsV2   47        False
2  list_streams    1.0        r    list  ListObjectsV2  NaN        False
3  list_streams    NaN        r    list  ListObjectsV2   43        False
4  list_streams    0.0        l    list  ListObjectsV2  NaN        False
5  list_streams    NaN        l    list  ListObjectsV2   24        False
6  list_streams    NaN   __NA__    list         __NA__  630       __NA__
7  list_streams    NaN   __NA__  __NA__         __NA__  630       __NA__


4. Use pandas::group_by()  to group by rows by specifying columns to group by are ["arcticdb_call", "stage", "key_type", "storage_op", "parallelized"] to merge rows in the same group into 1 row

name: ('list_streams', '__NA__', '__NA__', '__NA__', '__NA__') # Overall time for the entire API call
group:   arcticdb_call  count key_type   stage storage_op time parallelized
7  list_streams    NaN   __NA__  __NA__     __NA__  630       __NA__

name: ('list_streams', 'list', '__NA__', '__NA__', '__NA__') # Overall time used in this stage; There is only one stage here
group:   arcticdb_call  count key_type stage storage_op time parallelized
6  list_streams    NaN   __NA__  list     __NA__  630       __NA__

name: ('list_streams', 'list', 'l', 'ListObjectsV2', 'False') # S3 SDK API calls for listing symbol list keys. Non-grouped columns are count and time here. One s3 API call will always log one entry that has count (Row 0 and 4) and one entry that has time (Row 1 and 5). It is designed that one row will only have one piece of non-group data in it. So a simple sum of non-groupable data will do the job to merge all the rows into one
group:   arcticdb_call  count key_type stage     storage_op time parallelized
0  list_streams    1.0        l  list  ListObjectsV2  NaN        False
1  list_streams    NaN        l  list  ListObjectsV2   47        False
4  list_streams    0.0        l  list  ListObjectsV2  NaN        False
5  list_streams    NaN        l  list  ListObjectsV2   24        False

name: ('list_streams', 'list', 'r', 'ListObjectsV2', 'False') # S3 SDK API calls for listing vref keys. Similar merge as above. Row 2 has non-groupable data count only. Row 3 has non-groupable data time only
group:   arcticdb_call  count key_type stage     storage_op time parallelized
2  list_streams    1.0        r  list  ListObjectsV2  NaN        False
3  list_streams    NaN        r  list  ListObjectsV2   43        False


5. After merging the rows, add time bucket as well:
  arcticdb_call stage key_type     storage_op parallelized  time_count_630  count  time_count_40  time_count_20
0  list_streams  None     None           None         None             1.0    NaN            NaN            NaN
1  list_streams  list     None           None         None             1.0    NaN            NaN            NaN
2  list_streams  list        l  ListObjectsV2        False             NaN    1.0            1.0            1.0
3  list_streams  list        r  ListObjectsV2        False             NaN    1.0            1.0            NaN


6. Reorder the column and replace NaN in time bucket to 0:
  arcticdb_call stage key_type     storage_op parallelized count  time_count_20  time_count_40  time_count_630
0  list_streams  None     None           None         None  None              0              0               1
1  list_streams  list     None           None         None  None              0              0               1
2  list_streams  list        l  ListObjectsV2        False     1              1              1               0
3  list_streams  list        r  ListObjectsV2        False     1              0              1               0

Checklist

Checklist for code changes...

Have you updated the relevant docstrings, documentation and copyright notice?
Is this contribution tested against all ArcticDB's features?
Do all exceptions introduced raise appropriate error messages?
Are API changes highlighted in the PR description?
Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

github-actions · 2025-02-27T16:34:36Z

Label error. Requires exactly 1 of: patch, minor, major. Found: enhancement

phoebusm · 2025-02-28T14:41:08Z

python/arcticdb/toolbox/stats_query.py

+    def __sub__(self, other):
+        return self._populate_stats(other._create_time)
+
+    def _populate_stats(self, other_time):


Boilerplate code to beautify the output for now
Pending changes and improvement in later PRs

poodlewars

Nice job figuring out how to pass this stuff through Folly. Can I suggest merging a PR to start with that just introduces the custom Folly executors (with a suite of tests to show that the stats calculation works with both our task based APIs and normal Folly::.via) and then we can figure out the other discussions after.

It would have been helpful if your PR description had explained your design.

poodlewars · 2025-03-03T09:36:04Z