-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DONT MERGE] Draft query stat API #2148
base: master
Are you sure you want to change the base?
Conversation
'count': [1, 5], | ||
'max_time': [1, 10], | ||
'min_time': [1, 20], | ||
'avg_time': [1, 15], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a histogram will be better than min
max
avg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
'max_time': [1, 10], | ||
'min_time': [1, 20], | ||
'avg_time': [1, 15], | ||
'uncompressed_size': [10, 1000], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does 10
mean for the list
here? You need to give realistic output for anyone to understand this proposal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update it with a slightly more realistic output
Reference Issues/PRs
What does this implement or fix?
Draft query stat API for discussion and as placeholder
If the columns stated below are checked, it mean the column will be used to group the corresponding entries
There are 3 kinds of entry in the stats:
GetObject
andPutObject
operation only)GetObject
andPutObject
operation only )GetObject
andPutObject
operation only )Columns that shared by all types of entry:
Latency histogram
Result:
Unit
Risk of overflow
Output format
Pandas dataframe
Sample use cases
(Github render gets weird when it comes to wide markdown table like below so can't uncomment the tables)
Very slow
list_symbools
because of very outdated symbol list cacheIn the situation where the library has frequent add/delete symbol, a lot of SymbolList keys will be added as journal.
The list of keys usually will be compacted by
list_symbols
call if caller has write permissionImagine if the above actions are not done, when users call
list_symbols
, those keys are needed to be iterated to create the symbol list. This could be very time consuming if the list is long.With query stat, users can learn the cause by checking the
count
in the entry:Very slow s3 API call
In the situation when the s3 endpoint has poor performance, the
latency
in stats will be very helpful to allow user pinpoint the culprit of poor performance:In this case, user can turn on s3 log to understand which exact step is slow.
Slow read due to fragmented segments
The library has very small setting for no. columns per segment. A simple read will require reading many segments.
It can be discovered by checking
count
in the stats. It can be verified with disproportional high ratio of tdata key count to vref key count. Below is the example of a simple read of 9 symbols with 1000 segments in each's latest version:Tranversing long version chain
In the situation when old versions being deleted, which makes more v keys are need to be tranversed, e.g. 10 versions are appended and the last 5 version are deleted, the v keys will be tranversed in this order (T = tombstone):
vref -> ver 5T -> ver 6T -> ver 7T -> ver 8T -> ver 9T -> ver 9 -> ver 8 -> ver 7 -> ver 6 -> ver 5 -> ver 4
It can be discovered with disproportional high ratio of tdata key count to vref key count, with the count of tomb keys highlighted in v key:
Slow response while trying to read tombstone/delete version of a symbol
In the situation when user tries to specifically read a tombstone/delete version of a symbol, the process will iterate all existing snapshots of the library. This is as last resort process to try to find the version but the process could be painstakingly long as all snapshots are needed to be iterated. Below is an example when user's trying to read a non-existting version larger then the latest version, with 100 snapshots in the library:
Analyzing why the script is slow
A very common request. With query stat, user can pinpoint which exact step is slow. Below is a common use case
With above stats, user can tell the slowness of the script run is because the library has delay delete OFF. Therefore it requires to iterate all snapshots to find out whether the version being overwritten is safe to delete