-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficiency improvements #136
Conversation
d37c5e3
to
0f632c8
Compare
@@ -20,10 +20,17 @@ def __init__(self, project, io_dir, data_to_aggregate): | |||
:param data_to_aggregate: list of dictionaries. The data that will be aggregated | |||
""" | |||
self.project = project | |||
self.data_to_aggregate = data_to_aggregate | |||
self.data_to_aggregate = sorted(data_to_aggregate, key=lambda x: x['timestamp']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the data already sorted during parsing? Is additional sorting needed?
@@ -20,10 +20,17 @@ def __init__(self, project, io_dir, data_to_aggregate): | |||
:param data_to_aggregate: list of dictionaries. The data that will be aggregated | |||
""" | |||
self.project = project | |||
self.data_to_aggregate = data_to_aggregate | |||
self.data_to_aggregate = sorted(data_to_aggregate, key=lambda x: x['timestamp']) | |||
self.data_start_date = hlp.get_timeframe_beginning(self.data_to_aggregate[0]['timestamp'][:10]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest creating helper functions to extract the date from a timestamp (instead of using [:10] or [:7])
self.aggregated_data_dir = io_dir / 'blocks_per_entity' | ||
self.aggregated_data_dir.mkdir(parents=True, exist_ok=True) | ||
|
||
self.monthly_data_breaking_points = [(self.data_start_date.strftime('%Y-%m'), 0)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a benefit to using strings ((which have to be cast into date objects again later)) instead of date objects here?
Also is there a benefit to using a list of tuples instead of a dictionary? With a dictionary we wouldn't have to iterate later in order to get the index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the data are not continuous, the dict wouldn't work because some dates wouldn't be there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we could still iterate the dict just for the cases where dates are missing, right? So it would be equally efficient for the missing dates but more efficient for the ones that are present in the dictionary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also what are your thoughts on using date objects instead of strings here? Is there some benefit in using strings (which have to be cast into dates again later on)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean try to use a dict key and iterate instead if a key error is thrown? That could work, but I don't think it'll save that much time; the size of this list is at most ~160 elements (~13 years), so it takes a few ms to iterate over it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the date objects no strong thoughts, I just find it easier to do the comparisons here. If you prefer date objects you can add a commit in this PR to change this.
blocks_per_entity[block['creator']] += 1 | ||
if self.data_start_date <= timeframe_end and self.data_end_date >= timeframe_start: | ||
for month, month_block_index in self.monthly_data_breaking_points: | ||
start_index = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the initialization be before the loop? Also, wouldn't an end_index
be equally useful here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
end_index
doesn't really matter here, because the loop anyway exits when the block timestamp goes beyond the timeframe end.
for i, chunk in enumerate(timeframe_chunks): | ||
chunk_start, chunk_end = chunk | ||
t_chunk = hlp.format_time_chunks(time_chunks=[chunk], granularity=aggregate_by)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are already doing the formatting on L129 so you can just move it above and then just access the corresponding element (so that we don't calculate the function twice for each chunk)
aggregated_data_dir = output_dir / project / 'blocks_per_entity' | ||
time_chunks, blocks_per_entity = hlp.get_blocks_per_entity_from_file(aggregated_data_dir / aggregated_data_filename) | ||
chunks_with_blocks = set() | ||
for _, block_values in blocks_per_entity.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the keys are not needed it's better to iterate over blocks_per_entity.values()
instead of blocks_per_entity.items()
.
current_max_name = name | ||
nc += 1 | ||
power_percentage += 100 * blocks_per_entity[current_max_name] / total_blocks | ||
top_entities.add(current_max_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a heap would be more suitable (and efficient) for this kind of operation. Or we could also use numpy for all the metric functions, which would probably make them more efficient (and it includes functions for partial sorting too). You don't have to change this now since it works but we can keep it in mind in case we still need better performance
self.aggregated_data_dir = io_dir / 'blocks_per_entity' | ||
self.aggregated_data_dir.mkdir(parents=True, exist_ok=True) | ||
|
||
self.monthly_data_breaking_points = [(self.data_start_date.strftime('%Y-%m'), 0)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we could still iterate the dict just for the cases where dates are missing, right? So it would be equally efficient for the missing dates but more efficient for the ones that are present in the dictionary
self.aggregated_data_dir = io_dir / 'blocks_per_entity' | ||
self.aggregated_data_dir.mkdir(parents=True, exist_ok=True) | ||
|
||
self.monthly_data_breaking_points = [(self.data_start_date.strftime('%Y-%m'), 0)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also what are your thoughts on using date objects instead of strings here? Is there some benefit in using strings (which have to be cast into dates again later on)?
consensus_decentralization/helper.py
Outdated
def get_date_from_block(block): | ||
""" | ||
Gets the date from the timestamp of a block. | ||
:param block: dictionary of block data, which include at least one entry with key 'timestamp' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why "at least one entry" - is it possible to have more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible to have more entries, but they should have at least one with the key "timestamp".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but is it possible to have more than one entry with the key "timestamp"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course not, it's a dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my point - so the description of the param should be updated, there's no reason to include the "at least" part as it doesn't add anything and can be confusing
945aeb9
to
6ecba74
Compare
All Submissions:
Description
Increases efficiency in aggregate process by:
Increases efficiency in analyze process by: