Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficiency improvements #136

Merged
merged 11 commits into from
Dec 7, 2023
Merged

Efficiency improvements #136

merged 11 commits into from
Dec 7, 2023

Conversation

dimkarakostas
Copy link
Member

@dimkarakostas dimkarakostas commented Dec 5, 2023

All Submissions:

  • Have you followed the guidelines in our Contributing documentation?
  • Have you verified that there aren't any other open Pull Requests for the same update/change?
  • Does the Pull Request pass all tests?

Description

Increases efficiency in aggregate process by:

  • decreasing the number of iterations needed
  • decreasing the amount of data that is accessed in each iteration
  • decreasing the memory consumption of the process by removing unnecessary list creations

Increases efficiency in analyze process by:

  • reducing the size of the dictionary that stores the aggregate data read from the csv file
  • reorganizing the loops s.t. the aggregate data is only loaded once (since it takes time to open the file which is large)

@dimkarakostas dimkarakostas changed the title [WIP] Aggregate efficiency improvements Efficiency improvements Dec 6, 2023
@@ -20,10 +20,17 @@ def __init__(self, project, io_dir, data_to_aggregate):
:param data_to_aggregate: list of dictionaries. The data that will be aggregated
"""
self.project = project
self.data_to_aggregate = data_to_aggregate
self.data_to_aggregate = sorted(data_to_aggregate, key=lambda x: x['timestamp'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the data already sorted during parsing? Is additional sorting needed?

@@ -20,10 +20,17 @@ def __init__(self, project, io_dir, data_to_aggregate):
:param data_to_aggregate: list of dictionaries. The data that will be aggregated
"""
self.project = project
self.data_to_aggregate = data_to_aggregate
self.data_to_aggregate = sorted(data_to_aggregate, key=lambda x: x['timestamp'])
self.data_start_date = hlp.get_timeframe_beginning(self.data_to_aggregate[0]['timestamp'][:10])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest creating helper functions to extract the date from a timestamp (instead of using [:10] or [:7])

self.aggregated_data_dir = io_dir / 'blocks_per_entity'
self.aggregated_data_dir.mkdir(parents=True, exist_ok=True)

self.monthly_data_breaking_points = [(self.data_start_date.strftime('%Y-%m'), 0)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a benefit to using strings ((which have to be cast into date objects again later)) instead of date objects here?
Also is there a benefit to using a list of tuples instead of a dictionary? With a dictionary we wouldn't have to iterate later in order to get the index

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the data are not continuous, the dict wouldn't work because some dates wouldn't be there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we could still iterate the dict just for the cases where dates are missing, right? So it would be equally efficient for the missing dates but more efficient for the ones that are present in the dictionary

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also what are your thoughts on using date objects instead of strings here? Is there some benefit in using strings (which have to be cast into dates again later on)?

Copy link
Member Author

@dimkarakostas dimkarakostas Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean try to use a dict key and iterate instead if a key error is thrown? That could work, but I don't think it'll save that much time; the size of this list is at most ~160 elements (~13 years), so it takes a few ms to iterate over it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the date objects no strong thoughts, I just find it easier to do the comparisons here. If you prefer date objects you can add a commit in this PR to change this.

blocks_per_entity[block['creator']] += 1
if self.data_start_date <= timeframe_end and self.data_end_date >= timeframe_start:
for month, month_block_index in self.monthly_data_breaking_points:
start_index = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the initialization be before the loop? Also, wouldn't an end_index be equally useful here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

end_index doesn't really matter here, because the loop anyway exits when the block timestamp goes beyond the timeframe end.

for i, chunk in enumerate(timeframe_chunks):
chunk_start, chunk_end = chunk
t_chunk = hlp.format_time_chunks(time_chunks=[chunk], granularity=aggregate_by)[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are already doing the formatting on L129 so you can just move it above and then just access the corresponding element (so that we don't calculate the function twice for each chunk)

aggregated_data_dir = output_dir / project / 'blocks_per_entity'
time_chunks, blocks_per_entity = hlp.get_blocks_per_entity_from_file(aggregated_data_dir / aggregated_data_filename)
chunks_with_blocks = set()
for _, block_values in blocks_per_entity.items():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the keys are not needed it's better to iterate over blocks_per_entity.values() instead of blocks_per_entity.items().

current_max_name = name
nc += 1
power_percentage += 100 * blocks_per_entity[current_max_name] / total_blocks
top_entities.add(current_max_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a heap would be more suitable (and efficient) for this kind of operation. Or we could also use numpy for all the metric functions, which would probably make them more efficient (and it includes functions for partial sorting too). You don't have to change this now since it works but we can keep it in mind in case we still need better performance

self.aggregated_data_dir = io_dir / 'blocks_per_entity'
self.aggregated_data_dir.mkdir(parents=True, exist_ok=True)

self.monthly_data_breaking_points = [(self.data_start_date.strftime('%Y-%m'), 0)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we could still iterate the dict just for the cases where dates are missing, right? So it would be equally efficient for the missing dates but more efficient for the ones that are present in the dictionary

self.aggregated_data_dir = io_dir / 'blocks_per_entity'
self.aggregated_data_dir.mkdir(parents=True, exist_ok=True)

self.monthly_data_breaking_points = [(self.data_start_date.strftime('%Y-%m'), 0)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also what are your thoughts on using date objects instead of strings here? Is there some benefit in using strings (which have to be cast into dates again later on)?

def get_date_from_block(block):
"""
Gets the date from the timestamp of a block.
:param block: dictionary of block data, which include at least one entry with key 'timestamp'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why "at least one entry" - is it possible to have more?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to have more entries, but they should have at least one with the key "timestamp".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but is it possible to have more than one entry with the key "timestamp"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course not, it's a dictionary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my point - so the description of the param should be updated, there's no reason to include the "at least" part as it doesn't add anything and can be confusing

@LadyChristina LadyChristina merged commit 2d0526e into main Dec 7, 2023
1 check passed
@LadyChristina LadyChristina deleted the metrics_calc branch December 7, 2023 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants