Efficiency improvements #136

dimkarakostas · 2023-12-05T17:14:35Z

All Submissions:

Have you followed the guidelines in our Contributing documentation?
Have you verified that there aren't any other open Pull Requests for the same update/change?
Does the Pull Request pass all tests?

Description

Increases efficiency in aggregate process by:

decreasing the number of iterations needed
decreasing the amount of data that is accessed in each iteration
decreasing the memory consumption of the process by removing unnecessary list creations

Increases efficiency in analyze process by:

reducing the size of the dictionary that stores the aggregate data read from the csv file
reorganizing the loops s.t. the aggregate data is only loaded once (since it takes time to open the file which is large)

… in iterations

…ciency

LadyChristina · 2023-12-06T17:12:14Z

consensus_decentralization/aggregate.py

@@ -20,10 +20,17 @@ def __init__(self, project, io_dir, data_to_aggregate):
        :param data_to_aggregate: list of dictionaries. The data that will be aggregated
        """
        self.project = project
-        self.data_to_aggregate = data_to_aggregate
+        self.data_to_aggregate = sorted(data_to_aggregate, key=lambda x: x['timestamp'])


Isn't the data already sorted during parsing? Is additional sorting needed?

LadyChristina · 2023-12-06T17:13:31Z

consensus_decentralization/aggregate.py

@@ -20,10 +20,17 @@ def __init__(self, project, io_dir, data_to_aggregate):
        :param data_to_aggregate: list of dictionaries. The data that will be aggregated
        """
        self.project = project
-        self.data_to_aggregate = data_to_aggregate
+        self.data_to_aggregate = sorted(data_to_aggregate, key=lambda x: x['timestamp'])
+        self.data_start_date = hlp.get_timeframe_beginning(self.data_to_aggregate[0]['timestamp'][:10])


I suggest creating helper functions to extract the date from a timestamp (instead of using [:10] or [:7])

LadyChristina · 2023-12-06T17:30:00Z

consensus_decentralization/aggregate.py

        self.aggregated_data_dir = io_dir / 'blocks_per_entity'
        self.aggregated_data_dir.mkdir(parents=True, exist_ok=True)

+        self.monthly_data_breaking_points = [(self.data_start_date.strftime('%Y-%m'), 0)]


is there a benefit to using strings ((which have to be cast into date objects again later)) instead of date objects here?
Also is there a benefit to using a list of tuples instead of a dictionary? With a dictionary we wouldn't have to iterate later in order to get the index

If the data are not continuous, the dict wouldn't work because some dates wouldn't be there.

But we could still iterate the dict just for the cases where dates are missing, right? So it would be equally efficient for the missing dates but more efficient for the ones that are present in the dictionary

Also what are your thoughts on using date objects instead of strings here? Is there some benefit in using strings (which have to be cast into dates again later on)?

You mean try to use a dict key and iterate instead if a key error is thrown? That could work, but I don't think it'll save that much time; the size of this list is at most ~160 elements (~13 years), so it takes a few ms to iterate over it.

For the date objects no strong thoughts, I just find it easier to do the comparisons here. If you prefer date objects you can add a commit in this PR to change this.

LadyChristina · 2023-12-06T17:31:53Z

consensus_decentralization/aggregate.py

-            blocks_per_entity[block['creator']] += 1
+        if self.data_start_date <= timeframe_end and self.data_end_date >= timeframe_start:
+            for month, month_block_index in self.monthly_data_breaking_points:
+                start_index = 0


Shouldn't the initialization be before the loop? Also, wouldn't an end_index be equally useful here?

end_index doesn't really matter here, because the loop anyway exits when the block timestamp goes beyond the timeframe end.

LadyChristina · 2023-12-06T18:54:06Z

consensus_decentralization/aggregate.py

        for i, chunk in enumerate(timeframe_chunks):
            chunk_start, chunk_end = chunk
+            t_chunk = hlp.format_time_chunks(time_chunks=[chunk], granularity=aggregate_by)[0]


We are already doing the formatting on L129 so you can just move it above and then just access the corresponding element (so that we don't calculate the function twice for each chunk)

LadyChristina · 2023-12-07T09:51:41Z

consensus_decentralization/analyze.py

+        aggregated_data_dir = output_dir / project / 'blocks_per_entity'
+        time_chunks, blocks_per_entity = hlp.get_blocks_per_entity_from_file(aggregated_data_dir / aggregated_data_filename)
+        chunks_with_blocks = set()
+        for _, block_values in blocks_per_entity.items():


if the keys are not needed it's better to iterate over blocks_per_entity.values() instead of blocks_per_entity.items().

LadyChristina · 2023-12-07T10:23:03Z

consensus_decentralization/metrics/nakamoto_coefficient.py

+                current_max_name = name
+        nc += 1
+        power_percentage += 100 * blocks_per_entity[current_max_name] / total_blocks
+        top_entities.add(current_max_name)


I think a heap would be more suitable (and efficient) for this kind of operation. Or we could also use numpy for all the metric functions, which would probably make them more efficient (and it includes functions for partial sorting too). You don't have to change this now since it works but we can keep it in mind in case we still need better performance

LadyChristina · 2023-12-07T12:51:01Z

consensus_decentralization/aggregate.py

        self.aggregated_data_dir = io_dir / 'blocks_per_entity'
        self.aggregated_data_dir.mkdir(parents=True, exist_ok=True)

+        self.monthly_data_breaking_points = [(self.data_start_date.strftime('%Y-%m'), 0)]


But we could still iterate the dict just for the cases where dates are missing, right? So it would be equally efficient for the missing dates but more efficient for the ones that are present in the dictionary

LadyChristina · 2023-12-07T12:52:08Z

consensus_decentralization/aggregate.py

        self.aggregated_data_dir = io_dir / 'blocks_per_entity'
        self.aggregated_data_dir.mkdir(parents=True, exist_ok=True)

+        self.monthly_data_breaking_points = [(self.data_start_date.strftime('%Y-%m'), 0)]


Also what are your thoughts on using date objects instead of strings here? Is there some benefit in using strings (which have to be cast into dates again later on)?

LadyChristina · 2023-12-07T12:57:13Z

consensus_decentralization/helper.py

+def get_date_from_block(block):
+    """
+    Gets the date from the timestamp of a block.
+    :param block: dictionary of block data, which include at least one entry with key 'timestamp'


why "at least one entry" - is it possible to have more?

It's possible to have more entries, but they should have at least one with the key "timestamp".

but is it possible to have more than one entry with the key "timestamp"?

Of course not, it's a dictionary.

That's my point - so the description of the param should be updated, there's no reason to include the "at least" part as it doesn't add anything and can be confusing

dimkarakostas added 7 commits December 5, 2023 09:50

Reduce complexity of computing NC metric

d71d79e

Reduce memory consumption of computing HHI metric

302a2bd

Reduce memory consumption of computing entropy metric

238f3f1

Reduce data iterations in aggregator based on data start date

1e5fd39

Sort data in aggregator to reduce length of iterations

09d416d

Use intermediate timestamp checkpoints in aggregator to decrease data…

126168c

… in iterations

Reduce timechunk dict size in aggregate for memory efficiency

61bc7d4

dimkarakostas changed the title ~~[WIP] Aggregate efficiency improvements~~ Efficiency improvements Dec 6, 2023

dimkarakostas requested a review from LadyChristina December 6, 2023 12:58

dimkarakostas added 3 commits December 6, 2023 14:01

Reduce aggregate dict size when reading from csv file for memory effi…

652f500

…ciency

Handle helper cornercase when metric from a family is missing

4234234

Reorg analyze loops to only read aggregate file once

0f632c8

dimkarakostas force-pushed the metrics_calc branch from d37c5e3 to 0f632c8 Compare December 6, 2023 15:06

LadyChristina requested changes Dec 7, 2023

View reviewed changes

LadyChristina reviewed Dec 7, 2023

View reviewed changes

Small changes for code readability

6ecba74

dimkarakostas force-pushed the metrics_calc branch from 945aeb9 to 6ecba74 Compare December 7, 2023 13:20

LadyChristina approved these changes Dec 7, 2023

View reviewed changes

LadyChristina merged commit 2d0526e into main Dec 7, 2023
1 check passed

LadyChristina deleted the metrics_calc branch December 7, 2023 13:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficiency improvements #136

Efficiency improvements #136

dimkarakostas commented Dec 5, 2023 •

edited

Loading

LadyChristina Dec 6, 2023

LadyChristina Dec 6, 2023

LadyChristina Dec 6, 2023

dimkarakostas Dec 7, 2023

LadyChristina Dec 7, 2023

LadyChristina Dec 7, 2023

dimkarakostas Dec 7, 2023 •

edited

Loading

dimkarakostas Dec 7, 2023

LadyChristina Dec 6, 2023

dimkarakostas Dec 7, 2023

LadyChristina Dec 6, 2023

LadyChristina Dec 7, 2023

LadyChristina Dec 7, 2023

LadyChristina Dec 7, 2023

LadyChristina Dec 7, 2023

LadyChristina Dec 7, 2023

dimkarakostas Dec 7, 2023

LadyChristina Dec 7, 2023

dimkarakostas Dec 7, 2023

LadyChristina Dec 7, 2023

Efficiency improvements #136

Efficiency improvements #136

Conversation

dimkarakostas commented Dec 5, 2023 • edited Loading

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimkarakostas Dec 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimkarakostas commented Dec 5, 2023 •

edited

Loading

dimkarakostas Dec 7, 2023 •

edited

Loading