Clustering utilities #770

Aske-Rosted · 2024-11-19T01:20:08Z

This PR add utility functions which will enable DOM-level clustering as described in #752. This PR does not include a NodesASDOM's definition but enables the creation of one. The new utilities completely contain the functionality of cluster_summarize_with_percentiles and as such a warning of this function being deprecated and a "TODO" of the removal of said function has been added so it can be removed some time in the future.

The PercentileClusters node defintion has been changed to use the new utility class/function, and this node definition is tested in the graphnet/tests/models/test_node_definition.py test_percentile_cluster() test.

I plan to soon follow up with a new node definition but thought it would be better to not grow the PR too large.

I created some test to ensure the performance of the new utilities and gathered clustering times as seen below

    ts = time()
    test = cluster_summarize_with_percentiles(tensors, [3,4], [0,1,2], percentiles, add_counts=True)
    old_times.append(time()-ts)

    ts = time()
    cluster_class = cluster_and_pad(tensors,[0,1,2])
    cluster_class.percentile_summarize([3,4],percentiles)
    cluster_class.add_counts(cluster_class.clustered_x.shape[1])
    new_times.append(time()-ts)

    ts = time()
    cluster_class_charge = cluster_and_pad(tensors,[0,1,2])
    cluster_class_charge.add_charge_threshold_summary([3,4],percentiles,4)
    cluster_class_charge.add_counts(cluster_class_charge.clustered_x.shape[1])
    charge_cluster_times.append(time()-ts)

With the tensor being some pseudo pulse data. The plots below show that the new utilities are about a factor 2 faster than the old implementation.

…t into clustering_utilities

RasmusOrsoe

This is a very elegant solution @Aske-Rosted! I really like the reusability of this method, and I think the code is quite readable - well done!

Also great to see the speed benchmarks. It looks like the new clustering logic is quite a bit faster than the current version. This is great!

I added a few minor comments

RasmusOrsoe · 2024-11-22T13:16:51Z

src/graphnet/models/graphs/nodes/nodes.py

            )
+            array = cluster_class.add_counts()


Looks like the resulting node definition will always have counts added as a node feature - but in the arguments to PercentileCluster we allow the user to specify if they want this. The new implementation should respect this.

Would recommend a simple change such as:

if self._add_counts: array = cluster_class.add_counts()

RasmusOrsoe · 2024-11-22T13:33:40Z

src/graphnet/models/graphs/utils.py

@@ -172,6 +179,251 @@ def cluster_summarize_with_percentiles(
    return array


+class cluster_and_pad:
+    """Cluster and pad the data for further summarization."""


Lets expand the doc string here to make it clear that the class stores a clustered tensor that is manipulated when the public methods are called. Here's a suggestion:

""" Cluster the input data according to a set of specified columns and compute aggregate statistics of cluster attributes. The method will cluster the input data according to unique combinations of values in the specified `cluster_columns` and store the clustered tensor. The clustered tensor can be manipulated further by calling public methods that add statistical quantities to the tensor of the values in each cluster. The order in which the public methods are called determines the order of manipulations. Example: # Build cluster of single event on first three columns cluster_class = cluster_and_pad(x = single_event_as_array, cluster_columns = [0,1,2]) # Summarize columns 3 and 4 from original array on each cluster with 10th, 50th and 90th percentiles clustered_array_with_percentile_summary = cluster_class.percentile_summarize(summarization_indices =[3,4], percentiles = [10, 50, 90]) # Add the std of the 5th column to the array above clustered_array_with_percentile_summary_and_std = cluster_class.add_std(column = 5) """

RasmusOrsoe · 2024-11-22T13:34:19Z

src/graphnet/models/graphs/utils.py

+        """
+        if location is None:
+            self.clustered_x = np.column_stack([self.clustered_x, column])
+        else:


The location parameter could potentially overwrite information, right? E.g. if column 6 already contains some aggregate statistics and the user specifies location=6. Maybe a warning/error here would be useful?

The location feature is added using insert which does not remove information but instead expands the vector. The only possible confusion that could come from this is that the final ordering of the features might not be as the user expects as since every feature after the insert will have their location incremented by 1.

pweigel · 2024-11-25T15:08:46Z

src/graphnet/models/graphs/utils.py

+        "This function is deprecated and will be removed,",
+        "use the class cluster_and_pad with add_percentile_summary",
+        "instead for the same functionality",
+    )


While it's good to warn the user that this function is deprecated, this will print for every call (lots of printing). I am not sure what a good solution is here. Maybe it can just be removed immediately since the nodes no longer use it?

Great catch! I missed that.

I have removed the warning but kept the TODO and function for now. If we all agree to remove the function then I am not against it but just prefer to air on the side of caution.

sevmag · 2024-11-25T15:48:35Z

Idea for improving Handling of Summarization Features in the Clustering nodes

I’ve been thinking about how we can make the clustering (especially the upcoming new NodeDefiniton class lets call it Cluster) more flexible and easier to maintain, and I have a suggestion. Instead of having a function within the cluster_and_pad class to create new cluster features, we could create a new class called ClusterFeature that handles summarizing each feature. The class might look something like this:

class ClusterFeature(Model):
    """Base class for summarizing feature series per cluster."""

    @abstractmethod
    def summarisation_feature_label(self):
        """Returns the label for the summarization feature."""

    @abstractmethod
    def add_feature(self):
        """Calculates the summarization feature."""

For every summarization feature we could then create a new subclass of this. E.g. for Totals of input features, Time of first pulse, Cumulative value of input feature after specific times, ....

Then, the new NodeDefinition (which we’ll call Cluster) can just take a list of these ClusterFeature instances. For example:

features = [
    Percentiles(percentiles=[10, 30, 50, 70]),
    Totals(),
    CumulativeAfterT(times=[10, 50, 40]),
    TimeOfPercentiles(percentiles=[10, 30, 50, 70]),
    Std(),
    TimeOfFirstPulse(),
    AddCounts(),
]

Why This Might Be Useful:

Cleaner Constructor: Instead of passing a ton of individual arguments to __init__(), we’d just pass a list of ClusterFeature instances. This will make the Cluster class constructor much cleaner and easier to work with.
Modular and Flexible: Each summarization feature could be its own class, which makes it super easy to add new features down the line without cluttering the Cluster class.
Easier Subsetting: You could easily calculate features on just a subset of the input data (like just the charge), instead of always calculating for everything at once.
Consistency: By looping through the ClusterFeature list, we ensure that the feature labels and data always line up correctly, so there’s less risk of mistakes with mismatched ordering in the _get_indices_and_feature_names and _construct_nodes.

Potential Concerns:

Speed: There might be a performance hit because of the extra abstraction but I don't have a feeling for that
Configuration: We’ll want to make sure this change doesn’t mess with any of the config file stuff or other initialization logic. (I am not too familiar of how the yaml files are being used)

Let me know what you think! I think this could make things more modular and easier to extend, but I’d love to hear your thoughts. I am happy to help out in any way.

I am aware that this would not add much functionality and is more a style choice I guess. I'm already very happy with this PR and I can build on this as well to implement some more summarization features once its through.

Aske-Rosted · 2024-12-06T05:08:12Z

Idea for improving Handling of Summarization Features in the Clustering nodes

I’ve been thinking about how we can make the clustering (especially the upcoming new NodeDefiniton class lets call it Cluster) more flexible and easier to maintain, and I have a suggestion. Instead of having a function within the cluster_and_pad class to create new cluster features, we could create a new class called ClusterFeature that handles summarizing each feature. The class might look something like this:
class ClusterFeature(Model):
    """Base class for summarizing feature series per cluster."""

    @abstractmethod
    def summarisation_feature_label(self):
        """Returns the label for the summarization feature."""

    @abstractmethod
    def add_feature(self):
        """Calculates the summarization feature."""
For every summarization feature we could then create a new subclass of this. E.g. for Totals of input features, Time of first pulse, Cumulative value of input feature after specific times, ....

Then, the new NodeDefinition (which we’ll call Cluster) can just take a list of these ClusterFeature instances. For example:
features = [
    Percentiles(percentiles=[10, 30, 50, 70]),
    Totals(),
    CumulativeAfterT(times=[10, 50, 40]),
    TimeOfPercentiles(percentiles=[10, 30, 50, 70]),
    Std(),
    TimeOfFirstPulse(),
    AddCounts(),
]
Why This Might Be Useful:
1. **Cleaner Constructor**: Instead of passing a ton of individual arguments to `__init__()`, we’d just pass a list of `ClusterFeature` instances. This will make the `Cluster` class constructor much cleaner and easier to work with.

2. **Modular and Flexible**: Each summarization feature could be its own class, which makes it super easy to add new features down the line without cluttering the `Cluster` class.

3. **Easier Subsetting**: You could easily calculate features on just a subset of the input data (like just the charge), instead of always calculating for everything at once.

4. **Consistency**: By looping through the `ClusterFeature` list, we ensure that the feature labels and data always line up correctly, so there’s less risk of mistakes with mismatched ordering in the `_get_indices_and_feature_names` and `_construct_nodes`.
Potential Concerns:
* **Speed**: There might be a performance hit because of the extra abstraction but I don't have a feeling for that

* **Configuration**: We’ll want to make sure this change doesn’t mess with any of the config file stuff or other initialization logic. (I am not too familiar of how the yaml files are being used)
Let me know what you think! I think this could make things more modular and easier to extend, but I’d love to hear your thoughts. I am happy to help out in any way.

I am aware that this would not add much functionality and is more a style choice I guess. I'm already very happy with this PR and I can build on this as well to implement some more summarization features once its through.

Thanks for the input.

After having thought about the suggestions for a while I believe that the implementation with the cluster and pad class and then functions in order to add extra information is the way to go, my main arguments for this is the following

Feeding instances of classes to the node definition would mean that we would have to add some translation from/to the config files in order to accommodate this.
I do not see a good way to share the clustering and padding information between the now separate classes, recreating this clustering and padding would basically linearly increase the over head of each of the added summarization features, which is part of what the cluster_and_pad class implementation tries to solve.
While this way definitely makes it easier for people make their own clustering classes which they can use in their private repos, I would worry that this might lead people to not uploading their contributions to the clustering features as these classes can be completely separate from the rest of the library.
for your point regarding consistency I believe that an automatic calculation of the final output features can also be implemented in the current solution, and I will try to add this.

Hope this is somewhat agreeable to you, and again thanks for your input. 😄

RasmusOrsoe

Very nice @Aske-Rosted! Thank you for the great discussion @sevmag @pweigel

Aske-Rosted added 10 commits October 23, 2024 18:21

cluster and pad utility

2233cdd

Location default None

19122e7

more default none

1e9e45a

Update PercentileCluster

81416b8

cluster and pad utility

64b728c

Location default None

b8654fa

more default none

d964efc

Update PercentileCluster

e6357b5

Merge branch 'clustering_utilities' of github.com:Aske-Rosted/graphne…

f4c4d16

…t into clustering_utilities

align with prehooks

7fcc7fe

Aske-Rosted requested review from pweigel and RasmusOrsoe November 19, 2024 01:20

sevmag mentioned this pull request Nov 21, 2024

Implementation of CNN for IceCube #771

Open

RasmusOrsoe requested changes Nov 22, 2024

View reviewed changes

pweigel reviewed Nov 25, 2024

View reviewed changes

Aske-Rosted added 9 commits December 6, 2024 14:29

fix add_counts optional

7f7000f

update docstrings

505a82e

move/use internal functions + output x

4a4083b

remove warning

87f41c6

re-add old function

51da4b0

remove returns

3e21f7f

docstrings udpdates

eaa12e5

automatic_name_generation

165fedb

small_fix

75b3260

RasmusOrsoe self-requested a review December 10, 2024 08:24

RasmusOrsoe approved these changes Dec 10, 2024

View reviewed changes

Aske-Rosted merged commit 92b150a into graphnet-team:main Dec 11, 2024
13 of 14 checks passed

Aske-Rosted deleted the clustering_utilities branch December 11, 2024 01:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering utilities #770

Clustering utilities #770

Aske-Rosted commented Nov 19, 2024

RasmusOrsoe left a comment •

edited

Loading

RasmusOrsoe Nov 22, 2024

RasmusOrsoe Nov 22, 2024

RasmusOrsoe Nov 22, 2024

Aske-Rosted Dec 6, 2024 •

edited

Loading

pweigel Nov 25, 2024 •

edited

Loading

RasmusOrsoe Nov 26, 2024

Aske-Rosted Dec 6, 2024

sevmag commented Nov 25, 2024 •

edited

Loading

Aske-Rosted commented Dec 6, 2024

Idea for improving Handling of Summarization Features in the Clustering nodes

Why This Might Be Useful:

Potential Concerns:

RasmusOrsoe left a comment

Clustering utilities #770

Clustering utilities #770

Conversation

Aske-Rosted commented Nov 19, 2024

RasmusOrsoe left a comment • edited Loading

Choose a reason for hiding this comment

RasmusOrsoe Nov 22, 2024

Choose a reason for hiding this comment

RasmusOrsoe Nov 22, 2024

Choose a reason for hiding this comment

RasmusOrsoe Nov 22, 2024

Choose a reason for hiding this comment

Aske-Rosted Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

pweigel Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

RasmusOrsoe Nov 26, 2024

Choose a reason for hiding this comment

Aske-Rosted Dec 6, 2024

Choose a reason for hiding this comment

sevmag commented Nov 25, 2024 • edited Loading

Idea for improving Handling of Summarization Features in the Clustering nodes

Why This Might Be Useful:

Potential Concerns:

Aske-Rosted commented Dec 6, 2024

Idea for improving Handling of Summarization Features in the Clustering nodes

Why This Might Be Useful:

Potential Concerns:

RasmusOrsoe left a comment

Choose a reason for hiding this comment

RasmusOrsoe left a comment •

edited

Loading

Aske-Rosted Dec 6, 2024 •

edited

Loading

pweigel Nov 25, 2024 •

edited

Loading

sevmag commented Nov 25, 2024 •

edited

Loading