Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustering utilities #770

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

Aske-Rosted
Copy link
Collaborator

This PR add utility functions which will enable DOM-level clustering as described in #752. This PR does not include a NodesASDOM's definition but enables the creation of one. The new utilities completely contain the functionality of cluster_summarize_with_percentiles and as such a warning of this function being deprecated and a "TODO" of the removal of said function has been added so it can be removed some time in the future.

The PercentileClusters node defintion has been changed to use the new utility class/function, and this node definition is tested in the graphnet/tests/models/test_node_definition.py test_percentile_cluster() test.

I plan to soon follow up with a new node definition but thought it would be better to not grow the PR too large.

I created some test to ensure the performance of the new utilities and gathered clustering times as seen below

    ts = time()
    test = cluster_summarize_with_percentiles(tensors, [3,4], [0,1,2], percentiles, add_counts=True)
    old_times.append(time()-ts)

    ts = time()
    cluster_class = cluster_and_pad(tensors,[0,1,2])
    cluster_class.percentile_summarize([3,4],percentiles)
    cluster_class.add_counts(cluster_class.clustered_x.shape[1])
    new_times.append(time()-ts)

    ts = time()
    cluster_class_charge = cluster_and_pad(tensors,[0,1,2])
    cluster_class_charge.add_charge_threshold_summary([3,4],percentiles,4)
    cluster_class_charge.add_counts(cluster_class_charge.clustered_x.shape[1])
    charge_cluster_times.append(time()-ts)

With the tensor being some pseudo pulse data. The plots below show that the new utilities are about a factor 2 faster than the old implementation.
cluster_time_ratio
cluster_time_scatter
cluster_time_hist

Copy link
Collaborator

@RasmusOrsoe RasmusOrsoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very elegant solution @Aske-Rosted! I really like the reusability of this method, and I think the code is quite readable - well done!

Also great to see the speed benchmarks. It looks like the new clustering logic is quite a bit faster than the current version. This is great!

I added a few minor comments

)
array = cluster_class.add_counts()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the resulting node definition will always have counts added as a node feature - but in the arguments to PercentileCluster we allow the user to specify if they want this. The new implementation should respect this.

Would recommend a simple change such as:

if self._add_counts:
    array = cluster_class.add_counts()

@@ -172,6 +179,251 @@ def cluster_summarize_with_percentiles(
return array


class cluster_and_pad:
"""Cluster and pad the data for further summarization."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets expand the doc string here to make it clear that the class stores a clustered tensor that is manipulated when the public methods are called. Here's a suggestion:

"""
Cluster the input data according to a set of specified columns and 
compute aggregate statistics of cluster attributes.

The method will cluster the input data according to unique combinations of values in 
the specified `cluster_columns` and store the clustered tensor. The clustered tensor can
be manipulated further by calling public methods that add statistical quantities to the 
tensor of the values in each cluster. The order in which the public methods are called
determines the order of manipulations. 

Example:
# Build cluster of single event on first three columns
cluster_class = cluster_and_pad(x = single_event_as_array, cluster_columns = [0,1,2])

# Summarize columns 3 and 4 from original array on each cluster with 10th, 50th and 90th percentiles
clustered_array_with_percentile_summary = cluster_class.percentile_summarize(summarization_indices =[3,4], percentiles = [10, 50, 90])
# Add the std of the 5th column to the array above
clustered_array_with_percentile_summary_and_std = cluster_class.add_std(column = 5)
"""

"""
if location is None:
self.clustered_x = np.column_stack([self.clustered_x, column])
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The location parameter could potentially overwrite information, right? E.g. if column 6 already contains some aggregate statistics and the user specifies location=6. Maybe a warning/error here would be useful?

"This function is deprecated and will be removed,",
"use the class cluster_and_pad with add_percentile_summary",
"instead for the same functionality",
)
Copy link
Collaborator

@pweigel pweigel Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it's good to warn the user that this function is deprecated, this will print for every call (lots of printing). I am not sure what a good solution is here. Maybe it can just be removed immediately since the nodes no longer use it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! I missed that.

@sevmag
Copy link

sevmag commented Nov 25, 2024

Idea for improving Handling of Summarization Features in the Clustering nodes

I’ve been thinking about how we can make the clustering (especially the upcoming new NodeDefiniton class lets call it Cluster) more flexible and easier to maintain, and I have a suggestion. Instead of having a function within the cluster_and_pad class to create new cluster features, we could create a new class called ClusterFeature that handles summarizing each feature. The class might look something like this:

class ClusterFeature(Model):
    """Base class for summarizing feature series per cluster."""

    @abstractmethod
    def summarisation_feature_label(self):
        """Returns the label for the summarization feature."""

    @abstractmethod
    def add_feature(self):
        """Calculates the summarization feature."""

For every summarization feature we could then create a new subclass of this. E.g. for Totals of input features, Time of first pulse, Cumulative value of input feature after specific times, ....

Then, the new NodeDefinition (which we’ll call Cluster) can just take a list of these ClusterFeature instances. For example:

features = [
    Percentiles(percentiles=[10, 30, 50, 70]),
    Totals(),
    CumulativeAfterT(times=[10, 50, 40]),
    TimeOfPercentiles(percentiles=[10, 30, 50, 70]),
    Std(),
    TimeOfFirstPulse(),
    AddCounts(),
]

Why This Might Be Useful:

  1. Cleaner Constructor: Instead of passing a ton of individual arguments to __init__(), we’d just pass a list of ClusterFeature instances. This will make the Cluster class constructor much cleaner and easier to work with.

  2. Modular and Flexible: Each summarization feature could be its own class, which makes it super easy to add new features down the line without cluttering the Cluster class.

  3. Easier Subsetting: You could easily calculate features on just a subset of the input data (like just the charge), instead of always calculating for everything at once.

  4. Consistency: By looping through the ClusterFeature list, we ensure that the feature labels and data always line up correctly, so there’s less risk of mistakes with mismatched ordering in the _get_indices_and_feature_names and _construct_nodes.

Potential Concerns:

  • Speed: There might be a performance hit because of the extra abstraction but I don't have a feeling for that
  • Configuration: We’ll want to make sure this change doesn’t mess with any of the config file stuff or other initialization logic. (I am not too familiar of how the yaml files are being used)

Let me know what you think! I think this could make things more modular and easier to extend, but I’d love to hear your thoughts. I am happy to help out in any way.

I am aware that this would not add much functionality and is more a style choice I guess. I'm already very happy with this PR and I can build on this as well to implement some more summarization features once its through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants