Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase speed for KNN, added MI and Entropy examples for multidimensional variables #54

Merged
merged 13 commits into from
Aug 30, 2024

Conversation

chrisferreyra13
Copy link
Contributor

@chrisferreyra13 chrisferreyra13 commented Jul 18, 2024

Thanks for contributing a pull request!

Please be aware that we are a loose team of volunteers so patience is
necessary. Assistance handling other issues is very welcome. We value
all user contributions, no matter how minor they are. If we are slow to
review, either the pull request needs some benchmarking, tinkering,
convincing, etc. or more likely the reviewers are simply busy. In either
case, we ask for your understanding during the review process.

Again, thanks for contributing!

Reference issue

None, new examples.

What does this implement/fix?

  • This PR has two new example scripts in which a comparison between MI/Entropy estimators is shown using gaussian variables with multiple dimensions.
  • I also change minsize to 1 in InfoTot estimator to estimate pairwise mutual information between all x features and the target.
  • I increased computational speed by almost 5x for entropy_knn and mi_knn.

Additional information

Kernel estimator should be added to MI example after following fix.

@chrisferreyra13 chrisferreyra13 changed the title MI and Entropy examples for multidimensional variables Increase speed for KNN, added MI and Entropy examples for multidimensional variables Jul 31, 2024
@chrisferreyra13
Copy link
Contributor Author

Hey sorry, github automatically included my new change for entropy/mi knn in this PR. Let me know if I remove it.

Copy link
Collaborator

@EtienneCmb EtienneCmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @chrisferreyra13,

Thanks for the PR. I made several comments, most of them involve minor changes.

Best,

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this file. You can add it to the .gitignore .vscode/*

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry! I didn't realize it was included.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chrisferreyra13 I guess it's faster because you use broadcasting rules. Can you check that it doesn't consume to much RAM when using a larger number of nodes? Same question for the mi

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you've checked that both implementations give the exact same results?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly. The motivation originated because for >5k samples, the method started to become really slow.

I try to measure peak memory consumption with tracemalloc pkg using get_traced_memory function. I don't trust my results hahaha 😅 because for me the new method should consume more memory for sure! Do you know another way to measure memory? I attach the corresponding plots.
I am following this approach (similarly for total info)

for nf in n_features:
    x = np.random.rand(n_samples, nf)
    # inject redundancy between nodes (0, 1, 2)
    x[:, 1] += x[:, 0]
    x[:, 2] += x[:, 0]
    model = TC(x)
    
    tracemalloc.start()
    model.fit(method="knn_old", minsize=3, maxsize=nf)
    mem_usg_knn_old.append(tracemalloc.get_traced_memory()[1]/1024**2)
    tracemalloc.stop()

image
image

I also measure if they give the same results. They do, but when I check by eye the values the last decimals are not the same. I guess this is only a matter of numerical precision of the functions used. I attach plots.

image
image

If the method is heavy in practice, we can leave an option for slow/fast method. Although to do I expect people using HOI in a cluster for real analysis, where RAM should be ok.

Let me know!

@EtienneCmb EtienneCmb merged commit 4a4aab3 into brainets:main Aug 30, 2024
8 checks passed
@EtienneCmb
Copy link
Collaborator

Thank you @chrisferreyra13, indeed, it's surprising that the new method doesn't use more memory. I merged your modifications. In case we/you/someone find memory issues in the future, we can go back to this PR. Best,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants