Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dadac integration alt #107

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

lykos98
Copy link

@lykos98 lykos98 commented Jan 15, 2024

Proposed changes

Optimization of ADP procedure and explicit parallelization of many tasks through openmp.

Types of changes

In the last year (or so) I have been working on my implementation of ADP due to the fact the original one, provided in dadapy was not able to scale well for my use case. Together with the help of @alexdepremia and @lucatornatore I developed a very small library in C (dadaC) which implemented the ADP procedure with the following optimizations:

  • In the first heuristic, in the second part where candidate centers are asked to not be part of the neighborhood of a point of higher density the loop on the centers has been removed allowing the procedure to scale better at a cost of a O(n) in memory to keep track of the changes.
  • In the third heuristic merges between the clusters can be computed ahead of time then sorted in increasing density ordered and processed in this way. The motivation for this optimization is that at the beginning of the procedure it is possible to compute all the borders that will lead to a merging, and later then in the process the number of "merging candidates" can only decrease. The only thing to carefully take care of is the labeling of the clusters and the topology of the borders, and to check for each one of them if later on they still satisfy the merging condition. This optimization is particularly useful since there is no need to explore at each iteration the entire topology of the borders

The implementation has been tested against the one in dadapy and preserves a 1 to 1 match of the results in terms of cluster labels, border indices and densities. A detailed discussion, tests, and benchmarks can be found in the repository of my implementation https://github.com/lykos98/dadaC .

Integration in the library requires minimal modifications of the code to allow the implementation in C to interface with yours, actually it requires to add dadaC as dependency, since dadaC requires to build a small .so file that is then used by an interface in python built with ctypes. I kindly ask you if this can be in line with your development, and I am open to modify my implementation to comply with your conventions if needed.

Thank you for your attention,
let me know your opinion.
F

@AldoGl
Copy link
Collaborator

AldoGl commented Jan 19, 2024

Wow!!
Thank you @lykos98, I had a look at your dadaC repo, at the benchmarks, and at you your PR. It all seems excellent work to me. I was considering asking you to contribute the entire dadaC codebase in dadapy, but probably is better to keep them separate as you originally designed.
I would ask you to make the tests pass, then I will have a closer look as soon as I have time before merging but it should not take much

@lykos98
Copy link
Author

lykos98 commented Jan 22, 2024

Goodafternoon,
I fixed the issue on the tests and run them in my fork, everything seems fine. The issue was linked to a leftover, I overlooked in the diff.

I think that ideally the codebase of dadac should be included but when I tried to perform that kind of integration it was not easy to deal with the inclusion of the .so file needed for dadac to run. In the next future moreover, I think I will also ship dadac as a proper package with precompiled binaries.

Thank you!
F

@codecov-commenter
Copy link

codecov-commenter commented Feb 21, 2024

Codecov Report

Attention: Patch coverage is 90.74074% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 81.17%. Comparing base (f0b3fdd) to head (d065ae1).

Files Patch % Lines
dadapy/clustering.py 90.74% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #107      +/-   ##
==========================================
+ Coverage   81.07%   81.17%   +0.10%     
==========================================
  Files          17       17              
  Lines        3006     3034      +28     
==========================================
+ Hits         2437     2463      +26     
- Misses        569      571       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants