Why the default setting is to replace values in the diagonal by the maximum value of that row? #33

wuzhitian111 · 2024-11-12T10:17:14Z

wuzhitian111
Nov 12, 2024

Hi, everyone. I am learning to use mcl for a student project in a bioinformatics program. The project itself is related to protein clustering. My ultimate goal is to have a feeling about how mcl split some largest graphs in the network. By the way, the input matrix is all symmetrical, so this has simplified the task for testing mcl as well.

I have read the minimcl program and also the info page of both mcl and mcxio. Now, I am doing some experiments using small graphs. To my surprise, if I provide a graph with values along the diagonal, e.g.,

A A 10
A B  1
B B 10

then, the diagonal values will be discarded and replaced by 1 (so called the connection to a neighbour with the largest weight), unless I specified discard-loop=n. I think it is quite normal for a protein to not have a neighbour as similar as itself, so the discard-loop=n must be specified to return reasonable grouping in my experiments. That is, A and B form two different clusters rather than one.

However, I am still confused by why the default setting is discard-loop=y? Could anyone explain it? Thanks.

micans · 2024-11-12T13:22:04Z

micans
Nov 12, 2024
Maintainer

Hello, this is a very good question. A few aspects come into it.

mcl is not specifically made for protein clustering. The approach was chosen as not all application domains have a natural concept of self-similarity that is meaningful in relationship to between-object similarity.
Additionally, my view is that loops in mcl are a bit special compared to other edge weights. In the absence of loops mcl may find clusters that are not connected in the graph, reflecting a graph region that has locally some aspect of bipartiteness. This may still happen even with loops. That said, the behaviour you describe (when not discarding loops) seems useful and in line with what you expect.
I wonder if it is possible that self-similarities are lower than a neighbour similarity (unlike in your example). In that case, using the self-similarity could lead to the issue described above.

The way similarities are computed matters as well by the way; it seems that bit scores or normalised bit scores are best, see e.g. this analysis and this Orthofinder paper.

It depends on your data and probably the inflation value as well, whether using --discard-loops=n will have much effect or not. I assume you can certainly use --discard-loops=n and it might have a (small?) beneficial effect in some cases. Perhaps this could be something to recommend everyone uses all the time when clustering sequences. However, this will require making sure negative consequences such as the third item above can be ruled out.

You write My ultimate goal is to have a feeling about how mcl split some largest graphs in the network . What does that exactly mean? Is it about breaking up a large (super-)family of proteins into (still large) families?
I am interested in organising multiple mcl clusterings (at different levels of granularity) into a single hierarchical tree - for this one can use RCL, which is now part of the mcl software distribution.

0 replies

wuzhitian111 · 2024-11-13T10:34:40Z

wuzhitian111
Nov 13, 2024
Author

Hi, it is quite surprising to receive your reply. Now I understand why discard-loop=y is the default option.

Please allow me to explain my goal a bit: I actually want to know more about some largest components in the graph. The graph is not fully connected because similarity between many protein pairs are so low, and such edges have been removed. Now, I already have a graph at hand. Before feeding it to mcl, it seems to me that there are several different types of components

In some components, each component only contains one protein from each species which are almost identical. These are likely orthologs to each other, and should not be split by mcl anyway, as every entry in the input matrix would essentially be identical;
some other components are likely to be the mix of two orthologous groups, which are connected by some spurious edges. In this case, mcl may be fine-tuned to detect individual clusters during its squaring-inflation-stochastic cycle.
But still some components are quite large (with hundreds of vertices) and it is difficult to tell what kind of proteins such a component contains, whether superfamilies or paralogs. In this case, I would like to know how mcl cluster them under different settings. I guess the RCL program you pointed to will be helpful as well. (have not tried it yet)

Not sure if I explain the goal clearly enough, but actually I still have three other questions regarding the usage of mcl program. It will be very much appreciated if you could give some advice on any of them.

My first question is that how should I choose between taking the power of each entry before the first squaring (mcxio power) and changing the inflation after the first matrix squaring operation? I feel like they may have similar effects, but not sure what the differences are.

My second question is that due to the different nature of components I described above, is it reasonable to use different transformation/inflation for different components? I believe that applications like orthofinder applies a single inflation to all the components.

My third question is that does mcl still represent numbers in 32-bit floating point? In our program, we transformed the weight into a score between 0 and 100, which is actually a fraction represented in 64-bit standard (IEEE 754, 53 bits for fraction part). If mcl only represents numbers in 32 bits, then it is unnecessary for us to use extra space.

0 replies

micans · 2024-11-13T12:56:31Z

micans
Nov 13, 2024
Maintainer

I would suggest, initially, not changing inflation after the first iteration, and I would also not pre-take a power, but rather only use different inflation values (the -I parameter). This is just to keep life simple. Those other options were intended to give some flexibility and research options, but I think doing the simple thing is best. Alternatively, time could be spent on finding the best network and edge creation methods that best encapsulate true relationships and suppress noise.
I think it is certainly reasonable to use different inflation values for different components. I would be interested in trying RCL. Dealing with hierarchical input makes life more complicated, so it depends on time and interest whether it's an option for you. As RCL is new it adds an element of unknown/experimental, so my default expectation is it's not a good option for most people. Should you want to try it then I'm happy to answer questions and address issues. When not using RCL of course you can still use different inflation values. In this discussion I shared some thoughts about picking inflation value using the concept of 'cluster stability' across different values. What is the size of your graph - how many nodes and edges (to get a sense of what are reasonable things to try)?
mcl does use 32-bit floating point by default, but can be compiled to use (64-bit) double. The latter mode has probably seen less testing (to make sure double precision is maintained everywhere it is needed). I would be a little bit surprised if it makes a qualitative difference.

As you looked into it, perhaps you could see whether --discard-loops=n improves your results in some way (for example, perhaps it improves the stability I mentioned above).

1 reply

wuzhitian111 Nov 27, 2024
Author

Thanks for your valuable suggestions! I have tried to learn a bit more theory before working with real data. I was really surprised when I realized that the first part of the RCL paper is really about comparing different set partitions, and the idea can also be found in section 9.3 of your PhD thesis. For the second part related to comparing each element in different partitions, I think I still need some more time to experiment and comprehend.

I am still trying to understand how adding loops increase clustering granularity. In "paired tests", it is quite clear that adding loops result in more clusterings.

# inflation 1.2
  d=3954   d1=449   d2=3505   nn=225707  c1=12209  c2=11646    n1:keep-loops n2:discard-loops=y

# inflation 1.5
  d=8828  d1=1665   d2=7163   nn=225707  c1=20803  c2=16094    n1:keep-loops n2: discard-loops=y

# inflation 2
 d=11037   d1=531  d2=10506   nn=225707  c1=27090  c2=18193    n1:keep-loops n2: discard-loops=y

The main difference is that there are many more "singletons" after keeping loops. For example, when setting inflation=2, there are 1330 singletons in the clustering when discarding-loops, but there are 9589 singletons if the loops are retained. This basically explains why d2=10506.
However, I still need to see whether these singletons are biologically meaningful. Since my test dataset contains different animal species that are evolutionarily distant (fruitfly, human, fugu...), adding loops may also make it easier for real orthologs to be split up. In OrthoMCL, there is a compensation for distance between input proteomes, but that goes beyond mcl :)

micans · 2024-11-29T10:58:51Z

micans
Nov 29, 2024
Maintainer

Loops will generally increase granularity (more small clusters), and getting more singletons is one aspect of that.
Regarding whether are these singletons biologically meaningful, my starting assumption is that they are not. There are two issues here in my view:

(1) how do MCL and other network clustering algorithms perform
(2) how well does the input network capture evolutionary relationships

There are a lot of potential intricacies; a tree-like structure seems most appropriate for capturing evolutionary relationship, but is this correct? Let's assume it is. With (2) we are already dealing with a quite obscured and simplified view of what we think is reality. We just have a lot of pairwise numerical scores between proteins. (As an aside, I'm a bit rusty in how Orthofinder, OrthoMCL and other algorithms deal with paralogues vs orthologues. Is it just using best-reciprocal-hit type edge filtering?).
These thoughts are not well-connected, but my message is that the networks we are dealing with are both noisy (too much/wrong information) and simplified (too little information). Then clustering methods that generate flat clusterings provide just one particular slice/view of that data (more simplification).

Clustering of these networks only generates a quantified truth I think. It might be "at this level of granularity the co-clustering of proteins seems to be right about 90% of the time" (I don't know, just making up an example). I assume a lot of analysis exists about this aspect ...

Clustering is good for broad/overal view; it may even be good in doing this at different levels of granularity. It may also generate specific clusters that are actually quite good (a nice set of orthologues). To a very large extent this depends on the construction of the input graph and the quality of the input. If it is within scope, one might look at singletons and their connections in the input graph to see whether it makes sense they end up alone or in small clusters. Is there something particular about them? Are their sequences shorter? Do they have more intrinsically disordered regions? Et cetera .. and it is possible of course that a protein truly has few or no orthologues.

A plot I've found sometimes useful to QC data is the scatter plot where each point is a protein p, where x = #neighbours(p) and y = median-edge-weight(p). You could additionally colour nodes based on the size of the cluster they are in.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why the default setting is to replace values in the diagonal by the maximum value of that row? #33

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Why the default setting is to replace values in the diagonal by the maximum value of that row? #33

wuzhitian111 Nov 12, 2024

Replies: 4 comments · 1 reply

micans Nov 12, 2024 Maintainer

wuzhitian111 Nov 13, 2024 Author

micans Nov 13, 2024 Maintainer

wuzhitian111 Nov 27, 2024 Author

micans Nov 29, 2024 Maintainer

wuzhitian111
Nov 12, 2024

Replies: 4 comments 1 reply

micans
Nov 12, 2024
Maintainer

wuzhitian111
Nov 13, 2024
Author

micans
Nov 13, 2024
Maintainer

wuzhitian111 Nov 27, 2024
Author

micans
Nov 29, 2024
Maintainer