Why the default setting is to replace values in the diagonal by the maximum value of that row? #33
Replies: 4 comments 1 reply
-
Hello, this is a very good question. A few aspects come into it.
The way similarities are computed matters as well by the way; it seems that bit scores or normalised bit scores are best, see e.g. this analysis and this Orthofinder paper. It depends on your data and probably the inflation value as well, whether using You write My ultimate goal is to have a feeling about how mcl split some largest graphs in the network . What does that exactly mean? Is it about breaking up a large (super-)family of proteins into (still large) families? |
Beta Was this translation helpful? Give feedback.
-
Hi, it is quite surprising to receive your reply. Now I understand why Please allow me to explain my goal a bit: I actually want to know more about some largest components in the graph. The graph is not fully connected because similarity between many protein pairs are so low, and such edges have been removed. Now, I already have a graph at hand. Before feeding it to mcl, it seems to me that there are several different types of components
Not sure if I explain the goal clearly enough, but actually I still have three other questions regarding the usage of mcl program. It will be very much appreciated if you could give some advice on any of them. My first question is that how should I choose between taking the power of each entry before the first squaring (mcxio power) and changing the inflation after the first matrix squaring operation? I feel like they may have similar effects, but not sure what the differences are. My second question is that due to the different nature of components I described above, is it reasonable to use different transformation/inflation for different components? I believe that applications like orthofinder applies a single inflation to all the components. My third question is that does mcl still represent numbers in 32-bit floating point? In our program, we transformed the weight into a score between 0 and 100, which is actually a fraction represented in 64-bit standard (IEEE 754, 53 bits for fraction part). If mcl only represents numbers in 32 bits, then it is unnecessary for us to use extra space. |
Beta Was this translation helpful? Give feedback.
-
As you looked into it, perhaps you could see whether |
Beta Was this translation helpful? Give feedback.
-
Loops will generally increase granularity (more small clusters), and getting more singletons is one aspect of that. (1) how do MCL and other network clustering algorithms perform There are a lot of potential intricacies; a tree-like structure seems most appropriate for capturing evolutionary relationship, but is this correct? Let's assume it is. With (2) we are already dealing with a quite obscured and simplified view of what we think is reality. We just have a lot of pairwise numerical scores between proteins. (As an aside, I'm a bit rusty in how Orthofinder, OrthoMCL and other algorithms deal with paralogues vs orthologues. Is it just using best-reciprocal-hit type edge filtering?). Clustering of these networks only generates a quantified truth I think. It might be "at this level of granularity the co-clustering of proteins seems to be right about 90% of the time" (I don't know, just making up an example). I assume a lot of analysis exists about this aspect ... Clustering is good for broad/overal view; it may even be good in doing this at different levels of granularity. It may also generate specific clusters that are actually quite good (a nice set of orthologues). To a very large extent this depends on the construction of the input graph and the quality of the input. If it is within scope, one might look at singletons and their connections in the input graph to see whether it makes sense they end up alone or in small clusters. Is there something particular about them? Are their sequences shorter? Do they have more intrinsically disordered regions? Et cetera .. and it is possible of course that a protein truly has few or no orthologues. A plot I've found sometimes useful to QC data is the scatter plot where each point is a protein p, where |
Beta Was this translation helpful? Give feedback.
-
Hi, everyone. I am learning to use mcl for a student project in a bioinformatics program. The project itself is related to protein clustering. My ultimate goal is to have a feeling about how mcl split some largest graphs in the network. By the way, the input matrix is all symmetrical, so this has simplified the task for testing mcl as well.
I have read the minimcl program and also the info page of both mcl and mcxio. Now, I am doing some experiments using small graphs. To my surprise, if I provide a graph with values along the diagonal, e.g.,
then, the diagonal values will be discarded and replaced by 1 (so called the connection to a neighbour with the largest weight), unless I specified
discard-loop=n
. I think it is quite normal for a protein to not have a neighbour as similar as itself, so thediscard-loop=n
must be specified to return reasonable grouping in my experiments. That is, A and B form two different clusters rather than one.However, I am still confused by why the default setting is
discard-loop=y
? Could anyone explain it? Thanks.Beta Was this translation helpful? Give feedback.
All reactions