Journal

Title

"Predicting disease modules using machine learning based Approach"

Introduction

Purpose of the research:
- Evaluate the Diamond Disease Module Detection algorithm
- Use machine learning approach to detect the module in a network

Background

I) Network Topological properties of network：[1]

Understanding Topological properties is essential to understand the previous algorithms for detecting modules

Local:
- Clustering coefficient: Fraction of interconnections
Star network/A highly centrailized network has a clustering coefficient of 0.

Let v be a node, and k_v denotes v's degree, and N_v denotes number of links between neighbors of v.

CC(v) = 2N_v/(k_v * (k_v-1))
- Shortest path: Longer the shortest path distance from the seed node, less relevant
- Centrality: Identify the most important nodes in a graph
Betweenness is a centrality measure, and it had been used in the Divisive Link Community algorithm.
- Modularity: Measure how well is it to group the network in certain way. Represents the possibility difference with the observed edges and the expected edges.
The original formula was introduced in this paper.

Then there are some modified forms introduced here.

Derivation between the above two forms are shown here
Global
- Modularity
- Clustering coefficient
CC(G) = avg_{v in G}(CC(v))

II) Previous Algorithm

There are handful of module detection algorithms in the previous studies, and some of the most commonly used ones are summarized as below

Modularity based
- Louvain algorithm [2]
- Greedy optimization [3]
- Divisive Link community [4]
Random flow based
- MCL [5]
Most recent
- Diamond (based on connectivity significance)
- this paper proposed a machine learning approach

III) Comparison for the previous work (use a table)

IV) Types of Modules in molecular networks

Specifically for protein-protein network, or generally for a biological network (regulatory and metabolic networks etc.), the following three types of modules should be taken care.

Topological modules

Can be found through: Compute and determine certain thing through the calculation of one or more topological network properties

Functional modules

A functional module (5) is defined as a group of genes or their products which are related by one or more genetic or cellular interactions.

A functional module has more relations between themselves rather than with others.

Integrative(meaning independent but biologically related data sets or different genetic and molecular networks) data analysis has been performed with correlation mapping and mean Pearson correlations.

Material and Methods

Results and Discussion

Conclusion

References:

[1] Mason, O., & Verwoerd, M. (2007). Graph theory and networks in biology. IET systems biology, 1(2), 89-119.

[2] Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10), P10008.

[3] Clauset, A., Newman, M. E., & Moore, C. (2004). Finding community structure in very large networks. Physical review E, 70(6), 066111.

[4] Newman, M. E., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical review E, 69(2), 026113.

[5] Van Dongen, S. M. (2000). Graph clustering by flow simulation (Doctoral dissertation).

[6]Hartwell L.H., Hopfield,J.J., Leibler,S. and Murray,A.W. (1999) From molecular to modular cell biology. Nature, 402 (suppl.), C47–C52

[7] Tornow, S., & Mewes, H. W. (2003). Functional modules by relating protein interaction networks and gene expression. Nucleic acids research, 31(21), 6283-9.

Question 1: How to Reference those files that are in ppt format, youtube or github link? Question 2: If the paper refered to other papers, how should we write the refercen?

Journal

Network basics

Small world property: the average path length is usually log(n) , n is the number of nodes. Details
High clustering coefficient

Mathematical Preliminary (Details refer to this)

Shortest path / Geodesic distance: δ(u, v)

In our case, we are using unweighted graph. Commonly used instructions: nx.shortest_path_length[G,source,target],nx.all_shortest_path[G,source,target]. Please refer to this for the full shortest-path documentation
Clustering coefficient
Degree distribution: in scale-free networks, most nodes have relatively low degrees. Those nodes with unusually high degrees are called hubs. when building degree distribution, the following mathematical models are used: 1. Classic model 2. Barabasi-ALbert Model 3. Duplication-Divergence Model
Modularity (Q): Represents the possiblity difference with the observed edges and the expected edges. The original formula was introducted in this paper. Then there are some modified forms introduced here. Derivation between the above two forms are shown here Another matric form is introduced as the picture.
Centraility: a significant measurement to find out the most important nodes Have three types: Degree centraility, Closeness centraility, and Betweeness centraility
Markov Centraility: a concept that reflects the notion of how central a node t is in a network relative to a particular node r or all nodes R. For more details, please refer to this. The importance that node t relative to root node(s) R is denoted:

I(t|R) = 1/( (1/|R|) * Σr∈R mrt )

where mrt = Σ (end:∞ from:n=1) nf(n)

f(n) is the probalibity of transitioning from node i to node j. If there is an edge between node i and node j, f(n) = 1/doutj; Otherwise, f(n) = 0

The I(t|R) formula is derived from its orginal definition formula:

I(t|r) = Σ (end: all paths from t to r; from the first path) λ to the power of -|ith path length|

Here, we choose the path as K-short node-disjoint path.

A random walk is an example of Markov Chain using the transition matrix.

Module detection algorithm

Agglomerative Bottom-up
1. Hierarchical clustering:
Expressed in the Dendrogram form.

Merge pair of clusters that are closest. Please refer to this for more details. Or please refer to this

Disadvantage: Tendency to find only the cores of community while leave out the periphery
1. Greedy way:
Optimize modularity approach.

Its time complexity is O(mdlogn) with d is the depth of the dendrogram.

Start with each vertex being the sole member of a community of one, then repeatedly join together the two communities whoes amalgamation produces the largest increase in Q. After n-1 steps, we have only one (the difference with Louvain algorithm) community. Please refer to this for more details.

Algorithm details:

i) Use a matrix to store ΔQij

∆Qij =1/2m − kikj/(2m)² if i, j are connected. 0 otherwise (as it is going to be sparse).

Each row is represented as both a balanced binary tree(insertion/searching can be down in O(logn)time) and a max heap(largest ∆Qij can be found in linear time)

An array to store aij=ki/2m

ii) Select the largest ∆Qij from the matrix. Assume if i and j join together, we name the community as j community. Then we only need to update the jth row and column, delete ith column.

The updating(will benefit from balanced binary search tree) jth row and column (exactly same to row, we use row as example here) is according to this:

If k is connected to i and j: new ∆Qjk = ∆Qik+∆Qij

If k is connected to i not j: new ∆Qjk = ∆Qik-2ajak

If k is connected to j not i: new ∆Qjk = ∆Qjk-2aiak

iii) Repeat ii) until only one community remains.

Disadvantage: Using max heap can help us find the largest value quickly, it also requires more effort to maintain it, means reform the max heap after updating. In real case, we often use the simplified version, that is to say, ignore the process of reforming the max heap. Instead, directly search the next largest value in the 'max heap'.
Divisive Top-down (Details refer to this)

Remove the edges with the highest betweeness from the graph, and recalculate after each change.

The betweeness have three commonly-used definition:

i) Shortest parth betweeness: use BFS algorithm

ii) Random walk betweeness: How often on average random walks starting at vertex s will pass down a particular edge from vertex v to vertex w before reaching the target t. For more details, can refer to this.

(Probability matrix) M = (Adjacency matrix) A * ��(Diagonal matrix of the degree of each node) D^(-1)

To prevent the edge reaching t, we delete the t's row and column. Denoted as: M_t

Assume we start from s, take n steps to end up with node v (different with t). Denoted as: [M_tⁿ]_vs

Summing over all n (the mean number of times that a walk of any length) traverses the edge from v to w is: k_v^-1[I-M_t]_vs^-1. Denotes as V_v.

Similar for from w to v.

The random walk betweeness of edge (v,w) is |V_v-V_w|.

iii) Current-flow betweeness

Disadvantage: Demand heavy computational power, O(m²n) with m edges and n nodes, O(n³) for a spare graph.
Louvain algorithm (Modularity Optimazation) It will merge a large network into several communities according to which types of grouping will give the largest increase in Q (formula is give in the orginal paper). Do it repeatedly until get the required number of communities.

Original paper.

Code implementation here.
Markov Chain Algorithm (Unsupervised)

MCL is based on: if start traversing from a random node, it is more likely to traverse inside a cluster instead of traverse between clusters.

Flow is easier within dense regions than across sparse boundaries, however, in the long run this effect disappears. This process can be oberserved:

The process is done through several round of raising to a non-negative power, then re-normalizing.

Raising to a non-negative power is called: Expansion.

Re-normalizing is called: MCL Inflation, and this is in order to strengthens strong currents, and weakens already weak currents. The formula for inflation is shown as below:

Repeat the above process until it reaches a convergent state. Attractors will attract the postive values in the same row, and sometimes overlapping clusters will occur.

Disadvantage: For clusters with large diameters, MCL requires long expansion and low inflation, and requires many itereations as small changes are turbulent to the whole. Speed could be further improved by: setting small values to zero at beginning

Understanding purpose.

Original paper.

Implementation code

Detailed implementation code

Analysis on Protein-Protein networks
**Diamond Algorithm

Networkx Environment setup

Install Networkx

Install in terminal: $ pip install networkx.

Details refer to here.
Operation to the remote server through terminal
1. Copy a directory: scp -r <local/file/path> username@IPaddress:<remote path>
  
  E.g. scp -r ~/Workspaces/Module-Detection [email protected]:~
  
  E.g. scp -r ~/Workspaces/Module-Detection/diamond_self.py [email protected]:/home2/lime0400/Module-Detection
  
  Source
2. Edit a file inside the remote server:
  
  vim <file name>
  
  press i to enter insert mode
  
  after the change, press esc to exit from the insert mode
  
  key in :wq to save the changes
  
  Open the SSH access in MAC OS(local): `systemsetup -f -setremotelogin o
  
  Transfering files from remote server to local desktop: scp -r gene-network.png MCL\ Inflation.png [email protected]:/Users/limengyang/Module-Detection/
  
  Source
Setup Microsoft Remote desktop

Intall the version 10
File explanation

gene-network.tsv contains all nodes and edges that are in our experiment (extracted from the interactome.tsv file)

gene-disease0 contains the seed nodes of 70 diseases after small modification to feed the gene-visualization.py file

gene-disease contains the seed nodes of 70 diseases (original)

Recommended Structure

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
.idea		.idea
.ipynb_checkpoints		.ipynb_checkpoints
FunctionalFeatures		FunctionalFeatures
LearningKit		LearningKit
ML		ML
Re-feeding Using New Dataset		Re-feeding Using New Dataset
SequenceFeatures		SequenceFeatures
TopologicalFeatures		TopologicalFeatures
data		data
.DS_Store		.DS_Store
ModuleDetectionEvaluationMetrix.png		ModuleDetectionEvaluationMetrix.png
PaperTemplate.pdf		PaperTemplate.pdf
README.md		README.md
Ureca-Report-draft.docx		Ureca-Report-draft.docx
allFunctionalFeatures.csv		allFunctionalFeatures.csv
allSequenceFeatures.csv		allSequenceFeatures.csv
allTopoFeatures.csv		allTopoFeatures.csv
cleanFeatures.csv		cleanFeatures.csv
re_feeding_feature_generation.py		re_feeding_feature_generation.py
visualizer.py		visualizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Title

Introduction

Background

Material and Methods

Results and Discussion

Conclusion

References:

Journal

Network basics

Mathematical Preliminary (Details refer to this)

Module detection algorithm

Networkx Environment setup

Recommended Structure

About

Releases

Packages

Contributors 2

Languages

LiMengyang990726/Module-Detection

Folders and files

Latest commit

History

Repository files navigation

Title

Introduction

Background

Material and Methods

Results and Discussion

Conclusion

References:

Journal

Network basics

Mathematical Preliminary (Details refer to this)

Module detection algorithm

Networkx Environment setup

Recommended Structure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages