DiffuPy is a generalizable Python implementation of the numerous label propagation algorithms. DiffuPy supports generic graph formats such as JSON, CSV, GraphML, or GML. Check out DiffuPy's documentation here.
If you use DiffuPy in your work, please consider citing:
MarĂn-LlaĂł, J., et al. (2020). MultiPaths: a python framework for analyzing multi-layer biological networks using diffusion algorithms. Bioinformatics, 37(1), 137-139.
The latest stable code can be installed from PyPI with:
$ python3 -m pip install diffupy
The most recent code can be installed from the source on GitHub with:
$ python3 -m pip install git+https://github.com/multipaths/DiffuPy.git
For developers, the repository can be cloned from GitHub and installed in editable mode with:
$ git clone https://github.com/multipaths/DiffuPy.git
$ cd diffupy
$ python3 -m pip install -e .
- The two required input elements to run diffusion using DiffuPy are:
- A network/graph. (see Network-Input Formatting below)
- A dataset of scores/ponderations. (see Scores-Input Formatting below)
For its usability, you can either:
- Use the Command Line Interface (see down).
- Use pythonicaly the functions provided in diffupy.diffuse:
from diffupy.diffuse import run_diffusion
# DATA INPUT and GRAPH as PATHs -> returned as *PandasDataFrame*
diffusion_scores = run_diffusion(~/data/input_scores.csv, ~/data/network.csv).as_pd_dataframe()
# DATA INPUT and GRAPH as Python OBJECTS -> exported *as_csv*
diffusion_scores = run_diffusion(input_scores, network).as_csv('~/output/diffusion_results.csv')
The diffusion method by default is z, which statistical normalization has previously shown outperformance. Further parameters to adapt the propagation procedure can be provided, such as choosing among the available diffusion methods or providing a custom method function. See the diffusion Methods and/or Method modularity.
diffusion_scores_select_method = run_diffusion(input_scores, method = 'raw')
from networkx import page_rank # Custom method function
diffusion_scores_custom_method = run_diffusion(input_scores, method = page_rank)
You can also provide your own kernel method or select among other provided in kernels.py function you can provide it as kernel_method argument. By default regularised_laplacian_kernel is used.
from diffupath.kernels import p_step_kernel # Custom kernel calculation function
diffusion_scores_custom_kernel_method = run_diffusion(input_scores, method = 'raw', kernel_method = p_step_kernel)
So method stands for the diffusion process method, and kernel_method for the kernel calculation method.
The following commands can be used directly from your terminal:
1. Run a diffusion analysis The following command will run a diffusion method on a given network with the given data. More information here.
$ python3 -m diffupy diffuse --network=<path-to-network-file> --data=<path-to-data-file> --method=<method>
2. Generate a kernel with one of the seven methods implemented Generates the regularised Laplacian kernel of a given graph. More information in the documentation.
$ python3 -m diffupy kernel --network=<path-to-network-file>
Before running diffusion algorithms on your network using DiffuPy, take into account the graph and input data/scores formats. You can find specified here samples of supported input scores and networks.
The input is preprocessed and further mapped before the diffusion. See input mapping or or see process_input docs for further details. Here are exposed the covered input formats for its preprocessing.
You can submit your dataset in any of the following formats:
- CSV (.csv)
- TSV (.tsv)
- pandas.DataFrame
- List
- Dictionary
(check Input dataset examples)
So you can either provide a path to a .csv or .tsv file:
from diffupy.diffuse import run_diffusion
diffusion_scores_from_file = run_diffusion('~/data/diffusion_scores.csv', network)
or Pythonicaly as a data structure as the input_scores parameter:
data = {'Node': ['A', 'B',...],
'Node Type': ['Metabolite', 'Gene',...],
....
}
df = pd.DataFrame (data, columns = ['Node','Node Type',...])
diffusion_scores_from_dict = run_diffusion(df, network)
Please ensure that the dataset minimally has a column 'Node' containing node IDs. You can also optionally add the following columns to your dataset:
- NodeType
- LogFC [*]
- p-value
[*] | Log2 fold change |
If you would like to submit your own networks, please ensure they are in one of the following formats:
- BEL (.bel)
- CSV (.csv)
- Edge list (.lst)
- GML (.gml or .xml)
- GraphML (.graphml or .xml)
- Pickle (.pickle). BELGraph object from PyBEL 0.13.2
- TSV (.tsv)
- TXT (.txt)
Minimally, please ensure each of the following columns are included in the network file you submit:
- Source
- Target
Optionally, you can choose to add a third column, "Relation" in your network (as in the example below). If the relation between the Source and Target nodes is omitted, and/or if the directionality is ambiguous, either node can be assigned as the Source or Target.
If you dispose of a precalculated kernel, you can provide it directly as the network parameter without needing to also provide a graph object.
DiffuPath accepts several input formats which can be codified in different ways. See the diffusion scores summary for more details on how the labels input are treated accorging each available method.
1. You can provide a dataset with a column 'Node' containing node IDs.
Node |
---|
A |
B |
C |
D |
from diffupy.diffuse import run_diffusion
diffusion_scores = run_diffusion(dataframe_nodes, network)
Also as a list of nodes:
['A', 'B', 'C', 'D']
diffusion_scores = run_diffusion(['A', 'B', 'C', 'D'], network)
2. You can also provide a dataset with a column 'Node' containing node IDs as well as a column 'NodeType', indicating the entity type of the node to run diffusion by entity type.
Node | NodeType |
---|---|
A | Gene |
B | Gene |
C | Metabolite |
D | Gene |
Also as a dictionary of type:list of nodes :
{'Gene': ['A', 'B', 'D'], 'Metabolite': ['C']}
diffusion_scores = run_diffusion({'Genes': ['A', 'B', 'D'], 'Metabolites': ['C']}, network)
3. You can also choose to provide a dataset with a column 'Node' containing node IDs as well as a column 'logFC' with their logFC. You may also add a 'NodeType' column to run diffusion by entity type.
Node | LogFC |
---|---|
A | 4 |
B | -1 |
C | 1.5 |
D | 3 |
Also as a dictionary of node:score_value :
{'A':-1, 'B':-1, 'C':1.5, 'D':4}
diffusion_scores = run_diffusion({'A':-1, 'B':-1, 'C':1.5, 'D':4})
Combining point 2., you can also indicating the node type:
Node | LogFC | NodeType |
---|---|---|
A | 4 | Gene |
B | -1 | Gene |
C | 1.5 | Metabolite |
D | 3 | Gene |
Also as a dictionary of type:node:score_value :
{Gene: {A:-1, B:-1, D:4}, Metabolite: {C:1.5}}
diffusion_scores = run_diffusion({Gene: {A:-1, B:-1, D:4}, Metabolite: {C:1.5}}, network)
4. Finally, you can provide a dataset with a column 'Node' containing node IDs, a column 'logFC' with their logFC and a column 'p-value' with adjusted p-values. You may also add a 'NodeType' column to run diffusion by entity type.
Node | LogFC | p-value |
---|---|---|
A | 4 | 0.03 |
B | -1 | 0.05 |
C | 1.5 | 0.001 |
D | 3 | 0.07 |
This only accepted pythonicaly in dataaframe format.
See the sample datasets directory for example files.
Source | Target | Relation |
---|---|---|
A | B | Increase |
B | C | Association |
A | D | Association |
You can also take a look at our sample networks folder for some examples.
Even though it is not relevant for the input user usage, it is relevant for the diffusion process assessment taking into account the input mapped entities over the background network, since the coverage of the input implies the actual entities-scores that are being diffused. In other words, only will be further processed for diffusion, the entities which label matches an entity in the network.
The diffusion running will report the mapping as follows:
Mapping descriptive statistics
wikipathways:
gene_nodes (474 mapped entities, 15.38% input coverage)
mirna_nodes (2 mapped entities, 4.65% input coverage)
metabolite_nodes (12 mapped entities, 75.0% input coverage)
bp_nodes (1 mapped entities, 0.45% input coverage)
total (489 mapped entities, 14.54% input coverage)
kegg:
gene_nodes (1041 mapped entities, 33.80% input coverage)
mirna_nodes (3 mapped entities, 6.98% input coverage)
metabolite_nodes (6 mapped entities, 0.375% input coverage)
bp_nodes (12 mapped entities, 5.36% input coverage)
total (1062 mapped entities, 31.58% input coverage)
reactome:
gene_nodes (709 mapped entities, 23.02% input coverage)
mirna_nodes (1 mapped entities, 2.33% input coverage)
metabolite_nodes (6 mapped entities, 37.5% input coverage)
total (716 mapped entities, 22.8% input coverage)
total:
gene_nodes (1461 mapped entities, 43.44% input coverage)
mirna_nodes (4 mapped entities, 0.12% input coverage)
metabolite_nodes (13 mapped entities, 0.38% input coverage)
bp_nodes (13 mapped entities, 0.39% input coverage)
total (1491 mapped entities, 44.34% input coverage)
To graphically see the mapping coverage, you can also plot a heatmap view of the mapping (see views). To see how the mapping is performed over a input pipeline preprocessing, take a look at this Jupyter Notebook or see process_input docs in DiffuPy.
The returned format is a custom Matrix type, with node labels as rows and a column with the diffusion score, which can be exported into the following formats:
diffusion_scores.to_dict()
diffusion_scores.as_pd_dataframe()
diffusion_scores.as_csv()
diffusion_scores.to_nx_graph()
DiffuPy is a scientific software that has been developed in an academic capacity, and thus comes with no warranty or guarantee of maintenance, support, or back-up of data.