Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Directionality #120

Merged
merged 50 commits into from
Dec 3, 2023
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
af6a590
documented research per algorithm as docstrings per class for the alg…
ntalluri Aug 23, 2023
3f173b5
added directionality for generate inputs
ntalluri Aug 29, 2023
a214cc8
Merge branch 'master' of github.com:ntalluri/spras into direction
ntalluri Aug 30, 2023
1e7fef9
added in parse_output directionality
ntalluri Aug 30, 2023
4fdaadb
removed directed from config file, testing with analysis all false
ntalluri Aug 30, 2023
cae057a
made updates to code and attempted to add testing for interactome
ntalluri Sep 4, 2023
319a5d3
precommit formatting
ntalluri Sep 5, 2023
bb55ee0
cleaned up code and finished interactome test
ntalluri Sep 5, 2023
7aed4ee
updated util to deal with the idea if someone is using an old config …
ntalluri Sep 5, 2023
0e15534
current ml repairs
ntalluri Sep 5, 2023
75be8b8
ml post processing pre-commit and config file
ntalluri Sep 6, 2023
550614a
fixed testing
ntalluri Sep 12, 2023
760b1b7
Merge branch 'master' into direction
ntalluri Sep 18, 2023
9e29ceb
updated summary.py/associated files and tests. updated interactome.py
ntalluri Sep 18, 2023
0b47bd9
Merge branch 'direction' of github.com:ntalluri/spras into direction
ntalluri Sep 18, 2023
713279a
pre-commit
ntalluri Sep 18, 2023
1198c9a
added generate inputs test, cleaned up code
ntalluri Sep 19, 2023
39b4748
Resolve merge conflicts
agitter Sep 22, 2023
af1b849
Update EGFR network with edge directions
agitter Sep 22, 2023
a34f832
added back graphspace to work for directed and undirected graphs only
ntalluri Sep 26, 2023
9b551d6
precommit
ntalluri Sep 26, 2023
b314751
automate test_prepare_inputs
ntalluri Sep 29, 2023
32be496
renamed the tests for creating the inputs
ntalluri Oct 4, 2023
57599f7
fix break in test
ntalluri Oct 4, 2023
e991e0a
added parse_output tests and still fixing generate inputs
ntalluri Oct 4, 2023
dd6c899
precommit
ntalluri Oct 4, 2023
ae276f5
added more information to step 5 of contributing guide
ntalluri Oct 4, 2023
df7da32
add cytoscape into workflow
ntalluri Oct 4, 2023
e5be5a0
cleaning up generate inputs/parse outputs test suites
ntalluri Oct 17, 2023
09c2913
clean up gen inputs and prase outputs
ntalluri Oct 17, 2023
8dd0990
Merge with master
agitter Oct 18, 2023
d9b0f0b
Fix ruff errors on GitHub actions
agitter Oct 18, 2023
6f631ef
made changes based on review
ntalluri Oct 26, 2023
f651376
made more changes based on review
ntalluri Oct 26, 2023
b764129
fixed error
ntalluri Oct 27, 2023
8eb9ebb
some of the comments
ntalluri Nov 27, 2023
5c0939f
more comments
ntalluri Nov 27, 2023
884fe4f
precommit
ntalluri Nov 27, 2023
19a4ad4
more comments
ntalluri Nov 27, 2023
99e6e6f
more comments resolved
ntalluri Nov 28, 2023
e8735ea
resolving parse output comments
ntalluri Nov 29, 2023
96c0482
precommit
ntalluri Nov 29, 2023
83262c7
add check to dataset.py
ntalluri Dec 1, 2023
7bc9e3e
Rename .csv to .txt in test directory
agitter Dec 3, 2023
d918c00
Add tests for invalid 4th edge column
agitter Dec 3, 2023
cd625d4
Remove self-edges from EGFR data
agitter Dec 3, 2023
ee58069
Update Cytoscape wrapper for directed edges
agitter Dec 3, 2023
da655b3
Remove directed from EGFR config and run more algs
agitter Dec 3, 2023
b4d6e1c
Systematic proofreading and formatting
agitter Dec 3, 2023
cccf5ce
Bump Cytoscape image version in workflow
agitter Dec 3, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 3 additions & 10 deletions config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,14 +29,12 @@
- name: "pathlinker"
params:
include: true
directed: true
run1:
k: range(100,201,100)

- name: "omicsintegrator1"
params:
include: true
directed: false
run1:
r: [5]
b: [5, 6]
Expand All @@ -47,7 +45,6 @@
- name: "omicsintegrator2"
params:
include: true
directed: false
run1:
b: [4]
g: [0]
Expand All @@ -58,7 +55,6 @@
- name: "meo"
params:
include: true
directed: true
run1:
max_path_length: [3]
local_search: ["Yes"]
Expand All @@ -67,20 +63,17 @@
- name: "mincostflow"
params:
include: true
directed: false
run1:
flow: [1] # The flow must be an int
capacity: [1]

- name: "allpairs"
params:
include: true
directed: false

- name: "domino"
params:
include: true
directed: false
run1:
slice_threshold: [0.3]
module_threshold: [0.05]
Expand Down Expand Up @@ -125,13 +118,13 @@
analysis:
# Create one summary per pathway file and a single summary table for all pathways for each dataset
summary:
include: true
include: false
# Create output files for each pathway that can be visualized with GraphSpace
graphspace:
include: true
include: false
# Machine learning analysis (e.g. clustering) of the pathway output files for each dataset
ml:
include: true
include: false
# specify how many principal components to calculate
components: 2
# boolean to show the labels on the pca graph
Expand Down
18 changes: 9 additions & 9 deletions input/alternative-network.txt
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
A B 0.98
B C 0.77
A D 0.12
C D 0.89
C E 0.59
C F 0.50
F G 0.76
G H 0.92
G I 0.66
A B 0.98 U
B C 0.77 U
A D 0.12 U
C D 0.89 U
C E 0.59 U
C F 0.50 U
F G 0.76 U
G H 0.92 U
G I 0.66 U
4 changes: 2 additions & 2 deletions input/network.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
A B 0.98
B C 0.77
A B 0.98 U
B C 0.77 U
8 changes: 7 additions & 1 deletion src/allpairs.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

import pandas as pd

from src.dataset import convert_directed_to_undirected, readd_direction_col_undirected
from src.prm import PRM
from src.util import prepare_volume, run_container

Expand Down Expand Up @@ -42,8 +43,12 @@ def generate_inputs(data, filename_map):

input_df.to_csv(filename_map["nodetypes"], sep="\t", index=False, columns=["#Node", "Node type"])

# Create network file
edges_df = data.get_interactome()
# Format network file
edges_df = convert_directed_to_undirected(edges_df)
# This is pretty memory intensive. We might want to keep the interactome centralized.
data.get_interactome().to_csv(filename_map["network"], sep="\t", index=False,
edges_df.to_csv(filename_map["network"], sep="\t", index=False,
columns=["Interactor1", "Interactor2", "Weight"],
header=["#Interactor1", "Interactor2", "Weight"])

Expand Down Expand Up @@ -100,4 +105,5 @@ def parse_output(raw_pathway_file, standardized_pathway_file):
"""
df = pd.read_csv(raw_pathway_file, sep='\t', header=None)
df['Rank'] = 1 # add a rank column of 1s since the edges are not ranked.
df = readd_direction_col_undirected(df, 2)
df.to_csv(standardized_pathway_file, header=False, index=False, sep='\t')
215 changes: 200 additions & 15 deletions src/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
class Dataset:

NODE_ID = "NODEID"
warning_threshold = 0.05 #Threshold for scarcity of columns to warn user
warning_threshold = 0.05 # Threshold for scarcity of columns to warn user

def __init__(self, dataset_dict):
self.label = None
Expand Down Expand Up @@ -63,25 +63,57 @@ def load_files_from_dict(self, dataset_dict):

self.label = dataset_dict["label"]

#Get file paths from config
# Get file paths from config
# TODO support multiple edge files
interactome_loc = dataset_dict["edge_files"][0]
node_data_files = dataset_dict["node_files"]
#edge_data_files = [""] # Currently None
# edge_data_files = [""] # Currently None
data_loc = dataset_dict["data_dir"]

#Load everything as pandas tables
self.interactome = pd.read_table(os.path.join(data_loc,interactome_loc), names = ["Interactor1","Interactor2","Weight"])
# Load everything as pandas tables
print("about to create self.interactome")
print(data_loc)
print(interactome_loc)

with open(os.path.join(data_loc, interactome_loc), "r") as f:
for _i in range(9): # first 5 lines
print(f.readline())

self.interactome = pd.read_table(
os.path.join(data_loc, interactome_loc), sep="\t", header=None
)
print(self.interactome)
num_cols = self.interactome.shape[1]
print(num_cols)
if num_cols == 3:

self.interactome.columns = ["Interactor1", "Interactor2", "Weight"]
self.interactome["Direction"] = "U"

elif num_cols == 4:
self.interactome.columns = [
"Interactor1",
"Interactor2",
"Weight",
"Direction",
]
else:
raise ValueError(
f"edge_files must have three or four columns but found {num_cols}"
)

node_set = set(self.interactome.Interactor1.unique())
node_set = node_set.union(set(self.interactome.Interactor2.unique()))

#Load generic node tables
# Load generic node tables
self.node_table = pd.DataFrame(node_set, columns=[self.NODE_ID])
for node_file in node_data_files:
single_node_table = pd.read_table(os.path.join(data_loc,node_file))
#If we have only 1 column, assume this is an indicator variable
if len(single_node_table.columns)==1:
single_node_table = pd.read_table(os.path.join(data_loc,node_file),header=None)
single_node_table = pd.read_table(os.path.join(data_loc, node_file))
# If we have only 1 column, assume this is an indicator variable
if len(single_node_table.columns) == 1:
single_node_table = pd.read_table(
os.path.join(data_loc, node_file), header=None
)
single_node_table.columns = [self.NODE_ID]
new_col_name = node_file.split(".")[0]
single_node_table[new_col_name] = True
Expand All @@ -91,7 +123,9 @@ def load_files_from_dict(self, dataset_dict):
# will be ignored
# TODO may want to warn about duplicate before removing them, for instance, if a user loads two files that
# both have prizes
self.node_table = self.node_table.merge(single_node_table, how="left", on=self.NODE_ID, suffixes=(None, "_DROP")).filter(regex="^(?!.*DROP)")
self.node_table = self.node_table.merge(
single_node_table, how="left", on=self.NODE_ID, suffixes=(None, "_DROP")
).filter(regex="^(?!.*DROP)")
# Ensure that the NODEID column always appears first, which is required for some downstream analyses
self.node_table.insert(0, "NODEID", self.node_table.pop("NODEID"))
self.other_files = dataset_dict["other_files"]
Expand All @@ -103,11 +137,18 @@ def request_node_columns(self, col_names):
"""
col_names.append(self.NODE_ID)
filtered_table = self.node_table[col_names]
filtered_table = filtered_table.dropna(axis=0, how='all',subset=filtered_table.columns.difference([self.NODE_ID]))
percent_hit = (float(len(filtered_table))/len(self.node_table))*100
if percent_hit <= self.warning_threshold*100:
filtered_table = filtered_table.dropna(
axis=0, how="all", subset=filtered_table.columns.difference([self.NODE_ID])
)
percent_hit = (float(len(filtered_table)) / len(self.node_table)) * 100
if percent_hit <= self.warning_threshold * 100:
# Only use stacklevel 1 because this is due to the data not the code context
warnings.warn("Only %0.2f of data had one or more of the following columns filled:"%(percent_hit) + str(col_names), stacklevel=1)
warnings.warn(
"Only %0.2f of data had one or more of the following columns filled:"
% (percent_hit)
+ str(col_names),
stacklevel=1,
)
return filtered_table

def contains_node_columns(self, col_names):
Expand All @@ -131,3 +172,147 @@ def get_other_files(self):

def get_interactome(self):
return self.interactome.copy()
ntalluri marked this conversation as resolved.
Show resolved Hide resolved

def convert_undirected_to_directed(df: pd.DataFrame) -> pd.DataFrame:
"""
turns a graph into a fully directed graph
- turns every unidirected edges into a bi-directed edge
- with bi-directed edges, we are not loosing too much information because the relationship of the undirected egde is still preserved

*A user must keep the Direction column when using this function

@param df: input network df of edges, weights, and directionality
@return a dataframe with no undirected edges in Direction column
"""

# TODO: add a check to make sure there is a direction column in df

for index, row in df.iterrows():
if row["Direction"] == "U":
df.at[index, "Direction"] = "D"

new_directed_row = row.copy(deep=True)
new_directed_row["Interactor1"], new_directed_row["Interactor2"] = (
row["Interactor2"],
row["Interactor1"],
)
print("new directed row\n", new_directed_row)
new_directed_row["Direction"] = "D"
df.loc[len(df)] = new_directed_row

return df


def convert_directed_to_undirected(df: pd.DataFrame) -> pd.DateOffset:
"""
turns a graph into a fully undirected graph
- turns the directed edges directly into undirected edges
- we will loose any sense of directionality and the graph won't be inherently accurate, but the basic relationship between the two connected nodes will still remain intact.

@param df: input network df of edges, weights, and directionality
@return a dataframe with no directed edges in Direction column
"""

for index, row in df.iterrows():
if row["Direction"] == "D":
df.at[index, "Direction"] = "U"

return df


def add_seperator(df: pd.DataFrame, col_loc: int, col_name: str, sep: str) -> pd.DataFrame:
"""
adds a seperator somewhere into the input dataframe

@param df: input network df of edges, weights, and directionality
@param col_loc: the spot in the dataframe to put the new column
@param col_name: the name of the new column
@param sep: some type of seperator needed in the df
@return a df with a new seperator added to every row
"""

df.insert(col_loc, col_name, sep)
return df


def add_directionality_seperators(df: pd.DataFrame, col_loc: int, col_name: str, dir_sep: str, undir_sep: str) -> pd.DataFrame:
"""
deals with adding in directionality seperators for mixed graphs that isn't in the universal input

*user must keep the Direction column when using the function

@param df: input network df of edges, weights, and directionality
@param col_loc: the spot in the dataframe to put the new column
@param col_name: the name of the new column
@param dir_sep: the directed edge sep
@param undir_sep: the undirected edge sep
@return a df converted to show directionality differently
"""

# TODO: add a check to make sure there is a direction column in df

df.insert(col_loc, col_name, dir_sep)

for index, row in df.iterrows():
if row["Direction"] == "U":
df.at[index, col_name] = undir_sep
elif row["Direction"] == "D":
continue
else:
raise ValueError(
f'direction must be a \'U\' or \'D\', but found {row["Direction"]}'
)

return df

def readd_direction_col_mixed(df: pd.DataFrame, direction_col_loc: int, existing_direction_column: str, dir_sep: str, undir_sep: str) -> pd.DataFrame:
"""
readds a 'Direction' column that puts a 'U' or 'D' based on the dir/undir seperators in the existing direction column

*user must keep the existing direction column when using the function

@param df: input network df that contains directionality
@param direction_col_loc: the spot in the dataframe to put back the 'Direction' column
@param existing_direction_column: the name of the existing directionality column
@param dir_sep: the directed edge sep
@param undir_sep: the undirected edge sep
@return a df with Direction column added back
"""

df.insert(direction_col_loc, "Direction", "D")

for index, row in df.iterrows():
if row[existing_direction_column] == undir_sep:
df.at[index, "Direction"] = "U"

elif row[existing_direction_column] == dir_sep:
df.at[index, "Direction"] = "D"

else:
raise ValueError(
f'direction must be a \'{dir_sep}\' or \'{undir_sep}\', but found {row[existing_direction_column]}'
)

return df

def readd_direction_col_undirected(df: pd.DataFrame, direction_col_loc: int) -> pd.DataFrame:
"""
readds a 'Direction' column that puts a 'U'

@param df: input network df that contains directionality
@param direction_col_loc: the spot in the dataframe to put back the 'Direction' column
@return a df with Direction column added back
"""
df.insert(direction_col_loc, "Direction", "U")
return df

def readd_direction_col_directed(df: pd.DataFrame, direction_col_loc: int) -> pd.DataFrame:
"""
readds a 'Direction' column that puts a 'D'

@param df: input network df that contains directionality
@param direction_col_loc: the spot in the dataframe to put back the 'Direction' column
@return a df with Direction column added back
"""
df.insert(direction_col_loc, "Direction", "D")
return df
Loading