Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Omics Integrator 2 Testing Error #133

Open
ntalluri opened this issue Oct 4, 2023 · 12 comments
Open

Omics Integrator 2 Testing Error #133

ntalluri opened this issue Oct 4, 2023 · 12 comments
Labels
bug Something isn't working

Comments

@ntalluri
Copy link
Collaborator

ntalluri commented Oct 4, 2023

When I was writing the tests for parse outputs, I kept getting an error on this line of the code for oi2: df = df[df['in_solution'] == True] # Check whether this column can be empty before revising this line

I was taking the output from the oi2 test suite, but it was failing because it was missing the 'in_solution' column, yet the column 'in_solution' appears in the raw-pathways for oi2 when SPRAS is run over the datasets.

There might be an issue with the test suite or merely the result. I haven't checked it out, but it's something we should look at because it's not a current test.

@agitter agitter added the bug Something isn't working label Oct 4, 2023
@agitter
Copy link
Collaborator

agitter commented Oct 14, 2023

This may not be a problem with the Omics Integrator 2 container or run function but rather have to do with its input data and parameters. I modified the test function to use the parameters (b=4, g=0) and input data from the workflow. I was able to get a complete output file with the in_solution column when running the test_oi2_required test:

protein1	protein2	cost	in_solution
B	A	0.52	True
B	C	0.73	True

I'll need to run more tests or read through its source code to understand why this column is sometimes missing.

@agitter
Copy link
Collaborator

agitter commented Oct 18, 2023

I pushed the branch oi2-misssing-col with my initial test dataset changes and explorations and some documentation updates.

I had started looking at where Omics Integrator 2 sets in_solution in its source code, but didn't set find anything that helps explain this behavior: https://github.com/agitter/OmicsIntegrator2/blob/command-line/src/graph.py#L335

I was testing pytest -k test_oi2_required.

@ntalluri
Copy link
Collaborator Author

ntalluri commented Aug 28, 2024

When I was grid searching the parameters for the EGFR dataset, oi2 kept breaking due to this error. For parameter tuning, it will be helpful to find a solution for this issue soon.

@ntalluri
Copy link
Collaborator Author

ntalluri commented Aug 28, 2024

Examples of headers from the raw pathway file

protein1 protein2 cost

  • sometimes the in_solution column is missing
  • why is this missing is important

protein1 protein2

  • sometimes cost and in_solution is missing
  • seems to happen when the raw pathway is empty

protein1 protein2 in_solution cost

  • sometimes the in_solution column is not in the right order

protein1 protein2 cost in_solution

  • normal behavior?

@ntalluri
Copy link
Collaborator Author

ntalluri commented Aug 28, 2024

quick fix, if the first 3 column headers are there, then write an empty file and say it is corrupted

then try to figure what the code is doing

@agitter
Copy link
Collaborator

agitter commented Aug 30, 2024

I resumed investgating the in_solution column. All of the following is speculative and needs to be tested and confirmed.

Omics Integrator 2 adds that information here using nx.set_edge_attributes. We use version 2.1 of networkx so the source code is here. My initial read suggests the graph edge attributes are dicts. Maybe the order of the attributes is not guaranteed? That could explain why cost and in_solution are sometimes swapped in order. We can work to confirm this idea.

In addition, if Omics Integrator 2 adds in_solution only for forest edges, that suggests the edge attribution is not added when forest is empty. We can also test that behavior. My initial testing in the branch above may support that theory.

None of this explains why cost may be missing sometimes. Does this only happen when the raw pathway is empty?

@ntalluri
Copy link
Collaborator Author

ntalluri commented Sep 3, 2024

#182 Pull request for error

@ntalluri
Copy link
Collaborator Author

ntalluri commented Sep 4, 2024

The cost is always missing when the raw pathway is empty.

@ntalluri
Copy link
Collaborator Author

ntalluri commented Sep 6, 2024

For the in_solution and cost swapping situation, this might be a problem with the version of python used as well. For python 3.7 and newer, dictionaries are guaranteed insertion order. For 3.6, it is not officially guaranteed by the language specification that insertion order is guaranteed. So, there is a chance that Python 3.6 could be potentially changing the order.

@ntalluri
Copy link
Collaborator Author

ntalluri commented Sep 6, 2024

Based on my understanding, if the forest is empty, the code will attempt to iterate through the edges, but since there are none, no exception will be raised, and the function will simply end without making any changes. Therefore, in_solution will never be added as an edge attribute.

@ntalluri
Copy link
Collaborator Author

Still need to figure out what is an empty file for OI2. Completely empty files vs files for 'protein1 protein2 cost' that also are technically empty because the forest is empty.

@agitter
Copy link
Collaborator

agitter commented Sep 22, 2024

#182 has been merged. Are we ready to close this?

It looks like you want to leave it open until we understand the protein1 protein2 cost output better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants