-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GEN-974] Allow NaN, nan and NA strings for mutation data #549
Merged
Merged
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
113e288
initial commit
rxu17 8720949
deprecated setup, replace with setup_method
rxu17 8501a62
lint
rxu17 664bde7
update black version and re-lint
rxu17 1f9760a
update _get_dataframe docstring for vcf and maf
rxu17 f952d19
add additional tests
rxu17 b6a3d6b
fix relevant code smells
rxu17 25c3e91
remove unused conversion code
rxu17 4d4147b
add None into valid vals in test
rxu17 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,5 @@ | ||
"""Configuration to obtain registry classes""" | ||
|
||
import importlib | ||
import logging | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,7 @@ | ||
""" | ||
Creates case lists per cancer type | ||
""" | ||
|
||
from collections import defaultdict | ||
import csv | ||
import os | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,5 @@ | ||
"""Processing functions that are used in the GENIE pipeline""" | ||
|
||
import datetime | ||
import json | ||
import logging | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,5 @@ | ||
"""Write invalid reasons""" | ||
|
||
import logging | ||
import os | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,5 @@ | ||
"""Initialize GENIE registry""" | ||
|
||
# Import logging last to not take in synapseclient logging | ||
import logging | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,5 @@ | ||
"""Assay information class""" | ||
|
||
import os | ||
import yaml | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,5 @@ | ||
"""GENIE bed class and functions""" | ||
|
||
import os | ||
import logging | ||
import subprocess | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,7 @@ | ||
from io import StringIO | ||
import os | ||
import logging | ||
import os | ||
from typing import List | ||
|
||
import pandas as pd | ||
|
||
|
@@ -198,10 +199,6 @@ def _validate(self, mutationDF): | |
for col in numerical_cols: | ||
col_exists = process_functions.checkColExist(mutationDF, col) | ||
if col_exists: | ||
# Since NA is an allowed value, when reading in the dataframe | ||
# the 'NA' string is not converted. This will convert all | ||
# 'NA' values in the numerical columns into actual float('nan') | ||
mutationDF.loc[mutationDF[col] == "NA", col] = float("nan") | ||
# Attempt to convert column to float | ||
try: | ||
mutationDF[col] = mutationDF[col].astype(float) | ||
|
@@ -352,13 +349,38 @@ def _cross_validate(self, mutationDF: pd.DataFrame) -> tuple: | |
) | ||
return errors, warnings | ||
|
||
def _get_dataframe(self, filePathList): | ||
"""Get mutation dataframe""" | ||
# Must do this because pandas.read_csv will allow for a file to | ||
# have more column headers than content. E.g. | ||
# A,B,C,D,E | ||
# 1,2 | ||
# 2,3 | ||
def _get_dataframe(self, filePathList: List[str]) -> pd.DataFrame: | ||
"""Get mutation dataframe | ||
|
||
1) Starts reading the first line in the file | ||
2) Skips lines that starts with # | ||
3) Reads in second line | ||
4) Checks that first line fields matches second line. Must do this because | ||
pandas.read_csv will allow for a file to have more column headers than content. | ||
E.g) A,B,C,D,E | ||
1,2 | ||
2,3 | ||
|
||
5) We keep the 'NA', 'nan', and 'NaN' as strings in the data because | ||
these are valid allele values | ||
then convert the ones in the non-allele columns back to actual NAs | ||
|
||
NOTE: Because allele columns are case-insensitive in maf data, we must | ||
standardize the case of the columns when checking for the non-allele columns | ||
to convert the NA strings to NAs | ||
|
||
NOTE: This code allows empty dataframes to pass through | ||
without errors | ||
|
||
Args: | ||
filePathList (List[str]): list of filepath(s) | ||
|
||
Raises: | ||
ValueError: First line fields doesn't match second line fields in file | ||
|
||
Returns: | ||
pd.DataFrame: mutation data | ||
""" | ||
with open(filePathList[0], "r") as maf_f: | ||
firstline = maf_f.readline() | ||
if firstline.startswith("#"): | ||
|
@@ -370,34 +392,43 @@ def _get_dataframe(self, filePathList): | |
"Number of fields in a line do not match the " | ||
"expected number of columns" | ||
) | ||
|
||
read_csv_params = { | ||
"filepath_or_buffer": filePathList[0], | ||
"sep": "\t", | ||
"comment": "#", | ||
# Keep the value 'NA' | ||
"keep_default_na": False, | ||
"na_values": [ | ||
"-1.#IND", | ||
"1.#QNAN", | ||
"1.#IND", | ||
"-1.#QNAN", | ||
"#N/A N/A", | ||
"NaN", | ||
"#N/A", | ||
"N/A", | ||
"#NA", | ||
"NULL", | ||
"-NaN", | ||
"nan", | ||
"-nan", | ||
"", | ||
], | ||
"keep_default_na": False, | ||
# This is to check if people write files | ||
# with R, quote=T | ||
"quoting": 3, | ||
# Retain completely blank lines so that | ||
# validator will cause the file to fail | ||
"skip_blank_lines": False, | ||
} | ||
|
||
mutationdf = transform._convert_df_with_mixed_dtypes(read_csv_params) | ||
|
||
mutationdf = transform._convert_values_to_na( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👀 I did not think about this at all... Good catch! |
||
input_df=mutationdf, | ||
values_to_replace=["NA", "nan", "NaN"], | ||
columns_to_convert=[ | ||
col | ||
for col in mutationdf.columns | ||
if col.upper() not in self._allele_cols | ||
], | ||
) | ||
return mutationdf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you're converting to
none
here, but should it befloat('nan')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this up.
So, I'm seeing this in maf.py's _validation in which this is handled (originally). I'm not seeing anything in processing in annotation-tools for maf.py that resembles what is happening in validation. Which is surprising, because I would think we'd want to do the same?
I think I have to move part of the code from the above specifically to the _convert_values_to_na in genie/transform.py method just to make sure it's int/float dtypes for both validation and processing
Or am I missing something and we don't need them to be numeric for processing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After experimenting, I think having
None
works. A dataframe with string values and None (e.g:pd.DataFrame({"bye":["2314", None, "124"]})
) converted to float will convert the column to float just fine.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like for
vcf
files' validation, we don't have to worry about this numeric stuff handling in validation or processing becausePOS
is the only expected numeric column and it can't have NAs anyways.So just maf processing side needs the handling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So most of the heavy lifting of processing is done via genome nexus and once we write it out into a csv, it doesn't matter what "type" the column is.
That said - we do want to be diligent about making sure the data itself isn't changing, but a '1' and 1 is going to look the same when we write it out (unless we deliberately add quotes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, then I think it's okay as it is? Since we've come this far without much issue without having to implement that for the maf processing side. I am for keeping
None
mainly because we can have more than just numeric columns and conversion to numeric works fine even withNone
values in the column