-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GEN-974] Allow NaN, nan and NA strings for mutation data #549
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 Fantastic! Going to approve, just a small comment.
pd.DataFrame: dataset with specified values replaced with NAs | ||
""" | ||
if not input_df.empty: | ||
replace_mapping = {value: None for value in values_to_replace} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you're converting to none
here, but should it be float('nan')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this up.
So, I'm seeing this in maf.py's _validation in which this is handled (originally). I'm not seeing anything in processing in annotation-tools for maf.py that resembles what is happening in validation. Which is surprising, because I would think we'd want to do the same?
I think I have to move part of the code from the above specifically to the _convert_values_to_na in genie/transform.py method just to make sure it's int/float dtypes for both validation and processing
Or am I missing something and we don't need them to be numeric for processing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After experimenting, I think having None
works. A dataframe with string values and None (e.g: pd.DataFrame({"bye":["2314", None, "124"]})
) converted to float will convert the column to float just fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like for vcf
files' validation, we don't have to worry about this numeric stuff handling in validation or processing because POS
is the only expected numeric column and it can't have NAs anyways.
So just maf processing side needs the handling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So most of the heavy lifting of processing is done via genome nexus and once we write it out into a csv, it doesn't matter what "type" the column is.
That said - we do want to be diligent about making sure the data itself isn't changing, but a '1' and 1 is going to look the same when we write it out (unless we deliberately add quotes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, then I think it's okay as it is? Since we've come this far without much issue without having to implement that for the maf processing side. I am for keeping None
mainly because we can have more than just numeric columns and conversion to numeric works fine even with None
values in the column
mutationdf = transform._convert_df_with_mixed_dtypes(read_csv_params) | ||
|
||
mutationdf = transform._convert_values_to_na( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👀 I did not think about this at all... Good catch!
Quality Gate passedThe SonarCloud Quality Gate passed, but some issues were introduced. 32 New issues |
Purpose: This is a draft PR. This PR allows in the "NaN", "nan" and "NA" strings for allele columns in the vcf and maf datasets because these are valid allele combinations.
Changes:
_get_dataframe
methods of thevcf.py
andmaf.py
filesread_csv
to only convert specific columns, especially since it already has a bunch of arguments for NA specific handling, and order of operations comes into play here. See the docs for more details.Testing: