[GEN-974] Allow NaN, nan and NA strings for mutation data #549

rxu17 · 2024-02-06T20:00:32Z

Purpose: This is a draft PR. This PR allows in the "NaN", "nan" and "NA" strings for allele columns in the vcf and maf datasets because these are valid allele combinations.

Changes:

_convert_values_to_na in genie/transform.py- new function to convert all occurrences of values in a dataframe to NA, this is a helper function that is used in the _get_dataframe methods of the vcf.py and maf.py files
_get_dataframe in maf.py
_get_dataframe in vcf.py
I decided to add the allowing in of the "NaN", "nan" and "NA" strings in this method because we need this to occur for validation and processing, and this is the method that is used in both. I also had to do some special handling for maf files because they can allow in case insensitive column names. The in-depth reasoning behind allowing in "NaN", "nan" and "NA" strings for ALL of the dataset, then converting them back to NA in non-allele columns can be found in the comments of the JIRA ticket. Summary of reasoning is that it is difficult to useread_csv to only convert specific columns, especially since it already has a bunch of arguments for NA specific handling, and order of operations comes into play here. See the docs for more details.

Testing:

Updated and ran pytests locally
Updated maf and vcf test files in test synapse project to include NaN, nan and NA as valid allele values and ran genie pipeline all the way through locally (integration test)
Ran genie pipeline under docker image

thomasyu888

🔥 Fantastic! Going to approve, just a small comment.

thomasyu888 · 2024-02-06T21:29:25Z

genie/transform.py

+        pd.DataFrame: dataset with specified values replaced with NAs
+    """
+    if not input_df.empty:
+        replace_mapping = {value: None for value in values_to_replace}


I see you're converting to none here, but should it be float('nan')

Thanks for bringing this up.

So, I'm seeing this in maf.py's _validation in which this is handled (originally). I'm not seeing anything in processing in annotation-tools for maf.py that resembles what is happening in validation. Which is surprising, because I would think we'd want to do the same?

I think I have to move part of the code from the above specifically to the _convert_values_to_na in genie/transform.py method just to make sure it's int/float dtypes for both validation and processing

Or am I missing something and we don't need them to be numeric for processing?

After experimenting, I think having None works. A dataframe with string values and None (e.g: pd.DataFrame({"bye":["2314", None, "124"]})) converted to float will convert the column to float just fine.

It seems like for vcf files' validation, we don't have to worry about this numeric stuff handling in validation or processing because POS is the only expected numeric column and it can't have NAs anyways.

So just maf processing side needs the handling

So most of the heavy lifting of processing is done via genome nexus and once we write it out into a csv, it doesn't matter what "type" the column is.

That said - we do want to be diligent about making sure the data itself isn't changing, but a '1' and 1 is going to look the same when we write it out (unless we deliberately add quotes)

Ah I see, then I think it's okay as it is? Since we've come this far without much issue without having to implement that for the maf processing side. I am for keeping None mainly because we can have more than just numeric columns and conversion to numeric works fine even with None values in the column

thomasyu888 · 2024-02-06T22:46:27Z

genie_registry/maf.py

        mutationdf = transform._convert_df_with_mixed_dtypes(read_csv_params)
+
+        mutationdf = transform._convert_values_to_na(


👀 I did not think about this at all... Good catch!

sonarqubecloud · 2024-02-07T03:42:05Z

Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

32 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

rxu17 added 7 commits February 6, 2024 11:51

initial commit

113e288

deprecated setup, replace with setup_method

8720949

lint

8501a62

update black version and re-lint

664bde7

update _get_dataframe docstring for vcf and maf

1f9760a

add additional tests

f952d19

fix relevant code smells

b6a3d6b

rxu17 marked this pull request as ready for review February 6, 2024 22:44

rxu17 requested a review from a team as a code owner February 6, 2024 22:44

thomasyu888 approved these changes Feb 6, 2024

View reviewed changes

rxu17 added 2 commits February 6, 2024 19:39

remove unused conversion code

25c3e91

add None into valid vals in test

4d4147b

rxu17 merged commit 5e69a73 into develop Feb 7, 2024
8 checks passed

rxu17 deleted the gen-974-allow-na-strings branch February 7, 2024 23:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GEN-974] Allow NaN, nan and NA strings for mutation data #549

[GEN-974] Allow NaN, nan and NA strings for mutation data #549

rxu17 commented Feb 6, 2024 •

edited

Loading

thomasyu888 left a comment •

edited

Loading

thomasyu888 Feb 6, 2024

rxu17 Feb 6, 2024 •

edited

Loading

rxu17 Feb 7, 2024

rxu17 Feb 7, 2024 •

edited

Loading

thomasyu888 Feb 7, 2024 •

edited

Loading

rxu17 Feb 7, 2024 •

edited

Loading

thomasyu888 Feb 6, 2024

sonarqubecloud bot commented Feb 7, 2024

		mutationdf = transform._convert_df_with_mixed_dtypes(read_csv_params)

		mutationdf = transform._convert_values_to_na(

[GEN-974] Allow NaN, nan and NA strings for mutation data #549

[GEN-974] Allow NaN, nan and NA strings for mutation data #549

Conversation

rxu17 commented Feb 6, 2024 • edited Loading

thomasyu888 left a comment • edited Loading

Choose a reason for hiding this comment

thomasyu888 Feb 6, 2024

Choose a reason for hiding this comment

rxu17 Feb 6, 2024 • edited Loading

Choose a reason for hiding this comment

rxu17 Feb 7, 2024

Choose a reason for hiding this comment

rxu17 Feb 7, 2024 • edited Loading

Choose a reason for hiding this comment

thomasyu888 Feb 7, 2024 • edited Loading

Choose a reason for hiding this comment

rxu17 Feb 7, 2024 • edited Loading

Choose a reason for hiding this comment

thomasyu888 Feb 6, 2024

Choose a reason for hiding this comment

sonarqubecloud bot commented Feb 7, 2024

Quality Gate passed

rxu17 commented Feb 6, 2024 •

edited

Loading

thomasyu888 left a comment •

edited

Loading

rxu17 Feb 6, 2024 •

edited

Loading

rxu17 Feb 7, 2024 •

edited

Loading

thomasyu888 Feb 7, 2024 •

edited

Loading

rxu17 Feb 7, 2024 •

edited

Loading