Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create SampleQC and Batching notebooks #744

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

kjaisingh
Copy link
Collaborator

This PR introduces the initial set of SampleQC and Batching notebooks part of the featured workspace into the GATK-SV repository to enable fine-grained source control. They have been added to the newly-created scripts/notebooks/ directory.

@kjaisingh kjaisingh added the enhancement New feature or request label Nov 5, 2024
@kjaisingh kjaisingh self-assigned this Nov 5, 2024
Copy link
Collaborator

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great! I've only had a chance to go through the QC notebook so far and my only broad comment is to document the figures so they are all clearly explained.

},
"outputs": [],
"source": [
"# sample_set_response = fapi.get_entities_tsv(\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be a little easier to do a block comment with """ just so you don't have to delete all the #'s

Copy link
Collaborator Author

@kjaisingh kjaisingh Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Jupyter Notebooks, you can simply do + / to comment/uncomment an entire code-block. I thought this would be easier than explicitly deleting/undeleting the """. Happy to change this if you think it's more suitable though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's a neat trick. That's fine then, but maybe add the shortcut as a tip in or above the cell

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly - left a comment in notebook execution tips.

scripts/notebooks/SampleQC.ipynb Outdated Show resolved Hide resolved
scripts/notebooks/SampleQC.ipynb Outdated Show resolved Hide resolved
scripts/notebooks/SampleQC.ipynb Outdated Show resolved Hide resolved
Comment on lines 1735 to 1739
"1. Each step should be executed by first using the **Analysis** section, which plots figures to assist with determining appropriate filtering cutoff values. This will involve modifying the following parameters:\n",
" - `NUM_BINS`: The number of bins to include in each histogram plot displayed.\n",
" - `LOG_SCALE`: Determines whether to log-scale the resulting plot.\n",
" - `LINE_DEVIATIONS`: A list of integers that defines the cutoff lines to draw on each histogram plot. This is a list of multipliers *x* for the median absolute deviation (MAD) such that lines are drawn at Median ± x * MAD. It has been initialized with default values, but set this to `[]` or `None` if you don't wish to include any cutoffs.\n",
" - `LINE_STYLES`: A list of strings that defines the line styles that each cutoff line should use in each histogram plot displayed. It has been initialized with default values, but set this to `[]` or `None` if you don't wish to include any cutoffs.\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"1. Each step should be executed by first using the **Analysis** section, which plots figures to assist with determining appropriate filtering cutoff values. This will involve modifying the following parameters:\n",
" - `NUM_BINS`: The number of bins to include in each histogram plot displayed.\n",
" - `LOG_SCALE`: Determines whether to log-scale the resulting plot.\n",
" - `LINE_DEVIATIONS`: A list of integers that defines the cutoff lines to draw on each histogram plot. This is a list of multipliers *x* for the median absolute deviation (MAD) such that lines are drawn at Median ± x * MAD. It has been initialized with default values, but set this to `[]` or `None` if you don't wish to include any cutoffs.\n",
" - `LINE_STYLES`: A list of strings that defines the line styles that each cutoff line should use in each histogram plot displayed. It has been initialized with default values, but set this to `[]` or `None` if you don't wish to include any cutoffs.\n",
"1. Each step should be executed by first using the **Analysis** section, which plots figures to assist with determining appropriate filtering cutoff values. Parameters are defined and documented in the following cells.

I think it's hard to read all this and then have to scroll down to change the parameters. IMO it would be easier to read to just have the definitions inline in the code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see where you are coming from. However, given that the parameters are (more or less) constant across all the various QC analyses, it feels a bit redundant to have the definitions in-line in the code. Or are you suggesting just having them in-line in the first QC analysis?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point, but I know that lazy people will not want to have to scroll. A compromise could be to just have a brief description inline and the more thorough description in this section for reference?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly.

scripts/notebooks/SampleQC.ipynb Outdated Show resolved Hide resolved
scripts/notebooks/SampleQC.ipynb Outdated Show resolved Hide resolved
scripts/notebooks/SampleQC.ipynb Outdated Show resolved Hide resolved
scripts/notebooks/SampleQC.ipynb Outdated Show resolved Hide resolved
scripts/notebooks/SampleQC.ipynb Outdated Show resolved Hide resolved
Copy link
Collaborator

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your updates. I moved on to the Batching notebook but ran into a problem, see below.

"\n",
"**Prerequisites**: EvidenceQC, Sample QC Notebook.\n",
"\n",
"**Next Steps**: TrainGCNV, GatherBatchEvidence.\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"**Next Steps**: TrainGCNV, GatherBatchEvidence.\n",
"**Next Step**: TrainGCNV\n",

Clearer since GatherBatchEvidence requires TrainGcnv first.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly.

"* Memory: 13 GB\n",
"* Persistent Disk: 100 GB\n",
"\n",
"**Prerequisites**: EvidenceQC, Sample QC Notebook.\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"**Prerequisites**: EvidenceQC, Sample QC Notebook.\n",
"**Prerequisites**: SampleQC notebook\n",

I think this is clearer. EvidenceQC is implied since that is required by SampleQC.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly.

Comment on lines 1890 to 1892
"validate_qc_inputs(samples_qc_table, 'median_coverage', num_bins=NUM_BINS, \n",
" method=METHOD, lower_cutoff=LOWER_CUTOFF, upper_cutoff=UPPER_CUTOFF, \n",
" mad_cutoff=MAD_CUTOFF, log_scale=LOG_SCALE)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still getting an error here with the defaults:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[43], line 9
      6 UPPER_CUTOFF = None
      7 MAD_CUTOFF = None
----> 9 validate_qc_inputs(samples_qc_table, 'median_coverage', num_bins=NUM_BINS, 
     10                    method=METHOD, lower_cutoff=LOWER_CUTOFF, upper_cutoff=UPPER_CUTOFF, 
     11                    mad_cutoff=MAD_CUTOFF, log_scale=LOG_SCALE)

Cell In[15], line 58, in validate_qc_inputs(table, filter_type, num_bins, line_deviations, line_styles, method, lower_cutoff, upper_cutoff, mad_cutoff, caller, caller_type, log_scale)
     56 if (method == 'hard'):
     57     if(not lower_cutoff and not upper_cutoff):
---> 58         raise Exception('Given that the chosen method is "hard", a value for the upper/lower cutoff must be provided.')
     60     if (lower_cutoff and not isinstance(lower_cutoff, (int, float))):
     61         raise Exception('Given that the chosen method is "hard", the value for the lower cutoff should be numeric.')

Exception: Given that the chosen method is "hard", a value for the upper/lower cutoff must be provided.

but should allow None to be -inf for LOWER_CUTOFF and inf for UPPER_CUTOFF

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly - also set inf for MAD_CUTOFF if none provided.

Comment on lines 2730 to 2731
"created_ped = pd.read_table(file_path, names=['Family_ID', 'Sample_ID', 'Paternal_ID', \n",
" 'Maternal_ID', 'Sex', 'Phenotype'])\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting this error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[84], line 3
      1 # Save and display generated PED file, and update 'cohort_ped' workspace variable to this
      2 file_path = generate_file_path(TLD_PATH, 'sex_analysis', 'sample_qc.ped')
----> 3 create_ped(updated_sex, reference_ped, file_path)
      5 created_ped = pd.read_table(file_path, names=['Family_ID', 'Sample_ID', 'Paternal_ID', 
      6                                                'Maternal_ID', 'Sex', 'Phenotype'])
      7 created_ped

Cell In[30], line 25, in create_ped(sex_assignments, reference_ped, file_path)
     23 # Account for case where reference PED file does not exist
     24 if (reference_ped.empty):
---> 25     reference_ped["Family_ID"] = reference_ped["Sample_ID"]
     26     reference_ped = pd.DataFrame({'Sample_ID': sex_assignments.keys()})
     27     reference_ped["Paternal_ID"] = "0"

File /opt/conda/lib/python3.10/site-packages/pandas/core/frame.py:3761, in DataFrame.__getitem__(self, key)
   3759 if self.columns.nlevels > 1:
   3760     return self._getitem_multilevel(key)
-> 3761 indexer = self.columns.get_loc(key)
   3762 if is_integer(indexer):
   3763     indexer = [indexer]

File /opt/conda/lib/python3.10/site-packages/pandas/core/indexes/range.py:349, in RangeIndex.get_loc(self, key)
    347         raise KeyError(key) from err
    348 if isinstance(key, Hashable):
--> 349     raise KeyError(key)
    350 self._check_indexing_error(key)
    351 raise KeyError(key)

KeyError: 'Sample_ID'

which looks like a bug. I think this line just needs to be moved below the next one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, I'd noticed this post-hoc so updated it only in the template workspace rather than in the Github repo as well. I've included this version in the latest commit though.

"source": [
"# Display metadata table, which was generated as part of EvidenceQC\n",
"PASS_METADATA = os.path.join(WS_BUCKET, \"evidence_qc/filtering/passed_samples_qc_metadata.tsv\")\n",
"pass_df = pd.read_table(PASS_METADATA)\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is breaking on me:

FileNotFoundError: b/fc-41a63557-9ddf-41ba-b5d0-807be90ddb87/o/evidence_qc%2Ffiltering%2Fpassed_samples_qc_metadata.tsv

Does pandas support reading directly from buckets?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does - you need to use gs:// as a prefix though.

Copy link
Collaborator Author

@kjaisingh kjaisingh Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, on closer inspection I realized that this is automatically set based on WS_BUCKET, which is pulled from os.environ['WORKSPACE_BUCKET']. Strange to me that this does not automatically prepend gs:// in this case though, as it does for me...do you mind sharing the value of WS_BUCKET in your notebook?

"MINIMUM_BATCH_SIZE = 'TODO'\n",
"MAXIMUM_BATCH_SIZE = 'TODO'\n",
"\n",
"validate_numeric_inputs([TARGET_BATCH_SIZE, MINIMUM_BATCH_SIZE, MAXIMUM_BATCH_SIZE])"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should validate that min size is at most the number of total samples

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly.

"outputs": [],
"source": [
"# Perform hierarchical batching and print summary of batch size and sex balance\n",
"batches, batches_meta = generate_hierarchical_batches(pass_df, INCLUDE_METRICS, INCLUDE_BINS, TARGET_BATCH_SIZE, \n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting stuck in an infinite loop with target size 500, max 600, min 100

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind sharing the workspace you encountered this in with me? It would be helpful to trigger the error myself, as I'm not 100% sure about the underlying source of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants