-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create SampleQC and Batching notebooks #744
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great! I've only had a chance to go through the QC notebook so far and my only broad comment is to document the figures so they are all clearly explained.
}, | ||
"outputs": [], | ||
"source": [ | ||
"# sample_set_response = fapi.get_entities_tsv(\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be a little easier to do a block comment with """
just so you don't have to delete all the #
's
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Jupyter Notebooks, you can simply do ⌘
+ /
to comment/uncomment an entire code-block. I thought this would be easier than explicitly deleting/undeleting the """
. Happy to change this if you think it's more suitable though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's a neat trick. That's fine then, but maybe add the shortcut as a tip in or above the cell
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated accordingly - left a comment in notebook execution tips.
scripts/notebooks/SampleQC.ipynb
Outdated
"1. Each step should be executed by first using the **Analysis** section, which plots figures to assist with determining appropriate filtering cutoff values. This will involve modifying the following parameters:\n", | ||
" - `NUM_BINS`: The number of bins to include in each histogram plot displayed.\n", | ||
" - `LOG_SCALE`: Determines whether to log-scale the resulting plot.\n", | ||
" - `LINE_DEVIATIONS`: A list of integers that defines the cutoff lines to draw on each histogram plot. This is a list of multipliers *x* for the median absolute deviation (MAD) such that lines are drawn at Median ± x * MAD. It has been initialized with default values, but set this to `[]` or `None` if you don't wish to include any cutoffs.\n", | ||
" - `LINE_STYLES`: A list of strings that defines the line styles that each cutoff line should use in each histogram plot displayed. It has been initialized with default values, but set this to `[]` or `None` if you don't wish to include any cutoffs.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"1. Each step should be executed by first using the **Analysis** section, which plots figures to assist with determining appropriate filtering cutoff values. This will involve modifying the following parameters:\n", | |
" - `NUM_BINS`: The number of bins to include in each histogram plot displayed.\n", | |
" - `LOG_SCALE`: Determines whether to log-scale the resulting plot.\n", | |
" - `LINE_DEVIATIONS`: A list of integers that defines the cutoff lines to draw on each histogram plot. This is a list of multipliers *x* for the median absolute deviation (MAD) such that lines are drawn at Median ± x * MAD. It has been initialized with default values, but set this to `[]` or `None` if you don't wish to include any cutoffs.\n", | |
" - `LINE_STYLES`: A list of strings that defines the line styles that each cutoff line should use in each histogram plot displayed. It has been initialized with default values, but set this to `[]` or `None` if you don't wish to include any cutoffs.\n", | |
"1. Each step should be executed by first using the **Analysis** section, which plots figures to assist with determining appropriate filtering cutoff values. Parameters are defined and documented in the following cells. |
I think it's hard to read all this and then have to scroll down to change the parameters. IMO it would be easier to read to just have the definitions inline in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see where you are coming from. However, given that the parameters are (more or less) constant across all the various QC analyses, it feels a bit redundant to have the definitions in-line in the code. Or are you suggesting just having them in-line in the first QC analysis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point, but I know that lazy people will not want to have to scroll. A compromise could be to just have a brief description inline and the more thorough description in this section for reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated accordingly.
Co-authored-by: Mark Walker <[email protected]>
Co-authored-by: Mark Walker <[email protected]>
Co-authored-by: Mark Walker <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your updates. I moved on to the Batching notebook but ran into a problem, see below.
scripts/notebooks/Batching.ipynb
Outdated
"\n", | ||
"**Prerequisites**: EvidenceQC, Sample QC Notebook.\n", | ||
"\n", | ||
"**Next Steps**: TrainGCNV, GatherBatchEvidence.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"**Next Steps**: TrainGCNV, GatherBatchEvidence.\n", | |
"**Next Step**: TrainGCNV\n", |
Clearer since GatherBatchEvidence requires TrainGcnv first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated accordingly.
scripts/notebooks/Batching.ipynb
Outdated
"* Memory: 13 GB\n", | ||
"* Persistent Disk: 100 GB\n", | ||
"\n", | ||
"**Prerequisites**: EvidenceQC, Sample QC Notebook.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"**Prerequisites**: EvidenceQC, Sample QC Notebook.\n", | |
"**Prerequisites**: SampleQC notebook\n", |
I think this is clearer. EvidenceQC is implied since that is required by SampleQC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated accordingly.
scripts/notebooks/SampleQC.ipynb
Outdated
"validate_qc_inputs(samples_qc_table, 'median_coverage', num_bins=NUM_BINS, \n", | ||
" method=METHOD, lower_cutoff=LOWER_CUTOFF, upper_cutoff=UPPER_CUTOFF, \n", | ||
" mad_cutoff=MAD_CUTOFF, log_scale=LOG_SCALE)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still getting an error here with the defaults:
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
Cell In[43], line 9
6 UPPER_CUTOFF = None
7 MAD_CUTOFF = None
----> 9 validate_qc_inputs(samples_qc_table, 'median_coverage', num_bins=NUM_BINS,
10 method=METHOD, lower_cutoff=LOWER_CUTOFF, upper_cutoff=UPPER_CUTOFF,
11 mad_cutoff=MAD_CUTOFF, log_scale=LOG_SCALE)
Cell In[15], line 58, in validate_qc_inputs(table, filter_type, num_bins, line_deviations, line_styles, method, lower_cutoff, upper_cutoff, mad_cutoff, caller, caller_type, log_scale)
56 if (method == 'hard'):
57 if(not lower_cutoff and not upper_cutoff):
---> 58 raise Exception('Given that the chosen method is "hard", a value for the upper/lower cutoff must be provided.')
60 if (lower_cutoff and not isinstance(lower_cutoff, (int, float))):
61 raise Exception('Given that the chosen method is "hard", the value for the lower cutoff should be numeric.')
Exception: Given that the chosen method is "hard", a value for the upper/lower cutoff must be provided.
but should allow None
to be -inf
for LOWER_CUTOFF
and inf
for UPPER_CUTOFF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated accordingly - also set inf
for MAD_CUTOFF
if none provided.
scripts/notebooks/SampleQC.ipynb
Outdated
"created_ped = pd.read_table(file_path, names=['Family_ID', 'Sample_ID', 'Paternal_ID', \n", | ||
" 'Maternal_ID', 'Sex', 'Phenotype'])\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting this error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[84], line 3
1 # Save and display generated PED file, and update 'cohort_ped' workspace variable to this
2 file_path = generate_file_path(TLD_PATH, 'sex_analysis', 'sample_qc.ped')
----> 3 create_ped(updated_sex, reference_ped, file_path)
5 created_ped = pd.read_table(file_path, names=['Family_ID', 'Sample_ID', 'Paternal_ID',
6 'Maternal_ID', 'Sex', 'Phenotype'])
7 created_ped
Cell In[30], line 25, in create_ped(sex_assignments, reference_ped, file_path)
23 # Account for case where reference PED file does not exist
24 if (reference_ped.empty):
---> 25 reference_ped["Family_ID"] = reference_ped["Sample_ID"]
26 reference_ped = pd.DataFrame({'Sample_ID': sex_assignments.keys()})
27 reference_ped["Paternal_ID"] = "0"
File /opt/conda/lib/python3.10/site-packages/pandas/core/frame.py:3761, in DataFrame.__getitem__(self, key)
3759 if self.columns.nlevels > 1:
3760 return self._getitem_multilevel(key)
-> 3761 indexer = self.columns.get_loc(key)
3762 if is_integer(indexer):
3763 indexer = [indexer]
File /opt/conda/lib/python3.10/site-packages/pandas/core/indexes/range.py:349, in RangeIndex.get_loc(self, key)
347 raise KeyError(key) from err
348 if isinstance(key, Hashable):
--> 349 raise KeyError(key)
350 self._check_indexing_error(key)
351 raise KeyError(key)
KeyError: 'Sample_ID'
which looks like a bug. I think this line just needs to be moved below the next one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, I'd noticed this post-hoc so updated it only in the template workspace rather than in the Github repo as well. I've included this version in the latest commit though.
"source": [ | ||
"# Display metadata table, which was generated as part of EvidenceQC\n", | ||
"PASS_METADATA = os.path.join(WS_BUCKET, \"evidence_qc/filtering/passed_samples_qc_metadata.tsv\")\n", | ||
"pass_df = pd.read_table(PASS_METADATA)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is breaking on me:
FileNotFoundError: b/fc-41a63557-9ddf-41ba-b5d0-807be90ddb87/o/evidence_qc%2Ffiltering%2Fpassed_samples_qc_metadata.tsv
Does pandas support reading directly from buckets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does - you need to use gs://
as a prefix though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, on closer inspection I realized that this is automatically set based on WS_BUCKET
, which is pulled from os.environ['WORKSPACE_BUCKET']
. Strange to me that this does not automatically prepend gs://
in this case though, as it does for me...do you mind sharing the value of WS_BUCKET
in your notebook?
"MINIMUM_BATCH_SIZE = 'TODO'\n", | ||
"MAXIMUM_BATCH_SIZE = 'TODO'\n", | ||
"\n", | ||
"validate_numeric_inputs([TARGET_BATCH_SIZE, MINIMUM_BATCH_SIZE, MAXIMUM_BATCH_SIZE])" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should validate that min size is at most the number of total samples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated accordingly.
"outputs": [], | ||
"source": [ | ||
"# Perform hierarchical batching and print summary of batch size and sex balance\n", | ||
"batches, batches_meta = generate_hierarchical_batches(pass_df, INCLUDE_METRICS, INCLUDE_BINS, TARGET_BATCH_SIZE, \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting stuck in an infinite loop with target size 500, max 600, min 100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind sharing the workspace you encountered this in with me? It would be helpful to trigger the error myself, as I'm not 100% sure about the underlying source of this.
This PR introduces the initial set of
SampleQC
andBatching
notebooks part of the featured workspace into the GATK-SV repository to enable fine-grained source control. They have been added to the newly-createdscripts/notebooks/
directory.