-
Notifications
You must be signed in to change notification settings - Fork 21
Feature: Export Library #156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Add options to calibrate/not calibrate IM,RT, MS2 Frag
also add tests for osw/parquet to test for the not implemented error
If the q values are the same for 2 different runs than behaviour is undefined for which precursor selecting. This can mean that transitions part of the same transition group have different RT/IM. To address this, also sort by RunId. If Q values are the same just take the first run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the addition! Looks mostly good to go, I just had a few questions and suggestions.
pyprophet/cli/export.py
Outdated
| type=float, | ||
| help="Filter results to maximum run-specific peak group-level q-value, should not use values > 0.01.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to limit the filters to be only less than or equal to 0.01, maybe we should change the type to a click.FloatRange? Or add param validation in the export_library if we want to limit the qvalue thresholds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add to the desc, that if there are multiple runs with the same precursor, then the run with the lowest qvalue is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to limit the filters to be only less than or equal to 0.01, maybe we should change the type to a click.FloatRange? Or add param validation in the
export_libraryif we want to limit the qvalue thresholds.
I am not sure if I want to enforce a hard limit because I am still experimenting with values greater than 1% FDR. and greater than 1% is fine if you are filtering to that value anyways.
E.g. If you are doing your entire analysis at 5% FDR it is probably fine to use 5% FDR here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably change the help description then. Either remove the "should not use values > 0.01", or change the wording as a suggestion.
|
I think the tests need to be updated with the added |
Co-authored-by: Justin Sing <[email protected]>
…to feature/lib_export
after minor changes in data manipulation update snapshot tests
if still a tie sort by runId
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds functionality to export library files (.oswpq and .oswpqd) to a TSV library format that can be used with OpenSWATH. The export supports both experimental and previous libraries for RT/IM or fragment ion intensity.
- Implements library export functionality through a new
export librarycommand - Adds support for various calibration options (RT, IM, intensity) and filtering parameters
- Restricts library export to split parquet files only (OSW and non-split parquet files raise NotImplementedError)
Reviewed Changes
Copilot reviewed 11 out of 19 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_pyprophet_export.py |
Adds test cases for library export functionality with different calibration and RT unit configurations |
tests/_regtest_outputs/ |
Reference outputs for the new library export test cases showing expected TSV format |
pyprophet/io/export/split_parquet.py |
Implements library-specific data reading logic with proper validation and SQL queries |
pyprophet/io/export/parquet.py |
Adds NotImplementedError for library export from non-split parquet files |
pyprophet/io/export/osw.py |
Adds NotImplementedError for library export from OSW files |
pyprophet/io/_base.py |
Implements library cleaning, processing, and export functionality with calibration support |
pyprophet/cli/export.py |
Adds new CLI command for library export with comprehensive configuration options |
pyprophet/_config.py |
Extends configuration to support library export parameters and options |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| @@ -68,6 +68,18 @@ def read(self) -> pd.DataFrame: | |||
| try: | |||
| self._init_duckdb_views(con) | |||
|
|
|||
| if self.config.export_format == "library": | |||
| if self._is_unscored_file(): | |||
| descr= "Files must be scored for library generation." | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent spacing around assignment operator. Should be 'descr = "Files must be scored for library generation."'
Copilot uses AI. Check for mistakes.
| logger.exception(descr) | ||
| raise ValueError(descr) | ||
| if not self._has_peptide_protein_global_scores(): | ||
| descr= "Files must have peptide and protein level global scores for library generation." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent spacing around assignment operator. Should be 'descr = "Files must have peptide and protein level global scores for library generation."'
Copilot uses AI. Check for mistakes.
| if self.config.keep_decoys: | ||
| decoy_query = "" | ||
| else: | ||
| decoy_query ="p.PRECURSOR_DECOY is false and t.TRANSITION_DECOY is false and" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent spacing around assignment operator. Should be 'decoy_query = "p.PRECURSOR_DECOY is false and t.TRANSITION_DECOY is false and"'
Copilot uses AI. Check for mistakes.
| @@ -48,6 +48,7 @@ | |||
| import duckdb | |||
| import pandas as pd | |||
| import polars as pl | |||
| import sklearn.preprocessing as preprocessing # For MinMaxScaler | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import comment should follow PEP 8 guidelines with proper spacing: '# For MinMaxScaler' should be '# For MinMaxScaler' (two spaces before #)
| import sklearn.preprocessing as preprocessing # For MinMaxScaler | |
| import sklearn.preprocessing as preprocessing # For MinMaxScaler |
Copilot uses AI. Check for mistakes.
| logger.info(f"Library Contains {len(data['Precursor'].drop_duplicates())} Precursors") | ||
|
|
||
| logger.info(f"Precursor Fragment Distribution (Before Filtering)") | ||
| num_frags_per_prec = data[['Precursor', 'TransitionId']].groupby("Precursor").count().reset_index(names='Precursor').groupby('TransitionId').count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is overly complex with multiple chained operations. Consider breaking it into multiple steps for better readability and debugging.
| num_frags_per_prec = data[['Precursor', 'TransitionId']].groupby("Precursor").count().reset_index(names='Precursor').groupby('TransitionId').count() | |
| precursor_transition = data[['Precursor', 'TransitionId']] | |
| precursor_counts = precursor_transition.groupby("Precursor").count() | |
| precursor_counts_reset = precursor_counts.reset_index(names='Precursor') | |
| num_frags_per_prec = precursor_counts_reset.groupby('TransitionId').count() |
Copilot uses AI. Check for mistakes.
|
|
||
| logger.info(f"After filtering, library contains {len(data['Precursor'].drop_duplicates())} Precursors") | ||
| if cfg.keep_decoys: | ||
| logger.info("Of Which {} are decoys".format(len(data[data['Decoy'] == 1]['Precursor'].drop_duplicates()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use f-string formatting instead of .format() for consistency with the rest of the codebase and better performance: f"Of which {len(data[data['Decoy'] == 1]['Precursor'].drop_duplicates())} are decoys"
| logger.info("Of Which {} are decoys".format(len(data[data['Decoy'] == 1]['Precursor'].drop_duplicates()))) | |
| logger.info(f"Of Which {len(data[data['Decoy'] == 1]['Precursor'].drop_duplicates())} are decoys") |
Copilot uses AI. Check for mistakes.
This adds functionality to export a .oswpq file or a .oswpqd file to a library. The library can use either experimental or the previous libraries RT/IM or fragment ion intensity.
Currently .osw and .parquet are unsupported but can be added in the future.