Skip to content

Feature: Export Library #156

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

jcharkow
Copy link
Contributor

@jcharkow jcharkow commented Aug 8, 2025

This adds functionality to export a .oswpq file or a .oswpqd file to a library. The library can use either experimental or the previous libraries RT/IM or fragment ion intensity.

Currently .osw and .parquet are unsupported but can be added in the future.

Copy link
Contributor

@singjc singjc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the addition! Looks mostly good to go, I just had a few questions and suggestions.

Comment on lines 371 to 372
type=float,
help="Filter results to maximum run-specific peak group-level q-value, should not use values > 0.01.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to limit the filters to be only less than or equal to 0.01, maybe we should change the type to a click.FloatRange? Or add param validation in the export_library if we want to limit the qvalue thresholds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add to the desc, that if there are multiple runs with the same precursor, then the run with the lowest qvalue is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to limit the filters to be only less than or equal to 0.01, maybe we should change the type to a click.FloatRange? Or add param validation in the export_library if we want to limit the qvalue thresholds.

I am not sure if I want to enforce a hard limit because I am still experimenting with values greater than 1% FDR. and greater than 1% is fine if you are filtering to that value anyways.

E.g. If you are doing your entire analysis at 5% FDR it is probably fine to use 5% FDR here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably change the help description then. Either remove the "should not use values > 0.01", or change the wording as a suggestion.

@singjc
Copy link
Contributor

singjc commented Aug 11, 2025

I think the tests need to be updated with the added rt_unit option?

@singjc singjc requested a review from Copilot August 19, 2025 17:12
Copilot

This comment was marked as outdated.

@jcharkow jcharkow requested review from Copilot and singjc August 19, 2025 21:56
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds functionality to export library files (.oswpq and .oswpqd) to a TSV library format that can be used with OpenSWATH. The export supports both experimental and previous libraries for RT/IM or fragment ion intensity.

  • Implements library export functionality through a new export library command
  • Adds support for various calibration options (RT, IM, intensity) and filtering parameters
  • Restricts library export to split parquet files only (OSW and non-split parquet files raise NotImplementedError)

Reviewed Changes

Copilot reviewed 11 out of 19 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/test_pyprophet_export.py Adds test cases for library export functionality with different calibration and RT unit configurations
tests/_regtest_outputs/ Reference outputs for the new library export test cases showing expected TSV format
pyprophet/io/export/split_parquet.py Implements library-specific data reading logic with proper validation and SQL queries
pyprophet/io/export/parquet.py Adds NotImplementedError for library export from non-split parquet files
pyprophet/io/export/osw.py Adds NotImplementedError for library export from OSW files
pyprophet/io/_base.py Implements library cleaning, processing, and export functionality with calibration support
pyprophet/cli/export.py Adds new CLI command for library export with comprehensive configuration options
pyprophet/_config.py Extends configuration to support library export parameters and options

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@@ -68,6 +68,18 @@ def read(self) -> pd.DataFrame:
try:
self._init_duckdb_views(con)

if self.config.export_format == "library":
if self._is_unscored_file():
descr= "Files must be scored for library generation."
Copy link
Preview

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent spacing around assignment operator. Should be 'descr = "Files must be scored for library generation."'

Copilot uses AI. Check for mistakes.

logger.exception(descr)
raise ValueError(descr)
if not self._has_peptide_protein_global_scores():
descr= "Files must have peptide and protein level global scores for library generation."
Copy link
Preview

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent spacing around assignment operator. Should be 'descr = "Files must have peptide and protein level global scores for library generation."'

Copilot uses AI. Check for mistakes.

if self.config.keep_decoys:
decoy_query = ""
else:
decoy_query ="p.PRECURSOR_DECOY is false and t.TRANSITION_DECOY is false and"
Copy link
Preview

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent spacing around assignment operator. Should be 'decoy_query = "p.PRECURSOR_DECOY is false and t.TRANSITION_DECOY is false and"'

Copilot uses AI. Check for mistakes.

@@ -48,6 +48,7 @@
import duckdb
import pandas as pd
import polars as pl
import sklearn.preprocessing as preprocessing # For MinMaxScaler
Copy link
Preview

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import comment should follow PEP 8 guidelines with proper spacing: '# For MinMaxScaler' should be '# For MinMaxScaler' (two spaces before #)

Suggested change
import sklearn.preprocessing as preprocessing # For MinMaxScaler
import sklearn.preprocessing as preprocessing # For MinMaxScaler

Copilot uses AI. Check for mistakes.

logger.info(f"Library Contains {len(data['Precursor'].drop_duplicates())} Precursors")

logger.info(f"Precursor Fragment Distribution (Before Filtering)")
num_frags_per_prec = data[['Precursor', 'TransitionId']].groupby("Precursor").count().reset_index(names='Precursor').groupby('TransitionId').count()
Copy link
Preview

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is overly complex with multiple chained operations. Consider breaking it into multiple steps for better readability and debugging.

Suggested change
num_frags_per_prec = data[['Precursor', 'TransitionId']].groupby("Precursor").count().reset_index(names='Precursor').groupby('TransitionId').count()
precursor_transition = data[['Precursor', 'TransitionId']]
precursor_counts = precursor_transition.groupby("Precursor").count()
precursor_counts_reset = precursor_counts.reset_index(names='Precursor')
num_frags_per_prec = precursor_counts_reset.groupby('TransitionId').count()

Copilot uses AI. Check for mistakes.


logger.info(f"After filtering, library contains {len(data['Precursor'].drop_duplicates())} Precursors")
if cfg.keep_decoys:
logger.info("Of Which {} are decoys".format(len(data[data['Decoy'] == 1]['Precursor'].drop_duplicates())))
Copy link
Preview

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use f-string formatting instead of .format() for consistency with the rest of the codebase and better performance: f"Of which {len(data[data['Decoy'] == 1]['Precursor'].drop_duplicates())} are decoys"

Suggested change
logger.info("Of Which {} are decoys".format(len(data[data['Decoy'] == 1]['Precursor'].drop_duplicates())))
logger.info(f"Of Which {len(data[data['Decoy'] == 1]['Precursor'].drop_duplicates())} are decoys")

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants