Feature: Export Library #156

jcharkow · 2025-08-08T13:50:03Z

This adds functionality to export a .oswpq file or a .oswpqd file to a library. The library can use either experimental or the previous libraries RT/IM or fragment ion intensity.

Currently .osw and .parquet are unsupported but can be added in the future.

Add options to calibrate/not calibrate IM,RT, MS2 Frag

also add tests for osw/parquet to test for the not implemented error

If the q values are the same for 2 different runs than behaviour is undefined for which precursor selecting. This can mean that transitions part of the same transition group have different RT/IM. To address this, also sort by RunId. If Q values are the same just take the first run

singjc

Thanks for the addition! Looks mostly good to go, I just had a few questions and suggestions.

singjc · 2025-08-11T17:08:28Z

pyprophet/cli/export.py

+    type=float,
+    help="Filter results to maximum run-specific peak group-level q-value, should not use values > 0.01.",


If we want to limit the filters to be only less than or equal to 0.01, maybe we should change the type to a click.FloatRange? Or add param validation in the export_library if we want to limit the qvalue thresholds.

Can you also add to the desc, that if there are multiple runs with the same precursor, then the run with the lowest qvalue is used.

If we want to limit the filters to be only less than or equal to 0.01, maybe we should change the type to a click.FloatRange? Or add param validation in the export_library if we want to limit the qvalue thresholds.

I am not sure if I want to enforce a hard limit because I am still experimenting with values greater than 1% FDR. and greater than 1% is fine if you are filtering to that value anyways.

E.g. If you are doing your entire analysis at 5% FDR it is probably fine to use 5% FDR here.

Should probably change the help description then. Either remove the "should not use values > 0.01", or change the wording as a suggestion.

pyprophet/cli/export.py

pyprophet/io/_base.py

pyprophet/io/export/osw.py

singjc · 2025-08-11T22:40:55Z

I think the tests need to be updated with the added rt_unit option?

Co-authored-by: Justin Sing <[email protected]>

…to feature/lib_export

after minor changes in data manipulation update snapshot tests

if still a tie sort by runId

Copilot

Pull Request Overview

This PR adds functionality to export library files (.oswpq and .oswpqd) to a TSV library format that can be used with OpenSWATH. The export supports both experimental and previous libraries for RT/IM or fragment ion intensity.

Implements library export functionality through a new export library command
Adds support for various calibration options (RT, IM, intensity) and filtering parameters
Restricts library export to split parquet files only (OSW and non-split parquet files raise NotImplementedError)

Reviewed Changes

Copilot reviewed 11 out of 19 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`tests/test_pyprophet_export.py`	Adds test cases for library export functionality with different calibration and RT unit configurations
`tests/_regtest_outputs/`	Reference outputs for the new library export test cases showing expected TSV format
`pyprophet/io/export/split_parquet.py`	Implements library-specific data reading logic with proper validation and SQL queries
`pyprophet/io/export/parquet.py`	Adds NotImplementedError for library export from non-split parquet files
`pyprophet/io/export/osw.py`	Adds NotImplementedError for library export from OSW files
`pyprophet/io/_base.py`	Implements library cleaning, processing, and export functionality with calibration support
`pyprophet/cli/export.py`	Adds new CLI command for library export with comprehensive configuration options
`pyprophet/_config.py`	Extends configuration to support library export parameters and options

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-08-19T21:57:41Z

pyprophet/io/export/split_parquet.py

@@ -68,6 +68,18 @@ def read(self) -> pd.DataFrame:
        try:
            self._init_duckdb_views(con)

+            if self.config.export_format == "library":
+                if self._is_unscored_file():
+                    descr= "Files must be scored for library generation."


Inconsistent spacing around assignment operator. Should be 'descr = "Files must be scored for library generation."'

Copilot · 2025-08-19T21:57:42Z

pyprophet/io/export/split_parquet.py

+                    logger.exception(descr)
+                    raise ValueError(descr)
+                if not self._has_peptide_protein_global_scores():
+                    descr= "Files must have peptide and protein level global scores for library generation."


Inconsistent spacing around assignment operator. Should be 'descr = "Files must have peptide and protein level global scores for library generation."'

Copilot · 2025-08-19T21:57:42Z

pyprophet/io/export/split_parquet.py

+        if self.config.keep_decoys:
+            decoy_query = ""
+        else:
+            decoy_query ="p.PRECURSOR_DECOY is false and t.TRANSITION_DECOY is false and" 


Inconsistent spacing around assignment operator. Should be 'decoy_query = "p.PRECURSOR_DECOY is false and t.TRANSITION_DECOY is false and"'

Copilot · 2025-08-19T21:57:42Z

pyprophet/io/_base.py

@@ -48,6 +48,7 @@
 import duckdb
 import pandas as pd
 import polars as pl
+import sklearn.preprocessing as preprocessing # For MinMaxScaler


Import comment should follow PEP 8 guidelines with proper spacing: '# For MinMaxScaler' should be '# For MinMaxScaler' (two spaces before #)

Suggested change

import sklearn.preprocessing as preprocessing # For MinMaxScaler

import sklearn.preprocessing as preprocessing # For MinMaxScaler

Copilot · 2025-08-19T21:57:42Z

pyprophet/io/_base.py

+        logger.info(f"Library Contains {len(data['Precursor'].drop_duplicates())} Precursors")
+
+        logger.info(f"Precursor Fragment Distribution (Before Filtering)")
+        num_frags_per_prec = data[['Precursor', 'TransitionId']].groupby("Precursor").count().reset_index(names='Precursor').groupby('TransitionId').count()


This line is overly complex with multiple chained operations. Consider breaking it into multiple steps for better readability and debugging.

Suggested change

num_frags_per_prec = data[['Precursor', 'TransitionId']].groupby("Precursor").count().reset_index(names='Precursor').groupby('TransitionId').count()

precursor_transition = data[['Precursor', 'TransitionId']]

precursor_counts = precursor_transition.groupby("Precursor").count()

precursor_counts_reset = precursor_counts.reset_index(names='Precursor')

num_frags_per_prec = precursor_counts_reset.groupby('TransitionId').count()

Copilot · 2025-08-19T21:57:42Z

pyprophet/io/_base.py

+
+        logger.info(f"After filtering, library contains {len(data['Precursor'].drop_duplicates())} Precursors")
+        if cfg.keep_decoys:
+            logger.info("Of Which {} are decoys".format(len(data[data['Decoy'] == 1]['Precursor'].drop_duplicates())))


Use f-string formatting instead of .format() for consistency with the rest of the codebase and better performance: f"Of which {len(data[data['Decoy'] == 1]['Precursor'].drop_duplicates())} are decoys"

Suggested change

logger.info("Of Which {} are decoys".format(len(data[data['Decoy'] == 1]['Precursor'].drop_duplicates())))

logger.info(f"Of Which {len(data[data['Decoy'] == 1]['Precursor'].drop_duplicates())} are decoys")

jcharkow added 19 commits June 17, 2025 16:04

feature: start implementation of lib export with pyprophet

fbbffd3

more functionality to lib export

02d5d65

Add options to calibrate/not calibrate IM,RT, MS2 Frag

change default min frags to 4

1ef91f9

filter fragments with 0 library intensity

0c999a6

require a --out parameter

0a6e7db

change config from 6 to 4

5f41220

fix bugs, update docs

d725eb8

fix: export protein info in lib

9b108aa

fix: lib export compute annotation col if empty

e36e23f

Merge branch 'feature/polars_explode' into feature/lib_export

d4c48a5

feature: option to keep significant decoys in lib refinement

76bbff7

verbose: note that keep_decoys in lib gen is experimental feature

f8b3753

test: add test for lib generation

4b9076b

minor refactor for better support across different i/o

f455b10

add not implemented error for osw/parquet output

1667c25

also add tests for osw/parquet to test for the not implemented error

feature: add option to export rt unit in non iRT

609e84f

remove debug line

5c6bda2

swtich keep_decoys default to no-keep_decoys

bb5607c

singjc requested changes Aug 11, 2025

View reviewed changes

jcharkow and others added 5 commits August 12, 2025 15:03

update parameter descriptions

989ade1

fix: error description

0aae3a8

Co-authored-by: Justin Sing <[email protected]>

Merge branch 'feature/lib_export' of github.com:Roestlab/pyprophet in…

ee3bb62

…to feature/lib_export

apply suggestions from PR review

c3daea1

test: update tests with new snapshots

3a8d203

after minor changes in data manipulation update snapshot tests

singjc requested a review from Copilot August 19, 2025 17:12

This comment was marked as outdated.

Sign in to view

jcharkow added 2 commits August 19, 2025 16:07

feature: sort by intensity if q value tie

76ac52c

if still a tie sort by runId

apply copilot suggestions

1c32cb6

jcharkow requested review from Copilot and singjc August 19, 2025 21:56

Copilot AI reviewed Aug 19, 2025

View reviewed changes

		type=float,
		help="Filter results to maximum run-specific peak group-level q-value, should not use values > 0.01.",

	import sklearn.preprocessing as preprocessing # For MinMaxScaler
	import sklearn.preprocessing as preprocessing # For MinMaxScaler

-        num_frags_per_prec = data[['Precursor', 'TransitionId']].groupby("Precursor").count().reset_index(names='Precursor').groupby('TransitionId').count()
+        precursor_transition = data[['Precursor', 'TransitionId']]
+        precursor_counts = precursor_transition.groupby("Precursor").count()
+        precursor_counts_reset = precursor_counts.reset_index(names='Precursor')
+        num_frags_per_prec = precursor_counts_reset.groupby('TransitionId').count()

	logger.info("Of Which {} are decoys".format(len(data[data['Decoy'] == 1]['Precursor'].drop_duplicates())))
	logger.info(f"Of Which {len(data[data['Decoy'] == 1]['Precursor'].drop_duplicates())} are decoys")

Feature: Export Library #156

Are you sure you want to change the base?

Feature: Export Library #156

Conversation

jcharkow commented Aug 8, 2025

Uh oh!

singjc left a comment

Choose a reason for hiding this comment

Uh oh!

singjc Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

singjc Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

jcharkow Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

singjc Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

singjc commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

singjc commented Aug 11, 2025 •

edited

Loading