-
Notifications
You must be signed in to change notification settings - Fork 19
Parameters
ANN-SoLo can be fully configured. Parameters can be specified as command-line arguments or in a configuration file. If an argument is specified in more than one place, then command-line values override configuration file values which override the default values.
The configuration file syntax consists of key = value
pairs.
There are three required, positional command-line arguments which cannot be listed in the configuration file (see below). All other parameters can be specified using either the appropriate command-line flag or in the configuration file.
The input and output files need to be specified in the following order:
spectral_library_filename query_filename out_filename
-
spectral_library_filename
: The path of spectral library file (in .splib format) to be used to identify the experimental spectra. -
query_filename
: The path of the spectral file (in .mgf format) to be searched. -
out_filename
: The path of the mzTab output file containing the search result. The ".mztab" extension will be added if that is not the case yet.
-
h
/--help
: Show the help message with command-line argument information and exit. -
-c
/--config
: The path of the configuration file.
ANN-SoLo uses a two-stage cascade search. During the first stage of the cascade search unmodified spectra are identified using a small precursor mass window. Specify this precursor mass window (typically in the ppm range) using the following parameters:
-
--precursor_tolerance_mass PRECURSOR_TOLERANCE_MASS
: The precursor tolerance mass for the first stage of the cascade search. -
--precursor_tolerance_mode {Da,ppm}
: The precursor tolerance mode for the first stage of the cascade search. Can be either "Da" or "ppm".
During the second stage of the cascade search modified spectra are identified using a wide precursor mass window. Specify this precursor mass window (typically in the 100s of Dalton range) using the following parameters:
-
--precursor_tolerance_mass_open PRECURSOR_TOLERANCE_MASS_OPEN
: The precursor tolerance mass for the second stage of the cascade search. -
--precursor_tolerance_mode_open {Da,ppm}
: The precursor tolerance mode for the second stage of the cascade search. Can be either "Da" or "ppm".
Specify the fragment mass tolerance (in m/z) for both stages of the cascade search as follows:
-
--fragment_mz_tolerance FRAGMENT_MZ_TOLERANCE
: The fragment mass tolerance (in m/z).
These parameters always need to be specified. None of these parameters have default values as these heavily depend on the properties of your data set.
If you want to perform a standard search instead of a two-stage open search you can omit the precursor mass tolerance for the second stage of the cascade search.
Some search parameters are only in effect during the second stage of the cascade search:
-
--allow_peak_shifts
: Boolean flag to enable using the shifted dot product to better match modified spectra against their unmodified counterparts by accounting for a single modification.
ANN-SoLo automatically controls the FDR during each part of its cascade search and only SSMs below the specified FDR threshold will be reported. The following parameters can be used to control the FDR threshold and how subgroup FDR filtering for the open search is performed:
-
--fdr FDR
: The FDR threshold used to accept identifications during each stage of the cascade search. SSMs below the FDR threshold will be accepted, spectra below the FDR threshold will be passed on to the next stage of the cascade search. The default is 0.01. -
--fdr_tolerance_mass FDR_TOLERANCE_MASS
: The bin width (in Da) to group SSMs for subgroup FDR calculation during the second stage of the cascade search. The default is 0.1 Da. -
--fdr_tolerance_mode {Da,ppm}
: The unit in which the FDR tolerance bin width is specified. Can be either "Da" or "ppm", the default is "Da". -
--fdr_min_group_size FDR_MIN_GROUP_SIZE
: The minimum group size to perform FDR control individually for that subgroup. Subgroups that contain fewer SSMs are combined into a residual group instead, whose FDR is jointly calculated in the end. The default is 20.
-
--mode {ann,bf}
: Search using ANN indexing ("ann") or the traditional, brute-force ("bf") mode. The default is "ann". -
--no_gpu
: Boolean flag to not use the GPU for ANN indexing. The default is to use a GPU if available.
ANN-SoLo uses approximate nearest neighbor indexing to speed up open modification searching. The following parameters control how spectra are converted into vectors for ANN indexing:
-
--bin_size BIN_SIZE
: The bin width (in Da) used to convert MS/MS spectra to vectors. Ideally the bin width should be slightly higher than the fragment mass tolerance to tightly capture the fragment masses while allowing small negative or positive mass deviations. The default is 0.04 Da. -
--hash_len HASH_LEN
: The number of hash bins used to convert a high-dimensional spectrum vector to a low-dimensional vector using feature hashing. The default is 800.
The following parameters control the ANN index performance:
-
--num_candidates NUM_CANDIDATES
: The number of candidates to retrieve from the ANN index for each query spectrum for final rescoring. The more candidates that are retrieved, the smaller the chance of missing the best matching library spectra in case the query spectrum is heavily modified, at the expense of running time. The maximum number of candidates when using a GPU for ANN searching is 1024, the default is 1024. -
--batch_size BATCH_SIZE
: The number of query spectra to process simultaneously. This depends on how much memory the GPU has. The default is 16384. -
--num_list NUM_LIST
The number of lists used to partition the spectral library during inverted index construction. Higher is generally better to get a more fine-grained partitioning of the data space. The default is 256. -
--num_probe NUM_PROBE
: The number of lists to inspect during querying, up to the number of lists used to construct the ANN index. The more lists that are inspected during querying, the smaller the chance of missing the best matching library spectra, at the expense of running time. We don't recommend setting this parameter higher than half ofnum_list
for computational efficiency. The maximum when using a GPU for ANN indexing is 1024, the default is 128.
The following parameters can be used to customize spectrum processing prior to spectral matching:
-
--resolution RESOLUTION
: The m/z resolution, masses will be rounded to the given number of decimals. The default is no m/z rounding. -
--min_mz MIN_MZ
: The minimum m/z value to consider (inclusive). The default is 11 m/z. -
--max_mz MAX_MZ
: The maximum m/z value to consider (inclusive). The default is 2010 m/z. -
--remove_precursor
: Boolean flag to remove peaks around the precursor mass. The default is not to remove such peaks. -
--remove_precursor_tolerance REMOVE_PRECURSOR_TOLERANCE
: The window (in m/z) around the precursor mass to remove peaks. The default is 0 m/z. -
--min_intensity MIN_INTENSITY
: Remove peaks with a lower intensity relative to the base peak intensity. The default is 0.01. -
--min_peaks MIN_PEAKS
: Discard low-quality spectra with less peaks. The default is 10. -
--min_mz_range MIN_MZ_RANGE
: Discard low-quality spectra with a smaller mass range. The default is 250 m/z. -
--max_peaks_used MAX_PEAKS_USED
: Only retain the specified number of most intense peaks for the query spectra. The default is 50. -
--max_peaks_used_library MAX_PEAKS_USED_LIBRARY
: Only retain the specified number of most intense peaks for the library spectra. The default is 50. -
--scaling {sqrt,rank}
: Scale the peak intensities by their square root ("sqrt") or by their rank ("rank") to reduce the influence of very intense peaks. The default is "rank".