biosaur2 - A feature detection LC-MS1 spectra. This project is a rewriten version of Biosaur software (https://github.com/abdrakhimov1/Biosaur).
The centroided mzML file is required for of the script.
Algorithm can be run with following command:
biosaur2 path_to_MZML
or with precomputed hills input:
biosaur2 path_to_input.hills.tsv
biosaur2 path_to_input.hills.parquet
The script outputs peptide features in tsv by default, or parquet when --feature_format parquet is used.
All available arguments can be shown with command "biosaur2 -h".
The default parameter minlh (the minimal number of consecutive scans for peptide feature) is 1 and this value is optimimal for ultra-short LC gradients (a few minutes). For the longer LC gradients, this value can be increased for reducing of feature detection time and removing noise isotopic clusters.
For TOF data please add "-tof" argument.
For PASEF data please convert mzML file using msconvert and '--combineIonMobilitySpectra --filter "msLevel 1" ' options. Do not use option --filter "scanSumming"! The latter is often required for MS/MS data analysis but breaks MS1 feature detection.
For negative mode data please add "-nm" argument.
Abdrakhimov, et al. Biosaur: An open-source Python software for liquid chromatography-mass spectrometry peptide feature detection with ion mobility support. https://doi.org/10.1002/rcm.9045
Using the pip:
pip install biosaur2
-minlh: Minimum number of MS1 scans for peaks extracted from the mzML file. Optimal usually is in 1-3 range for 5-15 min LC gradients and 5-10 for 60-180 min gradients. Default = 2
-mini : Minimal intensity threshold for peaks extracted from the mzML file. Default = 1
-minmz : Minimal m/z value for peaks extracted from the mzML file. Default = 350
-maxmz : Maximal m/z value for peaks extracted from the mzML file. Default = 1500
-htol : Mass accuracy in ppm to combine peaks into hills between scans. Default = 8 ppm
-itol : Mass accuracy in ppm for isotopic hills. Default = 8 ppm
-ignore_iso_calib : Turn off accurate isotope error estimation if added as the parameter. Input "itol" value will be used instead of gaussian fitting of mass errors and systematic shifts for every isotope number.
-o : Path to output feature file. Default is the input mzML name with .features.tsv (or .features.parquet when --feature_format parquet) in the same folder.
-hvf: Threshold to split hills into multiple if local minimum intensity multiplied by hvf is less than both surrounding local maximums. All peaks after splitting must have at least max(2, minlh) MS1 scans. Default = 1.3
-ivf: Threshold to split isotope pattern into multiple features if local minimum intensity multiplied by ivf is less right local maximum. Local minimum position should be higher than max(4rd isotope, isotope position with maximum intensity according to averagine model). Default = 5.0
-nm : Negative mode. 1-true, 0-false. Affect only neutral mass column calculated in the output features table. Default = 0
-cmin: Minimum allowed charge for isotopic clusters. Default = 1
-cmax: Maximal allowed charge for isotopic clusters. Default = 6
-nprocs: Number of processes used by biosau2. Automatically set to 1 for Windows system due to multiprocessing issues. Default = 4
-write_hills: Add hills output if added as the parameter. Output format is controlled by --hills_format. When used without --stop_after_hills, both feature and hills outputs include feature_idx (1-based feature label). In hills output, feature_idx = -1 means the hill is not assigned to any detected feature.
--hills_format: Format for hills output generated by -write_hills. Supported values: tsv, parquet. Default = tsv. The parquet output is compressed with zstd (balanced compression ratio and speed).
--no_hill_list: For -write_hills, exclude hills_scan_lists, hills_intensity_list, and hills_mz_array from the hills output. This reduces size, but such a hills file cannot be used later as input for feature detection.
-write_hills output now includes scanApex (the mzML scan ID corresponding to rtApex, using the scan= value from spectrum ID, e.g. scan=1).
--write_ms1: Write MS1 summary output (default file suffix: .ms1.tsv or .ms1.parquet). Default columns: scan_id, RT, total_intensity. scan_id follows mzML scan= numbering (e.g. scan=1), and RT is written in seconds.
--ms1_format: Format for MS1 summary output generated by --write_ms1. Supported values: tsv, parquet. Default = tsv.
--feature_format: Format for feature output. Supported values: tsv, parquet. Default = tsv.
--stop_after_hills: Automatically enables -write_hills and stops the run after hills are written, skipping feature detection. In this mode, feature_idx is not added because feature detection is skipped.
-write_extra_details: Add extra diagnostic columns to feature output (for example: isotope candidate details, theoretical/experimental isotope intensity vectors, and monoisotopic hill/index identifiers). Useful for debugging and method development; increases output file size.
--no-mono-hills: Exclude mono_hills_scan_lists and mono_hills_intensity_list from feature output (-dia requires these columns, so this option cannot be combined with -dia).
--64: For parquet output, store key hill/feature coordinates, MS1 summary columns, and list elements as 64-bit values. By default, parquet output uses 32-bit values (int32/float32), including list elements in hills_scan_lists, hills_intensity_list, hills_mz_array, and mono_hills_* columns.
-paseminlh: For TIMS-TOF data. Minimum number of ion mobility values for m/z peaks to be kept in the analysis. Default = 1
-paseftol: For TIMS-TOF data. Ion mobility tolerance used to combine close peaks into a single one. Default = 0.05
-pasefmini: For TIMS-TOF data. Minimal intensity threshold for peaks after combining peaks with close m/z (itol option) and ion mobility (paseftol option) values. Default = 100
-tof: Experimental. If added as the parameter, biosaur2 estimates noise intensity distribution across m/z range and automatically calculates intensity cutoffs for different m/z value ranges. This is an alternative way to reduce noise to the "-mini" option which is a fixed intensity threshold for all m/z values. Can be usefull for TOF data
- GitHub repo & issue tracker: https://github.com/markmipt/biosaur2
- Mailing list: markmipt@gmail.com