Data processing code

raw2feats.py

raw2feats.py is the main workhorse of the MS processing code. It reads in the summary file, picks peaks (if applicable), and aligns peaks for all batches specified in the sequence file. It basically coordinates all of the inputs and outputs and calls wrappers to R functions as necessary. Most of the functions that actually do work (i.e. pick and align peaks) are found in preprocessing_mtab.py.

Parsing the summary file

The SummaryParserMtab.py module simply reads in summary_file.txt and stores its attributes in a dictionary. SummaryParserMtab.py looks for the summary_file.txt in the input directory.

Picking peaks

If an Rimage file is specified in summary_file.txt, this part is skipped. If not, raw2feats.py calls the pick_peaks function in the preprocessing_mtab.py module. pick_peaks() calls pick_peaks.R and saves the PDF and Rimage files resulting from the call to xcmsSet. The Rimage file contains an xcmsSet object called xs. Once peaks are picked, summary_file.txt is updated with the respective RIMAGE file so that future processing calls skip the time-consuming peak picking step and go straight to aligning.

Aligning peaks

After peaks are picked, raw2feats.py reads in all of the specified batches in the batches column in the sequence file. One sample may be in multiple batches - batch names should be comma-separated in the sample’s cell in the batches column. If a batch contains samples of multiple ionization modes, that batch is thrown out and never processed.

align_peaks.R first loads in the Rimage file that was either specified in the summary file or created by picking peaks. It identifies which samples to align and uses the xcms functions to align peaks across samples, group these peaks together, and fill in any peaks that weren’t found in individual samples but are considered real peaks in some other samples. align_peaks.R then finds isotopes and adducts (for the specified mode) using CAMERA.

Back in preprocessing_mtab.py, the sample IDs in the aligned table, which are currently the mzML file names, are replaced by their sample ID in the sequence file.