Preparing your dataset for processing¶

You should create a directory where you put all of the data and associated files described below.

Raw data¶

The pipeline does not currently support raw data, and begins instead with open-sourced mzMLfiles.

mzML files¶

Currently, you need to convert your raw data files into mzML files manually. Use MSConvert for this. Each file should be run through MSConvert twice: once with a threshold of 1000, (named with the suffix *.threshold1000.mzML), and once with no threshold (named *.mzML). Negative mode peak-picking works better with the thresholded files, whereas positive-mode and MS2 processing don’t need any of the pre-thresholding by MSConvert.

Note that the associated github repository contains a helper script, msconvert_wrapper.py which you can use to easily convert raw data into mzML files. Note that some raw data formats (e.g. Thermo Fisher) can only be converted with the Windows version of MSConvert. If this is your case, we recommend installing Cygwin and using msconvert_wrapper.py from that command line.

Sequence file¶

The sequence file should be provided to you by whoever ran the samples on the machine. The sequence file is essentially a mapping file containing the names of each data file and its corresponding metadata. Some metadata that is often present includes: instrument method, file path, injection volume, etc.

The required parts of the sequence file are as follows:

Sample ID: The first column in your sequence file should be the sample IDs. This is how each sample will be labeled in all downstream processing.
File Name: This column should contain the raw data file name (without any extensions). The processing code assumes that the mzML files are created from these file names. For example, if you have mtab_alm_sample1 in this column, the code assumes that the corresponding mzML files are mtab_alm_sample1.threshold1000.mzML and mtab_alm_sample1.mzML.
Ion Mode: This column contains the ion mode used for each sample. Accepted values are negative and positive.
Batches: This column specifies the “batches” of samples you want to align. When aligning the picked features to created an aligned feature table, you may only want to consider a subset of your samples (e.g. all PPL samples in one batch, all direct injection samples in another). Each sample can be in many (or no) batches. If a sample is in multiple batches, the batch names should be comma-separated within the same cell. The order of batches in a cell doesn’t matter, but the case does. If a batch contains samples from multiple ionization modes, that batch will not be aligned. Each batch yields an aligned feature table with only the samples in that batch aligned.

The first row of the sequence file should contain at least the following case- insensitive column headers: SampleID, File Name, Ion Mode, and batches. If there is an additional line at the top of the file (above the column headers), delete it before providing to the pipeline. The sequence file is assumed to be comma-separated, but the delimitation can be specified in the summary file with the attribute SEQUENCE_FILE_SEPARATOR. To specify a different sequence file delimiter, include the Pythonic string representation of the separator. For example, a tab-delimited sequence file would have a SEQUENCE_FILE_SEPARATOR of \t.

Summary File¶

Once you have your data and sequence file all sorted, you need to create a summary file that “talks” to raw2feats.py (through the SummaryParserMtab.py module). Your summary file should be a tab-delimited file named summary_file.txt and placed in the same directory as your sequence file.

Required attributes¶

The following attributes are required to be specified in summary_file.txt:

DATASET_ID

MODE

SEQUENCE_FILE

Identifier for this processing run. Output files will: contain this string as an identifier.

Ionization mode. Accepted values are negative or positive. If you have both positive and negative mode files to process, you will need to do them in two separate runs.

Name of sequence file. Full path is not necessary, as sequence file is assumed to (and should) be in the same directory as the summary file.

Optional attributes¶

The following attributes can be specified in the summary file, but are not required for processing:

`DATA_DIRECTORY`	If the data is in a different directory, you can provide the full path to the directory here. Otherwise the code assumes that all of your mzML files are in the input directory.
`SEQUENCE_FILE_DELIMITER`	The sequence file is assumed to be comma-delimited. If this is not the case, specify the delimitation here. (i.e. if your sequence file is tab-delimited, this should be `\t`).
`RIMAGE`	If peak-picking has already been performed on this dataset, you may provide the full path to an Rimage file containing this results to load up and skip the peak-picking. This Rimage should contain an `xcmsSet` object named `xs`. If no Rimage file is specified, this attribute will be updated with the correct file once peak-picking has been run once, so you may re-use this summary file to run different alignments without necessarily re-picking peaks.
`RAW _DATA`	`True` if you are providing raw data that needs to be converted to mzML, `False` if you are directly providing mzML files. Note: the current code does NOT accept raw data. This functionality may be added in future iterations.

Metabolomics summary file attributes should be located between #mtab_start and #mtab_end in the summary file.

Sample Summary File¶

Note: all blank spaces are tab characters.

DATASET_ID test

#mtab_start
MODE negative
SEQUENCE_FILE test_sequence_file.csv
#mtab_end

Sample Summary File, with optional attributes¶

DATASET_ID test

#mtab_start
MODE negative
SEQUENCE_FILE testsequencefile.csv
RIMAGE full/path/to/file.Rimage
SEQUENCE_FILE_DELIMITER \t
#mtab_end