Untargeted metabolomics processing pipeline¶
This documentation describes the process of going from open-source mzML mass spectrometry files to aligned feature table(s). This is orchestrated by the script raw2feats.py. Each dataset folder must contain a machine-readable text file called a summary file which gives instructions to raw2feats.py. The format of this summary file, and all files required for MS processing, are described in this documentation.
Preparing your dataset for processing¶
You should create a directory where you put all of the data and associated files described below.
Raw data¶
The pipeline does not currently support raw data, and begins instead with open-sourced mzMLfiles.
mzML files¶
Currently, you need to convert your raw data files into mzML files
manually. Use MSConvert for this. Each file should be run through
MSConvert twice: once with a threshold of 1000, (named with the suffix
*.threshold1000.mzML
), and once with no threshold (named
*.mzML
). Negative mode peak-picking works better with the
thresholded files, whereas positive-mode and MS2 processing don’t need
any of the pre-thresholding by MSConvert.
Note that the associated github
repository contains a
helper script, msconvert_wrapper.py
which you can use to easily
convert raw data into mzML files. Note that some raw data formats (e.g.
Thermo Fisher) can only be converted with the Windows version of
MSConvert. If this is your case, we recommend installing
Cygwin and using msconvert_wrapper.py
from that command line.
Sequence file¶
The sequence file should be provided to you by whoever ran the samples on the machine. The sequence file is essentially a mapping file containing the names of each data file and its corresponding metadata. Some metadata that is often present includes: instrument method, file path, injection volume, etc.
The required parts of the sequence file are as follows:
- Sample ID: The first column in your sequence file should be the sample IDs. This is how each sample will be labeled in all downstream processing.
- File Name: This column should contain the raw data file name
(without any extensions). The processing code assumes that the mzML
files are created from these file names. For example, if you have
mtab_alm_sample1
in this column, the code assumes that the corresponding mzML files aremtab_alm_sample1.threshold1000.mzML
andmtab_alm_sample1.mzML
. - Ion Mode: This column contains the ion mode used for each sample.
Accepted values are
negative
andpositive
. - Batches: This column specifies the “batches” of samples you want to align. When aligning the picked features to created an aligned feature table, you may only want to consider a subset of your samples (e.g. all PPL samples in one batch, all direct injection samples in another). Each sample can be in many (or no) batches. If a sample is in multiple batches, the batch names should be comma-separated within the same cell. The order of batches in a cell doesn’t matter, but the case does. If a batch contains samples from multiple ionization modes, that batch will not be aligned. Each batch yields an aligned feature table with only the samples in that batch aligned.
The first row of the sequence file should contain at least the following
case- insensitive column headers: SampleID
, File Name
,
Ion Mode
, and batches
. If there is an additional line at the top
of the file (above the column headers), delete it before providing to
the pipeline. The sequence file is assumed to be comma-separated, but
the delimitation can be specified in the summary file with the attribute
SEQUENCE_FILE_SEPARATOR
. To specify a different sequence file
delimiter, include the Pythonic string representation of the separator.
For example, a tab-delimited sequence file would have a
SEQUENCE_FILE_SEPARATOR
of \t
.
Summary File¶
Once you have your data and sequence file all sorted, you need to create
a summary file that “talks” to raw2feats.py
(through the
SummaryParserMtab.py
module). Your summary file should be a
tab-delimited file named summary_file.txt
and placed in the same
directory as your sequence file.
Required attributes¶
The following attributes are required to be specified in
summary_file.txt
:
|
|
Ionization mode. Accepted values are negative or
positive . If you have both positive and negative
mode files to process, you will need to do them in two
separate runs. |
|
Name of sequence file. Full path is not necessary, as sequence file is assumed to (and should) be in the same directory as the summary file. |
Optional attributes¶
The following attributes can be specified in the summary file, but are not required for processing:
DATA_DIRECTORY |
If the data is in a different directory, you can provide the full path to the directory here. Otherwise the code assumes that all of your mzML files are in the input directory. |
SEQUENCE_FILE_DELIMITER |
The sequence file is assumed to be
comma-delimited. If this is not the case,
specify the delimitation here. (i.e. if
your sequence file is tab-delimited, this
should be \t ). |
RIMAGE |
If peak-picking has already been performed
on this dataset, you may provide the full
path to an Rimage file containing this
results to load up and skip the
peak-picking. This Rimage should contain
an xcmsSet object named xs . If no
Rimage file is specified, this attribute
will be updated with the correct file once
peak-picking has been run once, so you may
re-use this summary file to run different
alignments without necessarily re-picking
peaks. |
RAW _DATA |
True if you are providing raw data
that needs to be converted to mzML,
False if you are directly providing
mzML files. Note: the current code does
NOT accept raw data. This functionality
may be added in future iterations. |
Metabolomics summary file attributes should be located between
#mtab_start
and #mtab_end
in the summary file.
Sample Summary File¶
Note: all blank spaces are tab characters.
DATASET_ID test
#mtab_start
MODE negative
SEQUENCE_FILE test_sequence_file.csv
#mtab_end
Sample Summary File, with optional attributes¶
DATASET_ID test
#mtab_start
MODE negative
SEQUENCE_FILE testsequencefile.csv
RIMAGE full/path/to/file.Rimage
SEQUENCE_FILE_DELIMITER \t
#mtab_end
Data processing code¶
raw2feats.py¶
raw2feats.py
is the main workhorse of the MS processing code. It
reads in the summary file, picks peaks (if applicable), and aligns peaks
for all batches specified in the sequence file. It basically coordinates
all of the inputs and outputs and calls wrappers to R functions as
necessary. Most of the functions that actually do work (i.e. pick and
align peaks) are found in preprocessing_mtab.py
.
Parsing the summary file¶
The SummaryParserMtab.py
module simply reads in summary_file.txt
and stores its attributes in a dictionary. SummaryParserMtab.py
looks for the summary_file.txt
in the input directory.
Picking peaks¶
If an Rimage file is specified in summary_file.txt
, this part is
skipped. If not, raw2feats.py
calls the pick_peaks
function in
the preprocessing_mtab.py
module. pick_peaks()
calls
pick_peaks.R
and saves the PDF and Rimage files resulting from the
call to xcmsSet
. The Rimage file contains an xcmsSet
object
called xs
. Once peaks are picked, summary_file.txt
is updated
with the respective RIMAGE
file so that future processing calls skip
the time-consuming peak picking step and go straight to aligning.
Aligning peaks¶
After peaks are picked, raw2feats.py
reads in all of the specified
batches in the batches
column in the sequence file. One sample may
be in multiple batches - batch names should be comma-separated in the
sample’s cell in the batches
column. If a batch contains samples of
multiple ionization modes, that batch is thrown out and never processed.
align_peaks.R
first loads in the Rimage file that was either
specified in the summary file or created by picking peaks. It identifies
which samples to align and uses the xcms functions to align peaks across
samples, group these peaks together, and fill in any peaks that weren’t
found in individual samples but are considered real peaks in some other
samples. align_peaks.R
then finds isotopes and adducts (for the
specified mode) using CAMERA
.
Back in preprocessing_mtab.py
, the sample IDs in the aligned table,
which are currently the mzML file names, are replaced by their sample ID
in the sequence file.
Processing your data¶
Running raw2feats.py
¶
To process your data, navigate to the directory containing your summary
and sequence files using the command line. If this directory does not
also contain your data, make sure to specify that in the
DATA_DIRECTORY
attribute in the summary file. Then, type
python /path/to/raw2feats.py -i /path/to/input/directory
Alternatively, if you want to specify a different output directory than the default, you can type
python /path/to/raw2feats.py -i /path/to/input/directory -o /full/path/to/output/directory
Note that you must have raw2feats.py
, SummaryParserMtab.py
, and
preprocessingmtab.py
all added to your PYTHONPATH
or in the same
folder.
Knowing when it’s finished¶
When the data is finished processing, you should see
[[Processing mzML files]] Done.
print to screen.
Saving the outputs to files¶
Using Windows DOS command line, you can save the stdout
(i.e.
anything that is spit out by any of the python codes) by piping your
command to a file name using >
. You can also save the outputs of the
calls to the R files using 2>
. For example, the following command
saves the python outputs to out1.txt
and the R outputs to
out2.txt
python /path/to/raw2feats.py -i /path/to/input/directory > out1.txt 2> out2.txt
Saving outputs on UNIX command line should be straightforward (but to be honest I wrote most of this code back in the dark ages before I got a Mac so I haven’t tested it out).
Common confusing errors¶
Error in ``phenoDataFromPaths(files)``: Directory tree must be level
This error means that there is probably a mismatch between the files
specified in your sequence file and those present in your data folder.
The pick_peaks
code goes through all files it finds in the sequence
file’s File Name
column and tries to pick peaks for them.
A value is trying to be set on a copy of a slice from a DataFrame
This is a non-fatal error. Don’t worry about it.
A subdirectory or file already exists, or other copying files errors
Errors related to making directories and copying files should also be non-fatal. As long as the processing continues after them, you can ignore them.
IOError: [Errno 13] Permission denied: file_name
This error means that one of the files that the code is trying to edit,
like the processing_tracker
or summary_file
are open in another
program. Close them and try again.
Rscript, pickpeaks.R, or alignpeaks.R not found
Make sure that the paths to the respective files are correct. These are
in the preprocessingmtab.py
module, in either (or both, in the case
of the call to Rscript) pickpeaks()
or alignpeaks()
. If there
are spaces in the file path and you are using a DOS command prompt,
enclose the string that has spaces in quotation marks (“).