API of seq-to-first-iso

seq-to-first-iso computes the first two isotopologue intentities (M0 and M1) from peptide sequences with natural carbon and with 99.99 % 12C enriched carbon.

The program can take into account unlabelled amino acids to simulate auxotrophies to amino acids.

seq-to-first-iso is available as a Python module.

[1]:
from pprint import pprint

from pkg_resources import get_distribution  # Comes with setuptools.
import pandas as pd
from pyteomics import mass

import seq_to_first_iso as stfi
[2]:
try:
    print(f"pyteomics version: {get_distribution('pyteomics').version}")
except:
    print("pyteomics version not found")

print(f"pandas version: {pd.__version__}\n"
      f"seq-to-first-iso version: {stfi.__version__}"
     )
pyteomics version: 4.1.2
pandas version: 0.24.2
seq-to-first-iso version: 0.5.0
[3]:
# Variables used for showcase.
peptide_seq = "YAQEISRAR"
unlabelled_amino_acids = ["A", "R"]

Abundances defined in seq-to-first-iso

[4]:
isotopic_abundance = stfi.isotopic_abundance
C12_abundance = stfi.C12_abundance
[5]:
pprint(isotopic_abundance)
{'C[12]': 0.9893,
 'C[13]': 0.0107,
 'H[1]': 0.999885,
 'H[2]': 0.000115,
 'N[14]': 0.99632,
 'N[15]': 0.00368,
 'O[16]': 0.99757,
 'O[17]': 0.00038,
 'O[18]': 0.00205,
 'S[32]': 0.9493,
 'S[33]': 0.0076,
 'S[34]': 0.0429,
 'X[12]': 0.9893,
 'X[13]': 0.0107}
[6]:
pprint(C12_abundance)
{'C[12]': 0.9999,
 'C[13]': 9.999999999998899e-05,
 'H[1]': 0.999885,
 'H[2]': 0.000115,
 'N[14]': 0.99632,
 'N[15]': 0.00368,
 'O[16]': 0.99757,
 'O[17]': 0.00038,
 'O[18]': 0.00205,
 'S[32]': 0.9493,
 'S[33]': 0.0076,
 'S[34]': 0.0429,
 'X[12]': 0.9893,
 'X[13]': 0.0107}
isotopic_abundance and C12_abundance are dictionaries with abundances of common isotopes of organic elements.
C12_abundance has a 12C abundance of 99.99 %, hence 13C abundance is 0.01 %.
Element X is a virtual element created to replace the carbon of unlabelled amino acids, it has the same isotopic abundances as natural carbon.

Separate sequences according to unlabelled amino acids

[7]:
help(stfi.separate_labelled)
Help on function separate_labelled in module seq_to_first_iso.seq_to_first_iso:

separate_labelled(sequence, unlabelled_aa)
    Get the sequence of unlabelled amino acids from a sequence.

    Parameters
    ----------
    sequence : str
        String of amino acids.
    unlabelled_aa : container object
        Container (list, string...) of unlabelled amino acids.

    Returns
    -------
    tuple(str, str)
        | The sequences as a tuple of string with:
        |    - the sequence without the unlabelled amino acids
        |    - the unlabelled amino acids in the sequence

[8]:
# Separate sequence "YAQEISRAR" with amino acids A and R unlabelled.
labelled_sequence, unlabelled_sequence = stfi.separate_labelled(peptide_seq, unlabelled_aa=unlabelled_amino_acids)
print(f"sequence with labelled carbon: {labelled_sequence}\n"
      f"sequence with unlabelled carbon: {unlabelled_sequence}")
sequence with labelled carbon: YQEIS
sequence with unlabelled carbon: ARAR

Obtain a composition with element X

[9]:
help(stfi.seq_to_xcomp)
Help on function seq_to_xcomp in module seq_to_first_iso.seq_to_first_iso:

seq_to_xcomp(sequence_l, sequence_nl)
    Take 2 amino acid sequences and return the composition with X.

    The second sequence will have its C replaced by X.

    Parameters
    ----------
    sequence_l : str or pyteomics.mass.Composition
        Sequence or composition with labelled amino acids.
    sequence_nl : str or pyteomics.mass.Composition
        Sequence or composition where amino acids are not labelled.

    Returns
    -------
    pyteomics.mass.Composition
        Composition with unlabelled carbon as element X.

    Notes
    -----
    | The function assumes the second sequence has no terminii (H-, -OH).
    | Supports pyteomics.mass.Composition as argument (0.5.1).
    | If mass.Composition objects are provided, the function assumes
      the terminii of the second composition were already removed.

[10]:
# Get the chemical formula with unlabelled carbon as element X.
chem_formula = stfi.seq_to_xcomp(labelled_sequence, unlabelled_sequence)
print(f"Composition of {peptide_seq} with {unlabelled_amino_acids} unlabelled:\n{chem_formula}")
Composition of YAQEISRAR with ['A', 'R'] unlabelled:
Composition({'H': 76, 'C': 28, 'O': 15, 'N': 16, 'X': 18})
[11]:
# If all amino acids are labelled, you can pass an empty string.
labelled_formula = stfi.seq_to_xcomp(labelled_sequence, "")
print(f"Composition of {peptide_seq} with {unlabelled_amino_acids} unlabelled:\n{labelled_formula}")
Composition of YAQEISRAR with ['A', 'R'] unlabelled:
Composition({'H': 42, 'C': 28, 'O': 11, 'N': 6})
[12]:
# You can also provide pyteomics.mass.Composition objects.
# To have a coherent result, H- -OH terminii should only be on one of the sequences.
labelled_composition = mass.Composition(labelled_sequence)
# parsed_sequence does not add the terminii.
unlabelled_composition = mass.Composition(parsed_sequence=unlabelled_sequence)

chem_formula = stfi.seq_to_xcomp(labelled_sequence, unlabelled_composition)
print(f"Composition of {peptide_seq} with {unlabelled_amino_acids} unlabelled:\n{chem_formula}")
Composition of YAQEISRAR with ['A', 'R'] unlabelled:
Composition({'H': 76, 'C': 28, 'O': 15, 'N': 16, 'X': 18})

Compute isotopologue intensity

[13]:
help(stfi.compute_M0_nl)
print("-" * 79)
help(stfi.compute_M1_nl)
Help on function compute_M0_nl in module seq_to_first_iso.seq_to_first_iso:

compute_M0_nl(f, a)
    Return the monoisotopic abundance M0 of a formula with mixed labels.

    Parameters
    ----------
    f : pyteomics.mass.Composition
        Chemical formula, as a dict of counts for each element:
        {element_name: count_of_element_in_sequence, ...}.
    a : dict
        Dictionary of abundances of isotopes, in the format:
        {element_name[isotope_number]: relative abundance, ..}.

    Returns
    -------
    float
        Value of M0.

    Notes
    -----
    X represents C with default isotopic abundance.

-------------------------------------------------------------------------------
Help on function compute_M1_nl in module seq_to_first_iso.seq_to_first_iso:

compute_M1_nl(f, a)
    Compute abundance of second isotopologue M1 from its formula.

    Parameters
    ----------
    f : pyteomics.mass.Composition
        Chemical formula, as a dict of counts for each element:
        {element_name: count_of_element_in_sequence, ...}.
    a : dict
        Dictionary of abundances of isotopes, in the format:
        {element_name[isotope_number]: relative abundance, ..}.

    Returns
    -------
    float
        Value of M1.

    Notes
    -----
    X represents C with default isotopic abundance.

[14]:
# Compute M0 with natural carbon.
first_isotopologue = stfi.compute_M0_nl(chem_formula, isotopic_abundance)
print(f"M0 in Normal Condition: {first_isotopologue}")

first_isotopologue = stfi.compute_M0_nl(chem_formula, C12_abundance)
print(f"M0 in 12C condition: {first_isotopologue}")
M0 in Normal Condition: 0.5493191520383802
M0 in 12C condition: 0.7403283857401063
[15]:
# Compute M1 with natural carbon.
first_isotopologue = stfi.compute_M1_nl(chem_formula, isotopic_abundance)
print(f"M1 in Normal Condition: {first_isotopologue}")

first_isotopologue = stfi.compute_M1_nl(chem_formula, C12_abundance)
print(f"M1 in 12C condition: {first_isotopologue}")
M1 in Normal Condition: 0.313702912736476
M1 in 12C condition: 0.200655465179031

Get the composition of a list of modifications

[16]:
help(stfi.get_mods_composition)
Help on function get_mods_composition in module seq_to_first_iso.seq_to_first_iso:

get_mods_composition(modifications)
    Return the composition of a list of modifications.

    Parameters
    ----------
    modifications: list of str
        List of modifications string (corresponding to Unimod titles).

    Returns
    -------
    pyteomics.mass.Composition
        The total composition change.

[17]:
# Modifications must be strict Unimod entries title.
modification_list = ["Acetyl", "Phospho", "phospho"]  # phospho does not correspond to a title, it will be ignored
total_composition = stfi.get_mods_composition(modification_list)
print(f"Total composition shift for {modification_list} is {total_composition}")
[2019-07-10, 10:38:33] WARNING : Unimod entry not found for : phospho
Total composition shift for ['Acetyl', 'Phospho', 'phospho'] is Composition({'H': 3, 'C': 2, 'O': 4, 'P': 1})

Get human-readable chemical formula

[18]:
help(stfi.formula_to_str)
Help on function formula_to_str in module seq_to_first_iso.seq_to_first_iso:

formula_to_str(composition)
    Return formula from Composition as a string.

    Parameters
    ----------
    composition : pyteomics.mass.Composition
        Chemical formula.

    Returns
    -------
    str
        Human-readable string of the formula.

    Warnings
    --------
    If the composition has elements not in USED_ELEMS, they will not
    be added to the output.

[19]:
# This is the function used to get the formulas in the output.
formula_str = stfi.formula_to_str(total_composition)
print(f"{total_composition} becomes {formula_str}")
Composition({'H': 3, 'C': 2, 'O': 4, 'P': 1}) becomes C2H3O4P1
[20]:
# !!! Warning: if the Composition has elements not in "CHONPSX", they will not be in the final string.
bad_composition = mass.Composition("U")
formula_str = stfi.formula_to_str(bad_composition)
print(f"{bad_composition} becomes {formula_str}")
Composition({'H': 7, 'C': 3, 'O': 2, 'N': 1, 'Se': 1}) becomes C3H7O2N1

Here, “non-CHONPSX” element Se (Selenium) is ignored.

Parse a file with peptide sequences

seq-to-first-iso accepts files with 1 sequence per line.
Optionally, annotations/sequence IDs can be placed in the same line before sequences if separated by a separator (default: “\t”). The program declares that the file has annotations if the separator is found on the first line.
The parser will ignore lines where sequences have incorrect characters (not in “ACDEFGHIKLMNPQRSTVWY”) unless it corresponds to XTandem’s Post-Translational Modification notation.
[21]:
help(stfi.sequence_parser)
Help on function sequence_parser in module seq_to_first_iso.seq_to_first_iso:

sequence_parser(file, sep='\t')
    Return information on sequences parsed from a file.

    Parameters
    ----------
    file : str
        Filename, the file can either just have sequences for each line or
        can have have annotations and sequences with a separator in-between.
    sep : str, optional
        Separator for files with annotations (default is ``\t``).

    Returns
    -------
    dict
        | Parsed output with "key: values" :
        |     - "annotations": a list of annotations if any.
        |     - "raw_sequences": a list of unmodified peptide sequences.
        |     - "sequences": a list of uppercase peptide sequences.
        |     - "modifications": a list of lists of PTMs.
        |     - "ignored_lines": the number of ignored lines.

    Warnings
    --------
    The function uses the first line to evaluate if the file has
    annotations or not, hence a file should have a consistent format.

    Notes
    -----
    | Supports Xtandem's Post-Translational Modification notation (0.4.0).
    | Supports annotations (0.3.0).

[22]:
parsed_output = stfi.sequence_parser("peptides.txt")
pprint(parsed_output)
{'annotations': [],
 'ignored_lines': 0,
 'modifications': [[], [], []],
 'raw_sequences': ['YAQEISR', 'VGFPVLSVKEHK', 'LAMVIIKEFVDDLK'],
 'sequences': ['YAQEISR', 'VGFPVLSVKEHK', 'LAMVIIKEFVDDLK']}

For peptides.txt, the list of annotations is empty (there are no annotations), no lines are ignored and no modifications were found so raw_sequences (with modifications) are the same as sequences (without modifications)

[23]:
# Get the values in the returned dict.
annotations = parsed_output["annotations"]
ignored_lines = parsed_output["ignored_lines"]
modifications = parsed_output["modifications"]
sequences = parsed_output["sequences"]
raw_sequences = parsed_output["raw_sequences"]
[24]:
# Parsing a file with annotations and modifications following XTandem notation.
parsed_output = stfi.sequence_parser("peptides_mod.tsv")
pprint(parsed_output)
{'annotations': ['0', '1', '2', '4', '7', '11', '13', '14', '16', '24', '27'],
 'ignored_lines': 1,
 'modifications': [['Oxidation'],
                   ['Phospho'],
                   [],
                   ['Glutathione'],
                   ['Acetyl', 'Oxidation'],
                   ['Heme'],
                   ['Pro->Val'],
                   [],
                   ['Glutathione'],
                   [],
                   ['Acetyl', 'Oxidation', 'Acetyl', 'Acetyl']],
 'raw_sequences': ['VPK(Oxidation)ER',
                   'VLLIDLRIPQR(Phospho)SAINHIVAPNLVNVDPNLLWDK',
                   'QRTTFFVLGINTVNYPDIYEHILER',
                   'AELFL(Glutathione)LNR',
                   '.(Acetyl)VGEVFINYIQRQNELFQGKLAYLII(Oxidation)DTCLSIVRPNDSKPLDNR',
                   'YKTMNTFDPD(Heme)EKFEWFQVWQAVK',
                   'HKSASSPAV(Pro->Val)NADTDIQDSSTPSTSPSGRR',
                   'FHNK',
                   '.(Glutathione)MDLEIK',
                   'LANEKPEDVFER',
                   '.(Acetyl)SDTPLR(Oxidation)D(Acetyl)EDG(Acetyl)LDFWETLRSLATTNPNPPVEK'],
 'sequences': ['VPKER',
               'VLLIDLRIPQRSAINHIVAPNLVNVDPNLLWDK',
               'QRTTFFVLGINTVNYPDIYEHILER',
               'AELFLLNR',
               'VGEVFINYIQRQNELFQGKLAYLIIDTCLSIVRPNDSKPLDNR',
               'YKTMNTFDPDEKFEWFQVWQAVK',
               'HKSASSPAVNADTDIQDSSTPSTSPSGRR',
               'FHNK',
               'MDLEIK',
               'LANEKPEDVFER',
               'SDTPLRDEDGLDFWETLRSLATTNPNPPVEK']}

Create a dataframe from a list of sequences

[25]:
help(stfi.seq_to_df)
Help on function seq_to_df in module seq_to_first_iso.seq_to_first_iso:

seq_to_df(sequences, unlabelled_aa, **kwargs)
    Create a dataframe from sequences and return its name.

    Parameters
    ----------
    sequences : list of str
        List of pure peptide sequences string.
    unlabelled_aa : container object
        Container of unlabelled amino acids.
    annotations : list of str, optional
        List of IDs for the sequences.
    raw_sequences : list of str, optional
        List of sequences with Xtandem PTMs.
    modifications : list of str, optional
        List of modifications for raw_sequences.

    Returns
    -------
    pandas.Dataframe
        | Dataframe with :
        |                  annotation (optional), sequence, mass,
                           formula, formula_X, M0_NC, M1_NC, M0_12C, M1_12C.

    Warnings
    --------
    If raw_sequence is provided, modifications must also be provided
    and vice-versa.

[26]:
# Dataframe from a list of sequences, give an empty list to unlabelled_aa have all amino acids labelled.
df_peptides = stfi.seq_to_df(sequences, unlabelled_aa=[])
df_peptides.head()
[2019-07-10, 10:38:33] INFO    : Computing formula
[2019-07-10, 10:38:33] INFO    : Computing mass
[2019-07-10, 10:38:33] INFO    : Computing M0 and M1
[26]:
sequence mass formula formula_X M0_NC M1_NC M0_12C M1_12C
0 YAQEISR 865.429381 C37H59O13N11 C37H59O13N11 0.620641 0.280871 0.920656 0.051619
1 VGFPVLSVKEHK 1338.765971 C63H102O16N16 C63H102O16N16 0.455036 0.345060 0.890522 0.074113
2 LAMVIIKEFVDDLK 1632.916066 C76H128O21N16S1 C76H128O21N16S1 0.369940 0.337319 0.831576 0.081017
[27]:
# The dict returned by sequence_parser can be unpacked in the function (remove "ignored_lines" against warning).
parsed_output.pop("ignored_lines")
df_peptides = stfi.seq_to_df(unlabelled_aa=unlabelled_amino_acids, **parsed_output)
# Dataframe with annotations.
df_peptides.head()
[2019-07-10, 10:38:34] INFO    : Computing formula
[2019-07-10, 10:38:34] INFO    : Computing composition of modifications
[2019-07-10, 10:38:34] WARNING : Fe in (Heme) is not supported in the computation of M0 and M1
[2019-07-10, 10:38:34] INFO    : Computing mass
[2019-07-10, 10:38:34] INFO    : Computing M0 and M1
[27]:
annotation sequence mass formula formula_X M0_NC M1_NC M0_12C M1_12C
0 0 VPK(Oxidation)ER 643.365324 C27H49O9N9 C21H49O9N9X6 0.703864 0.235324 0.880417 0.096230
1 1 VLLIDLRIPQR(Phospho)SAINHIVAPNLVNVDPNLLWDK 3838.102264 C172H285O49N48P1 C154H285O49N48P1X18 0.113124 0.236320 0.583917 0.256235
2 2 QRTTFFVLGINTVNYPDIYEHILER 3037.566156 C140H212O40N36 C128H212O40N36X12 0.171960 0.290060 0.672794 0.212051
3 4 AELFL(Glutathione)LNR 1279.623072 C55H89O18N15S1 C46H89O18N15S1X9 0.470936 0.318055 0.768911 0.140284
4 7 .(Acetyl)VGEVFINYIQRQNELFQGKLAYLII(Oxidation)D... 5049.638616 C226H361O68N61S1 C205H361O68N61S1X21 0.054198 0.148778 0.481767 0.264187

If modifications have “non-CHONPSX” elements, computation of isotopologue intensities will be less accurate.