API of seq-to-first-iso

seq-to-first-iso computes the first two isotopologue intentities (M0 and M1) from peptide sequences with natural carbon and with 99.99% 12C enriched carbon.

The program can take into account unlabelled amino acids to simulate auxotrophies to amino acids.

seq-to-first-iso is available as a Python module.

from pathlib import Path
from pprint import pprint

from pkg_resources import get_distribution  # Comes with setuptools.
import pandas as pd
from pyteomics import mass

import seq_to_first_iso as stfi
    print(f"pyteomics version: {get_distribution('pyteomics').version}")
    print("pyteomics version not found")

print(f"pandas version: {pd.__version__}\n"
      f"seq-to-first-iso version: {stfi.__version__}"
pyteomics version: 4.1.2
pandas version: 0.25.1
seq-to-first-iso version: 0.5.1

Abundances defined in seq-to-first-iso

{'C[12]': 0.9893,
 'C[13]': 0.0107,
 'H[1]': 0.999885,
 'H[2]': 0.000115,
 'N[14]': 0.99632,
 'N[15]': 0.00368,
 'O[16]': 0.99757,
 'O[17]': 0.00038,
 'O[18]': 0.00205,
 'S[32]': 0.9493,
 'S[33]': 0.0076,
 'S[34]': 0.0429,
 'X[12]': 0.9893,
 'X[13]': 0.0107}
{'C[12]': 0.9999,
 'C[13]': 9.999999999998899e-05,
 'H[1]': 0.999885,
 'H[2]': 0.000115,
 'N[14]': 0.99632,
 'N[15]': 0.00368,
 'O[16]': 0.99757,
 'O[17]': 0.00038,
 'O[18]': 0.00205,
 'S[32]': 0.9493,
 'S[33]': 0.0076,
 'S[34]': 0.0429,
 'X[12]': 0.9893,
 'X[13]': 0.0107}
NATURAL_ABUNDANCE and C12_ABUNDANCE are dictionaries with abundances of common isotopes of organic elements.
C12_ABUNDANCE has a 12C abundance of 99.99 %, hence 13C abundance is 0.01 %.
Element X is a virtual element created to replace the carbon of unlabelled amino acids, it has the same isotopic abundances as natural carbon.

Separate sequences according to unlabelled amino acids

Help on function separate_labelled in module seq_to_first_iso.seq_to_first_iso:

separate_labelled(sequence, unlabelled_aa)
    Get the sequence of unlabelled amino acids from a sequence.

    sequence : str
        String of amino acids.
    unlabelled_aa : container object
        Container (list, string...) of unlabelled amino acids.

    tuple(str, str)
        | The sequences as a tuple of string with:
        |    - the sequence without the unlabelled amino acids
        |    - the unlabelled amino acids in the sequence

# Separate sequence "YAQEISRAR" with amino acids A and R unlabelled.
peptide_seq = "YAQEISRAR"
unlabelled_amino_acids = ["A", "R"]

labelled_sequence, unlabelled_sequence = stfi.separate_labelled(peptide_seq, unlabelled_aa=unlabelled_amino_acids)

    f"Original sequence: {peptide_seq}\n"
    f"Unlabelled amino acids: {unlabelled_amino_acids}\n"
    f"Sequence with labelled carbon: {labelled_sequence}\n"
    f"Sequence with unlabelled carbon: {unlabelled_sequence}")
Original sequence: YAQEISRAR
Unlabelled amino acids: ['A', 'R']
Sequence with labelled carbon: YQEIS
Sequence with unlabelled carbon: ARAR

Obtain a composition with element X

# Get the chemical formula with unlabelled carbon as element X.
labelled_formula = mass.Composition(labelled_sequence)
unlabelled_formula = stfi.convert_atom_C_to_X(mass.Composition(parsed_sequence=unlabelled_sequence))
peptide_formula = unlabelled_formula + labelled_formula
print(f"Composition of labelled amino acids: {labelled_formula}")
print(f"Composition of unlabelled amino acids (X is C): {unlabelled_formula}")
print(f"Composition of {peptide_seq} with {unlabelled_amino_acids} unlabelled:\n{peptide_formula}")
Composition of labelled amino acids: Composition({'H': 42, 'C': 28, 'O': 11, 'N': 6})
Composition of unlabelled amino acids (X is C): Composition({'H': 34, 'O': 4, 'N': 10, 'X': 18})
Composition of YAQEISRAR with ['A', 'R'] unlabelled:
Composition({'H': 76, 'O': 15, 'N': 16, 'X': 18, 'C': 28})

Compute isotopologue intensity

print("-" * 79)
Help on function compute_M0_nl in module seq_to_first_iso.seq_to_first_iso:

compute_M0_nl(formula, abundance)
    Compute intensity of the first isotopologue M0.

    Handle element X with specific abundance.

    formula : pyteomics.mass.Composition
        Chemical formula, as a dict of the number of atoms for each element:
        {element_name: number_of_atoms, ...}.
    abundance : dict
        Dictionary of abundances of isotopes:
        {"element_name[isotope_number]": relative abundance, ..}.

        Value of M0.

    X represents C with default isotopic abundance.

Help on function compute_M1_nl in module seq_to_first_iso.seq_to_first_iso:

compute_M1_nl(formula, abundance)
    Compute intensity of the second isotopologue M1.

    Handle element X with specific abundance.

    formula : pyteomics.mass.Composition
        Chemical formula, as a dict of the number of atoms for each element:
        {element_name: number_of_atoms, ...}.
    abundance : dict
        Dictionary of abundances of isotopes:
        {"element_name[isotope_number]": relative abundance, ..}.

        Value of M1.

    X represents C with default isotopic abundance.

# Compute M0 with natural carbon.
first_isotopologue = stfi.compute_M0_nl(peptide_formula, stfi.NATURAL_ABUNDANCE)
print(f"M0 in normal (98.93% 12C) condition: {first_isotopologue}")

first_isotopologue = stfi.compute_M0_nl(peptide_formula, stfi.C12_ABUNDANCE)
print(f"M0 in    12C (99.99% 12C) condition: {first_isotopologue}")
M0 in normal (98.93% 12C) condition: 0.5493191520383802
M0 in    12C (99.99% 12C) condition: 0.7403283857401063
# Compute M1 with natural carbon.
second_isotopologue = stfi.compute_M1_nl(peptide_formula, stfi.NATURAL_ABUNDANCE)
print(f"M1 in normal (98.93% 12C) condition: {second_isotopologue}")

second_isotopologue = stfi.compute_M1_nl(peptide_formula, stfi.C12_ABUNDANCE)
print(f"M1 in    12C (99.99% 12C) condition: {second_isotopologue}")
M1 in normal (98.93% 12C) condition: 0.313702912736476
M1 in    12C (99.99% 12C) condition: 0.200655465179031

Get the composition of a list of Post-translational modifications (PTMs)

Help on function get_mods_composition in module seq_to_first_iso.seq_to_first_iso:

    Return the composition of a list of modifications.

    modifications : list of str
        List of modifications string (corresponding to Unimod titles).

        The total composition change.

# Modifications must be strict Unimod entries title.
modification_list = ["Acetyl", "Phospho", "phospho"]  # phospho does not correspond to a real PTM name, it will be ignored
total_composition = stfi.get_mods_composition(modification_list)
print(f"Total composition for {modification_list} is {total_composition}")
[2019-12-05, 13:55:32] WARNING : Unimod entry not found for : phospho
Total composition for ['Acetyl', 'Phospho', 'phospho'] is Composition({'H': 3, 'C': 2, 'O': 4, 'P': 1})

Get human-readable chemical formula

Help on function formula_to_str in module seq_to_first_iso.seq_to_first_iso:

    Return formula from Composition as a string.

    composition : pyteomics.mass.Composition
        Chemical formula.

        Human-readable string of the formula.

    If the composition has elements not in USED_ELEMS, they will not
    be added to the output.

# This is the function used to get the formulas in the output.
formula_str = stfi.formula_to_str(total_composition)
print(f"{total_composition} becomes {formula_str}")
Composition({'H': 3, 'C': 2, 'O': 4, 'P': 1}) becomes C2H3O4P1
# !!! Warning: if the Composition has elements not in "CHONPSX", they will not be in the final string.
bad_composition = mass.Composition("U")
formula_str = stfi.formula_to_str(bad_composition)
print(f"Compostion with unsupported element {bad_composition} becomes {formula_str}")
Compostion with unsupported element Composition({'H': 7, 'C': 3, 'O': 2, 'N': 1, 'Se': 1}) becomes C3H7O2N1

Here, “non-CHONPSX” element Se (Selenium) is ignored!

Parse a file with peptide sequences and charges

seq-to-first-iso reads tsv files with at least a sequence and a charge columns.

The parser will ignore lines where sequences have incorrect characters (not in ACDEFGHIKLMNPQRSTVWY) unless it corresponds to XTandem’s PTMs notation.

df_raw = stfi.parse_input_file("peptides.tsv")
df_filtered = stfi.filter_input_dataframe(df_raw, "pep_sequence", "pep_charge")
[2019-12-05, 13:55:32] INFO    : Read peptides.tsv
[2019-12-05, 13:55:32] INFO    : Found 11 lines and 3 columns
                                             sequence  charge
0                                             YAQEISR       2
2                           QRTTFFVLGINTVNYPDIYEHILER       2
3                               AELFL(Glutathione)LNR       1
4   .(Acetyl)VGEVFINYIQRQNELFQGKLAYLII(Oxidation)D...       4
5                       YKTMNTFDPD(Heme)EKFEWFQVWQAVK       2
7                                                FHNK       1
8                                .(Glutathione)MDLEIK       3
9                                        LANEKPEDVFER       2
10  .(Acetyl)SDTPLR(Oxidation)D(Acetyl)EDG(Acetyl)...       3
df_final = stfi.compute_intensities(df_filtered, unlabelled_aa=["A", "R"])
[2019-12-05, 13:55:33] INFO    : Reading sequences.
[2019-12-05, 13:55:33] INFO    : Computing composition and formula.
[2019-12-05, 13:55:33] WARNING : Fe in (Heme) is not supported in the computation of M0 and M1
[2019-12-05, 13:55:33] INFO    : Computing neutral mass
[2019-12-05, 13:55:33] INFO    : Computing M0 and M1
stfi_sequence stfi_charge stfi_sequence_clean stfi_modification stfi_sequence_without_mod stfi_sequence_to_process stfi_log stfi_sequence_labelled stfi_sequence_unlabelled stfi_composition_mod ... stfi_composition_peptide_neutral stfi_composition_peptide_with_charge stfi_composition_peptide_with_charge_X stfi_formula stfi_formula_X stfi_neutral_mass stfi_M0_NC stfi_M1_NC stfi_M0_12C stfi_M1_12C
0 YAQEISR 2 YAQEISR [] YAQEISR YAQEISR YQEIS AR {} ... {'H': 59, 'C': 37, 'O': 13, 'N': 11} {'H': 61, 'C': 37, 'O': 13, 'N': 11} {'H': 61, 'C': 28, 'O': 13, 'N': 11, 'X': 9} C37H61O13N11 C28H61O13N11X9 865.429381 0.620499 0.280949 0.836258 0.127729
1 VLLIDLRIPQR(Phospho)SAINHIVAPNLVNVDPNLLWDK 3 VLLIDLRIPQR(Phospho)SAINHIVAPNLVNVDPNLLWDK [Phospho] VLLIDLRIPQRSAINHIVAPNLVNVDPNLLWDK VLLIDLRIPQRSAINHIVAPNLVNVDPNLLWDK VLLIDLIPQSINHIVPNLVNVDPNLLWDK RRAA {'H': 1, 'O': 3, 'P': 1} ... {'H': 285, 'C': 172, 'O': 49, 'N': 48, 'P': 1} {'H': 288, 'C': 172, 'O': 49, 'N': 48, 'P': 1} {'H': 288, 'C': 154, 'O': 49, 'N': 48, 'X': 18... C172H288O49N48P1 C154H288O49N48P1X18 3838.102264 0.113085 0.236277 0.583716 0.256348
2 QRTTFFVLGINTVNYPDIYEHILER 2 QRTTFFVLGINTVNYPDIYEHILER [] QRTTFFVLGINTVNYPDIYEHILER QRTTFFVLGINTVNYPDIYEHILER QTTFFVLGINTVNYPDIYEHILE RR {} ... {'H': 212, 'C': 140, 'O': 40, 'N': 36} {'H': 214, 'C': 140, 'O': 40, 'N': 36} {'H': 214, 'C': 128, 'O': 40, 'N': 36, 'X': 12} C140H214O40N36 C128H214O40N36X12 3037.566156 0.171920 0.290033 0.672639 0.212157
3 AELFL(Glutathione)LNR 1 AELFL(Glutathione)LNR [Glutathione] AELFLLNR AELFLLNR ELFLLN AR {'H': 15, 'C': 10, 'N': 3, 'O': 6, 'S': 1} ... {'H': 89, 'C': 55, 'O': 18, 'N': 15, 'S': 1} {'H': 90, 'C': 55, 'O': 18, 'N': 15, 'S': 1} {'H': 90, 'C': 46, 'O': 18, 'N': 15, 'X': 9, '... C55H90O18N15S1 C46H90O18N15S1X9 1279.623072 0.470882 0.318073 0.768822 0.140356
4 .(Acetyl)VGEVFINYIQRQNELFQGKLAYLII(Oxidation)D... 4 .(Acetyl)VGEVFINYIQRQNELFQGKLAYLII(Oxidation)D... [Acetyl, Oxidation] VGEVFINYIQRQNELFQGKLAYLIIDTCLSIVRPNDSKPLDNR VGEVFINYIQRQNELFQGKLAYLIIDTCLSIVRPNDSKPLDNR VGEVFINYIQQNELFQGKLYLIIDTCLSIVPNDSKPLDN RARR {'H': 2, 'C': 2, 'O': 2} ... {'H': 361, 'C': 226, 'O': 68, 'N': 61, 'S': 1} {'H': 365, 'C': 226, 'O': 68, 'N': 61, 'S': 1} {'H': 365, 'C': 205, 'O': 68, 'N': 61, 'S': 1,... C226H365O68N61S1 C205H365O68N61S1X21 5049.638616 0.054173 0.148735 0.481545 0.264287
5 YKTMNTFDPD(Heme)EKFEWFQVWQAVK 2 YKTMNTFDPD(Heme)EKFEWFQVWQAVK [Heme] YKTMNTFDPDEKFEWFQVWQAVK YKTMNTFDPDEKFEWFQVWQAVK YKTMNTFDPDEKFEWFQVWQVK A {'H': 32, 'C': 34, 'N': 4, 'O': 4, 'Fe': 1} ... {'H': 225, 'C': 173, 'O': 42, 'N': 35, 'S': 1,... {'H': 227, 'C': 173, 'O': 42, 'N': 35, 'S': 1,... {'H': 227, 'C': 170, 'O': 42, 'N': 35, 'S': 1,... C173H227O42N35S1 C170H227O42N35S1X3 3552.561645 0.114128 0.234021 0.698631 0.159873
6 HKSASSPAV(Pro->Val)NADTDIQDSSTPSTSPSGRR 2 HKSASSPAV(Pro->Val)NADTDIQDSSTPSTSPSGRR [Pro->Val] HKSASSPAVNADTDIQDSSTPSTSPSGRR HKSASSPAVNADTDIQDSSTPSTSPSGRR HKSSSPVNDTDIQDSSTPSTSPSG AAARR {'H': 2} ... {'H': 196, 'C': 118, 'N': 40, 'O': 49} {'H': 198, 'C': 118, 'N': 40, 'O': 49} {'H': 198, 'C': 97, 'N': 40, 'O': 49, 'X': 21} C118H198O49N40 C97H198O49N40X21 2957.407483 0.210376 0.308292 0.591515 0.251993
7 FHNK 1 FHNK [] FHNK FHNK FHNK {} ... {'H': 36, 'C': 25, 'O': 6, 'N': 8} {'H': 37, 'C': 25, 'O': 6, 'N': 8} {'H': 37, 'C': 25, 'O': 6, 'N': 8} C25H37O6N8 C25H37O6N8 544.275781 0.728121 0.223157 0.950424 0.036677
8 .(Glutathione)MDLEIK 3 .(Glutathione)MDLEIK [Glutathione] MDLEIK MDLEIK MDLEIK {'H': 15, 'C': 10, 'N': 3, 'O': 6, 'S': 1} ... {'H': 72, 'C': 42, 'S': 2, 'O': 17, 'N': 10} {'H': 75, 'C': 42, 'S': 2, 'O': 17, 'N': 10} {'H': 75, 'C': 42, 'S': 2, 'O': 17, 'N': 10} C42H75O17N10S2 C42H75O17N10S2 1052.451833 0.525852 0.274658 0.822740 0.059443
9 LANEKPEDVFER 2 LANEKPEDVFER [] LANEKPEDVFER LANEKPEDVFER LNEKPEDVFE AR {} ... {'H': 99, 'C': 63, 'O': 22, 'N': 17} {'H': 101, 'C': 63, 'O': 22, 'N': 17} {'H': 101, 'C': 54, 'O': 22, 'N': 17, 'X': 9} C63H101O22N17 C54H101O22N17X9 1445.715058 0.446843 0.341468 0.794506 0.147405
10 .(Acetyl)SDTPLR(Oxidation)D(Acetyl)EDG(Acetyl)... 3 .(Acetyl)SDTPLR(Oxidation)D(Acetyl)EDG(Acetyl)... [Acetyl, Oxidation, Acetyl, Acetyl] SDTPLRDEDGLDFWETLRSLATTNPNPPVEK SDTPLRDEDGLDFWETLRSLATTNPNPPVEK SDTPLDEDGLDFWETLSLTTNPNPPVEK RRA {'H': 6, 'C': 6, 'O': 4} ... {'H': 243, 'C': 159, 'O': 58, 'N': 41} {'H': 246, 'C': 159, 'O': 58, 'N': 41} {'H': 246, 'C': 144, 'O': 58, 'N': 41, 'X': 15} C159H246O58N41 C144H246O58N41X15 3654.732565 0.131200 0.252105 0.608763 0.230393

11 rows × 22 columns

# Most interesting columns are the following
df_final[["stfi_sequence", "stfi_charge", "stfi_M0_NC", "stfi_M1_NC", "stfi_M0_12C", "stfi_M1_12C"]]
stfi_sequence stfi_charge stfi_M0_NC stfi_M1_NC stfi_M0_12C stfi_M1_12C
0 YAQEISR 2 0.620499 0.280949 0.836258 0.127729
1 VLLIDLRIPQR(Phospho)SAINHIVAPNLVNVDPNLLWDK 3 0.113085 0.236277 0.583716 0.256348
2 QRTTFFVLGINTVNYPDIYEHILER 2 0.171920 0.290033 0.672639 0.212157
3 AELFL(Glutathione)LNR 1 0.470882 0.318073 0.768822 0.140356
4 .(Acetyl)VGEVFINYIQRQNELFQGKLAYLII(Oxidation)D... 4 0.054173 0.148735 0.481545 0.264287
5 YKTMNTFDPD(Heme)EKFEWFQVWQAVK 2 0.114128 0.234021 0.698631 0.159873
6 HKSASSPAV(Pro->Val)NADTDIQDSSTPSTSPSGRR 2 0.210376 0.308292 0.591515 0.251993
7 FHNK 1 0.728121 0.223157 0.950424 0.036677
8 .(Glutathione)MDLEIK 3 0.525852 0.274658 0.822740 0.059443
9 LANEKPEDVFER 2 0.446843 0.341468 0.794506 0.147405
10 .(Acetyl)SDTPLR(Oxidation)D(Acetyl)EDG(Acetyl)... 3 0.131200 0.252105 0.608763 0.230393

Concatenation of results with input data

input_file_name = "peptides.tsv"
output_file_name = Path(input_file_name).stem + "_stfi.tsv"

column_of_interest = ["stfi_neutral_mass",
                      "stfi_formula", "stfi_formula_X",
                      "stfi_M0_NC", "stfi_M1_NC",
                      "stfi_M0_12C", "stfi_M1_12C"]

# Read original file and append STFI data.
df_old = pd.read_csv(input_file_name, sep="\t")
df_new = pd.concat([df_old, df_final[column_of_interest]], axis=1)
df_new.to_csv(output_file_name, sep="\t", index=False)
!head peptides_stfi.tsv
pep_name        pep_sequence    pep_charge      stfi_neutral_mass       stfi_formula    stfi_formula_X  stfi_M0_NC      stfi_M1_NC      stfi_M0_12C     stfi_M1_12C
seq1    YAQEISR 2       865.42938099921 C37H61O13N11    C28H61O13N11X9  0.6204986747402674      0.28094895790268576     0.8362584492452608      0.1277294394585608
seq2    VLLIDLRIPQR(Phospho)SAINHIVAPNLVNVDPNLLWDK      3       3838.1022643587894      C172H288O49N48P1        C154H288O49N48P1X18     0.11308454311128492     0.23627735941497488     0.5837157078086469      0.256348239423703
seq3    QRTTFFVLGINTVNYPDIYEHILER       2       3037.56615575404        C140H214O40N36  C128H214O40N36X12       0.17192000472677066     0.29003268314604863     0.6726389393255647      0.2121565119028707
seq4    AELFL(Glutathione)LNR   1       1279.6230720783099      C55H90O18N15S1  C46H90O18N15S1X9        0.47088227298965996     0.31807282610880205     0.7688224723128251      0.1403559631032404
seq5    .(Acetyl)VGEVFINYIQRQNELFQGKLAYLII(Oxidation)DTCLSIVRPNDSKPLDNR 4       5049.63861600015        C226H365O68N61S1        C205H365O68N61S1X21     0.05417296058666768     0.14873470210020426     0.48154538801515706     0.26428662893114313
seq6    YKTMNTFDPD(Heme)EKFEWFQVWQAVK   2       3552.56164490527        C173H227O42N35S1        C170H227O42N35S1X3      0.11412815567709074     0.23402086836029898     0.6986310451922292      0.15987291091234185
seq7    HKSASSPAV(Pro->Val)NADTDIQDSSTPSTSPSGRR  2       2957.40748283616        C118H198O49N40  C97H198O49N40X21        0.21037550761092094     0.30829218128938995     0.5915145465128161      0.2519928490706656
seq8    FHNK    1       544.27578091028 C25H37O6N8      C25H37O6N8      0.7281205110566825      0.2231565512772339      0.950423678912205       0.036676880813002036
seq9    .(Glutathione)MDLEIK    3       1052.4518328895601      C42H75O17N10S2  C42H75O17N10S2  0.5258517009900313      0.27465762228958784     0.8227403058336873      0.05944288050042882
