Look at overlap with ProteomeScout Dataset#

This example is from an experiment that created phosphopeptides that matched proteins for binding measurements. Therefore, the peptide is given (where the central Y is phosphorylated) and the site and accession number these peptides were created from are given.

The general framework in interacting with the API will be:

  1. Check that the protein accession can be found in ProteomeScout PTM_API.get_sequence(ACC)

  2. For each protein/site pair, see if that phosphorylation site is known to exist in the list of PTM_API.get_phosphosites

The remainder of the code is handling possible errors in the file.

[ ]:
# Setup the workspace,
from proteomeScoutAPI.proteomeScoutAPI import ProteomeScoutAPI
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from pylab import *
import pandas as pd
from scipy import stats
import pickle
import re

# Establish the API to the current data file

# define input file for ProteomeScout data
proteomeScoutFile = '../../data/proteomescout_everything_20161218/data.tsv'
# read in ProteomeScout data
PTM_API = ProteomeScoutAPI(proteomeScoutFile)

# Open up and load the file of interest, this file may have redundant sites, so uniquify it.
dataFile = 'Overlap_exampleList.txt'
df = pd.DataFrame.from_csv(dataFile, sep='\t')

Get list of all proteins and check they have a record in ProteomeScout database file#

[2]:
protein_list = [] #this will hold accessions with protein records that can be found
protein_list_noMatch = []
for protein, group_proteins in df.groupby('acc'): # walk through each list
    sequence = PTM_API.get_sequence(protein)
    if sequence != '-1': #means the record could not be found
        protein_list.append(protein)
    else:
        protein_list_noMatch.append(protein)

For proteins, get phosphorylation sites and look at overlap#

Site codes: E_NP - Error: Peptide not in protein E_NY - Error :Peptide does not conform to Y is center E_PNY - Error: Position in protein sequence is not a Y E_P_NF - Error: Protein not found in ProteomeScout No - Not known to be phosphorylated Yes - Known to be phosphorylated

[3]:
columns = ['pep', 'acc', 'site', 'phosphorylation_code' ]
df_u = pd.DataFrame(columns=columns)
rows_list = []
for protein_acc in protein_list:
    known_phosphosites = PTM_API.get_phosphosites(protein_acc)
    site_pos_Y = []
    for pSites in known_phosphosites:
        if pSites[1] == 'Y':
            site_pos_Y.append(int(pSites[0]))

    df_sub = df[df['acc']==protein_acc]
    sequence = PTM_API.get_sequence(protein_acc)
    #Walk through each site for a protein
    for sites, group in df_sub.groupby(['peptide', 'site']):
        site_code = ''
        pep = sites[0]
        #check that the middle of the peptide is a Y
        offset = len(pep)/2
        if pep[offset] != 'Y':
            site_code = 'E_NY'
        else:
            match = sequence.find(pep)
            if not match:
                site_code = 'E_NP'
            else:

                pos = match+len(pep)/2 #phosphorylation site is in center of this peptide string
                #print pos
                #is that pos a tyrosine
                if sequence[pos] == 'Y':
                    if int(sites[1]) in site_pos_Y:
                        site_code = 'Yes'
                    else:
                        site_code = 'No'
                else:
                    site_code = 'E_PNY'
        #append a row in the dataframe
        row_list = (sites[0], protein_acc, sites[1], site_code)
        df_u = df_u.append({'pep':sites[0], 'acc':protein_acc, 'site':sites[1], 'phosphorylation_code':site_code}, ignore_index=True)

[4]:
# append all the sites and proteins for proteins that could not be found
site_code = 'E_P_NF'
for protein_acc in protein_list_noMatch:
    df_sub = df[df['acc']==protein_acc]
    for sites, group in df_sub.groupby(['peptide', 'site']):
        row_list = (sites[0], protein_acc, sites[1], site_code)
        df_u = df_u.append({'pep':sites[0], 'acc':protein_acc, 'site':sites[1], 'phosphorylation_code':site_code}, ignore_index=True)

Create a report#

[5]:
print "The total Number of Peptides\t\t%d"%(len(df_u))
df_u.groupby(['phosphorylation_code']).size()
The total Number of Peptides            1979
[5]:
phosphorylation_code
E_PNY       35
E_P_NF     107
No        1137
Yes        700
dtype: int64
[ ]: