Look at overlap with ProteomeScout Dataset#
This example is from an experiment that created phosphopeptides that matched proteins for binding measurements. Therefore, the peptide is given (where the central Y is phosphorylated) and the site and accession number these peptides were created from are given.
The general framework in interacting with the API will be:
Check that the protein accession can be found in ProteomeScout PTM_API.get_sequence(ACC)
For each protein/site pair, see if that phosphorylation site is known to exist in the list of PTM_API.get_phosphosites
The remainder of the code is handling possible errors in the file.
[ ]:
# Setup the workspace,
from proteomeScoutAPI.proteomeScoutAPI import ProteomeScoutAPI
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from pylab import *
import pandas as pd
from scipy import stats
import pickle
import re
# Establish the API to the current data file
# define input file for ProteomeScout data
proteomeScoutFile = '../../data/proteomescout_everything_20161218/data.tsv'
# read in ProteomeScout data
PTM_API = ProteomeScoutAPI(proteomeScoutFile)
# Open up and load the file of interest, this file may have redundant sites, so uniquify it.
dataFile = 'Overlap_exampleList.txt'
df = pd.DataFrame.from_csv(dataFile, sep='\t')
Get list of all proteins and check they have a record in ProteomeScout database file#
[2]:
protein_list = [] #this will hold accessions with protein records that can be found
protein_list_noMatch = []
for protein, group_proteins in df.groupby('acc'): # walk through each list
sequence = PTM_API.get_sequence(protein)
if sequence != '-1': #means the record could not be found
protein_list.append(protein)
else:
protein_list_noMatch.append(protein)
For proteins, get phosphorylation sites and look at overlap#
Site codes: E_NP - Error: Peptide not in protein E_NY - Error :Peptide does not conform to Y is center E_PNY - Error: Position in protein sequence is not a Y E_P_NF - Error: Protein not found in ProteomeScout No - Not known to be phosphorylated Yes - Known to be phosphorylated
[3]:
columns = ['pep', 'acc', 'site', 'phosphorylation_code' ]
df_u = pd.DataFrame(columns=columns)
rows_list = []
for protein_acc in protein_list:
known_phosphosites = PTM_API.get_phosphosites(protein_acc)
site_pos_Y = []
for pSites in known_phosphosites:
if pSites[1] == 'Y':
site_pos_Y.append(int(pSites[0]))
df_sub = df[df['acc']==protein_acc]
sequence = PTM_API.get_sequence(protein_acc)
#Walk through each site for a protein
for sites, group in df_sub.groupby(['peptide', 'site']):
site_code = ''
pep = sites[0]
#check that the middle of the peptide is a Y
offset = len(pep)/2
if pep[offset] != 'Y':
site_code = 'E_NY'
else:
match = sequence.find(pep)
if not match:
site_code = 'E_NP'
else:
pos = match+len(pep)/2 #phosphorylation site is in center of this peptide string
#print pos
#is that pos a tyrosine
if sequence[pos] == 'Y':
if int(sites[1]) in site_pos_Y:
site_code = 'Yes'
else:
site_code = 'No'
else:
site_code = 'E_PNY'
#append a row in the dataframe
row_list = (sites[0], protein_acc, sites[1], site_code)
df_u = df_u.append({'pep':sites[0], 'acc':protein_acc, 'site':sites[1], 'phosphorylation_code':site_code}, ignore_index=True)
[4]:
# append all the sites and proteins for proteins that could not be found
site_code = 'E_P_NF'
for protein_acc in protein_list_noMatch:
df_sub = df[df['acc']==protein_acc]
for sites, group in df_sub.groupby(['peptide', 'site']):
row_list = (sites[0], protein_acc, sites[1], site_code)
df_u = df_u.append({'pep':sites[0], 'acc':protein_acc, 'site':sites[1], 'phosphorylation_code':site_code}, ignore_index=True)
Create a report#
[5]:
print "The total Number of Peptides\t\t%d"%(len(df_u))
df_u.groupby(['phosphorylation_code']).size()
The total Number of Peptides 1979
[5]:
phosphorylation_code
E_PNY 35
E_P_NF 107
No 1137
Yes 700
dtype: int64
[ ]: