Annotate phosphoproteomic dataset

Contents

Annotate phosphoproteomic dataset#

In many cases, you may want to append information from ProteomeScout to a particular dataset. This is easily done through the Proteomics class, which is designed to read in UniProt accessions and a peptide and find information associated with each peptide.

In order for this to work, the annotation class expects the peptide to be formatted such that any phosphorylation sites are annotated as lowercase, and all other residues are formatted as uppercase. For example, the following peptide has

Load dataset#

First, load the dataset as a dataframe. Notice the format of the peptides in the ‘pep’ column, with a lowercased phosphorylation site. If there are multiple phosphorylated sites in the peptide, there should can be multiple lowercased residues.

[1]:
import pandas as pd

df = pd.read_csv('example_dataset_for_annotation.csv')
df.head()
[1]:
acc Protein name Protein symbol Phosphorylation site modification pep data:time(hours):0 data:time(hours):0.5 data:time(hours):1 data:time(hours):2 data:time(hours):4 data:time(hours):8 data:time(hours):24
0 Q9C0C2 182 kDa tankyrase-1-binding protein. TNKS1BP1 S672 phosphorylation TEAQDLCRAsPEPPGPESSSR 1.05 0.75 0.59 0.89 0.95 0.89 1.0
1 Q9C0C2 182 kDa tankyrase-1-binding protein. TNKS1BP1 S691 phosphorylation WLDDLLAsPPPSGGGAR 0.56 0.66 0.63 0.58 0.48 0.64 1.0
2 P17980 26S protease regulatory subunit 6A PSMC3 Y132 phosphorylation QTyFLPVIGLVDAEK 0.11 0.09 0.11 0.12 0.14 0.04 1.0
3 O15530 3-phosphoinositide-dependent protein kinase 1 PDPK1 S241 phosphorylation ANsFVGTAQYVSPELLTEK 0.64 0.81 0.58 0.98 0.62 0.66 1.0
4 P23396 40S ribosomal protein S3. RPS3 T220/ T221* phosphorylation DEILPtTPISEQK 0.73 0.64 0.59 0.56 0.61 0.58 1.0

Once data has been loaded, you simply need to pass the dataset to the ProteomicDataset class, indicating which columns contain the SwissProt accession and which columns contain your formatted peptide.

[2]:
from proteomeScoutAPI import ProteomicDataset

#initialize ProteomicsDataset object
dataset = ProteomicDataset(df, accession_col='acc', peptide_col='pep', find_site=True, GO_terms=True)

#annotate the dataset
dataset.annotate_dataset()
Annotation information has been appended to the dataset DataFrame as new columns.

This will append two types of information, either based on the entire gene (gene name, domains, GO terms) or the individual ptms (position, modification, if in domain or macromolecular structure)

Gene level information will look like this:

[4]:
dataset.dataset[['gene_name', 'domains', 'domain_architecture', 'GO_terms']].head()
[4]:
gene_name domains domain_architecture GO_terms
0 TNKS1BP1 Tankyrase-bd_C:1554:1726:IPR032764 Tankyrase-bd_C GO:0071479-P:cellular response to ionizing rad...
1 TNKS1BP1 Tankyrase-bd_C:1554:1726:IPR032764 Tankyrase-bd_C GO:0071479-P:cellular response to ionizing rad...
2 PSMC3 Prot_ATP_ID_OB_2nd:112:165:IPR032501;AAA+_ATPa... Prot_ATP_ID_OB_2nd~AAA+_ATPase~AAA_lid_3 GO:0001824-P:blastocyst development;  GO:00439...
3 PDPK1 Prot_kinase_dom:82:342:IPR000719;PDK1-typ_PH:4... Prot_kinase_dom~PDK1-typ_PH GO:0030036-P:actin cytoskeleton organization; ...
4 RPS3 KH_dom_type_2:20:93:IPR004044;Ribosomal_uS3_C:... KH_dom_type_2~Ribosomal_uS3_C GO:0006915-P:apoptotic process;  GO:0006284-P:...

Each column contains:

  • gene_name: gene name associated with SwissProt accession

  • domains: domain information associated with the protein, with each entry separated by a semicolon. Domains are structured like “domain_name:start:stop:interpro_ID”

  • domain_architecture: Presents each domain in the order they appear in the protein

  • GO_terms: gene ontology terms associated with the gene

PTM-level information will look like this:

[6]:
dataset.dataset[['aligned_peps', 'modification_sites', 'documented_phosphosites', 'site_in_domain']].head()
[6]:
aligned_peps modification_sites documented_phosphosites site_in_domain
0 AQDLCRAsPEPPGPE S672 1
1 WLDDLLAsPPPSGGG S691 1
2 KTSTRQTyFLPVIGL Y132 1 Prot_ATP_ID_OB_2nd
3 SKQARANsFVGTAQY S241 1 Prot_kinase_dom
4 PKDEILPtTPISEQK T220 1

Each column indicates:

  • aligned_peps: sequence surrounding phosphrylation sites (+/- 7 amino acids)

  • modification_sites: residue and position in protein for each modifications

  • documented_phosphosites: indicates whether the specific site is found in ProteomeScout (1) or not (0)

  • site_in_domain: protein domains from Interpro that contain the phosphosite

  • site_in_macro: macromolecular structures that contain the phosphosite