Annotate phosphoproteomic dataset#
In many cases, you may want to append information from ProteomeScout to a particular dataset. This is easily done through the Proteomics class, which is designed to read in UniProt accessions and a peptide and find information associated with each peptide.
In order for this to work, the annotation class expects the peptide to be formatted such that any phosphorylation sites are annotated as lowercase, and all other residues are formatted as uppercase. For example, the following peptide has
Load dataset#
First, load the dataset as a dataframe. Notice the format of the peptides in the ‘pep’ column, with a lowercased phosphorylation site. If there are multiple phosphorylated sites in the peptide, there should can be multiple lowercased residues.
[1]:
import pandas as pd
df = pd.read_csv('example_dataset_for_annotation.csv')
df.head()
[1]:
| acc | Protein name | Protein symbol | Phosphorylation site | modification | pep | data:time(hours):0 | data:time(hours):0.5 | data:time(hours):1 | data:time(hours):2 | data:time(hours):4 | data:time(hours):8 | data:time(hours):24 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Q9C0C2 | 182 kDa tankyrase-1-binding protein. | TNKS1BP1 | S672 | phosphorylation | TEAQDLCRAsPEPPGPESSSR | 1.05 | 0.75 | 0.59 | 0.89 | 0.95 | 0.89 | 1.0 |
| 1 | Q9C0C2 | 182 kDa tankyrase-1-binding protein. | TNKS1BP1 | S691 | phosphorylation | WLDDLLAsPPPSGGGAR | 0.56 | 0.66 | 0.63 | 0.58 | 0.48 | 0.64 | 1.0 |
| 2 | P17980 | 26S protease regulatory subunit 6A | PSMC3 | Y132 | phosphorylation | QTyFLPVIGLVDAEK | 0.11 | 0.09 | 0.11 | 0.12 | 0.14 | 0.04 | 1.0 |
| 3 | O15530 | 3-phosphoinositide-dependent protein kinase 1 | PDPK1 | S241 | phosphorylation | ANsFVGTAQYVSPELLTEK | 0.64 | 0.81 | 0.58 | 0.98 | 0.62 | 0.66 | 1.0 |
| 4 | P23396 | 40S ribosomal protein S3. | RPS3 | T220/ T221* | phosphorylation | DEILPtTPISEQK | 0.73 | 0.64 | 0.59 | 0.56 | 0.61 | 0.58 | 1.0 |
Once data has been loaded, you simply need to pass the dataset to the ProteomicDataset class, indicating which columns contain the SwissProt accession and which columns contain your formatted peptide.
[2]:
from proteomeScoutAPI import ProteomicDataset
#initialize ProteomicsDataset object
dataset = ProteomicDataset(df, accession_col='acc', peptide_col='pep', find_site=True, GO_terms=True)
#annotate the dataset
dataset.annotate_dataset()
Annotation information has been appended to the dataset DataFrame as new columns.
This will append two types of information, either based on the entire gene (gene name, domains, GO terms) or the individual ptms (position, modification, if in domain or macromolecular structure)
Gene level information will look like this:
[4]:
dataset.dataset[['gene_name', 'domains', 'domain_architecture', 'GO_terms']].head()
[4]:
| gene_name | domains | domain_architecture | GO_terms | |
|---|---|---|---|---|
| 0 | TNKS1BP1 | Tankyrase-bd_C:1554:1726:IPR032764 | Tankyrase-bd_C | GO:0071479-P:cellular response to ionizing rad... |
| 1 | TNKS1BP1 | Tankyrase-bd_C:1554:1726:IPR032764 | Tankyrase-bd_C | GO:0071479-P:cellular response to ionizing rad... |
| 2 | PSMC3 | Prot_ATP_ID_OB_2nd:112:165:IPR032501;AAA+_ATPa... | Prot_ATP_ID_OB_2nd~AAA+_ATPase~AAA_lid_3 | GO:0001824-P:blastocyst development; GO:00439... |
| 3 | PDPK1 | Prot_kinase_dom:82:342:IPR000719;PDK1-typ_PH:4... | Prot_kinase_dom~PDK1-typ_PH | GO:0030036-P:actin cytoskeleton organization; ... |
| 4 | RPS3 | KH_dom_type_2:20:93:IPR004044;Ribosomal_uS3_C:... | KH_dom_type_2~Ribosomal_uS3_C | GO:0006915-P:apoptotic process; GO:0006284-P:... |
Each column contains:
gene_name: gene name associated with SwissProt accessiondomains: domain information associated with the protein, with each entry separated by a semicolon. Domains are structured like “domain_name:start:stop:interpro_ID”domain_architecture: Presents each domain in the order they appear in the proteinGO_terms: gene ontology terms associated with the gene
PTM-level information will look like this:
[6]:
dataset.dataset[['aligned_peps', 'modification_sites', 'documented_phosphosites', 'site_in_domain']].head()
[6]:
| aligned_peps | modification_sites | documented_phosphosites | site_in_domain | |
|---|---|---|---|---|
| 0 | AQDLCRAsPEPPGPE | S672 | 1 | |
| 1 | WLDDLLAsPPPSGGG | S691 | 1 | |
| 2 | KTSTRQTyFLPVIGL | Y132 | 1 | Prot_ATP_ID_OB_2nd |
| 3 | SKQARANsFVGTAQY | S241 | 1 | Prot_kinase_dom |
| 4 | PKDEILPtTPISEQK | T220 | 1 |
Each column indicates:
aligned_peps: sequence surrounding phosphrylation sites (+/- 7 amino acids)modification_sites: residue and position in protein for each modificationsdocumented_phosphosites: indicates whether the specific site is found in ProteomeScout (1) or not (0)site_in_domain: protein domains from Interpro that contain the phosphositesite_in_macro: macromolecular structures that contain the phosphosite