Annotate phosphoproteomic dataset#

In many cases, you may want to append information from ProteomeScout to a particular dataset. This is easily done through the Proteomics class, which is designed to read in UniProt accessions and a peptide and find information associated with each peptide.

In order for this to work, the annotation class expects the peptide to be formatted such that any phosphorylation sites are annotated as lowercase, and all other residues are formatted as uppercase. For example, the following peptide has

Load dataset#

First, load the dataset as a dataframe. Notice the format of the peptides in the ‘pep’ column, with a lowercased phosphorylation site. If there are multiple phosphorylated sites in the peptide, there should can be multiple lowercased residues.

[1]:

import pandas as pd

df = pd.read_csv('example_dataset_for_annotation.csv')
df.head()

[1]:

	acc	Protein name	Protein symbol	Phosphorylation site	modification	pep	data:time(hours):0	data:time(hours):0.5	data:time(hours):1	data:time(hours):2	data:time(hours):4	data:time(hours):8	data:time(hours):24
0	Q9C0C2	182 kDa tankyrase-1-binding protein.	TNKS1BP1	S672	phosphorylation	TEAQDLCRAsPEPPGPESSSR	1.05	0.75	0.59	0.89	0.95	0.89	1.0
1	Q9C0C2	182 kDa tankyrase-1-binding protein.	TNKS1BP1	S691	phosphorylation	WLDDLLAsPPPSGGGAR	0.56	0.66	0.63	0.58	0.48	0.64	1.0
2	P17980	26S protease regulatory subunit 6A	PSMC3	Y132	phosphorylation	QTyFLPVIGLVDAEK	0.11	0.09	0.11	0.12	0.14	0.04	1.0
3	O15530	3-phosphoinositide-dependent protein kinase 1	PDPK1	S241	phosphorylation	ANsFVGTAQYVSPELLTEK	0.64	0.81	0.58	0.98	0.62	0.66	1.0
4	P23396	40S ribosomal protein S3.	RPS3	T220/ T221*	phosphorylation	DEILPtTPISEQK	0.73	0.64	0.59	0.56	0.61	0.58	1.0

Once data has been loaded, you simply need to pass the dataset to the ProteomicDataset class, indicating which columns contain the SwissProt accession and which columns contain your formatted peptide.

[2]:

from proteomeScoutAPI import ProteomicDataset

#initialize ProteomicsDataset object
dataset = ProteomicDataset(df, accession_col='acc', peptide_col='pep', find_site=True, GO_terms=True)

#annotate the dataset
dataset.annotate_dataset()

Annotation information has been appended to the dataset DataFrame as new columns.

This will append two types of information, either based on the entire gene (gene name, domains, GO terms) or the individual ptms (position, modification, if in domain or macromolecular structure)

Gene level information will look like this:

[3]:

dataset.dataset[['gene_name', 'domains', 'domain_architecture', 'GO_terms']].head()

[3]:

	gene_name	domains	domain_architecture	GO_terms
0	TNKS1BP1	Tankyrase-bd_C:1554:1726:IPR032764	Tankyrase-bd_C	GO:0071479-P:cellular response to ionizing rad...
1	TNKS1BP1	Tankyrase-bd_C:1554:1726:IPR032764	Tankyrase-bd_C	GO:0071479-P:cellular response to ionizing rad...
2	PSMC3	Prot_ATP_ID_OB_2nd:112:165:IPR032501;AAA+_ATPa...	Prot_ATP_ID_OB_2nd~AAA+_ATPase~AAA_lid_3	GO:0001824-P:blastocyst development; GO:00439...
3	PDPK1	Prot_kinase_dom:82:342:IPR000719;PDK1-typ_PH:4...	Prot_kinase_dom~PDK1-typ_PH	GO:0030036-P:actin cytoskeleton organization; ...
4	RPS3	KH_dom_type_2:20:93:IPR004044;Ribosomal_uS3_C:...	KH_dom_type_2~Ribosomal_uS3_C	GO:0006915-P:apoptotic process; GO:0006284-P:...

Each column contains:

gene_name: gene name associated with SwissProt accession
domains: domain information associated with the protein, with each entry separated by a semicolon. Domains are structured like “domain_name:start:stop:interpro_ID”
domain_architecture: Presents each domain in the order they appear in the protein
GO_terms: gene ontology terms associated with the gene

PTM-level information will look like this:

[4]:

dataset.dataset[['aligned_peps', 'modification_sites', 'documented_phosphosites', 'site_in_domain:name', 'site_in_domain:interpro', 'site_in_macro', 'site_in_structure']].head()

[4]:

	aligned_peps	modification_sites	documented_phosphosites	site_in_domain:name	site_in_domain:interpro	site_in_macro	site_in_structure
0	AQDLCRAsPEPPGPE	S672	1			Acidic;Disordered
1	WLDDLLAsPPPSGGG	S691	1			Acidic;Disordered
2	KTSTRQTyFLPVIGL	Y132	1	Prot_ATP_ID_OB_2nd	IPR032501		STRAND
3	SKQARANsFVGTAQY	S241	1	Prot_kinase_dom	IPR000719
4	PKDEILPtTPISEQK	T220	1			Disordered

Each column indicates:

aligned_peps: sequence surrounding phosphrylation sites (+/- 7 amino acids)
modification_sites: residue and position in protein for each modifications
documented_phosphosites: indicates whether the specific site is found in ProteomeScout (1) or not (0)
site_in_domain: protein domains from Interpro that contain the phosphosite
site_in_macro: macromolecular structures that contain the phosphosite
site_in_structure: secondary structures associated with phosphosite

Annotate phosphoproteomic dataset

Contents

Annotate phosphoproteomic dataset#

Load dataset#