Map Dataset to Reference#
Download and Process Dataset of Interest#
Prior to predicting kinase activities, datasets need to be mapped to KinPred to obtain the Uniprot ID, phosphosite, and the +/-7 peptide sequence that will be used by KSTAR to identify which kinases are associated with each phosphosite. In order to map kinase activities, the dataframe containing phosphoproteomic data should contain each peptides Uniprot accession, as well as either the site number or peptide sequence. If the peptide sequence is used, it should be formatted with only the phosphorylated peptides being lowercased. For example, if a peptide sequence is annotated with ‘(ph)’ in front of the phosphorylated amino acid, you would need to remove the ‘(ph)’ from the sequence and lowercase the phosphorylated amino acid. So, the peptide sequence SGLAYCPND(ph)YHQLFSPR would become SGLAYCPNDyHQLFSPR.
It is recommended to use the peptide sequence rather than the site number when possible, as this is more likely to be found in the most recent version of KinPred. An example of the processed dataset can be seen below, which is a trimmed, processed, and mapped version of the dataset published publically (Chylek, 2014). You can download this data from the original publication below or the pre-mapped version at FigShare.
Reference: L. A. Chylek, V. Akimov, J. Dengjel, K. T. G. Rigbolt, B. Hu, W. S. Hlavacek, and B. Blagoev. Phosphorylation Site Dynamics of Early T-cell Receptor Signaling. PLoS ONE, 9(8):e104240,
[1]:
#import KSTAR and other necesary packages
import pandas as pd
#load data
df = pd.read_csv('example.tsv', index_col = 0, sep = '\t')
df.head()
[1]:
| query_accession | mod_sites | peptide | data:time:5 | data:time:15 | data:time:30 | data:time:60 | KSTAR_ACCESSION | KSTAR_PEPTIDE | KSTAR_SITE | |
|---|---|---|---|---|---|---|---|---|---|---|
| MS_id | ||||||||||
| 7605136 | Q9P2D3-1 | Y1104 | EAAEVCEyAMSLAK | -0.01 | -0.28 | -0.03 | -0.27 | Q9P2D3 | EAAEVCEyAMSLAK | NaN |
| 7605137 | A0FGR8-6 | Y845 | NLIAFSEDGSDPyVR | 0.26 | 0.27 | 0.04 | 0.05 | A0FGR8 | NLIAFSEDGSDPyVR | NaN |
| 7605138 | Q5T4S7-2 | Y5156 | HNDMPIyEAADK | 0.31 | -0.15 | 0.01 | -0.23 | Q5T4S7 | HNDMPIyEAADK | NaN |
| 7605139 | Q16181-1 | Y30 | NLEGyVGFANLPNQVYR | -0.14 | -0.19 | 0.07 | 0.15 | Q16181 | NLEGyVGFANLPNQVYR | NaN |
| 7605140 | Q16181-1 | Y41 | NLEGYVGFANLPNQVyR | -0.14 | -0.09 | 0.04 | -0.06 | Q16181 | NLEGYVGFANLPNQVyR | NaN |
Notice that all data columns in the dataset have ‘data:’ in front of them. This is how KSTAR will identify which columns to use when making evidence decisions. This can be done manually prior to mapping, or will be done by KSTAR automatically once you indicate which columns you would like to use as evidence.
Map the Dataset to Reference#
Before running KSTAR, you need to map your dataset to the reference phosphoproteome used by KSTAR to ensure site positions agree with the kinase-substrate information. First, we need to indicate where to save the mapped data and the name you would like to use for output files:
[2]:
#define the directory where mapped dataset and run information will be saved.
odir = './example'
name = 'example_run'
Next, you need to indicate where to find columns containing peptide-specific information. Construct a dictionary which indicates the columns where KSTAR can find information about each phosphopeptide. This should include:
accession_id: the UniProt accession corresponding to the identified peptide.
and either:
peptide: amino acid sequence with phosphorylation sites lowercased
site: modified amino acid + modification location, such as Y11
It is recommended to use the peptide sequence when possible.
[3]:
#setup mapping columns: since we only have peptide column in the example data, we will use that column (and not provide a site column)
accession_col = 'query_accession'
peptide_col = 'peptide'
mapDict = {'accession_id':'query_accession', 'peptide':'peptide'}
[4]:
from kstar import mapping
#map dataset and record process in the logger
exp_mapper = mapping.ExperimentMapper(experiment = df,
columns = mapDict,
odir = odir,
name = name)
Warning: Could not find network directory as specified in configuration file. If you have not downloaded networks, please do so using config.install_network_files(). If using your own networks, please update the configuration file using config.update_configuration() to point to correct directory and network name
Processing provided accessions...
Aligning peptides/sites to reference sequences...
Mapping peptides/sites to reference sequences: 100%|███████████████████████████████████████████████████| 689/689 [00:00<00:00, 2078.42it/s]
Mapping complete.
If you look at the ExperimentMapper class, you will find that five new columns have been added to the original dataset, which allows for easy mapping to KSTAR networks used in activity prediction.
[5]:
exp_mapper.experiment.head()
[5]:
| query_accession | mod_sites | peptide | data:time:5 | data:time:15 | data:time:30 | data:time:60 | KSTAR_ACCESSION | KSTAR_PEPTIDE | KSTAR_SITE | KSTAR_NUM_COMPENDIA | KSTAR_NUM_COMPENDIA_CLASS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Q9P2D3-1 | Y1104 | EAAEVCEyAMSLAK | -0.01 | -0.28 | -0.03 | -0.27 | Q9P2D3 | EAAEVCEyAMSLAKN | Y1104 | 0 | 0 |
| 1 | A0FGR8-6 | Y845 | NLIAFSEDGSDPyVR | 0.26 | 0.27 | 0.04 | 0.05 | A0FGR8 | SEDGSDPyVRMYLLP | Y824 | 2 | 1 |
| 2 | Q5T4S7-2 | Y5156 | HNDMPIyEAADK | 0.31 | -0.15 | 0.01 | -0.23 | Q5T4S7 | RHNDMPIyEAADKAL | Y5135 | 1 | 1 |
| 3 | Q16181-1 | Y30 | NLEGyVGFANLPNQVYR | -0.14 | -0.19 | 0.07 | 0.15 | Q16181 | QQKNLEGyVGFANLP | Y30 | 5 | 2 |
| 4 | Q16181-1 | Y41 | NLEGYVGFANLPNQVyR | -0.14 | -0.09 | 0.04 | -0.06 | Q16181 | ANLPNQVyRKSVKRG | Y41 | 1 | 1 |
These additional columns have the following meaning:
KSTAR_ACCESSION: Uniprot accession id corresponding to reviewed protein sequence, focusing only on the canonical isoforms of each protein.
KSTAR_PEPTIDE: Peptide sequence containing a single phosphorylation site, including the 7 amino acids both before and after the modified residue.
KSTAR_SITE: Location of modified residue in the protein
KSTAR_NUM_COMPENDIA: The number of different phosphoproteome compendia that modification site is identified in, used as an indicator of the study bias of each modification site. For this purpose, PhosphoSitePlus, PhosphoELM, HPRD, ______, and ProteomeScout were profiled.
KSTAR_NUM_COMPENDIA_CLASS: Same as 4, but sites are grouped into smaller classes based on study bias (0 is <1 compendia, 1 is 1-3 compendia, 2 is >3 compendia)
Save Mapping Results#
You can save the result using the save_experiment() function. This will save the following files in a MAPPED_DATA directory:
name_mapped.csv: dataset mapped to reference phosphoproteome, to be used for activity prediction
name_mapping_stats.txt: summary of how successful mapping process was
name_missed_sites.csv: dataframe of all sites that were removed from the dataset and the reason
Looking at the mapping_stats.txt file is a good check to make sure the data was effectively mapped:
[6]:
exp_mapper.save_experiment()
You can either inspect the mapping_stats.txt file directly or load it into python to look at results
[7]:
#load mapped data
stats_file = f"{odir}/MAPPED_DATA/{name}_mapping_stats.txt"
with open(stats_file, 'r') as f:
print(f.read())
Site counts per data column after mapping:
Phospho type: Y
data:time:5 -> 663
data:time:15 -> 663
data:time:30 -> 663
data:time:60 -> 663
Phospho type: ST
data:time:5 -> 1
data:time:15 -> 1
data:time:30 -> 1
data:time:60 -> 1
Mapping success statistics:
Mapped Peptides: 653/653 peptides mapped (100.00%).
Reasons for unmapped sites/peptides:
See ./example/MAPPED_DATA/example_run_removed_sites.csv for details on removed sites/peptides.
You are now ready to proceed to activity calculation