{ "cells": [ { "cell_type": "markdown", "id": "acdd338e", "metadata": {}, "source": [ "# Map Datasets to KinPred" ] }, { "cell_type": "markdown", "id": "05c70fbb", "metadata": {}, "source": [ "## Download and Process Dataset of Interest\n", "Prior to predicting kinase activities, datasets need to be mapped to KinPred to obtain the Uniprot ID, phosphosite, and the +/-7 peptide sequence that will be used by KSTAR to identify which kinases are associated with each phosphosite. In order to map kinase activities, the dataframe containing phosphoproteomic data should contain each peptides Uniprot accession, as well as either the site number or peptide sequence. If the peptide sequence is used, it should be formatted with only the phosphorylated peptides being lowercased. For example, if a peptide sequence is annotated with '(ph)' in front of the phosphorylated amino acid, you would need to remove the '(ph)' from the sequence and lowercase the phosphorylated amino acid. So, the peptide sequence SGLAYCPND(ph)YHQLFSPR would become SGLAYCPNDyHQLFSPR.\n", "\n", "It is recommended to use the peptide sequence rather than the site number when possible, as this is more likely to be found in the most recent version of KinPred. An example of the processed dataset can be seen below, which is a trimmed, processed, and mapped version of the dataset published publically (Chylek, 2014). You can download this data from the original publication below or the pre-mapped version at [FigShare](https://figshare.com/articles/dataset/KSTAR_Supplementary_Data/14919726).\n", "\n", "Reference: L. A. Chylek, V. Akimov, J. Dengjel, K. T. G. Rigbolt, B. Hu, W. S. Hlavacek, and B. Blagoev.\n", "Phosphorylation Site Dynamics of Early T-cell Receptor Signaling. PLoS ONE, 9(8):e104240,\n", "2014." ] }, { "cell_type": "code", "execution_count": 1, "id": "67047061", "metadata": {}, "outputs": [], "source": [ "#import KSTAR and other necesary packages\n", "import pandas as pd\n", "import numpy as np\n", "import pickle\n", "import os\n", "\n", "from kstar import config, helpers, mapping" ] }, { "cell_type": "code", "execution_count": 2, "id": "cdf7b2fd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | query_accession | \n", "mod_sites | \n", "peptide | \n", "data:time:0 | \n", "data:time:5 | \n", "data:time:15 | \n", "data:time:30 | \n", "data:time:60 | \n", "
---|---|---|---|---|---|---|---|---|
MS_id | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
7605136 | \n", "Q9P2D3-1 | \n", "Y1104 | \n", "EAAEVCEyAMSLAK | \n", "0.0 | \n", "-0.01 | \n", "-0.28 | \n", "-0.03 | \n", "-0.27 | \n", "
7605137 | \n", "A0FGR8-6 | \n", "Y845 | \n", "NLIAFSEDGSDPyVR | \n", "0.0 | \n", "0.26 | \n", "0.27 | \n", "0.04 | \n", "0.05 | \n", "
7605138 | \n", "Q5T4S7-2 | \n", "Y5156 | \n", "HNDMPIyEAADK | \n", "0.0 | \n", "0.31 | \n", "-0.15 | \n", "0.01 | \n", "-0.23 | \n", "
7605139 | \n", "Q16181-1 | \n", "Y30 | \n", "NLEGyVGFANLPNQVYR | \n", "0.0 | \n", "-0.14 | \n", "-0.19 | \n", "0.07 | \n", "0.15 | \n", "
7605140 | \n", "Q16181-1 | \n", "Y41 | \n", "NLEGYVGFANLPNQVyR | \n", "0.0 | \n", "-0.14 | \n", "-0.09 | \n", "0.04 | \n", "-0.06 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
7605855 | \n", "O95801 | \n", "Y129 | \n", "AAAQYyLGNFR | \n", "0.0 | \n", "0.15 | \n", "0.39 | \n", "-0.05 | \n", "0.08 | \n", "
7605859 | \n", "O60711 | \n", "Y22 | \n", "STLQDSDEySNPAPLPLDQHSR | \n", "0.0 | \n", "-0.73 | \n", "0.76 | \n", "4.48 | \n", "5.88 | \n", "
7605860 | \n", "O60711 | \n", "Y203 | \n", "SGLAYCPNDyHQLFSPR | \n", "0.0 | \n", "0.55 | \n", "-1.77 | \n", "3.58 | \n", "5.15 | \n", "
7605861 | \n", "P47736 | \n", "Y374 | \n", "LINAEyACYK | \n", "0.0 | \n", "-0.12 | \n", "0.19 | \n", "0.02 | \n", "-0.16 | \n", "
7605863 | \n", "Q14511 | \n", "Y345 | \n", "DGVyDVPLHNPPDAK | \n", "0.0 | \n", "-0.41 | \n", "-0.14 | \n", "-0.12 | \n", "-0.29 | \n", "
665 rows × 8 columns
\n", "\n", " | query_accession | \n", "mod_sites | \n", "peptide | \n", "data:time:0 | \n", "data:time:5 | \n", "data:time:15 | \n", "data:time:30 | \n", "data:time:60 | \n", "KSTAR_ACCESSION | \n", "KSTAR_PEPTIDE | \n", "KSTAR_SITE | \n", "KSTAR_NUM_COMPENDIA | \n", "KSTAR_NUM_COMPENDIA_CLASS | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "Q9P2D3-1 | \n", "Y1104 | \n", "EAAEVCEyAMSLAK | \n", "0.0 | \n", "-0.01 | \n", "-0.28 | \n", "-0.03 | \n", "-0.27 | \n", "Q9P2D3 | \n", "EAAEVCEyAMSLAKN | \n", "Y1104 | \n", "0 | \n", "0 | \n", "
1 | \n", "A0FGR8-6 | \n", "Y845 | \n", "NLIAFSEDGSDPyVR | \n", "0.0 | \n", "0.26 | \n", "0.27 | \n", "0.04 | \n", "0.05 | \n", "A0FGR8 | \n", "SEDGSDPyVRMYLLP | \n", "Y824 | \n", "2 | \n", "1 | \n", "
2 | \n", "Q5T4S7-2 | \n", "Y5156 | \n", "HNDMPIyEAADK | \n", "0.0 | \n", "0.31 | \n", "-0.15 | \n", "0.01 | \n", "-0.23 | \n", "Q5T4S7 | \n", "RHNDMPIyEAADKAL | \n", "Y5135 | \n", "1 | \n", "1 | \n", "
3 | \n", "Q16181-1 | \n", "Y30 | \n", "NLEGyVGFANLPNQVYR | \n", "0.0 | \n", "-0.14 | \n", "-0.19 | \n", "0.07 | \n", "0.15 | \n", "Q16181 | \n", "QQKNLEGyVGFANLP | \n", "Y30 | \n", "5 | \n", "2 | \n", "
4 | \n", "Q16181-1 | \n", "Y41 | \n", "NLEGYVGFANLPNQVyR | \n", "0.0 | \n", "-0.14 | \n", "-0.09 | \n", "0.04 | \n", "-0.06 | \n", "Q16181 | \n", "ANLPNQVyRKSVKRG | \n", "Y41 | \n", "1 | \n", "1 | \n", "
\n", " | KSTAR_ACCESSION | \n", "KSTAR_SITE | \n", "data:time:5:0 | \n", "data:time:5:1 | \n", "data:time:5:2 | \n", "data:time:5:3 | \n", "data:time:5:4 | \n", "data:time:5:5 | \n", "data:time:5:6 | \n", "data:time:5:7 | \n", "... | \n", "data:time:60:140 | \n", "data:time:60:141 | \n", "data:time:60:142 | \n", "data:time:60:143 | \n", "data:time:60:144 | \n", "data:time:60:145 | \n", "data:time:60:146 | \n", "data:time:60:147 | \n", "data:time:60:148 | \n", "data:time:60:149 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "A0AUZ9 | \n", "Y792 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
1 | \n", "A0AV02 | \n", "Y107 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
3 | \n", "A0AVF1 | \n", "Y174 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
5 | \n", "A0AVK6 | \n", "Y202 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
6 | \n", "A0AVK6 | \n", "Y316 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
5 rows × 602 columns
\n", "