# Map Dataset to Reference

## Download and Process Dataset of Interest
Prior to predicting kinase activities, datasets need to be mapped to KinPred to obtain the Uniprot ID, phosphosite, and the +/-7 peptide sequence that will be used by KSTAR to identify which kinases are associated with each phosphosite. In order to map kinase activities, the dataframe containing phosphoproteomic data should contain each peptides Uniprot accession, as well as either the site number or peptide sequence. If the peptide sequence is used, it should be formatted with only the phosphorylated peptides being lowercased. For example, if a peptide sequence is annotated with '(ph)' in front of the phosphorylated amino acid, you would need to remove the '(ph)' from the sequence and lowercase the phosphorylated amino acid. So, the peptide sequence SGLAYCPND(ph)YHQLFSPR would become SGLAYCPNDyHQLFSPR.

It is recommended to use the peptide sequence rather than the site number when possible, as this is more likely to be found in the most recent version of KinPred. An example of the processed dataset can be seen below, which is a trimmed, processed, and mapped version of the dataset published publically (Chylek, 2014). You can download this data from the original publication below or the pre-mapped version at [FigShare](https://figshare.com/articles/dataset/KSTAR_Supplementary_Data/14919726).

Reference:  L. A. Chylek, V. Akimov, J. Dengjel, K. T. G. Rigbolt, B. Hu, W. S. Hlavacek, and B. Blagoev.
Phosphorylation Site Dynamics of Early T-cell Receptor Signaling. PLoS ONE, 9(8):e104240,
2014.

In [None]:
#import KSTAR and other necesary packages
import pandas as pd

#load data
df = pd.read_csv('example.tsv', index_col = 0, sep = '\t')
df.head()

Notice that all data columns in the dataset have 'data:' in front of them. This is how KSTAR will identify which columns to use when making evidence decisions. This can be done manually prior to mapping, or will be done by KSTAR automatically once you indicate which columns you would like to use as evidence.

## Map the Dataset to Reference

Before running KSTAR, you need to map your dataset to the reference phosphoproteome used by KSTAR to ensure site positions agree with the kinase-substrate information. First, we need to indicate where to save the mapped data and the name you would like to use for output files:

In [None]:
#define the directory where mapped dataset and run information will be saved. 
odir = './example'
name = 'example_run'

Next, you need to indicate where to find columns containing peptide-specific information. Construct a dictionary which indicates the columns where KSTAR can find information about each phosphopeptide. This should include:
- *accession_id*: the UniProt accession corresponding to the identified peptide. 

and either:
- *peptide*: amino acid sequence with phosphorylation sites lowercased
- *site*: modified amino acid + modification location, such as Y11

It is recommended to use the peptide sequence when possible.

In [None]:
#setup mapping columns: since we only have peptide column in the example data, we will use that column (and not provide a site column)
accession_col = 'query_accession'
peptide_col = 'peptide'
mapDict = {'accession_id':'query_accession', 'peptide':'peptide'}

In [None]:
from kstar import mapping
#map dataset and record process in the logger
exp_mapper = mapping.ExperimentMapper(experiment = df,
                                      columns = mapDict, 
                                      odir = odir,
                                      name = name)

If you look at the ExperimentMapper class, you will find that five new columns have been added to the original dataset, which allows for easy mapping to KSTAR networks used in activity prediction.

In [None]:
exp_mapper.experiment.head()

These additional columns have the following meaning:

1. *KSTAR_ACCESSION*: Uniprot accession id corresponding to reviewed protein sequence, focusing only on the canonical isoforms of each protein.
2. *KSTAR_PEPTIDE*: Peptide sequence containing a single phosphorylation site, including the 7 amino acids both before and after the modified residue.
3. *KSTAR_SITE*: Location of modified residue in the protein
4. *KSTAR_NUM_COMPENDIA*: The number of different phosphoproteome compendia that modification site is identified in, used as an indicator of the study bias of each modification site. For this purpose, PhosphoSitePlus, PhosphoELM, HPRD, ______, and ProteomeScout were profiled.
5. *KSTAR_NUM_COMPENDIA_CLASS*: Same as 4, but sites are grouped into smaller classes based on study bias (0 is <1 compendia, 1 is 1-3 compendia, 2 is >3 compendia)

## Save Mapping Results

You can save the result using the `save_experiment()` function. This will save the following files in a MAPPED_DATA directory:
- *name_mapped.csv*: dataset mapped to reference phosphoproteome, to be used for activity prediction
- *name_mapping_stats.txt*: summary of how successful mapping process was
- *name_missed_sites.csv*: dataframe of all sites that were removed from the dataset and the reason

Looking at the mapping_stats.txt file is a good check to make sure the data was effectively mapped:

In [None]:
exp_mapper.save_experiment()

You can either inspect the mapping_stats.txt file directly or load it into python to look at results

In [None]:
#load mapped data
stats_file = f"{odir}/MAPPED_DATA/{name}_mapping_stats.txt"
with open(stats_file, 'r') as f:
    print(f.read())

You are now ready to proceed to activity calculation