Project PTMs onto ESRP1 knockdown data from MATS (Yang et al, 2016)#

Here is an example of running PTM-POSE on MATS analysis of RNA sequencing data from ESRP1 knockdown experiments performed by Yang et al, 2016

First, let’s focus on skipped exon events.

Phase 1: Load the data and initialize PTM-POSE#

To identify differentially included PTMs as a result of ESRP1 knockdown, we need three layers of information for each splice event: 1. Chromosome 2. DNA strand 2. Start and end coordinates of the event (either hg19 or hg38)

Optionally, we can also provide: 1. Gene name 2. Event ID 3. Delta PSI for the event 4. Significance of the event

With PTM-POSE, we need to indicate where to find this information within the splice data

[1]:
import pandas as pd

SE_data = pd.read_excel('../../ESRP1_data/Yang2016/esrp1_knockdown_data_Yang2016.xlsx', sheet_name='rMATS ESRP KD', header = 2).iloc[0:179]


# required column information
chromosome_col = 'chr'
strand_col = 'strand'
region_start_col = 'exonStart_0base'
region_end_col = 'exonEnd'

# optional column information (None if nothing is provided and will not be appended to the output)
gene_col = 'geneSymbol'
event_id_col = None   #not in the data
dPSI_col = 'meanDeltaPSI'
sig_col = 'FDR'

#look at the data
SE_data[[gene_col, chromosome_col, strand_col, region_start_col, region_end_col, dPSI_col, sig_col]].head()
[1]:
geneSymbol chr strand exonStart_0base exonEnd meanDeltaPSI FDR
0 SPAG9 chr17 - 49053223 49053262 0.227 0
1 ARHGAP17 chr16 - 24950684 24950918 0.413 0
2 ITGA6 chr2 + 173366499 173366629 -0.361 0
3 KRAS chr12 - 25368370 25368494 -0.068 0
4 TCIRG1 chr11 + 67817953 67818131 0.368 0

The strand can either be provided use ‘+’ and ‘-’ or using 1 and -1 to indicate the forward and reverse strand, the code will convert strand to integer format (-1 or 1) when running.

If this is the first time running PTM-POSE, you will need to download ptm_coordinates. If you set save = True, the coordinates will be saved for the future so you do not need to redownload them, but you can also set save = False to avoid saving the coordinates (will take ~60MB of space)

[3]:
from ptm_pose import pose_config
pose_config.ptm_coordinates = pose_config.download_ptm_coordinates(save = True)

Phase 2: Project PTMs onto differentially included regions#

We can then use the project module of PTM-POSE to identify PTMs that can be found in these regions. This dataset uses the hg19 genome build, so we need to specify this using the ‘coordinate_type’ parameter.

[2]:
from ptm_pose import project

splice_data, spliced_ptms = project.project_ptms_onto_splice_events(SE_data, chromosome_col = chromosome_col, strand_col = strand_col, region_start_col = region_start_col, region_end_col = region_end_col, gene_col = gene_col, event_id_col = event_id_col, dPSI_col = dPSI_col, sig_col = sig_col, coordinate_type = 'hg19')
Translator file not found. Downloading mapping information between UniProt and Gene Names from pybiomart
Projecting PTMs onto splice events using hg19 coordinates.: 100%|██████████| 179/179 [00:03<00:00, 48.82it/s]
PTMs projection successful (475 identified).


From this, there are two outputs: 1. The original splice dataframe with additional PTM information added

[4]:
splice_data[[gene_col, chromosome_col, strand_col, region_start_col, region_end_col, dPSI_col, sig_col] + ['PTMs', 'Number of PTMs Affected', 'Number of Unique PTM Sites by Position', 'Event Length', 'PTM Density (PTMs/bp)']].head()
[4]:
geneSymbol chr strand exonStart_0base exonEnd meanDeltaPSI FDR PTMs Number of PTMs Affected Number of Unique PTM Sites by Position Event Length PTM Density (PTMs/bp)
0 SPAG9 17 - 49053223 49053262 0.227 0 NaN 0 0 39 0.0
1 ARHGAP17 16 - 24950684 24950918 0.413 0 Q68EM7_S575.0 (Phosphorylation)/Q68EM7_S570.0 ... 6 1 234 0.004274
2 ITGA6 2 + 173366499 173366629 -0.361 0 P23229_Ynan (Phosphorylation)/P23229_Tnan (Pho... 7 4 130 0.030769
3 KRAS 12 - 25368370 25368494 -0.068 0 P01116_C186 (Methylation)/P01116_C180 (Palmito... 3 2 124 0.016129
4 TCIRG1 11 + 67817953 67818131 0.368 0 NaN 0 0 178 0.0
  1. New dataframe that has each PTM and additional information about the PTM in its own row

[5]:
spliced_ptms.head()
[5]:
dPSI Significance Gene Source of PTM UniProtKB Accession Residue PTM Position in Canonical Isoform Gene Location (hg19) Modification Modification Class Proximity to Region Start (bp) Proximity to Region End (bp) Proximity to Splice Boundary (bp)
0 0.413 0.0 ARHGAP17 Q68EM7-1_S575 Q68EM7 S 575.0 24950686.0 Phosphoserine Phosphorylation 2.0 232.0 2.0
1 0.413 0.0 ARHGAP17 Q68EM7-1_S570 Q68EM7 S 570.0 24950701.0 Phosphoserine Phosphorylation 17.0 217.0 17.0
2 0.413 0.0 ARHGAP17 Q68EM7-1_S560 Q68EM7 S 560.0 24950731.0 Phosphoserine Phosphorylation 47.0 187.0 47.0
3 0.413 0.0 ARHGAP17 Q68EM7-1_S553 Q68EM7 S 553.0 24950752.0 Phosphoserine Phosphorylation 68.0 166.0 68.0
4 0.413 0.0 ARHGAP17 Q68EM7-1_S547 Q68EM7 S 547.0 24950770.0 Phosphoserine Phosphorylation 86.0 148.0 86.0

For MATS data, there is also a built in function for running PTM-POSE on MATS data, including all events:

Phase 3: Identify PTMs with altered flanking sequences as a result of splice events#

In addition to differential inclusion of PTMs, some PTMs may experience altered flanking sequences. We can use the project module of PTM-POSE to identify PTMs for which this happens. You will need to provide the same layers of information, plus the genomic coordinates of the regions flanking the spliced region.

[12]:
from ptm_pose import flanking_sequences

first_flank_start_col = 'firstFlankingES'
first_flank_end_col='firstFlankingEE'
second_flank_start_col = 'secondFlankingES'
second_flank_end_col = 'secondFlankingEE'

flanks = flanking_sequences.get_flanking_changes_from_splice_data(SE_data, chromosome_col = chromosome_col, strand_col = strand_col, first_flank_start_col = first_flank_start_col, first_flank_end_col=first_flank_end_col, second_flank_start_col = second_flank_start_col, second_flank_end_col = second_flank_end_col , spliced_region_start_col = region_start_col, spliced_region_end_col = region_end_col, dPSI_col=dPSI_col, sig_col = sig_col, event_id_col = event_id_col, coordinate_type = 'hg19')
c:\Users\Sam\miniconda3\envs\testing_pose\Lib\site-packages\Bio\pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopython developers if you still need the Bio.pairwise2 module.
  warnings.warn(
[3]:
flanks.head()
[3]:
Event ID Source of PTM Residue PTM Position in Canonical Isoform Inclusion Sequence Exclusion Sequence Region Translation Success Matched
0 3 P01116-2_T148;P01116-1_T148 T 148 ETSAKtRQESG ETSAKtRQGC* Second True False
1 3 P01116-1_K147;P01116-2_K147 K 147 IETSAkTRQES IETSAkTRQGC Second True False
0 8 Q9UPQ0-1_S746 S 746 LPNLNsQGVAW LPNLNsQGGFS First True False
1 8 Q9UPQ0-10_S750;Q9UPQ0-6_S596;Q9UPQ0-1_S750 S 750 PSQVDsPSSEK ILKVDsPSSEK Second True False
0 11 P62847-1_K129 K NaN NVGAGkKSVSW NVGAGkKAEGV First True False

We can also do additional comparisons, such as comparing sequence identity and looking for matching elm motifs.

[ ]:
flanks = flanking_sequences.compare_flanking_sequences(flanks)
flanks = flanking_sequences.compare_inclusion_motifs(flanks)
flanks[['Source of PTM','Sequence Identity', 'Altered Positions','Residue Changes', 'Altered Flank Side', 'Motif only in Inclusion', 'Motif only in Exclusion']].head()

Phase 4: Annotate PTMs with functional information#

Once we have PTMs impacted by splicing, we can also annotate them with additional information. This can be done using the annotate module of PTM-POSE, and can be used with outputs from either the project module (differentially included PTMs) or the flanking_sequence module (PTMs with altered flanking sequences).

Currently, there are functions for appending information from: 1. PhosphoSitePlus (function, biological process, disease association, interactions, and kinase-substrate), 2. PTMsigDB (iKiP db, perturbations) 3. RegPhos (kinase-substrate), 4. PTMcode (inter and intraprotein interactions) 5. PTMInt (interactions) 6. DEPOD (Phosphatase-substrate) 7. ELM (interactions, motifs)

[3]:
from ptm_pose import annotate

#where to find PhosphoSitePlus data
psp_regulatory_file = '/PhosphoSitePlus/Regulatory_sites.gz'
psp_disease_file = '/PhosphoSitePlus/Disease-associated_sites.gz'
psp_kinase_file = '/Database_Information/PhosphoSitePlus/Kinase_Substrate_Dataset.gz'

#where to find ELM data

#PhosphoSitePlus data (due to licencsing issues, must be downloaded manually from PhosphoSitePlus and the file path provided)
spliced_ptms = annotate.add_PSP_regulatory_site_data(spliced_ptms, '/PhosphoSitePlus/Regulatory_sites.gz')
spliced_ptms = annotate.add_PSP_disease_association(spliced_ptms, '/PhosphoSitePlus/Disease-associated_sites.gz')
spliced_ptms = annotate.add_PSP_kinase_substrate_data(spliced_ptms, '/Database_Information/PhosphoSitePlus/Kinase_Substrate_Dataset.gz')

#ELM interactions (will be faster if file is downloaded manually from ELM and the file path provided)
spliced_ptms = annotate.add_ELM_interactions(spliced_ptms)

#PTMint interactions
spliced_ptms = annotate.add_PTMint_data(spliced_ptms)

#PTMcode interactions (will be faster/more reliable if file is downloaded manually from PTMcode and the file path provided)
ptm_code_interprotein = '/PTMcode2_associations_between_proteins.txt.gz'

#DEPOD phosphatase data
spliced_ptms = annotate.add_DEPOD_phosphatase_data(spliced_ptms)

#RegPhos data
spliced_ptms = annotate.add_RegPhos_data(spliced_ptms)

#annotate ptms
spliced_ptms = annotate.annotate_ptms(spliced_ptms)
PhosphoSitePlus regulatory_site information added:
         ->6 PTMs in dataset found associated with a molecular function
         ->7 PTMs in dataset found associated with a biological process
         ->2 PTMs in dataset found associated with a protein interaction
PhosphoSitePlus disease associations added: 1 PTM sites in dataset found associated with a disease in PhosphoSitePlus
PhosphoSitePlus kinase-substrate interactions added: 6 phosphorylation sites in dataset found associated with a kinase in PhosphoSitePlus
ELM interaction instances added: 1 PTMs in dataset found associated with at least one known ELM instance
PTMInt data added: 2 PTMs in dataset found with PTMInt interaction information
PTMcode interprotein interactions added: 27 PTMs in dataset found with PTMcode interprotein interaction information
DEPOD Phosphatase substrates added: 0 PTMs in dataset found with Phosphatase substrate information
RegPhos kinase-substrate data added: 3 PTMs in dataset found with kinase-substrate information
c:\Users\Sam\miniconda3\envs\testing_pose\Lib\site-packages\ptm_pose\annotate.py:558: DtypeWarning: Columns (4) have mixed types. Specify dtype option on import or set low_memory=False.
  regphos = pd.read_csv('http://140.138.144.141/~RegPhos/download/RegPhos_Phos_human.txt', sep = '\t')

Phase 5: Analyze Results#

Once we have all of this information, we can start to assess how PTMs are impacted by splicing. Let’s first get an idea for how many PTMs have different annotations associated with them from the various sources

[6]:
from ptm_pose import analyze

analyze.show_available_annotations(spliced_ptms, figsize = (5,5))
../_images/Examples_ESRP1_knockdown_19_0.png

There are several ptms that have previously been annotated with specific functions, let’s take a look at those:

[10]:
annotations, annotation_counts = analyze.get_ptm_annotations(spliced_ptms, annotation_type = 'Process', database = 'PhosphoSitePlus')
annotations.head()
[10]:
Gene UniProtKB Accession Residue PTM Position in Canonical Isoform Modification Class PSP:ON_PROCESS
145 CEACAM1 P13688 S 461.0 Phosphorylation apoptosis, altered
184 YAP1 P46937 K 342.0 Ubiquitination carcinogenesis, altered
217 TSC2 P49815 S 981.0 Phosphorylation carcinogenesis, inhibited; cell growth, inhibi...
395 SPHK2 Q9NRA0 S 387.0 Phosphorylation cell motility, altered
407 SPHK2 Q9NRA0 T 614.0 Phosphorylation cell motility, altered
[11]:
annotation_counts
[11]:
PSP:ON_PROCESS count
0 cell motility, altered 3
1 cell growth, induced 2
2 apoptosis, altered 1
3 carcinogenesis, altered 1
4 carcinogenesis, inhibited 1
5 cell growth, inhibited 1
6 autophagy, inhibited 1
7 signaling pathway regulation 1
8 cytoskeletal reorganization 1
9 cell adhesion, inhibited 1