PTM-POSE Reference#

Configuration#

ptm_pose.pose_config.download_ptm_coordinates(save=False, max_retries=5, delay=10)[source]#

Download ptm_coordinates dataframe from GitHub Large File Storage (LFS). By default, this will not save the file locally due the larger size (do not want to force users to download but highly encourage), but an option to save the file is provided if desired

Parameters:
savebool, optional

Whether to save the file locally into Resource Files directory. The default is False.

max_retriesint, optional

Number of times to attempt to download the file. The default is 5.

delayint, optional

Time to wait between download attempts. The default is 10.

PTM Projection#

ptm_pose.project.project_ptms_onto_MATS(ptm_coordinates=None, SE_events=None, fiveASS_events=None, threeASS_events=None, RI_events=None, MXE_events=None, coordinate_type='hg38', identify_flanking_sequences=False, dPSI_col='meanDeltaPSI', sig_col='FDR', separate_modification_types=False, PROCESSES=1)[source]#

Given splice quantification from the MATS algorithm, annotate with PTMs that are found in the differentially included regions.

Parameters:
ptm_coordinates: pandas.DataFrame

dataframe containing PTM information, including chromosome, strand, and genomic location of PTMs

SE_events: pandas.DataFrame

dataframe containing skipped exon event information from MATS

fiveASS_events: pandas.DataFrame

dataframe containing 5’ alternative splice site event information from MATS

threeASS_events: pandas.DataFrame

dataframe containing 3’ alternative splice site event information from MATS

RI_events: pandas.DataFrame

dataframe containing retained intron event information from MATS

MXE_events: pandas.DataFrame

dataframe containing mutually exclusive exon event information from MATS

coordinate_type: str

indicates the coordinate system used for the start and end positions. Either hg38 or hg19. Default is ‘hg38’.

identify_flanking_sequences: bool

Indicate whether to look for altered flanking sequences from spliced events, in addition to those directly in the spliced region. Default is False. (not yet active)

PROCESSES: int

Number of processes to use for multiprocessing. Default is 1.

ptm_pose.project.project_ptms_onto_splice_events(splice_data, ptm_coordinates=None, annotate_original_df=True, chromosome_col='chr', strand_col='strand', region_start_col='exonStart_0base', region_end_col='exonEnd', dPSI_col=None, sig_col=None, event_id_col=None, gene_col=None, extra_cols=None, separate_modification_types=False, coordinate_type='hg38', taskbar_label=None, PROCESSES=1)[source]#

Given splice event quantification data, project PTMs onto the regions impacted by the splice events. Assumes that the splice event data will have chromosome, strand, and genomic start/end positions for the regions of interest, and each row of the splice_event_data corresponds to a unique region.

Parameters

splice_data: pandas.DataFrame

dataframe containing splice event information, including chromosome, strand, and genomic location of regions of interest

ptm_coordinates: pandas.DataFrame

dataframe containing PTM information, including chromosome, strand, and genomic location of PTMs. If none, it will pull from the config file.

chromosome_col: str

column name in splice_data that contains chromosome information. Default is ‘chr’. Expects it to be a str with only the chromosome number: ‘Y’, ‘1’, ‘2’, etc.

strand_col: str

column name in splice_data that contains strand information. Default is ‘strand’. Expects it to be a str with ‘+’ or ‘-’, or integers as 1 or -1. Will convert to integers automatically if string format is provided.

region_start_col: str

column name in splice_data that contains the start position of the region of interest. Default is ‘exonStart_0base’.

region_end_col: str

column name in splice_data that contains the end position of the region of interest. Default is ‘exonEnd’.

event_id_col: str

column name in splice_data that contains the unique identifier for the splice event. If provided, will be used to annotate the ptm information with the specific splice event ID. Default is None.

gene_col: str

column name in splice_data that contains the gene name. If provided, will be used to make sure the projected PTMs stem from the same gene (some cases where genomic coordiantes overlap between distinct genes). Default is None.

dPSI_col: str

column name in splice_data that contains the delta PSI value for the splice event. Default is None, which will not include this information in the output

sig_col: str

column name in splice_data that contains the significance value for the splice event. Default is None, which will not include this information in the output.

extra_cols: list

list of additional columns to include in the output dataframe. Default is None, which will not include any additional columns.

coordinate_type: str

indicates the coordinate system used for the start and end positions. Either hg38 or hg19. Default is ‘hg38’.

separate_modification_types: bool

Indicate whether to store PTM sites with multiple modification types as multiple rows. For example, if a site at K100 was both an acetylation and methylation site, these will be separated into unique rows with the same site number but different modification types. Default is True.

taskbar_label: str

Label to display in the tqdm progress bar. Default is None, which will automatically state “Projecting PTMs onto regions using —– coordinates”.

PROCESSES: int

Number of processes to use for multiprocessing. Default is 1 (single processing)

Returns:
spliced_ptm_info: pandas.DataFrame

Contains the PTMs identified across the different splice events

splice_data: pandas.DataFrame

dataframe containing the original splice data with an additional column ‘PTMs’ that contains the PTMs found in the region of interest, in the format of ‘SiteNumber(ModificationType)’. If no PTMs are found, the value will be np.nan.

Flanking Sequences#

ptm_pose.flanking_sequences.extract_region_from_splicegraph(splicegraph, region_id)[source]#

Given a region id and the splicegraph from SpliceSeq, extract the chromosome, strand, and start and stop locations of that exon. Start and stop are forced to be in ascending order, which is not necessarily true from the splice graph (i.e. start > stop for negative strand exons). This is done to make the region extraction consistent with the rest of the codebase.

Parameters:
spliceseqpandas.DataFrame

SpliceSeq splicegraph dataframe, with region_id as index

region_idstr

Region ID to extract information from, in the format of ‘GeneName_ExonNumber’

Returns:
list

List containing the chromosome, strand (1 for forward, -1 for negative), start, and stop locations of the region

ptm_pose.flanking_sequences.get_flanking_changes(ptm_coordinates, chromosome, strand, first_flank_region, spliced_region, second_flank_region, gene=None, dPSI=None, sig=None, event_id=None, flank_size=5, coordinate_type='hg38', lowercase_mod=True, order_by='Coordinates')[source]#

Currently has been tested with MATS splicing events.

Given flanking and spliced regions associated with a splice event, identify PTMs that have potential to have an altered flanking sequence depending on whether spliced region is included or excluded (if PTM is close to splice boundary). For these PTMs, extract the flanking sequences associated with the inclusion and exclusion cases and translate into amino acid sequences. If the PTM is not associated with a codon that codes for the expected amino acid, the PTM will be excluded from the results.

Parameters:
ptm_coordinatespandas.DataFrame

DataFrame containing PTM coordinate information for identify PTMs in the flanking regions

chromosomestr

Chromosome associated with the splice event

strandint

Strand associated with the splice event (1 for forward, -1 for negative)

first_flank_regionlist

List containing the start and stop locations of the first flanking region (first is currently defined based on location the genome not coding sequence)

spliced_regionlist

List containing the start and stop locations of the spliced region

second_flank_regionlist

List containing the start and stop locations of the second flanking region (second is currently defined based on location the genome not coding sequence)

event_idstr, optional

Event ID associated with the splice event, by default None

flank_sizeint, optional

Number of amino acids to include flanking the PTM, by default 7

coordinate_typestr, optional

Coordinate system used for the regions, by default ‘hg38’. Other options is hg19.

lowercase_modbool, optional

Whether to lowercase the amino acid associated with the PTM in returned flanking sequences, by default True

order_bystr, optional

Whether the first, spliced and second regions are defined by their genomic coordinates (first has smallest coordinate, spliced next, then second), or if they are defined by their translation (first the first when translated, etc.)

Returns:
pandas.DataFrame

DataFrame containing the PTMs associated with the flanking regions and the amino acid sequences of the flanking regions in the inclusion and exclusion cases

ptm_pose.flanking_sequences.get_flanking_changes_from_splice_data(splice_data, ptm_coordinates=None, chromosome_col=None, strand_col=None, first_flank_start_col=None, first_flank_end_col=None, spliced_region_start_col=None, spliced_region_end_col=None, second_flank_start_col=None, second_flank_end_col=None, dPSI_col=None, sig_col=None, event_id_col=None, gene_col=None, flank_size=5, coordinate_type='hg38', lowercase_mod=True)[source]#

Given a DataFrame containing information about splice events, extract the flanking sequences associated with the PTMs in the flanking regions if there is potential for this to be altered. The DataFrame should contain columns for the chromosome, strand, start and stop locations of the first flanking region, spliced region, and second flanking region. The DataFrame should also contain a column for the event ID associated with the splice event. If the DataFrame does not contain the necessary columns, the function will raise an error.

Parameters:
splice_datapandas.DataFrame

DataFrame containing information about splice events

ptm_coordinatespandas.DataFrame

DataFrame containing PTM coordinate information for identify PTMs in the flanking regions

chromosome_colstr, optional

Column name indicating chromosome, by default None

strand_colstr, optional

Column name indicating strand, by default None

first_flank_start_colstr, optional

Column name indicating start location of the first flanking region, by default None

first_flank_end_colstr, optional

Column name indicating end location of the first flanking region, by default None

spliced_region_start_colstr, optional

Column name indicating start location of the spliced region, by default None

spliced_region_end_colstr, optional

Column name indicating end location of the spliced region, by default None

second_flank_start_colstr, optional

Column name indicating start location of the second flanking region, by default None

second_flank_end_colstr, optional

Column name indicating end location of the second flanking region, by default None

event_id_colstr, optional

Column name indicating event ID, by default None

flank_sizeint, optional

Number of amino acids to include flanking the PTM, by default 7

coordinate_typestr, optional

Coordinate system used for the regions, by default ‘hg38’. Other options is hg19.

lowercase_modbool, optional

Whether to lowercase the amino acid associated with the PTM in returned flanking sequences, by default True

Returns:
list

List containing DataFrames with the PTMs associated with the flanking regions and the amino acid sequences of the flanking regions in the inclusion and exclusion cases

ptm_pose.flanking_sequences.get_flanking_changes_from_splicegraph(psi_data, splicegraph, ptm_coordinates=None, dPSI_col=None, sig_col=None, event_id_col=None, extra_cols=None, gene_col='symbol', flank_size=5, coordinate_type='hg19')[source]#

Given a DataFrame containing information about splice events obtained from SpliceSeq and the corresponding splicegraph, extract the flanking sequences of PTMs that are nearby the splice boundary (potential for flanking sequence to be altered). Coordinate information of individual exons should be found in splicegraph. You can also provide columns with specific psi or significance information. Extra cols not in these categories can be provided with extra_cols parameter.

Parameters:
psi_datapandas.DataFrame

DataFrame containing information about splice events obtained from SpliceSeq

splicegraphpandas.DataFrame

DataFrame containing information about individual exons and their coordinates

ptm_coordinatespandas.DataFrame

DataFrame containing PTM coordinate information for identify PTMs in the flanking regions

dPSI_colstr, optional

Column name indicating delta PSI value, by default None

sig_colstr, optional

Column name indicating significance of the event, by default None

event_id_colstr, optional

Column name indicating event ID, by default None

extra_colslist, optional

List of column names for additional information to add to the results, by default None

gene_colstr, optional

Column name indicating gene symbol of spliced gene, by default ‘symbol’

flank_sizeint, optional

Number of amino acids to include flanking the PTM, by default 5

coordinate_typestr, optional

Coordinate system used for the regions, by default ‘hg19’. Other options is hg38.

Returns:
altered_flankspandas.DataFrame

DataFrame containing the PTMs associated with the flanking regions that are altered, and the flanking sequences that arise depending on whether the flanking sequence is included or not

ptm_pose.flanking_sequences.get_flanking_sequence(ptm_loc, seq, ptm_residue, flank_size=5, lowercase_mod=True, full_flanking_seq=False)[source]#

Given a PTM location in a sequence of DNA, extract the flanking sequence around the PTM location and translate into the amino acid sequence. If the sequence is not the correct length, the function will attempt to extract the flanking sequence with spaces to account for missing parts if full_flanking_seq is not True. If the sequence is still not the correct length, the function will raise an error. Any unrecognized codons that are found in the sequence and are not in the standard codon table, including stop codons, will be translated as ‘X’ (unknown) or ‘*’ (stop codon).

Parameters:
ptm_locint

Location of the first base pair associated with PTM in the DNA sequence

seqstr

DNA sequence containing the PTM

ptm_residuestr

Amino acid residue associated with the PTM

flank_sizeint, optional

Number of amino acids to include flanking the PTM, by default 5

lowercase_modbool, optional

Whether to lowercase the amino acid associated with the PTM, by default True

full_flanking_seqbool, optional

Whether to require the flanking sequence to be the correct length, by default False

Returns:
str

Amino acid sequence of the flanking sequence around the PTM if translation was successful, otherwise np.nan

ptm_pose.flanking_sequences.get_ptm_locs_in_spliced_sequences(ptm_loc_in_flank, first_flank_seq, spliced_seq, second_flank_seq, strand, which_flank='First', order_by='Coordinates')[source]#

Given the location of a PTM in a flanking sequence, extract the location of the PTM in the Inclusion Flanking Sequence and the Exclusion Flanking Sequence associated with a given splice event. Inclusion Flanking Sequence will include the skipped exon region, retained intron, or longer alternative splice site depending on event type. The PTM location should be associated with where the PTM is located relative to spliced region (before = ‘First’, after = ‘Second’).

Parameters:
ptm_loc_in_flankint

Location of the PTM in the flanking sequence it is found (either first or second)

first_flank_seqstr

Flanking exon sequence before the spliced region

spliced_seqstr

Spliced region sequence

second_flank_seqstr

Flanking exon sequence after the spliced region

which_flankstr, optional

Which flank the PTM is associated with, by default ‘First’

order_bystr, optional

Whether the first, spliced and second regions are defined by their genomic coordinates (first has smallest coordinate, spliced next, then second), or if they are defined by their translation (first the first when translated, etc.)

Returns:
tuple

Tuple containing the PTM location in the Inclusion Flanking Sequence and the Exclusion Flanking Sequence

ptm_pose.flanking_sequences.get_ptms_in_splicegraph_flank(gene_name, chromosome, strand, flank_region_start, flank_region_end, coordinate_type='hg19', which_flank='First', flank_size=5)[source]#
ptm_pose.flanking_sequences.get_spliceseq_event_regions(gene_name, from_exon, spliced_exons, to_exon, splicegraph)[source]#

Given all exons associated with a splicegraph event, obtain the coordinates associated with the flanking exons and the spliced region. The spliced region is defined as the exons that are associated with psi values, while flanking regions include the “from” and “to” exons that indicate the adjacent, unspliced exons.

Parameters:
gene_namestr

Gene name associated with the splice event

from_exonint

Exon number associated with the first flanking exon

spliced_exonsstr

Exon numbers associated with the spliced region, separated by colons for each unique exon

to_exonint

Exon number associated with the second flanking exon

splicegraphpandas.DataFrame

DataFrame containing information about individual exons and their coordinates

Returns:
tuple

Tuple containing the genomic coordinates of the first flanking region, spliced regions, and second flanking region

ptm_pose.flanking_sequences.get_spliceseq_flank_loc(ptm, strand, from_region_coords, to_region_coords, coordinate_type='hg19')[source]#

Given ptm information for identifying flanking sequences from splicegraph information, extract the relative location of the ptm in the flanking region (where it is located in translation of the flanking region).

Parameters:
ptmpandas.Series

Series containing PTM information

strandint

Strand associated with the splice event (1 for forward, -1 for negative)

from_region_coordslist

List containing the chromosome, strand, start, and stop locations of the first flanking region

to_region_coordslist

List containing the chromosome, strand, start, and stop locations of the second flanking region

Returns:
int

Relative location of the PTM in the flanking region

ptm_pose.flanking_sequences.translate_flanking_sequence(seq, flank_size=7, full_flanking_seq=True, lowercase_mod=True, first_flank_length=None, stop_codon_symbol='*', unknown_codon_symbol='X')[source]#

Given a DNA sequence, translate the sequence into an amino acid sequence. If the sequence is not the correct length, the function will attempt to extract the flanking sequence with spaces to account for missing parts if full_flanking_seq is not True. If the sequence is still not the correct length, the function will raise an error. Any unrecognized codons that are found in the sequence and are not in the standard codon table, including stop codons, will be translated as ‘X’ (unknown) or ‘*’ (stop codon).

Parameters:
seqstr

DNA sequence to translate

flank_sizeint, optional

Number of amino acids to include flanking the PTM, by default 7

full_flanking_seqbool, optional

Whether to require the flanking sequence to be the correct length, by default True

lowercase_modbool, optional

Whether to lowercase the amino acid associated with the PTM, by default True

first_flank_lengthint, optional

Length of the flanking sequence in front of the PTM, by default None. If full_flanking_seq is False and sequence is not the correct length, this is required.

stop_codon_symbolstr, optional

Symbol to use for stop codons, by default ‘*’

unknown_codon_symbolstr, optional

Symbol to use for unknown codons, by default ‘X’

Returns:
str

Amino acid sequence of the flanking sequence if translation was successful, otherwise np.nan

Annotating PTMs#

ptm_pose.annotate.add_ELM_interactions(spliced_ptms, file=None, report_success=True)[source]#

Given a spliced ptms dataframe from the project module, add ELM interaction data to the dataframe

ptm_pose.annotate.add_PSP_disease_association(spliced_ptms, file='Disease-associated_sites.gz', report_success=True)[source]#

Process disease asociation data from PhosphoSitePlus (Disease-associated_sites.gz), and add to spliced_ptms dataframe from project_ptms_onto_splice_events() function

Parameters:
file: str

Path to the PhosphoSitePlus Kinase_Substrate_Dataset.gz file. Should be downloaded from PhosphoSitePlus in the zipped format

Returns:
spliced_ptms: pandas.DataFrame

Contains the PTMs identified across the different splice events with an additional column indicating the kinases known to phosphorylate that site (not relevant to non-phosphorylation PTMs)

ptm_pose.annotate.add_PSP_kinase_substrate_data(spliced_ptms, file='Kinase_Substrate_Dataset.gz', report_success=True)[source]#

Add kinase substrate data from PhosphoSitePlus (Kinase_Substrate_Dataset.gz) to spliced_ptms dataframe from project_ptms_onto_splice_events() function

Parameters:
file: str

Path to the PhosphoSitePlus Kinase_Substrate_Dataset.gz file. Should be downloaded from PhosphoSitePlus in the zipped format

Returns:
spliced_ptms: pandas.DataFrame

Contains the PTMs identified across the different splice events with an additional column indicating the kinases known to phosphorylate that site (not relevant to non-phosphorylation PTMs)

ptm_pose.annotate.add_PSP_regulatory_site_data(spliced_ptms, file='Regulatory_sites.gz', report_success=True)[source]#

Add functional information from PhosphoSitePlus (Regulatory_sites.gz) to spliced_ptms dataframe from project_ptms_onto_splice_events() function

Parameters:
file: str

Path to the PhosphoSitePlus Regulatory_sites.gz file. Should be downloaded from PhosphoSitePlus in the zipped format

Returns:
spliced_ptms: pandas.DataFrame

Contains the PTMs identified across the different splice events with additional columns for regulatory site information, including domains, biological process, functions, and protein interactions associated with the PTMs

ptm_pose.annotate.add_PTMInt_data(spliced_ptms, file=None, report_success=True)[source]#

Given spliced_ptms data from project module, add PTMInt interaction data, which will include the protein that is being interacted with, whether it enchances or inhibits binding, and the localization of the interaction. This will be added as a new column labeled PTMInt:Interactions and each entry will be formatted like ‘Protein->Effect|Localization’. If multiple interactions, they will be separated by a semicolon

ptm_pose.annotate.add_annotation(spliced_ptms, database='PhosphoSitePlus', annotation_type='Function', file=None, check_existing=False)[source]#

Given a desired database and annotation type, add the corresponding annotation data to the spliced ptm dataframe

Parameters:
spliced_ptms: pd.DataFrame

Dataframe containing PTM data

database: str

Database to extract annotation data from. Options include ‘PhosphoSitePlus’, ‘PTMcode’, ‘PTMInt’, ‘RegPhos’, ‘DEPOD’

annotation_type: str

Type of annotation to extract. Options include ‘Function’, ‘Process’, ‘Interactions’, ‘Disease’, ‘Kinase’, ‘Phosphatase’, but depend on the specific database (run analyze.get_annotation_categories())

file: str

File path to annotation data. If None, will download from online source, except for PhosphoSitePlus (due to licensing restrictions)

ptm_pose.annotate.add_custom_annotation(spliced_ptms, annotation_data, source_name, annotation_type, annotation_col, accession_col='UniProtKB Accession', residue_col='Residue', position_col='PTM Position in Canonical Isoform')[source]#

Add custom annotation data to spliced_ptms or altered flanking sequence dataframes

Parameters:
annotation_data: pandas.DataFrame

Dataframe containing the annotation data to be added to the spliced_ptms dataframe. Must contain columns for UniProtKB Accession, Residue, PTM Position in Canonical Isoform, and the annotation data to be added

source_name: str

Name of the source of the annotation data, will be used to label the columns in the spliced_ptms dataframe

annotation_type: str

Type of annotation data being added, will be used to label the columns in the spliced_ptms dataframe

annotation_col: str

Column name in the annotation data that contains the annotation data to be added to the spliced_ptms dataframe

Returns:
spliced_ptms: pandas.DataFrame

Contains the PTMs identified across the different splice events with an additional column for the custom annotation data

ptm_pose.annotate.annotate_ptms(spliced_ptms, psp_regulatory_site_file=None, psp_ks_file=None, psp_disease_file=None, elm_interactions=False, elm_motifs=False, PTMint=False, PTMcode_interprotein=False, DEPOD=False, RegPhos=False, ptmsigdb_file=None, interactions_to_combine=['PTMcode', 'PhosphoSitePlus', 'RegPhos', 'PTMInt'], kinases_to_combine=['PhosphoSitePlus', 'RegPhos'], combine_similar=True)[source]#

Given spliced ptm data, add annotations from various databases. The annotations that can be added are the following: - PhosphoSitePlus

  • regulatory site data (file must be provided)

  • kinase-substrate data (file must be provided)

  • disease association data (file must be provided)

  • ELM
    • interaction data (can be downloaded automatically or provided as a file)

    • motif matches (elm class data can be downloaded automatically or provided as a file)

  • PTMInt
    • interaction data (will be downloaded automatically)

  • PTMcode
    • intraprotein interactions (can be downloaded automatically or provided as a file)

    • interprotein interactions (can be downloaded automatically or provided as a file)

  • DEPOD
    • phosphatase-substrate data (will be downloaded automatically)

  • RegPhos
    • kinase-substrate data (will be downloaded automatically)

Parameters:
spliced_ptms: pd.DataFrame

Spliced PTM data from project module

psp_regulatory_site_file: str

File path to PhosphoSitePlus regulatory site data

psp_ks_file: str

File path to PhosphoSitePlus kinase-substrate data

psp_disease_file: str

File path to PhosphoSitePlus disease association data

elm_interactions: bool or str

If True, download ELM interaction data automatically. If str, provide file path to ELM interaction data

elm_motifs: bool or str

If True, download ELM motif data automatically. If str, provide file path to ELM motif data

PTMint: bool

If True, download PTMInt data automatically

PTMcode_intraprotein: bool or str

If True, download PTMcode intraprotein data automatically. If str, provide file path to PTMcode intraprotein data

PTMcode_interprotein: bool or str

If True, download PTMcode interprotein data automatically. If str, provide file path to PTMcode interprotein data

DEPOD: bool

If True, download DEPOD data automatically

RegPhos: bool

If True, download RegPhos data automatically

ptmsigdb_file: str

File path to PTMsigDB data

interactions_to_combine: list

List of databases to combine interaction data from. Default is [‘PTMcode’, ‘PhosphoSitePlus’, ‘RegPhos’, ‘PTMInt’]

kinases_to_combine: list

List of databases to combine kinase-substrate data from. Default is [‘PhosphoSitePlus’, ‘RegPhos’]

combine_similar: bool

Whether to combine annotations of similar information (kinase, interactions, etc) from multiple databases into another column labeled as ‘Combined’. Default is True

ptm_pose.annotate.check_file(fname, expected_extension='.tsv')[source]#

Given a file name, check if the file exists and has the expected extension. If the file does not exist or has the wrong extension, raise an error.

Parameters:
fname: str

File name to check

expected_extension: str

Expected file extension. Default is ‘.tsv’

ptm_pose.annotate.combine_KS_data(spliced_ptms, ks_databases=['PhosphoSitePlus', 'RegPhos'], regphos_conversion={'ABL1(ABL)': 'ABL1', 'CDC2': 'CDK1', 'CK2A1': 'CSNK2A1', 'ERK1(MAPK3)': 'MAPK3', 'ERK2(MAPK1)': 'MAPK1', 'JNK2(MAPK9)': 'MAPK9', 'PKACA': 'PRKACA'})[source]#

Given spliced ptm information, combine kinase-substrate data from multiple databases (currently support PhosphoSitePlus and RegPhos), assuming that the kinase data from these resources has already been added to the spliced ptm data. The combined kinase data will be added as a new column labeled ‘Combined:Kinase’

Parameters:
spliced_ptms: pd.DataFrame

Spliced PTM data from project module

ks_databases: list

List of databases to combine kinase data from. Currently support PhosphoSitePlus and RegPhos

regphos_conversion: dict

Allows conversion of RegPhos names to matching names in PhosphoSitePlus.

Returns:
splicde_ptms: pd.DataFrame

Spliced PTM data with combined kinase data added

ptm_pose.annotate.combine_interaction_data(spliced_ptms, interaction_databases=['PhosphoSitePlus', 'PTMcode', 'PTMInt', 'RegPhos', 'DEPOD', 'ELM'], include_enzyme_interactions=True)[source]#

Given annotated spliced ptm data, extract interaction data from various databases and combine into a single dataframe. This will include the interacting protein, the type of interaction, and the source of the interaction data

Parameters:
spliced_ptms: pd.DataFrame

Dataframe containing PTM data and associated interaction annotations from various databases

interaction_databases: list

List of databases to extract interaction data from. Options include ‘PhosphoSitePlus’, ‘PTMcode’, ‘PTMInt’, ‘RegPhos’, ‘DEPOD’. These should already have annotation columns in the spliced_ptms dataframe, otherwise they will be ignored. For kinase-substrate interactions, if combined column is present, will use that instead of individual databases

include_enzyme_interactions: bool

If True, will include kinase-substrate and phosphatase interactions in the output dataframe

Returns:
interact_data: list

List of dataframes containing PTMs and their interacting proteins, the type of influence the PTM has on the interaction (DISRUPTS, INDUCES, or REGULATES), and the source of the interaction data

ptm_pose.annotate.convert_PSP_label_to_UniProt(label)[source]#

Given a label for an interacting protein from PhosphoSitePlus, convert to UniProtKB accession. Required as PhosphoSitePlus interactions are recorded in various ways that aren’t necessarily consistent with other databases (i.e. not always gene name)

Parameters:
label: str

Label for interacting protein from PhosphoSitePlus

ptm_pose.annotate.extract_positions_from_DEPOD(x)[source]#

Given string object consisting of multiple modifications in the form of ‘Residue-Position’ separated by ‘, ‘, extract the residue and position. Ignore any excess details in the string.

ptm_pose.annotate.unify_interaction_data(spliced_ptms, interaction_col, name_dict={})[source]#

Given spliced ptm data and a column containing interaction data, extract the interacting protein, type of interaction, and convert to UniProtKB accession. This will be added as a new column labeled ‘Interacting ID’

Parameters:
spliced_ptms: pd.DataFrame

Dataframe containing PTM data

interaction_col: str

column containing interaction information from a specific database

name_dict: dict

dictionary to convert names within given database to UniProt IDs. For cases when name is not necessarily one of the gene names listed in UniProt

Returns:
interact: pd.DataFrame

Contains PTMs and their interacting proteins, the type of influence the PTM has on the interaction (DISRUPTS, INDUCES, or REGULATES)

Analysis#

ptm_pose.analyze.annotation_enrichment(spliced_ptms, database='PhosphoSitePlus', annotation_type='Function', background_type='pregenerated', collapse_on_similar=False, mod_class=None, alpha=None, min_dPSI=None, annotation_file=None, save_background=False)[source]#

In progress, needs to be tested

Given spliced ptm information (differential inclusion, altered flanking sequences, or both), calculate the enrichment of specific annotations in the dataset using a hypergeometric test. Background data can be provided/constructed in a few ways:

  1. Use preconstructed background data for the annotation of interest, which considers the entire proteome present in the ptm_coordinates dataframe. While this is the default, it may not be the most accurate representation of your data, so you may alternative decide to use the other options which will be more specific to your context.

  2. Use the alpha and min_dPSI parameter to construct a foreground that only includes significantly spliced PTMs, and use the entire provided spliced_ptms dataframe as the background. This will allow you to compare the enrichment of specific annotations in the significantly spliced PTMs compared to the entire dataset. Will do this automatically if alpha or min_dPSI is provided.

Parameters:
spliced_ptms: pd.DataFrame

Dataframe with PTMs projected onto splicing events and with annotations appended from various databases

database: str

database from which PTMs are pulled. Options include ‘PhosphoSitePlus’, ‘ELM’, ‘PTMInt’, ‘PTMcode’, ‘DEPOD’, ‘RegPhos’, ‘PTMsigDB’. Default is ‘PhosphoSitePlus’.

annotation_type: str

Type of annotation to pull from spliced_ptms dataframe. Available information depends on the selected database. Default is ‘Function’.

background_type: str

how to construct the background data. Options include ‘pregenerated’ (default) and ‘significance’. If ‘significance’ is selected, the alpha and min_dPSI parameters must be provided. Otherwise, will use whole proteome in the ptm_coordinates dataframe as the background.

collapse_on_similar: bool

Whether to collapse similar annotations (for example, increasing and decreasing functions) into a single category. Default is False.

mod_class: str

modification class to subset, if any

alpha: float

significance threshold to use to subset foreground PTMs. Default is None.

min_dPSI: float

minimum delta PSI value to use to subset foreground PTMs. Default is None.

annotation_file: str

file to use to annotate custom background data. Default is None.

save_background: bool

Whether to save the background data constructed from the ptm_coordinates dataframe into Resource_Files within package. Default is False.

ptm_pose.analyze.combine_outputs(spliced_ptms, altered_flanks, mod_class=None, include_stop_codon_introduction=False, remove_conflicting=True)[source]#

Given the spliced_ptms (differentially included) and altered_flanks (altered flanking sequences) dataframes obtained from project and flanking_sequences modules, combine the two into a single dataframe that categorizes each PTM by the impact on the PTM site

Parameters:
spliced_ptms: pd.DataFrame

Dataframe with PTMs projected onto splicing events and with annotations appended from various databases

altered_flanks: pd.DataFrame

Dataframe with PTMs associated with altered flanking sequences and with annotations appended from various databases

mod_class: str

modification class to subset, if any

include_stop_codon_introduction: bool

Whether to include PTMs that introduce stop codons in the altered flanks. Default is False.

remove_conflicting: bool

Whether to remove PTMs that are both included and excluded across different splicing events. Default is True.

ptm_pose.analyze.compare_inclusion_motifs(flanking_sequences, elm_classes=None)[source]#

Given a DataFrame containing flanking sequences with changes and a DataFrame containing ELM class information, identify motifs that are found in the inclusion and exclusion events, identifying motifs unique to each case. This does not take into account the position of the motif in the sequence or additional information that might validate any potential interaction (i.e. structural information that would indicate whether the motif is accessible or not). ELM class information can be downloaded from the download page of elm (http://elm.eu.org/elms/elms_index.tsv).

Parameters:
flanking_sequences: pandas.DataFrame

DataFrame containing flanking sequences with changes, obtained from get_flanking_changes_from_splice_data()

elm_classes: pandas.DataFrame

DataFrame containing ELM class information (ELMIdentifier, Regex, etc.), downloaded directly from ELM (http://elm.eu.org/elms/elms_index.tsv). Recommended to download this file and input it manually, but will download from ELM otherwise

Returns:
flanking_sequences: pandas.DataFrame

DataFrame containing flanking sequences with changes and motifs found in the inclusion and exclusion events

ptm_pose.analyze.edit_sequence_for_kinase_library(seq)[source]#

Convert flanking sequence to version accepted by kinase library (modified residue denoted by asterick)

ptm_pose.analyze.findAlteredPositions(seq1, seq2, flank_size=5)[source]#

Given two sequences, identify the location of positions that have changed

Parameters:
seq1, seq2: str

sequences to compare (order does not matter)

flank_size: int

size of the flanking sequences (default is 5). This is used to make sure the provided sequences are the correct length

Returns:
altered_positions: list

list of positions that have changed

residue_change: list

list of residues that have changed associated with that position

flank_side: str

indicates which side of the flanking sequence the change has occurred (N-term, C-term, or Both)

ptm_pose.analyze.find_motifs(seq, elm_classes)[source]#

Given a sequence and a dataframe containinn ELM class information, identify motifs that can be found in the provided sequence using the RegEx expression provided by ELM (PTMs not considered). This does not take into account the position of the motif in the sequence or additional information that might validate any potential interaction (i.e. structural information that would indicate whether the motif is accessible or not). ELM class information can be downloaded from the download page of elm (http://elm.eu.org/elms/elms_index.tsv).

Parameters:
seq: str

sequence to search for motifs

elm_classes: pandas.DataFrame

DataFrame containing ELM class information (ELMIdentifier, Regex, etc.), downloaded directly from ELM (http://elm.eu.org/elms/elms_index.tsv)

ptm_pose.analyze.gene_set_enrichment(spliced_ptms=None, altered_flanks=None, combined=None, alpha=0.05, min_dPSI=None, gene_sets=['KEGG_2021_Human', 'GO_Biological_Process_2023', 'GO_Cellular_Component_2023', 'GO_Molecular_Function_2023', 'Reactome_2022'], background=None, return_sig_only=True, max_retries=5, delay=10)[source]#

Given spliced_ptms and/or altered_flanks dataframes (or the dataframes combined from combine_outputs()), perform gene set enrichment analysis using the enrichr API

Parameters:
spliced_ptms: pd.DataFrame

Dataframe with differentially included PTMs projected onto splicing events and with annotations appended from various databases. Default is None (will not be considered in analysis). If combined dataframe is provided, this dataframe will be ignored.

altered_flanks: pd.DataFrame

Dataframe with PTMs associated with altered flanking sequences and with annotations appended from various databases. Default is None (will not be considered). If combined dataframe is provided, this dataframe will be ignored.

combined: pd.DataFrame

Combined dataframe with spliced_ptms and altered_flanks dataframes. Default is None. If provided, spliced_ptms and altered_flanks dataframes will be ignored.

gene_sets: list

List of gene sets to use in enrichment analysis. Default is [‘KEGG_2021_Human’, ‘GO_Biological_Process_2023’, ‘GO_Cellular_Component_2023’, ‘GO_Molecular_Function_2023’,’Reactome_2022’]. Look at gseapy and enrichr documentation for other available gene sets

background: list

List of genes to use as background in enrichment analysis. Default is None (all genes in the gene set database will be used).

return_sig_only: bool

Whether to return only significantly enriched gene sets. Default is True.

max_retries: int

Number of times to retry downloading gene set enrichment data from enrichr API. Default is 5.

delay: int

Number of seconds to wait between retries. Default is 10.

Returns:
results: pd.DataFrame

Dataframe with gene set enrichment results from enrichr API

ptm_pose.analyze.getSequenceIdentity(seq1, seq2)[source]#

Given two flanking sequences, calculate the sequence identity between them using Biopython and parameters definded by Pillman et al. BMC Bioinformatics 2011

Parameters:
seq1, seq2: str

flanking sequence

Returns:
normalized_score: float

normalized score of sequence similarity between flanking sequences (calculated similarity/max possible similarity)

ptm_pose.analyze.get_annotation_categories(spliced_ptms)[source]#

Given spliced ptm information, return the available annotation categories that have been appended to dataframe

Parameters:
spliced_ptms: pd.DataFrame

PTMs projected onto splicing events and with annotations appended from various databases

Returns:
annot_categories: pd.DataFrame

Dataframe that indicates the available databases, annotations from each database, and column associated with that annotation

ptm_pose.analyze.get_annotation_col(spliced_ptms, annotation_type='Function', database='PhosphoSitePlus')[source]#

Given the database of interest and annotation type, return the annotation column that will be found in a annotated spliced_ptm dataframe

Parameters:
spliced_ptms: pd.DataFrame

Dataframe with PTM annotations added from annotate module

annotation_type: str

Type of annotation to pull from spliced_ptms dataframe. Available information depends on the selected database. Default is ‘Function’.

database: str

database from which PTMs are pulled. Options include ‘PhosphoSitePlus’, ‘ELM’, ‘PTMInt’, ‘PTMcode’, ‘DEPOD’, and ‘RegPhos’. Default is ‘PhosphoSitePlus’.

Returns:
annotation_col: str

Column name in spliced_ptms dataframe that contains the requested annotation

ptm_pose.analyze.get_enrichment_inputs(spliced_ptms, annotation_type='Function', database='PhosphoSitePlus', background_type='pregenerated', background=None, collapse_on_similar=False, mod_class=None, alpha=0.05, min_dPSI=0, annotation_file=None, save_background=False)[source]#

Given the spliced ptms, altered_flanks, or combined PTMs dataframe, identify the number of PTMs corresponding to specific annotations in the foreground (PTMs impacted by splicing) and the background (all PTMs in the proteome or all PTMs in dataset not impacted by splicing). This information can be used to calculate the enrichment of specific annotations among PTMs impacted by splicing. Several options are provided for constructing the background data: pregenerated (based on entire proteome in the ptm_coordinates dataframe) or significance (foreground PTMs are extracted from provided spliced PTMs based on significance and minimum delta PSI)

Parameters:
spliced_ptms: pd.DataFrame
ptm_pose.analyze.get_modification_counts(ptms)[source]#

Given PTM data (either spliced ptms, altered flanks, or combined data), return the counts of each modification class

Parameters:
ptms: pd.DataFrame

Dataframe with PTMs projected onto splicing events or with altered flanking sequences

Returns:
modification_counts: pd.Series

Series with the counts of each modification class

ptm_pose.analyze.get_ptm_annotations(spliced_ptms, annotation_type='Function', database='PhosphoSitePlus', mod_class=None, collapse_on_similar=False, dPSI_col=None, sig_col=None)[source]#

Given spliced ptm information obtained from project and annotate modules, grab PTMs in spliced ptms associated with specific PTM modules

Parameters:
spliced_ptms: pd.DataFrame

PTMs projected onto splicing events and with annotations appended from various databases

annotation_type: str

Type of annotation to pull from spliced_ptms dataframe. Available information depends on the selected database. Default is ‘Function’.

database: str

database from which PTMs are pulled. Options include ‘PhosphoSitePlus’, ‘ELM’, or ‘PTMInt’. ELM and PTMInt data will automatically be downloaded, but due to download restrictions, PhosphoSitePlus data must be manually downloaded and annotated in the spliced_ptms data using functions from the annotate module. Default is ‘PhosphoSitePlus’.

mod_class: str

modification class to subset

ptm_pose.analyze.simplify_annotation(annotation, sep=',')[source]#

Given an annotation, remove additional information such as whether or not a function is increasing or decreasing. For example, ‘cell growth, induced’ would be simplified to ‘cell growth’

Parameters:
annotation: str

Annotation to simplify

sep: str

Separator that splits the core annotation from additional detail. Default is ‘,’. Assumes the first element is the core annotation.

Returns:
annotation: str

Simplified annotation

Plotting#

ptm_pose.analyze.annotation_enrichment(spliced_ptms, database='PhosphoSitePlus', annotation_type='Function', background_type='pregenerated', collapse_on_similar=False, mod_class=None, alpha=None, min_dPSI=None, annotation_file=None, save_background=False)[source]#

In progress, needs to be tested

Given spliced ptm information (differential inclusion, altered flanking sequences, or both), calculate the enrichment of specific annotations in the dataset using a hypergeometric test. Background data can be provided/constructed in a few ways:

  1. Use preconstructed background data for the annotation of interest, which considers the entire proteome present in the ptm_coordinates dataframe. While this is the default, it may not be the most accurate representation of your data, so you may alternative decide to use the other options which will be more specific to your context.

  2. Use the alpha and min_dPSI parameter to construct a foreground that only includes significantly spliced PTMs, and use the entire provided spliced_ptms dataframe as the background. This will allow you to compare the enrichment of specific annotations in the significantly spliced PTMs compared to the entire dataset. Will do this automatically if alpha or min_dPSI is provided.

Parameters:
spliced_ptms: pd.DataFrame

Dataframe with PTMs projected onto splicing events and with annotations appended from various databases

database: str

database from which PTMs are pulled. Options include ‘PhosphoSitePlus’, ‘ELM’, ‘PTMInt’, ‘PTMcode’, ‘DEPOD’, ‘RegPhos’, ‘PTMsigDB’. Default is ‘PhosphoSitePlus’.

annotation_type: str

Type of annotation to pull from spliced_ptms dataframe. Available information depends on the selected database. Default is ‘Function’.

background_type: str

how to construct the background data. Options include ‘pregenerated’ (default) and ‘significance’. If ‘significance’ is selected, the alpha and min_dPSI parameters must be provided. Otherwise, will use whole proteome in the ptm_coordinates dataframe as the background.

collapse_on_similar: bool

Whether to collapse similar annotations (for example, increasing and decreasing functions) into a single category. Default is False.

mod_class: str

modification class to subset, if any

alpha: float

significance threshold to use to subset foreground PTMs. Default is None.

min_dPSI: float

minimum delta PSI value to use to subset foreground PTMs. Default is None.

annotation_file: str

file to use to annotate custom background data. Default is None.

save_background: bool

Whether to save the background data constructed from the ptm_coordinates dataframe into Resource_Files within package. Default is False.

ptm_pose.analyze.combine_outputs(spliced_ptms, altered_flanks, mod_class=None, include_stop_codon_introduction=False, remove_conflicting=True)[source]#

Given the spliced_ptms (differentially included) and altered_flanks (altered flanking sequences) dataframes obtained from project and flanking_sequences modules, combine the two into a single dataframe that categorizes each PTM by the impact on the PTM site

Parameters:
spliced_ptms: pd.DataFrame

Dataframe with PTMs projected onto splicing events and with annotations appended from various databases

altered_flanks: pd.DataFrame

Dataframe with PTMs associated with altered flanking sequences and with annotations appended from various databases

mod_class: str

modification class to subset, if any

include_stop_codon_introduction: bool

Whether to include PTMs that introduce stop codons in the altered flanks. Default is False.

remove_conflicting: bool

Whether to remove PTMs that are both included and excluded across different splicing events. Default is True.

ptm_pose.analyze.compare_inclusion_motifs(flanking_sequences, elm_classes=None)[source]#

Given a DataFrame containing flanking sequences with changes and a DataFrame containing ELM class information, identify motifs that are found in the inclusion and exclusion events, identifying motifs unique to each case. This does not take into account the position of the motif in the sequence or additional information that might validate any potential interaction (i.e. structural information that would indicate whether the motif is accessible or not). ELM class information can be downloaded from the download page of elm (http://elm.eu.org/elms/elms_index.tsv).

Parameters:
flanking_sequences: pandas.DataFrame

DataFrame containing flanking sequences with changes, obtained from get_flanking_changes_from_splice_data()

elm_classes: pandas.DataFrame

DataFrame containing ELM class information (ELMIdentifier, Regex, etc.), downloaded directly from ELM (http://elm.eu.org/elms/elms_index.tsv). Recommended to download this file and input it manually, but will download from ELM otherwise

Returns:
flanking_sequences: pandas.DataFrame

DataFrame containing flanking sequences with changes and motifs found in the inclusion and exclusion events

ptm_pose.analyze.edit_sequence_for_kinase_library(seq)[source]#

Convert flanking sequence to version accepted by kinase library (modified residue denoted by asterick)

ptm_pose.analyze.findAlteredPositions(seq1, seq2, flank_size=5)[source]#

Given two sequences, identify the location of positions that have changed

Parameters:
seq1, seq2: str

sequences to compare (order does not matter)

flank_size: int

size of the flanking sequences (default is 5). This is used to make sure the provided sequences are the correct length

Returns:
altered_positions: list

list of positions that have changed

residue_change: list

list of residues that have changed associated with that position

flank_side: str

indicates which side of the flanking sequence the change has occurred (N-term, C-term, or Both)

ptm_pose.analyze.find_motifs(seq, elm_classes)[source]#

Given a sequence and a dataframe containinn ELM class information, identify motifs that can be found in the provided sequence using the RegEx expression provided by ELM (PTMs not considered). This does not take into account the position of the motif in the sequence or additional information that might validate any potential interaction (i.e. structural information that would indicate whether the motif is accessible or not). ELM class information can be downloaded from the download page of elm (http://elm.eu.org/elms/elms_index.tsv).

Parameters:
seq: str

sequence to search for motifs

elm_classes: pandas.DataFrame

DataFrame containing ELM class information (ELMIdentifier, Regex, etc.), downloaded directly from ELM (http://elm.eu.org/elms/elms_index.tsv)

ptm_pose.analyze.gene_set_enrichment(spliced_ptms=None, altered_flanks=None, combined=None, alpha=0.05, min_dPSI=None, gene_sets=['KEGG_2021_Human', 'GO_Biological_Process_2023', 'GO_Cellular_Component_2023', 'GO_Molecular_Function_2023', 'Reactome_2022'], background=None, return_sig_only=True, max_retries=5, delay=10)[source]#

Given spliced_ptms and/or altered_flanks dataframes (or the dataframes combined from combine_outputs()), perform gene set enrichment analysis using the enrichr API

Parameters:
spliced_ptms: pd.DataFrame

Dataframe with differentially included PTMs projected onto splicing events and with annotations appended from various databases. Default is None (will not be considered in analysis). If combined dataframe is provided, this dataframe will be ignored.

altered_flanks: pd.DataFrame

Dataframe with PTMs associated with altered flanking sequences and with annotations appended from various databases. Default is None (will not be considered). If combined dataframe is provided, this dataframe will be ignored.

combined: pd.DataFrame

Combined dataframe with spliced_ptms and altered_flanks dataframes. Default is None. If provided, spliced_ptms and altered_flanks dataframes will be ignored.

gene_sets: list

List of gene sets to use in enrichment analysis. Default is [‘KEGG_2021_Human’, ‘GO_Biological_Process_2023’, ‘GO_Cellular_Component_2023’, ‘GO_Molecular_Function_2023’,’Reactome_2022’]. Look at gseapy and enrichr documentation for other available gene sets

background: list

List of genes to use as background in enrichment analysis. Default is None (all genes in the gene set database will be used).

return_sig_only: bool

Whether to return only significantly enriched gene sets. Default is True.

max_retries: int

Number of times to retry downloading gene set enrichment data from enrichr API. Default is 5.

delay: int

Number of seconds to wait between retries. Default is 10.

Returns:
results: pd.DataFrame

Dataframe with gene set enrichment results from enrichr API

ptm_pose.analyze.getSequenceIdentity(seq1, seq2)[source]#

Given two flanking sequences, calculate the sequence identity between them using Biopython and parameters definded by Pillman et al. BMC Bioinformatics 2011

Parameters:
seq1, seq2: str

flanking sequence

Returns:
normalized_score: float

normalized score of sequence similarity between flanking sequences (calculated similarity/max possible similarity)

ptm_pose.analyze.get_annotation_categories(spliced_ptms)[source]#

Given spliced ptm information, return the available annotation categories that have been appended to dataframe

Parameters:
spliced_ptms: pd.DataFrame

PTMs projected onto splicing events and with annotations appended from various databases

Returns:
annot_categories: pd.DataFrame

Dataframe that indicates the available databases, annotations from each database, and column associated with that annotation

ptm_pose.analyze.get_annotation_col(spliced_ptms, annotation_type='Function', database='PhosphoSitePlus')[source]#

Given the database of interest and annotation type, return the annotation column that will be found in a annotated spliced_ptm dataframe

Parameters:
spliced_ptms: pd.DataFrame

Dataframe with PTM annotations added from annotate module

annotation_type: str

Type of annotation to pull from spliced_ptms dataframe. Available information depends on the selected database. Default is ‘Function’.

database: str

database from which PTMs are pulled. Options include ‘PhosphoSitePlus’, ‘ELM’, ‘PTMInt’, ‘PTMcode’, ‘DEPOD’, and ‘RegPhos’. Default is ‘PhosphoSitePlus’.

Returns:
annotation_col: str

Column name in spliced_ptms dataframe that contains the requested annotation

ptm_pose.analyze.get_enrichment_inputs(spliced_ptms, annotation_type='Function', database='PhosphoSitePlus', background_type='pregenerated', background=None, collapse_on_similar=False, mod_class=None, alpha=0.05, min_dPSI=0, annotation_file=None, save_background=False)[source]#

Given the spliced ptms, altered_flanks, or combined PTMs dataframe, identify the number of PTMs corresponding to specific annotations in the foreground (PTMs impacted by splicing) and the background (all PTMs in the proteome or all PTMs in dataset not impacted by splicing). This information can be used to calculate the enrichment of specific annotations among PTMs impacted by splicing. Several options are provided for constructing the background data: pregenerated (based on entire proteome in the ptm_coordinates dataframe) or significance (foreground PTMs are extracted from provided spliced PTMs based on significance and minimum delta PSI)

Parameters:
spliced_ptms: pd.DataFrame
ptm_pose.analyze.get_modification_counts(ptms)[source]#

Given PTM data (either spliced ptms, altered flanks, or combined data), return the counts of each modification class

Parameters:
ptms: pd.DataFrame

Dataframe with PTMs projected onto splicing events or with altered flanking sequences

Returns:
modification_counts: pd.Series

Series with the counts of each modification class

ptm_pose.analyze.get_ptm_annotations(spliced_ptms, annotation_type='Function', database='PhosphoSitePlus', mod_class=None, collapse_on_similar=False, dPSI_col=None, sig_col=None)[source]#

Given spliced ptm information obtained from project and annotate modules, grab PTMs in spliced ptms associated with specific PTM modules

Parameters:
spliced_ptms: pd.DataFrame

PTMs projected onto splicing events and with annotations appended from various databases

annotation_type: str

Type of annotation to pull from spliced_ptms dataframe. Available information depends on the selected database. Default is ‘Function’.

database: str

database from which PTMs are pulled. Options include ‘PhosphoSitePlus’, ‘ELM’, or ‘PTMInt’. ELM and PTMInt data will automatically be downloaded, but due to download restrictions, PhosphoSitePlus data must be manually downloaded and annotated in the spliced_ptms data using functions from the annotate module. Default is ‘PhosphoSitePlus’.

mod_class: str

modification class to subset

ptm_pose.analyze.simplify_annotation(annotation, sep=',')[source]#

Given an annotation, remove additional information such as whether or not a function is increasing or decreasing. For example, ‘cell growth, induced’ would be simplified to ‘cell growth’

Parameters:
annotation: str

Annotation to simplify

sep: str

Separator that splits the core annotation from additional detail. Default is ‘,’. Assumes the first element is the core annotation.

Returns:
annotation: str

Simplified annotation