PTM-POSE Reference

PTM-POSE Reference#

Configuration#

ptm_pose.pose_config.download_translator(save=False)[source]#

Using rest API from UniProt, download mapping information between UniProt IDs, Gene names, and Ensembl Gene IDs. This information is used to convert between different gene identifiers and UniProt IDs

Parameters:

savebool, optional: Whether to save the translator file locally. The default is False.

PTM Projection#

ptm_pose.project.find_ptms_in_region(ptm_coordinates, chromosome, strand, start, end, gene=None, coordinate_type='hg38')[source]#

Given an genomic region in either hg38 or hg19 coordinates (such as the region encoding an exon of interest), identify PTMs that are mapped to that region. If so, return the exon number. If none are found, return np.nan.

Parameters:

chromosome: str: chromosome where region is located
strand: int: DNA strand for region is found on (1 for forward, -1 for reverse)
start: int: start position of region on the chromosome/strand (should always be less than end)
end: int: end position of region on the chromosome/strand (should always be greater than start)
coordinate_type: str: indicates the coordinate system used for the start and end positions. Either hg38 or hg19. Default is ‘hg38’.

Returns:

ptms_in_region: pandas.DataFrame: dataframe containing all PTMs found in the region. If no PTMs are found, returns np.nan.

ptm_pose.project.project_ptms_onto_MATS(SE_events=None, A5SS_events=None, A3SS_events=None, RI_events=None, MXE_events=None, coordinate_type='hg38', identify_flanking_sequences=False, dPSI_col='meanDeltaPSI', sig_col='FDR', extra_cols=None, separate_modification_types=False, PROCESSES=1, ptm_coordinates=None, **kwargs)[source]#

Given splice quantification from the MATS algorithm, annotate with PTMs that are found in the differentially included regions.

Parameters:

ptm_coordinates: pandas.DataFrame: dataframe containing PTM information, including chromosome, strand, and genomic location of PTMs
SE_events: pandas.DataFrame: dataframe containing skipped exon event information from MATS
A5SS_events: pandas.DataFrame: dataframe containing 5’ alternative splice site event information from MATS
A3SS_events: pandas.DataFrame: dataframe containing 3’ alternative splice site event information from MATS
RI_events: pandas.DataFrame: dataframe containing retained intron event information from MATS
MXE_events: pandas.DataFrame: dataframe containing mutually exclusive exon event information from MATS
coordinate_type: str: indicates the coordinate system used for the start and end positions. Either hg38 or hg19. Default is ‘hg38’.
dPSI_col: str: Column name indicating delta PSI value. Default is ‘meanDeltaPSI’.
sig_col: str: Column name indicating significance of the event. Default is ‘FDR’.
extra_cols: list: List of column names for additional information to add to the results. Default is None.
separate_modification_types: bool: Indicate whether residues with multiple modifications (i.e. phosphorylation and acetylation) should be treated as separate PTMs and be placed in unique rows of the output dataframe. Default is False.
PROCESSES: int: Number of processes to use for multiprocessing. Default is 1.
**kwargs: additional keyword arguments: Additional keyword arguments to pass to the find_ptms_in_many_regions function, which will be fed into the filter_ptms() function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add ‘min_MS_observations = 2’ to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the filter_ptms() function for more options.

ptm_pose.project.project_ptms_onto_SpliceSeq(psi_data, splicegraph, gene_col='symbol', dPSI_col=None, sig_col=None, extra_cols=None, coordinate_type='hg19', separate_modification_types=False, identify_flanking_sequences=False, flank_size=5, ptm_coordinates=None, PROCESSES=1, **kwargs)[source]#

Given splice event quantification from SpliceSeq (such as what can be downloaded from TCGASpliceSeq), annotate with PTMs that are found in the differentially included regions.

Parameters:

psi_data: pandas.DataFrame: dataframe containing splice event quantification from SpliceSeq. Must contain the following columns: ‘symbol’, ‘exons’, ‘splice_type’.
splicegraph: pandas.DataFrame: dataframe containing exon information from the splicegraph used during splice event quantification. Must contain the following columns: ‘Symbol’, ‘Exon’, ‘Chr_Start’, ‘Chr_Stop’.
gene_col: str: column name in psi_data that contains the gene name. Default is ‘symbol’.
dPSI_col: str: column name in psi_data that contains the delta PSI value for the splice event. Default is None, which will not include this information in the output.
sig_col: str: column name in psi_data that contains the significance value for the splice event. Default is None, which will not include this information in the output.
extra_cols: list: list of additional columns to include in the output dataframe. Default is None, which will not include any additional columns.
coordinate_type: str: indicates the coordinate system used for the start and end positions. Either hg38 or hg19. Default is ‘hg19’.
separate_modification_types: bool: Indicate whether to store PTM sites with multiple modification types as multiple rows. For example, if a site at K100 was both an acetylation and methylation site, these will be separated into unique rows with the same site number but different modification types. Default is True.
identify_flanking_sequences: bool: Indicate whether to identify and return the flanking sequences for the splice events. Default is False.
flank_size: int: Size of the flanking sequence to extract from the splice event. Default is 5, which will extract 5 bases upstream and downstream of the splice event. Only relevant if identify_flanking_sequences is True.
PROCESSES: int: Number of processes to use for multiprocessing. Default is 1 (single processing).
**kwargs: additional keyword arguments: Additional keyword arguments to pass to the find_ptms_in_many_regions function, which will be fed into the filter_ptms() function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add ‘min_MS_observations = 2’ to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the filter_ptms() function for more options.

ptm_pose.project.project_ptms_onto_splice_events(splice_data, annotate_original_df=True, chromosome_col='chr', strand_col='strand', region_start_col='exonStart_0base', region_end_col='exonEnd', dPSI_col=None, sig_col=None, event_id_col=None, gene_col=None, extra_cols=None, separate_modification_types=False, coordinate_type='hg38', start_coordinate_system='1-based', end_coordinate_system='1-based', taskbar_label=None, ptm_coordinates=None, PROCESSES=1, **kwargs)[source]#

Given splice event quantification data, project PTMs onto the regions impacted by the splice events. Assumes that the splice event data will have chromosome, strand, and genomic start/end positions for the regions of interest, and each row of the splice_event_data corresponds to a unique region.

Important note: PTM-POSE relies on Ensembl based coordinates (1-based), so if the coordinates are 0-based, make sure to indicate using the start_coordinate_system and end_coordinate_system parameters. For example, rMATS uses 0-based for the start coordinates, but 1-based for the end coordinates. In this case, set start_coordinate_system = ‘0-based’ and end_coordinate_system = ‘1-based’.

Parameters:

splice_data: pandas.DataFrame: dataframe containing splice event information, including chromosome, strand, and genomic location of regions of interest
ptm_coordinates: pandas.DataFrame: dataframe containing PTM information, including chromosome, strand, and genomic location of PTMs. If none, it will pull from the config file.
chromosome_col: str: column name in splice_data that contains chromosome information. Default is ‘chr’. Expects it to be a str with only the chromosome number: ‘Y’, ‘1’, ‘2’, etc.
strand_col: str: column name in splice_data that contains strand information. Default is ‘strand’. Expects it to be a str with ‘+’ or ‘-’, or integers as 1 or -1. Will convert to integers automatically if string format is provided.
region_start_col: str: column name in splice_data that contains the start position of the region of interest. Default is ‘exonStart_0base’.
region_end_col: str: column name in splice_data that contains the end position of the region of interest. Default is ‘exonEnd’.
event_id_col: str: column name in splice_data that contains the unique identifier for the splice event. If provided, will be used to annotate the ptm information with the specific splice event ID. Default is None.
gene_col: str: column name in splice_data that contains the gene name. If provided, will be used to make sure the projected PTMs stem from the same gene (some cases where genomic coordiantes overlap between distinct genes). Default is None.
dPSI_col: str: column name in splice_data that contains the delta PSI value for the splice event. Default is None, which will not include this information in the output
sig_col: str: column name in splice_data that contains the significance value for the splice event. Default is None, which will not include this information in the output.
extra_cols: list: list of additional columns to include in the output dataframe. Default is None, which will not include any additional columns.
coordinate_type: str: indicates the coordinate system used for the start and end positions. Either hg38 or hg19. Default is ‘hg38’.
start_coordinate_system: str: indicates the coordinate system used for the start position. Either ‘0-based’ or ‘1-based’. Default is ‘1-based’.
end_coordinate_system: str: indicates the coordinate system used for the end position. Either ‘0-based’ or ‘1-based’. Default is ‘1-based’.
separate_modification_types: bool: Indicate whether to store PTM sites with multiple modification types as multiple rows. For example, if a site at K100 was both an acetylation and methylation site, these will be separated into unique rows with the same site number but different modification types. Default is True.
taskbar_label: str: Label to display in the tqdm progress bar. Default is None, which will automatically state “Projecting PTMs onto regions using —– coordinates”.
PROCESSES: int: Number of processes to use for multiprocessing. Default is 1 (single processing)
**kwargs: additional keyword arguments: Additional keyword arguments to pass to the find_ptms_in_many_regions function, which will be fed into the filter_ptms() function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add ‘min_MS_observations = 2’ to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the filter_ptms() function for more options.

Returns:

spliced_ptm_info: pandas.DataFrame: Contains the PTMs identified across the different splice events
splice_data: pandas.DataFrame: dataframe containing the original splice data with an additional column ‘PTMs’ that contains the PTMs found in the region of interest, in the format of ‘SiteNumber(ModificationType)’. If no PTMs are found, the value will be np.nan.

Flanking Sequences#

ptm_pose.flanking_sequences.get_flanking_changes(ptm_coordinates, chromosome, strand, first_flank_region, spliced_region, second_flank_region, gene=None, dPSI=None, sig=None, event_id=None, flank_size=5, coordinate_type='hg38', lowercase_mod=True, order_by='Coordinates')[source]#

Given flanking and spliced regions associated with a splice event, identify PTMs that have potential to have an altered flanking sequence depending on whether spliced region is included or excluded (if PTM is close to splice boundary). For these PTMs, extract the flanking sequences associated with the inclusion and exclusion cases and translate into amino acid sequences. If the PTM is not associated with a codon that codes for the expected amino acid, the PTM will be excluded from the results.

Important note: It is assumed that all region coordinates are based on a 1-based coordinate system, not 0-based, consistent with Ensembl. If using a 0-based system, please adjust the coordinates accordingly prior to running this function

Parameters:

ptm_coordinatespandas.DataFrame: DataFrame containing PTM coordinate information for identify PTMs in the flanking regions
chromosomestr: Chromosome associated with the splice event
strandint: Strand associated with the splice event (1 for forward, -1 for negative)
first_flank_regionlist: List containing the start and stop locations of the first flanking region (first is currently defined based on location the genome not coding sequence)
spliced_regionlist: List containing the start and stop locations of the spliced region
second_flank_regionlist: List containing the start and stop locations of the second flanking region (second is currently defined based on location the genome not coding sequence)
event_idstr, optional: Event ID associated with the splice event, by default None
flank_sizeint, optional: Number of amino acids to include flanking the PTM, by default 7
coordinate_typestr, optional: Coordinate system used for the regions, by default ‘hg38’. Other options is hg19.
lowercase_modbool, optional: Whether to lowercase the amino acid associated with the PTM in returned flanking sequences, by default True
order_bystr, optional: Whether the first, spliced and second regions are defined by their genomic coordinates (first has smallest coordinate, spliced next, then second), or if they are defined by their translation (first the first when translated, etc.)

Returns:

pandas.DataFrame: DataFrame containing the PTMs associated with the flanking regions and the amino acid sequences of the flanking regions in the inclusion and exclusion cases

ptm_pose.flanking_sequences.get_flanking_changes_from_rMATS(ptm_coordinates=None, SE_events=None, A5SS_events=None, A3SS_events=None, RI_events=None, coordinate_type='hg38', dPSI_col='meanDeltaPSI', sig_col='FDR', extra_cols=None, **kwargs)[source]#

Given splice events identified rMATS extract quantified PTMs that are nearby the splice boundary (potential for flanking sequence to be altered). Coordinate information of individual exons should be found in splicegraph. You can also provide columns with specific psi or significance information. Extra cols not in these categories can be provided with extra_cols parameter.

Only use this function if you do not care about differentially included sites, otherwise you can use the project module set identify_flanking_sequences = True (project.project_ptms_onto_MATS(identify_flanking_sequences = True))

Parameters:

ptm_coordinates: pandas.DataFrame: dataframe containing PTM information, including chromosome, strand, and genomic location of PTMs. If none, will use the PTM coordinates from the pose_config file.
SE_events: pandas.DataFrame: dataframe containing skipped exon event information from MATS
A5SS_events: pandas.DataFrame: dataframe containing 5’ alternative splice site event information from MATS
A3SS_events: pandas.DataFrame: dataframe containing 3’ alternative splice site event information from MATS
RI_events: pandas.DataFrame: dataframe containing retained intron event information from MATS
MXE_events: pandas.DataFrame: dataframe containing mutually exclusive exon event information from MATS
coordinate_type: str: indicates the coordinate system used for the start and end positions. Either hg38 or hg19. Default is ‘hg38’.
dPSI_col: str: Column name indicating delta PSI value. Default is ‘meanDeltaPSI’.
sig_col: str: Column name indicating significance of the event. Default is ‘FDR’.
extra_cols: list: List of column names for additional information to add to the results. Default is None.
**kwargs: additional keyword arguments: Additional keyword arguments, which will be fed into the filter_ptms() function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add ‘min_MS_observations = 2’ to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the filter_ptms() function for more options.

ptm_pose.flanking_sequences.get_flanking_changes_from_splice_data(splice_data, ptm_coordinates=None, chromosome_col=None, strand_col=None, first_flank_start_col=None, first_flank_end_col=None, spliced_region_start_col=None, spliced_region_end_col=None, second_flank_start_col=None, second_flank_end_col=None, dPSI_col=None, sig_col=None, event_id_col=None, gene_col=None, extra_cols=None, flank_size=5, coordinate_type='hg38', start_coordinate_system='1-based', end_coordinate_system='1-based', lowercase_mod=True, **kwargs)[source]#

Given a DataFrame containing information about splice events, extract the flanking sequences associated with the PTMs in the flanking regions if there is potential for this to be altered. The DataFrame should contain columns for the chromosome, strand, start and stop locations of the first flanking region, spliced region, and second flanking region. The DataFrame should also contain a column for the event ID associated with the splice event. If the DataFrame does not contain the necessary columns, the function will raise an error.

Parameters:

splice_datapandas.DataFrame: DataFrame containing information about splice events
ptm_coordinatespandas.DataFrame: DataFrame containing PTM coordinate information for identify PTMs in the flanking regions
chromosome_colstr, optional: Column name indicating chromosome, by default None
strand_colstr, optional: Column name indicating strand, by default None
first_flank_start_colstr, optional: Column name indicating start location of the first flanking region, by default None
first_flank_end_colstr, optional: Column name indicating end location of the first flanking region, by default None
spliced_region_start_colstr, optional: Column name indicating start location of the spliced region, by default None
spliced_region_end_colstr, optional: Column name indicating end location of the spliced region, by default None
second_flank_start_colstr, optional: Column name indicating start location of the second flanking region, by default None
second_flank_end_colstr, optional: Column name indicating end location of the second flanking region, by default None
event_id_colstr, optional: Column name indicating event ID, by default None
gene_colstr, optional: Column name indicating gene name, by default None
extra_colslist, optional: List of additional columns to include in the output DataFrame, by default None
flank_sizeint, optional: Number of amino acids to include flanking the PTM, by default 7
coordinate_typestr, optional: Coordinate system used for the regions, by default ‘hg38’. Other options is hg19.
lowercase_modbool, optional: Whether to lowercase the amino acid associated with the PTM in returned flanking sequences, by default True
start_coordinate_systemstr, optional: Coordinate system used for the start locations of the regions, by default ‘1-based’. Other option is ‘0-based’.
end_coordinate_systemstr, optional: Coordinate system used for the end locations of the regions, by default ‘1-based’. Other option is ‘0-based’.
kwargskeyword arguments, optional: Additional keyword arguments to pass to the find_ptms_in_many_regions function, which will be fed into the filter_ptms() function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add ‘min_MS_observations = 2’ to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the filter_ptms() function for more options.

Returns:

list: List containing DataFrames with the PTMs associated with the flanking regions and the amino acid sequences of the flanking regions in the inclusion and exclusion cases

ptm_pose.flanking_sequences.get_flanking_changes_from_splicegraph(psi_data, splicegraph, ptm_coordinates=None, dPSI_col=None, sig_col=None, event_id_col=None, extra_cols=None, gene_col='symbol', flank_size=5, coordinate_type='hg19', **kwargs)[source]#

Given a DataFrame containing information about splice events obtained from SpliceSeq and the corresponding splicegraph, extract the flanking sequences of PTMs that are nearby the splice boundary (potential for flanking sequence to be altered). Coordinate information of individual exons should be found in splicegraph. You can also provide columns with specific psi or significance information. Extra cols not in these categories can be provided with extra_cols parameter.

Parameters:

psi_datapandas.DataFrame: DataFrame containing information about splice events obtained from SpliceSeq
splicegraphpandas.DataFrame: DataFrame containing information about individual exons and their coordinates
ptm_coordinatespandas.DataFrame: DataFrame containing PTM coordinate information for identify PTMs in the flanking regions
dPSI_colstr, optional: Column name indicating delta PSI value, by default None
sig_colstr, optional: Column name indicating significance of the event, by default None
event_id_colstr, optional: Column name indicating event ID, by default None
extra_colslist, optional: List of column names for additional information to add to the results, by default None
gene_colstr, optional: Column name indicating gene symbol of spliced gene, by default ‘symbol’
flank_sizeint, optional: Number of amino acids to include flanking the PTM, by default 5
coordinate_typestr, optional: Coordinate system used for the regions, by default ‘hg19’. Other options is hg38.
**kwargs: additional keyword arguments: Additional keyword arguments, which will be fed into the filter_ptms() function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add ‘min_MS_observations = 2’ to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the filter_ptms() function for more options.

Returns:

altered_flankspandas.DataFrame: DataFrame containing the PTMs associated with the flanking regions that are altered, and the flanking sequences that arise depending on whether the flanking sequence is included or not

Annotating PTMs#

ptm_pose.annotate.add_ELM_interactions(ptms, file=None, report_success=True)[source]#

Given a spliced ptms or altered flanks dataframe from the project module, add ELM interaction data to the dataframe

Parameters:

ptms: pandas.DataFrame: Contains the PTMs identified across the different splice events, either differentially included events, or altered flanking sequences
file: str: Path to the ELM data file. If not provided, the data will be downloaded directly from the ELM website
report_success: bool: If True, will print out the number of PTMs identified in the dataset that have ELM interaction information

Returns:

ptms: pandas.DataFrame: Contains the PTMs identified across the different splice events with additional columns for ELM interaction data

ptm_pose.annotate.add_ELM_matched_motifs(ptms, flank_size=7, file=None, report_success=True)[source]#

Given spliced ptms or altered flanks dataframes, compare the canonical flanking sequences of each PTM to motifs recorded in the ELM database. If a match is found, the ELM motif will be recorded in the ELM:Motif Matches column

Parameters:

ptms: pandas.DataFrame: Contains the PTMs identified across the different splice events, either differentially included events, or altered flanking sequences
flank_size: int: Number of residues to include on either side of the PTM for the motif search. Default is 7
file: str: Path to the ELM data file. If not provided, the data will be downloaded directly from the ELM website
report_success: bool: If True, will print out the number of PTMs identified in the dataset that have ELM motif data

ptm_pose.annotate.add_custom_annotation(ptms, annotation_data, source_name, annotation_type, annotation_col, accession_col='UniProtKB Accession', residue_col='Residue', position_col='PTM Position in Isoform')[source]#

Add custom annotation data to ptms or altered flanking sequence dataframes

Parameters:

annotation_data: pandas.DataFrame: Dataframe containing the annotation data to be added to the ptms dataframe. Must contain columns for UniProtKB Accession, Residue, PTM Position in Isoform, and the annotation data to be added
source_name: str: Name of the source of the annotation data, will be used to label the columns in the ptms dataframe
annotation_type: str: Type of annotation data being added, will be used to label the columns in the ptms dataframe
annotation_col: str: Column name in the annotation data that contains the annotation data to be added to the ptms dataframe
accession_col: str: Column name in the annotation data that contains the UniProtKB Accession information. Default is ‘UniProtKB Accession’
residue_col: str: Column name in the annotation data that contains the residue information
position_col: str: Column name in the annotation data that contains the PTM position information

Returns:

ptms: pandas.DataFrame: Contains the PTMs identified across the different splice events with an additional column for the custom annotation data

ptm_pose.annotate.add_omnipath_data(ptms, min_sources=1, min_references=1, convert_to_gene_name=True, replace_old_annotations=True, report_success=True)[source]#

Given spliced ptms or altered flanks dataframe, append enzyme-substrate interactions recorded in OmniPath database. These will be split between ‘Writer’ enzymes, or enzymes that add the modification (OmniPath:Writer Enzyme), and ‘Eraser’ enzymes, or enzymes that remove the modification (OmniPath:Eraser Enzyme). Note, we do not consider the ‘post translational modification’ or ‘cleavage’ entries for this purpose.

Parameters:

ptmspandas.DataFrame: Spliced PTMs or altered flanks dataframe.
min_sourcesint: Minimum number of sources (i.e. database) for enzyme-substrate interaction. Default is 1, or all entries.
min_referencesint: Minimum number of references (i.e. publications) for enzyme-substrate interaction. Default is 1, or all entries.
convert_to_gene_namebool: Whether to convert enzyme names from UniProt IDs to gene names using pose_config.uniprot_to_genename. Default is True.
report_successbool: Whether to report success message. Default is True.

ptm_pose.annotate.annotate_ptms(ptms, annot_type='All', phosphositeplus=True, ptmsigdb=True, ptmcode=True, ptmint=True, omnipath=True, regphos=True, depod=True, elm=False, interactions_to_combine='All', enzymes_to_combine='All', combine_similar=True, report_success=True, **kwargs)[source]#

Given spliced ptm data, add annotations from various databases. The annotations that can be added are the following:

PhosphoSitePlus: regulatory site data (file must be provided), kinase-substrate data (file must be provided), and disease association data (file must be provided) ELM: interaction data (can be downloaded automatically or provided as a file), motif matches (elm class data can be downloaded automatically or provided as a file) PTMInt: interaction data (will be downloaded automatically) PTMcode: intraprotein interactions (can be downloaded automatically or provided as a file), interprotein interactions (can be downloaded automatically or provided as a file) DEPOD: phosphatase-substrate data (will be downloaded automatically) RegPhos: kinase-substrate data (will be downloaded automatically)

Parameters:

ptms: pd.DataFrame: Spliced PTM data from project module
psp_regulatorybool
interactions_to_combine: list: List of databases to combine interaction data from. Default is [‘PTMcode’, ‘PhosphoSitePlus’, ‘RegPhos’, ‘PTMInt’]
kinases_to_combine: list: List of databases to combine kinase-substrate data from. Default is [‘PhosphoSitePlus’, ‘RegPhos’]
combine_similar: bool: Whether to combine annotations of similar information (kinase, interactions, etc) from multiple databases into another column labeled as ‘Combined’. Default is True

ptm_pose.annotate.append_from_gmt(ptms, database=None, annot_type=None, gmt_df=None, column_name=None, **kwargs)[source]#

Given a gmt annotation file format, add the annotations to the ptms dataframe

Parameters:

ptmspd.DataFrame: dataframe containing ptm information, which can be the spliced_ptms or altered_flanks dataframe generated during projection
databasestr: Name of the database for the annotation. Used to identify proper annotation if gmt_df not provided
annot_typestr: Type of annotation to append to the ptms dataframe
gmt_dfpd.DataFrame: If using custom gmt file, provide the dataframe loaded from the GMT file. This will override the database and annot_type parameters if provided.
column_namestr or None: Name of the column to use for the annotations in the ptms dataframe. If None, will use a default name based on the database and annot_type. Default is None.
**kwargsadditional keyword arguments: Passes additional keyword arguments to annotation specific functions. For example, you could pass min_sources for the construct_omnipath_gmt() function

ptm_pose.annotate.check_file(fname, expected_extension='.tsv')[source]#

Given a file name, check if the file exists and has the expected extension. If the file does not exist or has the wrong extension, raise an error.

Parameters:

fname: str: File name to check
expected_extension: str: Expected file extension. Default is ‘.tsv’

ptm_pose.annotate.check_gmt_file(gmt_file, database, annot_type, automatic_download=False, odir=None, **kwargs)[source]#

Given a gmt file path, check to make sure it exists. If it doesn’t, either raise error or download and save a gmt file in the provided directory.

Parameters:

gmt_filestr: file path to gmt file
databasestr: name of database associated with gmt file
annot_typestr: type of annotation to check for. This is used to provide more specific error messages
automatic download: bool: whether to automatically download data and process into gmt file if it does not exist and can be done. Default is false
odirstr or None: location to save annotations, if automatic download is true
kwargsadditional keyword arguments: Passes additional keyword arguments to annotation specific functions. For example, you could pass min_sources for the construct_omnipath_gmt() function

ptm_pose.annotate.combine_enzyme_data(ptms, enzyme_databases=['PhosphoSitePlus', 'RegPhos', 'OmniPath', 'DEPOD'], regphos_conversion={'ABL1(ABL)': 'ABL1', 'CDC2': 'CDK1', 'CK2A1': 'CSNK2A1', 'ERK1(MAPK3)': 'MAPK3', 'ERK2(MAPK1)': 'MAPK1', 'JNK2(MAPK9)': 'MAPK9', 'PKACA': 'PRKACA'})[source]#

Given spliced ptm information, combine enzyme-substrate data from multiple databases (currently support PhosphoSitePlus, RegPhos, OmniPath, DEPOD, and iKiP downloaded from PTMsigDB), assuming that the enzyme data from these resources has already been added to the spliced ptm data. The combined kinase data will be added as a new column labeled ‘Combined:Writer Enzyme’ and ‘Combined:Eraser Enzyme’

Parameters:

ptms: pd.DataFrame: Spliced PTM data from project module
enzyme_databases: list: List of databases to combine enzyme data from. Currently support PhosphoSitePlus, RegPhos, OmniPath, and DEPOD
regphos_conversion: dict: Allows conversion of RegPhos names to matching names in PhosphoSitePlus.

Returns:

ptms: pd.DataFrame: PTM data with combined kinase data added

ptm_pose.annotate.combine_interaction_data(ptms, interaction_databases=['PhosphoSitePlus', 'PTMcode', 'PTMInt', 'RegPhos', 'DEPOD'], include_enzyme_interactions=True)[source]#

Given annotated spliced ptm data, extract interaction data from various databases and combine into a single dataframe. This will include the interacting protein, the type of interaction, and the source of the interaction data

Parameters:

ptms: pd.DataFrame: Dataframe containing PTM data and associated interaction annotations from various databases
interaction_databases: list: List of databases to extract interaction data from. Options include ‘PhosphoSitePlus’, ‘PTMcode’, ‘PTMInt’, ‘RegPhos’, ‘DEPOD’. These should already have annotation columns in the ptms dataframe, otherwise they will be ignored. For kinase-substrate interactions, if combined column is present, will use that instead of individual databases
include_enzyme_interactions: bool: If True, will include kinase-substrate and phosphatase interactions in the output dataframe

Returns:

interact_data: list: List of dataframes containing PTMs and their interacting proteins, the type of influence the PTM has on the interaction (DISRUPTS, INDUCES, or REGULATES), and the source of the interaction data

ptm_pose.annotate.construct_DEPOD_gmt_file(odir=None, overwrite=False, max_retries=5, delay=10)[source]#

Download and process DEPOD data to create a GMT file for PTM-POSE. DEPOD contains information on dephosphorylation sites and their corresponding substrates.

Parameters:

odirstr, optional: Output directory for the GMT file. If not provided, it will default to the ‘Resource_Files/Annotations/DEPOD’ directory within the PTM-POSE package directory.
overwritebool, optional: If True, will overwrite any existing GMT file in the output directory. If False and the GMT file already exists, the function will skip processing and print a message.
max_retriesint, optional: Number of times to try downloading data from DEPOD
delayint, optional: Delay in seconds between download attempts. Default is 10 seconds.

ptm_pose.annotate.construct_PTMInt_gmt_file(file=None, odir=None, overwrite=False, max_retries=5, delay=10)[source]#

Download and process PTMInt interaction data to create gmt files for PTM-POSE

Parameters:

filestr, optional: Path to the PTMInt data file. If not provided, the data will be downloaded directly from the PTMInt website. Default is None.
odirstr, optional: Output directory for the gmt file. If not provided, will default to the PTM-POSE resource directory for annotations. Default is None.
overwritebool, optional: If True, will overwrite any existing gmt files in the output directory. If False, will skip the creation of the gmt file if it already exists. Default is False.
max_retriesint, optional: Number of times to retry downloading the PTMInt data if the download fails. Default is 5.
delayint, optional: Amount of time to wait (in seconds) before retrying the download if it fails. Default is 10 seconds.

ptm_pose.annotate.construct_PTMcode_interprotein_gmt_file(file=None, odir=None, overwrite=False, max_retries=5, delay=10)[source]#

Given the PTMcode interprotein interaction data, convert to readily usable format with PTM-POSE in gmt file format

file: str: Path to the PTMcode interprotein interaction data file. If not provided, the data will be downloaded directly from the PTMcode website
odirstr: Output directory for the gmt file. If not provided, will default to the PTM-POSE resource directory for annotations
overwritebool, optional: If True, will overwrite any existing gmt files in the output directory. If False, will skip the creation of the gmt file if it already exists. Default is False.
max_retriesint, optional: Number of times to retry downloading the PTMcode data if the initial attempt fails. Default is 5.
delayint, optional: Number of seconds to wait between retries if the download fails. Default is 10 seconds.

ptm_pose.annotate.construct_PTMsigDB_gmt_files(file, odir=None, overwrite=False, process_PSP_data=True)[source]#

Given the PTMsigDB xlsx file, convert to readily usable format with PTM-POSE in gmt file format. This will also process the PhosphoSitePlus data in PTMsigDB if requested.

Parameters:

filestr: PTMsigDB excel file path. This file can be downloaded from the PTMsigDB website.
odirstr: Output directory for the gmt files. If None, will default to the PTM-POSE resource directory.
overwritebool, optional: If True, will overwrite any existing gmt files in the output directory. Default is False.
process_PSP_databool, optional: If True, will process the PhosphoSitePlus data included in the PTMsigDB file, but only if not already found in odir. Default is True.

ptm_pose.annotate.construct_PhosphoSitePlus_gmt_files(regulatory_site_file=None, kinase_substrate_file=None, disease_association_file=None, odir=None, overwrite=False)[source]#

Given three PhosphoSitePlus annotation files, convert to readily usable format with PTM-POSE in gmt file format

Parameters:

regulatory_site_file: str or None: Path to the PhosphoSitePlus regulatory site file (gzipped). If None, will skip creating function annotations.
kinase_substrate_file: str or None: Path to the PhosphoSitePlus kinase-substrate file (gzipped). If None, will skip creating kinase-substrate annotations.
disease_association_file: str or None: Path to the PhosphoSitePlus disease association file (gzipped). If None, will skip creating disease association annotations.
odirstr or None: Path to the output directory where the GMT files will be saved. If None, will save to the default resource directory for PhosphoSitePlus annotations.
overwritebool: If True, will overwrite existing GMT files if they already exist. If False, will skip creating the GMT files if they already exist. Default is False.

ptm_pose.annotate.construct_RegPhos_gmt_file(file=None, odir=None, overwrite=False)[source]#

filestr: RegPhos text file path. This file can be downloaded from the RegPhos website. If None, the function will raise an error.
odirstr: Output directory for the gmt files. If None, will default to the PTM-POSE resource directory.
overwritebool, optional: If True, will overwrite any existing gmt files in the output directory. Default is False.

ptm_pose.annotate.construct_annotation_dict_from_df(ptms, annot_col, key_type='annotation')[source]#

Given an annotated PTM dataframe, construct a dictionary mapping each item to its annotations, with either the annotation as key or PTM as the key

Parameters:

ptmspd.DataFrame: Dataframe containing PTM data with annotation data, could be either spliced_ptm or altered_flanks dataframe
annot_colstr: Column name in the dataframe that contains the annotation data to construct the dictionary from.
key_typestr: Whether the annotation or ptm should be the key of the output dictionary. Default is annotation

ptm_pose.annotate.construct_annotation_dict_from_gmt(gmt_df, key_type='annotation')[source]#: Given a gmt annotation file format, construct a dictionary mapping each item to its annotations, with either the annotation as key or PTM as the key

ptm_pose.annotate.construct_combined_enzyme_gmt_df(annot_type='Writer Enzyme', enzyme_databases=['PhosphoSitePlus', 'RegPhos', 'OmniPath', 'DEPOD'], **kwargs)[source]#

Combine enzyme-substrate interaction information and format into gmt format for downstream analysis. To avoid confusion, we will not include ability to save these files, as they are dependent on the specific databases chosen.

Parameters:

annot_typestr: Type of enzyme annotation to include. Must be either ‘Writer Enzyme’ or ‘Eraser Enzyme’. Default is ‘Writer Enzyme’.
enzyme_databaseslist of str: List of enzyme databases to include in the analysis. Default is [‘PhosphoSitePlus’, ‘PTMcode’, ‘PTMInt’, ‘RegPhos’, ‘DEPOD’].

ptm_pose.annotate.construct_combined_interactions_gmt_df(interaction_databases=['PhosphoSitePlus', 'PTMcode', 'PTMInt', 'RegPhos', 'DEPOD'], include_enzyme_interactions=True, **kwargs)[source]#: Combine interaction information and format into gmt format for downstream analysis. To avoid confusion, we will not include ability to save these files, as they are dependent on the specific databases chosen.

ptm_pose.annotate.construct_custom_gmt_file(annotation_df, database, annot_type, annot_col, accession_col='UniProtKB Accession', residue_col='Residue', position_col='PTM Position in Isofrom', odir=None, **kwargs)[source]#

Function for constructing a gmt file for annotations not currently provided by PTM-POSE. Ideally, these annotations should be partially processed to have the same format as PTM-POSE annotations. For example, they should have columns for UniProtKB Accession, Residue, PTM Position in Isoform, and the annotation data to be added.

Parameters:

annotation_df: pandas.DataFrame: Dataframe containing the annotation data to be added to the ptms dataframe. Must contain columns for UniProtKB Accession, Residue, PTM Position in Isoform, and the annotation data to be added.
databasestr: Name of the database for the annotation. This will be used to create the output directory and file name.
annot_typestr: Type of annotation data being added. This will be used to create the output file name and description.
annot_colstr: Column name in the annotation data that contains the annotation data to be added to the ptms dataframe. This will be used as the annotation column in the output GMT file.
accession_colstr: Column name in the annotation data that contains the UniProtKB Accession information. Default is ‘UniProtKB Accession’.
residue_colstr: Column name in the annotation data that contains the residue information. Default is ‘Residue’.
position_colstr: Column name in the annotation data that contains the PTM position information. Default is ‘PTM Position in Isoform’.
odirstr or None: Path to the output directory where the GMT file will be saved. If None, will save to the default resource directory for annotations. Default is None.
kwargsadditional keyword arguments: additional keywords to pass to construct_gmt_df function. This can include parameters such as annotation_separator, description, and compressed.

ptm_pose.annotate.construct_gmt_df(df, annotation_col, description=nan, annotation_separator=None, odir=None, fname=None, compressed=True)[source]#

Given annotation data, construct a dataframe in the gmt file format. Save if odir and fname are provided

Parameters:

dfpd.DataFrame: Dataframe containing the annotation data to be converted to GMT format. Must contain columns for UniProtKB Accession, Residue, PTM Position in Isoform, and the annotation data to be added.
annotation_colstr: Column name in the dataframe that contains the annotation data to be added to the GMT file. This will be used as the annotation column in the output GMT file.
descriptionstr or np.nan: description to add to description column
annotation_separatorstr or None: what separator to use for splitting annotations in the annotation_col. If None, will not split annotations. Default is None.
odirstr or None: file path to output directory where the GMT file will be saved. If None, will not save. Default is None.
fnamestr or None:: name of output file. If None, will use the annotation_col as the file name. Default is None.
compressedbool: whether to save gmt file in gzip format. Default is True.

ptm_pose.annotate.construct_omnipath_gmt_file(min_sources=1, min_references=1, convert_to_gene_name=True, odir=None)[source]#: Download enzyme-substrate interactions from the OmniPath database. The data will be filtered based on the number of sources and references specified. The resulting data will be split into two categories: ‘Writer’ enzymes, which add the modification, and ‘Eraser’ enzymes, which remove the modification. The output will be saved as GMT files in resource files directory.

ptm_pose.annotate.convert_PSP_label_to_UniProt(label)[source]#

Given a label for an interacting protein from PhosphoSitePlus, convert to UniProtKB accession. Required as PhosphoSitePlus interactions are recorded in various ways that aren’t necessarily consistent with other databases (i.e. not always gene name)

Parameters:

label: str: Label for interacting protein from PhosphoSitePlus

ptm_pose.annotate.extract_ids_PTMcode(df, col='## Protein1')[source]#: Many different ID forms are used in PTMcode, but we rely on UniProt IDs. This function is used to convert between Ensembl Gene IDs to UniProt IDs

ptm_pose.annotate.extract_interaction_details(interaction, column='PhosphoSitePlus:Interactions')[source]#: Given an interaction string from a specific database, extract the type of interaction and the interacting protein. This is required as different databases format their interaction strings differently.

ptm_pose.annotate.extract_positions_from_DEPOD(x)[source]#

Given string object consisting of multiple modifications in the form of ‘Residue-Position’ separated by ‘, ‘, extract the residue and position. Ignore any excess details in the string.

Parameters:

xstr: dephosphosite entry from DEPOD data

Returns:

new_x :str: ptm residue and position in format that PTM-POSE recognizes

ptm_pose.annotate.get_available_gmt_annotations(format='dict')[source]#

Get the annotations available in resource files in GMT format. Can be outputted as either a dictionary or pandas DataFrame

Parameters:

format: str: Format to output the available annotations. Options are ‘dict’ or ‘dataframe’

ptm_pose.annotate.load_gmt_file(gmt_file)[source]#: Load a GMT file into a pandas DataFrame

ptm_pose.annotate.process_database_annotations(database='PhosphoSitePlus', annot_type='Function', key_type='annotation', collapsed=False, resource_dir=None, automatic_download=False, **kwargs)[source]#

Given a database and annotation type, find and process the annotations into a dictionary mapping each PTM to its annotations, or vice versa

Parameters:

database: str: source of annotation
annot_typestr: type of annotation to retrieve
key_typestr: whether the annotation or ptm should be the key of the output dictionary. Default is annotation
collapsedbool: whether to combine annotations for similar types into a single annotation. For example, ‘cell growth, induced’ and ‘cell growth, inhibited’ would be simplified to ‘cell growth’. Default is False.
resource_dirstr or None: location of annotations. By default, this will look for annotations in PTM-POSE resource directory
automatic_download: bool: Whether to automatically download annotations that are not yet present in resource files directory
kwargsadditional keyword arguments: Passes additional keyword arguments to annotation specific functions. For example, you could pass min_sources for the construct_omnipath_gmt() function

ptm_pose.annotate.simplify_annotation(annotation, sep=',')[source]#

Given an annotation, remove additional information such as whether or not a function is increasing or decreasing. For example, ‘cell growth, induced’ would be simplified to ‘cell growth’

Parameters:

annotation: str: Annotation to simplify
sep: str: Separator that splits the core annotation from additional detail. Default is ‘,’. Assumes the first element is the core annotation.

Returns:

annotation: str: Simplified annotation

ptm_pose.annotate.unify_interaction_data(ptms, interaction_col, name_dict={})[source]#

Given spliced ptm data and a column containing interaction data, extract the interacting protein, type of interaction, and convert to UniProtKB accession. This will be added as a new column labeled ‘Interacting ID’

Parameters:

ptms: pd.DataFrame: Dataframe containing PTM data
interaction_col: str: column containing interaction information from a specific database
name_dict: dict: dictionary to convert names within given database to UniProt IDs. For cases when name is not necessarily one of the gene names listed in UniProt

Returns:

interact: pd.DataFrame: Contains PTMs and their interacting proteins, the type of influence the PTM has on the interaction (DISRUPTS, INDUCES, or REGULATES)

Analyze Modules#

Summaries#

ptm_pose.analyze.summarize.combine_outputs(spliced_ptms, altered_flanks, report_removed_annotations=True, include_stop_codon_introduction=False, remove_conflicting=True, **kwargs)[source]#

Given the spliced_ptms (differentially included) and altered_flanks (altered flanking sequences) dataframes obtained from project and flanking_sequences modules, combine the two into a single dataframe that categorizes each PTM by the impact on the PTM site

Parameters:

spliced_ptms: pd.DataFrame: Dataframe with PTMs projected onto splicing events and with annotations appended from various databases
altered_flanks: pd.DataFrame: Dataframe with PTMs associated with altered flanking sequences and with annotations appended from various databases
include_stop_codon_introduction: bool: Whether to include PTMs that introduce stop codons in the altered flanks. Default is False.
remove_conflicting: bool: Whether to remove PTMs that are both included and excluded across different splicing events. Default is True.
kwargs: dict: Additional keyword arguments to pass to the function, will be passed to helpers.filter_ptms if filtering is desired. Will automatically filter out insignificant events if not provided

ptm_pose.analyze.summarize.get_modification_class_data(ptms, mod_class)[source]#

Given ptm dataframe and a specific modification class, return a dataframe with only the PTMs of that class

Parameters:

ptmspd.DataFrame: Dataframe with ptm information, such as the spliced_ptms or altered_flanks dataframe obtained during projection
mod_classstr: The modification class to filter by, e.g. ‘Phosphorylation’, ‘Acetylation’, etc.

ptm_pose.analyze.summarize.get_modification_counts(ptms, **kwargs)[source]#

Given PTM data (either spliced ptms, altered flanks, or combined data), return the counts of each modification class

Parameters:

ptms: pd.DataFrame: Dataframe with PTMs projected onto splicing events or with altered flanking sequences

Returns:

modification_counts: pd.Series: Series with the counts of each modification class

ptm_pose.analyze.summarize.plot_modification_breakdown(spliced_ptms=None, altered_flanks=None, colors=[(0.00392156862745098, 0.45098039215686275, 0.6980392156862745), (0.8705882352941177, 0.5607843137254902, 0.0196078431372549), (0.00784313725490196, 0.6196078431372549, 0.45098039215686275), (0.8352941176470589, 0.3686274509803922, 0.0), (0.8, 0.47058823529411764, 0.7372549019607844), (0.792156862745098, 0.5686274509803921, 0.3803921568627451), (0.984313725490196, 0.6862745098039216, 0.8941176470588236), (0.5803921568627451, 0.5803921568627451, 0.5803921568627451), (0.9254901960784314, 0.8823529411764706, 0.2), (0.33725490196078434, 0.7058823529411765, 0.9137254901960784)], ax=None, **kwargs)[source]#

Plot the number of PTMs that are differentially included or have altered flanking sequences, separated by PTM type

Parameters:

spliced_ptms: pd.DataFrame: Dataframe with PTMs that are differentially included
altered_flanks: pd.DataFrame: Dataframe with PTMs that have altered flanking sequences
colors: list: List of colors to use for the bar plot (first two will be used). Default is seaborn colorblind palette.
ax: matplotlib.Axes: Axis to plot on. If None, will create new figure. Default is None.
kwargs: dict: Additional keyword arguments to pass to the function, will be passed to helpers.filter_ptms if filtering is desired. Will automatically filter out insignificant events by min_dpsi and significance if the columns are present

Filtering PTMs and Events#

ptm_pose.analyze.filter.assess_filter_range(ptms, min_value=0, max_value=None, step=None, filter_type='min_studies', phospho_only_evidence_filter=False, ax=None, fontsize=11)[source]#

Given a dataframe of PTMs and a PTM evidence filter type, assess how adjusting the filter value impacts the number of PTMs and the fraction of those PTMs that are phosphorylated. This is done by plotting the number of PTMs and the fraction of phosphorylated PTMs as a function of the filter value.

Parameters:

ptmspd.DataFrame: Dataframe containing PTM data, either the spliced_ptms or altered_flank data
min_valueint, optional: The minimum value for the filter. The default is 0.
max_valueint, optional: The maximum value for the filter. If None, the maximum value will be determined based on the filter type. The default is None.
stepint, optional: The step size for the filter. If None, the step size will be determined based on the maximum value. The default is None.
filter_typestr, optional: The type of filter to apply. Must be one of [‘min_studies’, ‘min_compendia’, ‘min_MS’, ‘min_LTP’]. The default is ‘min_studies’.
phospho_only_evidence_filterbool, optional: Whether to apply the phospho only evidence filter (only filter phosphorylatio sites). The default is False.
axmatplotlib.axes.Axes, optional: The axes to plot on. If None, a new figure and axes will be created. The default is None.
fontsizeint, optional: The font size for the plot. The default is 11.

ptm_pose.analyze.filter.plot_filter_impact(ptms, output_type='count', topn=10, ax=None, **kwargs)[source]#

Given a dataframe of PTMs and a set of filter arguments to be passed to helpers.filter_ptms, this function will plot the number or fraction of PTMs that are retained after filtering for each modification type

Parameters:

ptmspd.DataFrame: Dataframe containing PTM data with a column ‘Modification Class’ that contains the type of modification (e.g. phosphorylation, acetylation, etc.)
output_typestr, optional: Type of output to plot, either ‘count’ or ‘fraction’. The default is ‘count’.
topnint, optional: The number of top modification classes to plot. The default is 10.
axmatplotlib.axes.Axes, optional: The axes to plot on. If None, a new figure and axes will be created. The default is None.
**kwargskeyword arguments: Additional keyword arguments to be passed to the filter_ptms function (e.g. min_studies, min_compendia, etc.). These will be extracted and checked for validity.

Annotations#

ptm_pose.analyze.annotations.annotation_enrichment(ptms, database='PhosphoSitePlus', annot_type='Function', background_type='all', collapse_on_similar=False, mod_class=None, alpha=0.05, min_dpsi=0.1, **kwargs)[source]#

Given spliced ptm information (differential inclusion, altered flanking sequences, or both), calculate the enrichment of specific annotations in the dataset using a hypergeometric test. Background data can be provided/constructed in a few ways:

Use annotations from the entire phosphoproteome (background_type = ‘all’)
Use the alpha and min_dpsi parameter to construct a foreground that only includes significantly spliced PTMs, and use the entire provided spliced_ptms dataframe as the background. This will allow you to compare the enrichment of specific annotations in the significantly spliced PTMs compared to the entire dataset. Will do this automatically if alpha or min_dpsi is provided.

Parameters:

ptms: pd.DataFrame: Dataframe with PTMs projected onto splicing events and with annotations appended from various databases
database: str: database from which PTMs are pulled. Options include ‘PhosphoSitePlus’, ‘ELM’, ‘PTMInt’, ‘PTMcode’, ‘DEPOD’, ‘RegPhos’, ‘PTMsigDB’. Default is ‘PhosphoSitePlus’.
annot_type: str: Type of annotation to pull from spliced_ptms dataframe. Available information depends on the selected database. Default is ‘Function’.
background_type: str: how to construct the background data. Options include ‘pregenerated’ (default) and ‘significance’. If ‘significance’ is selected, the alpha and min_dpsi parameters must be provided. Otherwise, will use whole proteome in the ptm_coordinates dataframe as the background.
collapse_on_similar: bool: Whether to collapse similar annotations (for example, increasing and decreasing functions) into a single category. Default is False.
mod_class: str: modification class to subset, if any
alpha: float: significance threshold to use to subset foreground PTMs. Default is None.
min_dpsi: float: minimum delta PSI value to use to subset foreground PTMs. Default is None.
kwargs: additional keyword arguments: Additional keyword arguments to pass to the filter_ptms() function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add ‘min_MS_observations = 2’ to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the filter_ptms() function for more options.

ptm_pose.analyze.annotations.draw_pie(dist, xpos, ypos, size, colors, edgecolor=None, type='donut', ax=None)[source]#

Draws pies individually, as if points on a scatter plot. This function was taken from this stack overflow post: https://stackoverflow.com/questions/56337732/how-to-plot-scatter-pie-chart-using-matplotlib

Parameters:

dist: list: list of values to be represented as pie slices for a single point
xpos: float: x position of pie chart in the scatter plot
ypos: float: y position of pie chart in the scatter plot
size: float: size of pie chart
colors: list: list of colors to use for pie slices
ax: matplotlib.Axes: axis to plot on, if None, will create new figure

ptm_pose.analyze.annotations.gene_set_enrichment(spliced_ptms=None, altered_flanks=None, sig_col='Significance', dpsi_col='dPSI', alpha=0.05, min_dpsi=None, gene_sets=['GO_Biological_Process_2023', 'Reactome_2022'], background=None, return_sig_only=True, max_retries=5, delay=10, **kwargs)[source]#

Given spliced_ptms and/or altered_flanks dataframes (or the dataframes combined from combine_outputs()), perform gene set enrichment analysis using the enrichr API

Parameters:

spliced_ptms: pd.DataFrame: Dataframe with differentially included PTMs projected onto splicing events and with annotations appended from various databases. Default is None (will not be considered in analysis). If combined dataframe is provided, this dataframe will be ignored.
altered_flanks: pd.DataFrame: Dataframe with PTMs associated with altered flanking sequences and with annotations appended from various databases. Default is None (will not be considered). If combined dataframe is provided, this dataframe will be ignored.
combined: pd.DataFrame: Combined dataframe with spliced_ptms and altered_flanks dataframes. Default is None. If provided, spliced_ptms and altered_flanks dataframes will be ignored.
gene_sets: list: List of gene sets to use in enrichment analysis. Default is [‘KEGG_2021_Human’, ‘GO_Biological_Process_2023’, ‘GO_Cellular_Component_2023’, ‘GO_Molecular_Function_2023’,’Reactome_2022’]. Look at gseapy and enrichr documentation for other available gene sets
background: list: List of genes to use as background in enrichment analysis. Default is None (all genes in the gene set database will be used).
return_sig_only: bool: Whether to return only significantly enriched gene sets. Default is True.
max_retries: int: Number of times to retry downloading gene set enrichment data from enrichr API. Default is 5.
delay: int: Number of seconds to wait between retries. Default is 10.
**kwargs: additional keyword arguments: Additional keyword arguments to pass to the combine_outputs() function from the summarize module. These will be used to filter the spliced_ptms and altered_flanks dataframes before performing gene set enrichment analysis. For example, if you want to filter PTMs based on the number of MS observations, you can add ‘min_MS_observations = 2’ to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the combine_outputs() function for more options.

Returns:

results: pd.DataFrame: Dataframe with gene set enrichment results from enrichr API

ptm_pose.analyze.annotations.get_available_annotations(ptms)[source]#

Given a PTM dataframe, indicate the annotations that are available for analysis and indicate whether they have already been appended to the PTM dataset

Parameters:

ptmspd.DataFrame: contains PTM information and may have annotations already appended, such as spliced_ptms and altered_flanks dataframes generated during projection

Returns:

available_annotspd.DataFrame: DataFrame indicating the available annotation types and their sources, as well as whether they have been appended to the PTM data.

ptm_pose.analyze.annotations.get_background_annotation_counts(database='PhosphoSitePlus', annot_type='Function', **kwargs)[source]#

Given a database and annotation type, retrieve the counts of PTMs associated with the requested annotation type across all PTMs in the ptm_coordinates dataframe used for projection

Parameters:

databasestr: Source of annotation. Default is PhosphoSitePlus
annot_typestr: Type of annotation that can be found in indicated database. Default is ‘Function’. Other options include ‘Process’, ‘Disease’, ‘Enzyme’, ‘Interactions’, etc.
kwargs: additional keyword arguments: Additional keyword arguments to pass to the filter_ptms() function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add ‘min_MS_observations = 2’ to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the filter_ptms() function for more options.

ptm_pose.analyze.annotations.get_enrichment_inputs(ptms, annot_type='Function', database='PhosphoSitePlus', background_type='all', collapse_on_similar=False, mod_class=None, alpha=0.05, min_dpsi=0.1, **kwargs)[source]#

Given the spliced ptms, altered_flanks, or combined PTMs dataframe, identify the number of PTMs corresponding to specific annotations in the foreground (PTMs impacted by splicing) and the background (all PTMs in the proteome or all PTMs in dataset not impacted by splicing). This information can be used to calculate the enrichment of specific annotations among PTMs impacted by splicing. Several options are provided for constructing the background data: all (based on entire proteome in the ptm_coordinates dataframe) or significance (foreground PTMs are extracted from provided spliced PTMs based on significance and minimum delta PSI)

Parameters:

ptms: pd.DataFrame: Dataframe with PTMs projected onto splicing events and with annotations appended from various databases. This can be either the spliced_ptms, altered_flanks, or combined dataframe.
annot_typestr: type of annotation to pull the annotations from. Default is ‘Function’.
databasestr: source of annotations. Default is ‘PhosphoSitePlus’.
background_typestr: Type of background to construct. Options are either ‘all’ (all PTMs in proteome) or ‘significance’ (only PTMs in dataset). Note that significance option assumes that PTMs have not already been filtered for significance.
collapse_on_similarbool: Whether to collapse similar annotations (for example, “cell growth, increased” and “cell growth, decreased”) into a single category. Default is False.
mod_classstr: Type of modification to perform enrichment for
min_dpsi: float: Minimum change in PSI required to return a PTM as associated with the annotation. Default is 0.1. This can be used to filter out PTMs that are not significantly spliced.
alphafloat: Significance threshold to use to filter PTMs based on their significance. Default is 0.05. This can be used to filter out PTMs that are not significantly spliced.
kwargs: additional keyword arguments: Additional keyword arguments to pass to the filter_ptms() function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add ‘min_MS_observations = 2’ to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the filter_ptms() function for more options.

ptm_pose.analyze.annotations.get_ptm_annotations(ptms, annot_type='Function', database='PhosphoSitePlus', collapse_on_similar=False, min_dpsi=0.1, alpha=0.05, **kwargs)[source]#

Given spliced ptm information obtained from project and annotate modules, grab PTMs in spliced ptms associated with specific PTM modules

Parameters:

spliced_ptms: pd.DataFrame: PTMs projected onto splicing events and with annotations appended from various databases
annot_type: str: Type of annotation to pull from spliced_ptms dataframe. Available information depends on the selected database. Default is ‘Function’.
database: str: database from which PTMs are pulled. Options include ‘PhosphoSitePlus’, ‘ELM’, or ‘PTMInt’. ELM and PTMInt data will automatically be downloaded, but due to download restrictions, PhosphoSitePlus data must be manually downloaded and annotated in the spliced_ptms data using functions from the annotate module. Default is ‘PhosphoSitePlus’.
collapse_on_similarbool: Whether to collapse similar annotations (for example, “cell growth, increased” and “cell growth, decreased”) into a single category. Default is False.
min_dpsi: float: Minimum change in PSI required to return a PTM as associated with the annotation. Default is 0.1. This can be used to filter out PTMs that are not significantly spliced.
alphafloat: Significance threshold to use to filter PTMs based on their significance. Default is 0.05. This can be used to filter out PTMs that are not significantly spliced.
kwargs: additional keyword arguments: Additional keyword arguments to pass to the filter_ptms() function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add ‘min_MS_observations = 2’ to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the filter_ptms() function for more options.

Returns:

annotationspd.DataFrame: Individual PTM information of PTMs that have been associated with the requested annotation type.
annotation_countspd.Series or pd.DataFrame: Number of PTMs associated with each annotation of the requested annotation type. If dPSI col is provided or impact col is present, will output annotation counts for each type of impact (‘Included’, ‘Excluded’, ‘Altered Flank’) separately.

ptm_pose.analyze.annotations.plot_EnrichR_pies(enrichr_results, top_terms=None, terms_to_plot=None, colors=None, edgecolor=None, row_height=0.3, type='circle', ax=None)[source]#

Given PTM-specific EnrichR results, plot EnrichR score for the provided terms, with each self point represented as a pie chart indicating the fraction of genes in the group with PTMs

Parameters:

ptm_results: pd.selfFrame: selfFrame containing PTM-specific results from EnrichR analysis
num_to_plot: int: number of terms to plot, if None, will plot all terms. Ignored if specific terms are provided in terms to plot list
terms_to_plot: list: list of terms to plot
colors: list: list of colors to use for pie slices. Default is None, which will use seaborn colorblind palette
edgecolor: str: color to use for edge of pie slices. Default is None, which will use the same color as the pie slice
row_height: float: height of each row in the plot. Default is 0.3.
type: str: type of pie chart to plot. Default is ‘circle’. Options include ‘circle’ and ‘donut’ (hole in center).
ax: matplotlib.Axes: axis to plot on, if None, will create new figure

ptm_pose.analyze.annotations.plot_annotation_counts(spliced_ptms=None, altered_flanks=None, database='PhosphoSitePlus', annot_type='Function', collapse_on_similar=True, colors=None, fontsize=10, top_terms=5, legend=False, legend_loc=(1.05, 0.5), title=None, title_type='database', ax=None, **kwargs)[source]#

Given a dataframe with PTM annotations added, plot the top annotations associated with the PTMs

Parameters:

spliced_ptms: pd.DataFrame: Dataframe with differentially included PTMs
altered_flanks: pd.DataFrame: Dataframe with PTMs associated with altered flanking sequences
database: str: Database to use for annotations. Default is ‘PhosphoSitePlus’.
annot_type: str: Type of annotation to plot. Default is ‘Function’.
collapse_on_similar: bool: Whether to collapse similar annotations into a single category. Default is True.
colors: list: List of colors to use for the bar plot. Default is None.
top_terms: int: Number of top terms to plot. Default is 5.
legend: bool: Whether to show the legend. Default is True.
legend_loc: tuple: Location of the legend. Default is None, which will place the legend in the upper right corner.
ax: matplotlib.Axes: Axis to plot on. If None, will create new figure. Default is None.
title_type: str: Type of title to use for the plot. Default is ‘database’. Options include ‘database’ and ‘detailed’.
title: str: Title to use for the plot. Default is None.
fontsize: int: Font size for the plot. Default is 10.
legend_loc: tuple: Location of the legend. Default is None, which will place the legend to the right of the plot.
**kwargs: additional keyword arguments: Additional keyword arguments, which will be fed into the filter_ptms() function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add ‘min_MS_observations = 2’ to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the filter_ptms() function for more options.

ptm_pose.analyze.annotations.plot_available_annotations(ptms, only_annotations_in_data=False, show_all_ptm_count=False, ax=None, **kwargs)[source]#

Given a dataframe with ptm annotations added, show the number of PTMs associated with each annotation type

Parameters:

ptms: pd.DataFrame: Dataframe with PTMs and annotations added
only_annotations_in_databool: Only plot annotations that are already appended to the dataset
show_all_ptm_count: bool: Whether to show the total number of PTMs in the dataset. Default is True.
ax: matplotlib.Axes: Axis to plot on. If None, will create new figure. Default is None.

Protein Interactions#

ptm_pose.analyze.interactions.get_edge_colors(interaction_graph, network_data, defaultedgecolor='gray', color_edges_by='Database', database_color_dict={'Multiple': 'purple', 'PSP/RegPhos': 'red', 'PTMInt': 'gold', 'PTMcode': 'blue', 'PhosphoSitePlus': 'green'})[source]#

Get the edge colors to use for a provided networkx graph, either plotting all edges the same color or coloring them based on the database they are from.

Parameters:

interaction_graph: nx.Graph: networkx graph containing the interaction network
network_data: pd.DataFrame: specific network edge data that contains information on which database the interaction is from and any other relevant information (such as regulation change)
defaultedgecolor‘str’: Default color to use for edges if no specific database color is found. Default is ‘gray’.
color_edges_bystr: How to color the edges. Options are ‘Database’ to color by the database they are from, or ‘Same’ to color all edges the same color. Default is ‘Database’.
database_color_dictdict: Colors to use for specific databases

ptm_pose.analyze.interactions.get_interaction_stats(interaction_graph)[source]#: Given the networkx interaction graph, calculate various network centrality measures to identify the most relevant PTMs or genes in the network

ptm_pose.analyze.interactions.plot_interaction_network(interaction_graph, network_data, network_stats=None, modified_color='red', modified_node_size=10, interacting_color='lightblue', interacting_node_size=1, defaultedgecolor='gray', color_edges_by='Same', seed=200, legend_fontsize=8, ax=None, proteins_to_label=None, labelcolor='black', legend=True)[source]#

Given the interaction graph and network data outputted from analyze.protein_interactions, plot the interaction network, signifying which proteins or ptms are altered by splicing and the specific regulation change that occurs. by default, will only label proteins

Parameters:

interaction_graph: nx.Graph: NetworkX graph object representing the interaction network, created from analyze.get_interaction_network
network_data: pd.DataFrame: Dataframe containing details about specifici protein interactions (including which protein contains the spliced PTMs)
network_stats: pd.DataFrame: Dataframe containing network statistics for each protein in the interaction network, obtained from analyze.get_interaction_stats(). Default is None, which will not label any proteins in the network.
modified_color: str: Color to use for proteins that are spliced. Default is ‘red’.
modified_node_size: int: Size of nodes that are spliced. Default is 10.
interacting_color: str: Color to use for proteins that are not spliced. Default is ‘lightblue’.
interacting_node_size: int: Size of nodes that are not spliced. Default is 1.
edgecolor: str: Color to use for edges in the network. Default is ‘gray’.
seed: int: Seed to use for spring layout of network. Default is 200.
ax: matplotlib.Axes: Axis to plot on. If None, will create new figure. Default is None.
proteins_to_label: list, int, or str: Specific proteins to label in the network. If list, will label all proteins in the list. If int, will label the top N proteins by degree centrality. If str, will label the specific protein. Default is None, which will not label any proteins in the network.
labelcolor: str: Color to use for labels. Default is ‘black’.

ptm_pose.analyze.interactions.plot_network_centrality(network_stats, network_data=None, centrality_measure='Degree', top_N=10, modified_color='red', interacting_color='black', ax=None)[source]#

Given the network statistics data obtained from analyze.get_interaction_stats(), plot the top N proteins in the protein interaction network based on centrality measure (Degree, Betweenness, or Closeness)

Parameters:

network_stats: pd.DataFrame: Dataframe containing network statistics for each protein in the interaction network, obtained from analyze.get_interaction_stats()
network_data: pd.DataFrame: Dataframe containing information on which proteins are spliced and how they are altered. Default is None, which will plot all proteins the same color (interacting_color)
centrality_measure: str: Centrality measure to use for plotting. Default is ‘Degree’. Options include ‘Degree’, ‘Degree Centrality’, ‘Betweenness Centrality’, ‘Closeness Centrality’.
top_N: int: Number of top proteins to plot. Default is 10.
modified_color: str: Color to use for proteins that are spliced. Default is ‘red’.
interacting_color: str: Color to use for proteins that are not spliced. Default is ‘black’.
ax: matplotlib.Axes: Axis to plot on. If None, will create new figure. Default is None.

class ptm_pose.analyze.interactions.protein_interactions(spliced_ptms, include_enzyme_interactions=True, interaction_databases=['PhosphoSitePlus', 'PTMcode', 'PTMInt', 'RegPhos', 'DEPOD', 'OmniPath'], **kwargs)[source]#

Class to assess interactions facilitated by PTMs in splicing network

Parameters:

spliced_ptms: pd.DataFrame: Dataframe with PTMs projected onto splicing events and with annotations appended from various databases
include_enzyme_interactions: bool: Whether to include interactions with enzymes in the network, such as kinase-substrate interactions. Default is True
interaction_databases: list: List of databases whose information to include in the network. Default is [‘PhosphoSitePlus’, ‘PTMcode’, ‘PTMInt’, ‘RegPhos’, ‘DEPOD’, ‘ELM’]
**kwargs: additional keyword arguments: Additional keyword arguments, which will be fed into the filter_ptms() function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add ‘min_MS_observations = 2’ to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the filter_ptms() function for more options.

Attributes:

interaction_graph: nx.Graph: NetworkX graph object representing the interaction network, created from analyze.get_interaction_network
network_data: pd.DataFrame: Dataframe containing details about specifici protein interactions (including which protein contains the spliced PTMs)
network_stats: pd.DataFrame: Dataframe containing network statistics for each protein in the interaction network, obtained from analyze.get_interaction_stats(). Default is None, which will not label any proteins in the network.

Methods

`compare_to_nease`(nease_edges)	Given the network edges generated by NEASE, compare the network edges generated by NEASE to the network edges generated by the PTM-driven interactions
`get_interaction_network`([node_type])	Given the spliced PTM data, extract interaction information and construct a dataframe containing all possible interactions driven by PTMs, either centered around specific PTMs or specific genes.
`get_interaction_stats`()	Given the networkx interaction graph, calculate various network centrality measures to identify the most relevant PTMs or genes in the network
`get_protein_interaction_network`(protein)	Given a specific protein, return the network data for that protein
`plot_interaction_network`([modified_color, ...])	Given the interactiong graph and network data outputted from analyze.get_interaction_network, plot the interaction network, signifying which proteins or ptms are altered by splicing and the specific regulation change that occurs.
`plot_nease_comparison`([nease_edges, ax])	Given the comparison data generated by compare_to_nease, plot the number of edges identified by NEASE and PTM-POSE
`plot_network_centrality`([...])	Plot the centrality measure for the top N proteins in the interaction network based on a specified centrality measure.
`summarize_protein_network`(protein)	Given a protein of interest, summarize the network data for that protein

compare_to_nease(nease_edges)[source]#

Given the network edges generated by NEASE, compare the network edges generated by NEASE to the network edges generated by the PTM-driven interactions

Parameters:

nease_edgespd.DataFrame: Interactions found by NEASE, which is the output of nease.get_edges() function after running NEASE (see nease_runner module for example)

Returns:

nease_comparisonpd.DataFrame: Dataframe containing the all edges found by NEASE and PTM-POSE. This will include edges that are unique to NEASE, unique to PTM-POSE, and common between the two.

get_interaction_network(node_type='Gene')[source]#

Given the spliced PTM data, extract interaction information and construct a dataframe containing all possible interactions driven by PTMs, either centered around specific PTMs or specific genes.

Parameters:

node_type: str: What to define interactions by. Can either be by ‘PTM’, which will consider each PTM as a separate node, or by ‘Gene’, which will aggregate information across all PTMs of a single gene into a single node. Default is ‘Gene’

get_interaction_stats()[source]#: Given the networkx interaction graph, calculate various network centrality measures to identify the most relevant PTMs or genes in the network

get_protein_interaction_network(protein)[source]#

Given a specific protein, return the network data for that protein

Parameters:

protein: str: Gene name of the protein of interest

Returns:

protein_network: pd.DataFrame: Dataframe containing network data for the protein of interest

plot_interaction_network(modified_color='red', modified_node_size=10, interacting_color='lightblue', interacting_node_size=5, defaultedgecolor='gray', color_edges_by='Same', seed=200, ax=None, proteins_to_label=None, labelcolor='black', legend=True)[source]#

Given the interactiong graph and network data outputted from analyze.get_interaction_network, plot the interaction network, signifying which proteins or ptms are altered by splicing and the specific regulation change that occurs. by default, will only label proteins

Parameters:

modified_color: str: Color to use for nodes that are modified by splicing. Default is ‘red’
modified_node_size: int: Size of nodes that are modified by splicing. Default is 10
interacting_color: str: Color to use for nodes that interact with modified nodes. Default is ‘lightblue’
interacting_node_size: int: Size of nodes that interact with modified nodes. Default is 5
defaultedgecolor: str: Color to use for edges in the network. Default is ‘gray’. Can choose to color by database by providing a dictionary with database names as keys and colors as values or by specifying ‘database’ as color
color_edges_by: str: How to color edges in the network. Default is ‘Same’, which will color all edges the same color. Can also specify ‘Database’ to color edges based on the database they are from. If using ‘Database’, please provide a dictionary with database names as keys and colors as values in defaultedgecolor parameter.
seed: int: Seed to use for random number generator. Default is 200
ax: matplotlib.pyplot.Axes: Axes object to plot the network. Default is None

plot_nease_comparison(nease_edges=None, ax=None)[source]#

Given the comparison data generated by compare_to_nease, plot the number of edges identified by NEASE and PTM-POSE

Parameters:

axmatplotlib.pyplot.Axes: axes to plot on
nease_edgespd.DataFrame, optional: Interactions found by NEASE, which is the output of nease.get_edges() function after running NEASE (see nease_runner module for example). Only needed if you have not run comparison_to_nease() previously

plot_network_centrality(centrality_measure='Degree', top_N=10, modified_color='red', interacting_color='black', ax=None)[source]#

Plot the centrality measure for the top N proteins in the interaction network based on a specified centrality measure. This will help identify the most relevant PTMs/genes in the network.

Parameters:

centrality_measure: str: How to calculate centrality. Options include ‘Degree’, ‘Degree Centrality’, ‘Betweenness Centrality’, ‘Closeness Centrality’, and ‘Eigenvector Centrality’. Default is ‘Degree’.
top_N: int: Number of top proteins to plot based on the centrality measure. Default is 10.
modified_color: str: Color to use for proteins that are spliced. Default is ‘red’.
interacting_color: str: Color to use for proteins that are not spliced. Default is ‘black’.
ax: matplotlib.pyplot.Axes: Axes object to plot the centrality bar plot. If None, a new figure will be created. Default is None.

summarize_protein_network(protein)[source]#

Given a protein of interest, summarize the network data for that protein

Parameters:

protein: str: Gene name of the protein of interest

Enzyme Regulation#

ptm_pose.analyze.enzyme.compare_KL_for_sequence(inclusion_seq, exclusion_seq, dpsi=None, comparison_type='percentile')[source]#

Given two sequences, compare the kinase library scores, percentiles, or ranks for each sequence. Optionally, provide a dPSI value to calculate the relative change in preference for each kinase.

Parameters:

inclusion_seqstr: sequence to score for inclusion preference, with modification lowercased
exclusion_seqstr: sequence to score for exclusion preference, with modification lowercased
dpsifloat: dPSI value for the PTM event, which will be used to calculate the relative change in preference for each kinase (score difference * dPSI). Default is None.
comparison_typestr: type of comparison to perform. Can be ‘percentile’, ‘score’, or ‘rank’. Default is ‘percentile’.

ptm_pose.analyze.enzyme.get_all_KL_scores(seq_data, seq_col, kin_type=['ser_thr', 'tyrosine'], score_type='percentiles')[source]#

Given a dataset with flanking sequences, score each flanking sequence

Parameters:

seq_datapandas dataframe: processed dataframe containing flanking sequences to score
seq_colstr: column in seq_data containing the flanking sequences
kin_typelist: list of kinase types to score. Can be ‘ser_thr’ or ‘ST’ for serine/threonine kinases, or ‘tyrosine’ or ‘Y’ for tyrosine kinases. Default is [‘ser_thr’, ‘tyrosine’].
score_typestr: type of score to calculate. Can be ‘percentile’, ‘score’, or ‘rank’. Default is ‘percentile’.

Returns:

merged_datadict: dictionary containing the merged dataframes for each kinase type with KinaseLibrary scores

Flanking Sequences#

ptm_pose.analyze.flank_analysis.compare_flanking_sequences(altered_flanks, flank_size=5)[source]#

Given the altered_flanks dataframe, compare the flanking sequences in the inclusion and exclusion isoforms. This includes identifying sequence identity, altered positions, residue changes, and the side of the flanking sequence that is altered (N-term, C-term, or both).

Parameters:

altered_flanks: pandas.DataFrame: DataFrame containing flanking sequences with changes, obtained from get_flanking_changes_from_splice_data(). It should contain the columns ‘Inclusion Flanking Sequence’ and ‘Exclusion Flanking Sequence’ for comparison.
flank_sizeint, optional: Size of the flanking region to analyze. Default is 5. This should be less than half the length of the shortest sequence to ensure proper comparison.

ptm_pose.analyze.flank_analysis.compare_inclusion_motifs(flanking_sequences, elm_classes=None)[source]#

Given a DataFrame containing flanking sequences with changes and a DataFrame containing ELM class information, identify motifs that are found in the inclusion and exclusion events, identifying motifs unique to each case. This does not take into account the position of the motif in the sequence or additional information that might validate any potential interaction (i.e. structural information that would indicate whether the motif is accessible or not). ELM class information can be downloaded from the download page of elm (http://elm.eu.org/elms/elms_index.tsv).

Parameters:

flanking_sequences: pandas.DataFrame: DataFrame containing flanking sequences with changes, obtained from get_flanking_changes_from_splice_data()
elm_classes: pandas.DataFrame: DataFrame containing ELM class information (ELMIdentifier, Regex, etc.), downloaded directly from ELM (http://elm.eu.org/elms/elms_index.tsv). Recommended to download this file and input it manually, but will download from ELM otherwise

Returns:

flanking_sequences: pandas.DataFrame: DataFrame containing flanking sequences with changes and motifs found in the inclusion and exclusion events

ptm_pose.analyze.flank_analysis.findAlteredPositions(seq1, seq2, flank_size=5)[source]#

Given two sequences, identify the location of positions that have changed

Parameters:

seq1, seq2: str: sequences to compare (order does not matter)
flank_size: int: size of the flanking sequences (default is 5). This is used to make sure the provided sequences are the correct length

Returns:

altered_positions: list: list of positions that have changed
residue_change: list: list of residues that have changed associated with that position
flank_side: str: indicates which side of the flanking sequence the change has occurred (N-term, C-term, or Both)

ptm_pose.analyze.flank_analysis.find_motifs(seq, elm_classes)[source]#

Given a sequence and a dataframe containinn ELM class information, identify motifs that can be found in the provided sequence using the RegEx expression provided by ELM (PTMs not considered). This does not take into account the position of the motif in the sequence or additional information that might validate any potential interaction (i.e. structural information that would indicate whether the motif is accessible or not). ELM class information can be downloaded from the download page of elm (http://elm.eu.org/elms/elms_index.tsv).

Parameters:

seq: str: sequence to search for motifs
elm_classes: pandas.DataFrame: DataFrame containing ELM class information (ELMIdentifier, Regex, etc.), downloaded directly from ELM (http://elm.eu.org/elms/elms_index.tsv)

ptm_pose.analyze.flank_analysis.getSequenceIdentity(seq1, seq2)[source]#

Given two flanking sequences, calculate the sequence identity between them using Biopython and parameters definded by Pillman et al. BMC Bioinformatics 2011

Parameters:

seq1, seq2: str: flanking sequence

Returns:

normalized_score: float: normalized score of sequence similarity between flanking sequences (calculated similarity/max possible similarity)

ptm_pose.analyze.flank_analysis.identify_change_to_specific_motif(altered_flanks, elm_motif_name, elm_classes=None, modification_class=None, residues=None, dPSI_col=None, **kwargs)[source]#

Given the altered_flanks dataframe, identify PTM flanking sequences that match different ELM motifs in the inclusion and exclusion isoform. This function allows for filtering based on specific ELM motifs, modification classes, and residues of interest. It also allows for additional filtering through kwargs.

Parameters:

altered_flanks: pandas.DataFrame: DataFrame containing flanking sequences with changes, obtained from get_flanking_changes_from_splice_data() and analyzed with compare_flanking_sequences(). It should contain the columns ‘Inclusion Flanking Sequence’ and ‘Exclusion Flanking Sequence’ for comparison.
elm_motif_name: str: Name of ELM motif to look for in inclusion and exclusion isoforms
elm_classes: pandas.DataFrame, optional: DataFrame containing ELM class information (ELMIdentifier, Regex, etc.), downloaded directly from ELM (http://elm.eu.org/elms/elms_index.tsv). If not provided, the function will download it automatically.
modification_classstr, optional: Specific modification class to filter for. This is useful if you are looking for a specific type of PTM (e.g., phosphorylation, acetylation). Default is None, which will not filter based on modification class.
residuesstr: Specific residue to focus on. If none will look at all residues
dPSI_colstr, optional: column containing deltaPSI information
kwargsadditional keyword arguments: additional keyword arguments to pass to the filter_ptms function

ptm_pose.analyze.flank_analysis.plot_alterations_matrix(altered_flanks, modification_class=None, residue=None, title='', ax=None, **kwargs)[source]#

Given the altered flanking sequences dataframe, plot a matrix showing the positions of altered residues for specific proteins, as well as the specific change

Parameters:

altered_flanks: pd.DataFrame: Dataframe with altered flanking sequences, and annotated information added with analyze.compare_flanking_sequences
modification_class: str: Specific modification class to plot. Default is None, which will plot all modification classes.
residue: str: Specific residue to plot. Default is None, which will plot all residues.
title: str: Title of the plot. Default is ‘’ (no title).
ax: matplotlib.Axes: Axis to plot on. If None, will create new figure. Default is None.
kwargs: additional keyword arguments: Additional keyword arguments to pass to the filter_ptms function. This allows for filtering based on specific criteria (e.g., dPSI, evidence, etc.).

ptm_pose.analyze.flank_analysis.plot_location_of_altered_flanking_residues(altered_flanks, figsize=(4, 3), modification_class=None, residue=None, **kwargs)[source]#

Plot the number of PTMs with altered residues as specific positions relative to the PTM site. This includes the specific position of the residue (-5 to +5 from PTM site) and the specific side of the PTM site that is altered (N-term or C-term)

Parameters:

altered_flanks: pd.DataFrame: Dataframe with altered flanking sequences, and annotated information added with analyze.compare_flanking_sequences
figsize: tuple: Size of the figure. Default is (4,3).
modification_class: str: Specific modification class to plot. Default is None, which will plot all modification classes.
residue: str: Specific residue to plot. Default is None, which will plot all residues.
kwargs: additional keyword arguments: Additional keyword arguments to pass to the filter_ptms function. This allows for filtering based on specific criteria (e.g., dPSI, evidence, etc.).

ptm_pose.analyze.flank_analysis.plot_sequence_differences(inclusion_seq, exclusion_seq, dpsi=None, flank_size=5, figsize=(3, 1))[source]#

Given the flanking sequences for a PTM resulting from a specific splice event, plot the differences between the two sequences, coloring the changing residues in red. If dPSI is also provided, will add an arrow to the plot indicating the direction of change

Parameters:

inclusion_seq: str: Sequence of the inclusion isoform (with spliced region included)
exclusion_seq: str: Sequence of the exclusion isoform (with spliced region excluded)
dpsi: float: Change in PSI for the specific splice event. Default is None, which will not add an arrow to the plot.
flank_size: int: Size of flanking region to plot. Default is 5. This must be less than half the length of the shortest sequence.
figsize: tuple: Size of the figure. Default is (3,1).

PTM-POSE Reference

Contents

PTM-POSE Reference#

Configuration#

PTM Projection#

Flanking Sequences#

Annotating PTMs#

Analyze Modules#

Summaries#

Filtering PTMs and Events#

Annotations#

Protein Interactions#

Enzyme Regulation#

Flanking Sequences#