PTM-POSE Reference#
Configuration#
- ptm_pose.pose_config.download_translator(save=False)[source]#
Using rest API from UniProt, download mapping information between UniProt IDs, Gene names, and Ensembl Gene IDs. This information is used to convert between different gene identifiers and UniProt IDs
- Parameters:
- savebool, optional
Whether to save the translator file locally. The default is False.
PTM Projection#
- ptm_pose.project.find_ptms_in_region(ptm_coordinates, chromosome, strand, start, end, gene=None, coordinate_type='hg38')[source]#
Given an genomic region in either hg38 or hg19 coordinates (such as the region encoding an exon of interest), identify PTMs that are mapped to that region. If so, return the exon number. If none are found, return np.nan.
- Parameters:
- chromosome: str
chromosome where region is located
- strand: int
DNA strand for region is found on (1 for forward, -1 for reverse)
- start: int
start position of region on the chromosome/strand (should always be less than end)
- end: int
end position of region on the chromosome/strand (should always be greater than start)
- coordinate_type: str
indicates the coordinate system used for the start and end positions. Either hg38 or hg19. Default is ‘hg38’.
- Returns:
- ptms_in_region: pandas.DataFrame
dataframe containing all PTMs found in the region. If no PTMs are found, returns np.nan.
- ptm_pose.project.project_ptms_onto_MATS(SE_events=None, A5SS_events=None, A3SS_events=None, RI_events=None, MXE_events=None, coordinate_type='hg38', identify_flanking_sequences=False, dPSI_col='meanDeltaPSI', sig_col='FDR', extra_cols=None, separate_modification_types=False, PROCESSES=1, ptm_coordinates=None, **kwargs)[source]#
Given splice quantification from the MATS algorithm, annotate with PTMs that are found in the differentially included regions.
- Parameters:
- ptm_coordinates: pandas.DataFrame
dataframe containing PTM information, including chromosome, strand, and genomic location of PTMs
- SE_events: pandas.DataFrame
dataframe containing skipped exon event information from MATS
- A5SS_events: pandas.DataFrame
dataframe containing 5’ alternative splice site event information from MATS
- A3SS_events: pandas.DataFrame
dataframe containing 3’ alternative splice site event information from MATS
- RI_events: pandas.DataFrame
dataframe containing retained intron event information from MATS
- MXE_events: pandas.DataFrame
dataframe containing mutually exclusive exon event information from MATS
- coordinate_type: str
indicates the coordinate system used for the start and end positions. Either hg38 or hg19. Default is ‘hg38’.
- dPSI_col: str
Column name indicating delta PSI value. Default is ‘meanDeltaPSI’.
- sig_col: str
Column name indicating significance of the event. Default is ‘FDR’.
- extra_cols: list
List of column names for additional information to add to the results. Default is None.
- separate_modification_types: bool
Indicate whether residues with multiple modifications (i.e. phosphorylation and acetylation) should be treated as separate PTMs and be placed in unique rows of the output dataframe. Default is False.
- PROCESSES: int
Number of processes to use for multiprocessing. Default is 1.
- **kwargs: additional keyword arguments
Additional keyword arguments to pass to the find_ptms_in_many_regions function, which will be fed into the filter_ptms() function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add ‘min_MS_observations = 2’ to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the filter_ptms() function for more options.
- ptm_pose.project.project_ptms_onto_SpliceSeq(psi_data, splicegraph, gene_col='symbol', dPSI_col=None, sig_col=None, extra_cols=None, coordinate_type='hg19', separate_modification_types=False, identify_flanking_sequences=False, flank_size=5, ptm_coordinates=None, PROCESSES=1, **kwargs)[source]#
Given splice event quantification from SpliceSeq (such as what can be downloaded from TCGASpliceSeq), annotate with PTMs that are found in the differentially included regions.
- Parameters:
- psi_data: pandas.DataFrame
dataframe containing splice event quantification from SpliceSeq. Must contain the following columns: ‘symbol’, ‘exons’, ‘splice_type’.
- splicegraph: pandas.DataFrame
dataframe containing exon information from the splicegraph used during splice event quantification. Must contain the following columns: ‘Symbol’, ‘Exon’, ‘Chr_Start’, ‘Chr_Stop’.
- gene_col: str
column name in psi_data that contains the gene name. Default is ‘symbol’.
- dPSI_col: str
column name in psi_data that contains the delta PSI value for the splice event. Default is None, which will not include this information in the output.
- sig_col: str
column name in psi_data that contains the significance value for the splice event. Default is None, which will not include this information in the output.
- extra_cols: list
list of additional columns to include in the output dataframe. Default is None, which will not include any additional columns.
- coordinate_type: str
indicates the coordinate system used for the start and end positions. Either hg38 or hg19. Default is ‘hg19’.
- separate_modification_types: bool
Indicate whether to store PTM sites with multiple modification types as multiple rows. For example, if a site at K100 was both an acetylation and methylation site, these will be separated into unique rows with the same site number but different modification types. Default is True.
- identify_flanking_sequences: bool
Indicate whether to identify and return the flanking sequences for the splice events. Default is False.
- flank_size: int
Size of the flanking sequence to extract from the splice event. Default is 5, which will extract 5 bases upstream and downstream of the splice event. Only relevant if identify_flanking_sequences is True.
- PROCESSES: int
Number of processes to use for multiprocessing. Default is 1 (single processing).
- **kwargs: additional keyword arguments
Additional keyword arguments to pass to the find_ptms_in_many_regions function, which will be fed into the filter_ptms() function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add ‘min_MS_observations = 2’ to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the filter_ptms() function for more options.
- ptm_pose.project.project_ptms_onto_splice_events(splice_data, annotate_original_df=True, chromosome_col='chr', strand_col='strand', region_start_col='exonStart_0base', region_end_col='exonEnd', dPSI_col=None, sig_col=None, event_id_col=None, gene_col=None, extra_cols=None, separate_modification_types=False, coordinate_type='hg38', start_coordinate_system='1-based', end_coordinate_system='1-based', taskbar_label=None, ptm_coordinates=None, PROCESSES=1, **kwargs)[source]#
Given splice event quantification data, project PTMs onto the regions impacted by the splice events. Assumes that the splice event data will have chromosome, strand, and genomic start/end positions for the regions of interest, and each row of the splice_event_data corresponds to a unique region.
Important note: PTM-POSE relies on Ensembl based coordinates (1-based), so if the coordinates are 0-based, make sure to indicate using the start_coordinate_system and end_coordinate_system parameters. For example, rMATS uses 0-based for the start coordinates, but 1-based for the end coordinates. In this case, set start_coordinate_system = ‘0-based’ and end_coordinate_system = ‘1-based’.
- Parameters:
- splice_data: pandas.DataFrame
dataframe containing splice event information, including chromosome, strand, and genomic location of regions of interest
- ptm_coordinates: pandas.DataFrame
dataframe containing PTM information, including chromosome, strand, and genomic location of PTMs. If none, it will pull from the config file.
- chromosome_col: str
column name in splice_data that contains chromosome information. Default is ‘chr’. Expects it to be a str with only the chromosome number: ‘Y’, ‘1’, ‘2’, etc.
- strand_col: str
column name in splice_data that contains strand information. Default is ‘strand’. Expects it to be a str with ‘+’ or ‘-’, or integers as 1 or -1. Will convert to integers automatically if string format is provided.
- region_start_col: str
column name in splice_data that contains the start position of the region of interest. Default is ‘exonStart_0base’.
- region_end_col: str
column name in splice_data that contains the end position of the region of interest. Default is ‘exonEnd’.
- event_id_col: str
column name in splice_data that contains the unique identifier for the splice event. If provided, will be used to annotate the ptm information with the specific splice event ID. Default is None.
- gene_col: str
column name in splice_data that contains the gene name. If provided, will be used to make sure the projected PTMs stem from the same gene (some cases where genomic coordiantes overlap between distinct genes). Default is None.
- dPSI_col: str
column name in splice_data that contains the delta PSI value for the splice event. Default is None, which will not include this information in the output
- sig_col: str
column name in splice_data that contains the significance value for the splice event. Default is None, which will not include this information in the output.
- extra_cols: list
list of additional columns to include in the output dataframe. Default is None, which will not include any additional columns.
- coordinate_type: str
indicates the coordinate system used for the start and end positions. Either hg38 or hg19. Default is ‘hg38’.
- start_coordinate_system: str
indicates the coordinate system used for the start position. Either ‘0-based’ or ‘1-based’. Default is ‘1-based’.
- end_coordinate_system: str
indicates the coordinate system used for the end position. Either ‘0-based’ or ‘1-based’. Default is ‘1-based’.
- separate_modification_types: bool
Indicate whether to store PTM sites with multiple modification types as multiple rows. For example, if a site at K100 was both an acetylation and methylation site, these will be separated into unique rows with the same site number but different modification types. Default is True.
- taskbar_label: str
Label to display in the tqdm progress bar. Default is None, which will automatically state “Projecting PTMs onto regions using —– coordinates”.
- PROCESSES: int
Number of processes to use for multiprocessing. Default is 1 (single processing)
- **kwargs: additional keyword arguments
Additional keyword arguments to pass to the find_ptms_in_many_regions function, which will be fed into the filter_ptms() function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add ‘min_MS_observations = 2’ to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the filter_ptms() function for more options.
- Returns:
- spliced_ptm_info: pandas.DataFrame
Contains the PTMs identified across the different splice events
- splice_data: pandas.DataFrame
dataframe containing the original splice data with an additional column ‘PTMs’ that contains the PTMs found in the region of interest, in the format of ‘SiteNumber(ModificationType)’. If no PTMs are found, the value will be np.nan.
Flanking Sequences#
- ptm_pose.flanking_sequences.get_flanking_changes(ptm_coordinates, chromosome, strand, first_flank_region, spliced_region, second_flank_region, gene=None, dPSI=None, sig=None, event_id=None, flank_size=5, coordinate_type='hg38', lowercase_mod=True, order_by='Coordinates')[source]#
Given flanking and spliced regions associated with a splice event, identify PTMs that have potential to have an altered flanking sequence depending on whether spliced region is included or excluded (if PTM is close to splice boundary). For these PTMs, extract the flanking sequences associated with the inclusion and exclusion cases and translate into amino acid sequences. If the PTM is not associated with a codon that codes for the expected amino acid, the PTM will be excluded from the results.
Important note: It is assumed that all region coordinates are based on a 1-based coordinate system, not 0-based, consistent with Ensembl. If using a 0-based system, please adjust the coordinates accordingly prior to running this function
- Parameters:
- ptm_coordinatespandas.DataFrame
DataFrame containing PTM coordinate information for identify PTMs in the flanking regions
- chromosomestr
Chromosome associated with the splice event
- strandint
Strand associated with the splice event (1 for forward, -1 for negative)
- first_flank_regionlist
List containing the start and stop locations of the first flanking region (first is currently defined based on location the genome not coding sequence)
- spliced_regionlist
List containing the start and stop locations of the spliced region
- second_flank_regionlist
List containing the start and stop locations of the second flanking region (second is currently defined based on location the genome not coding sequence)
- event_idstr, optional
Event ID associated with the splice event, by default None
- flank_sizeint, optional
Number of amino acids to include flanking the PTM, by default 7
- coordinate_typestr, optional
Coordinate system used for the regions, by default ‘hg38’. Other options is hg19.
- lowercase_modbool, optional
Whether to lowercase the amino acid associated with the PTM in returned flanking sequences, by default True
- order_bystr, optional
Whether the first, spliced and second regions are defined by their genomic coordinates (first has smallest coordinate, spliced next, then second), or if they are defined by their translation (first the first when translated, etc.)
- Returns:
- pandas.DataFrame
DataFrame containing the PTMs associated with the flanking regions and the amino acid sequences of the flanking regions in the inclusion and exclusion cases
- ptm_pose.flanking_sequences.get_flanking_changes_from_rMATS(ptm_coordinates=None, SE_events=None, A5SS_events=None, A3SS_events=None, RI_events=None, coordinate_type='hg38', dPSI_col='meanDeltaPSI', sig_col='FDR', extra_cols=None, **kwargs)[source]#
Given splice events identified rMATS extract quantified PTMs that are nearby the splice boundary (potential for flanking sequence to be altered). Coordinate information of individual exons should be found in splicegraph. You can also provide columns with specific psi or significance information. Extra cols not in these categories can be provided with extra_cols parameter.
Only use this function if you do not care about differentially included sites, otherwise you can use the project module set identify_flanking_sequences = True (project.project_ptms_onto_MATS(identify_flanking_sequences = True))
- Parameters:
- ptm_coordinates: pandas.DataFrame
dataframe containing PTM information, including chromosome, strand, and genomic location of PTMs. If none, will use the PTM coordinates from the pose_config file.
- SE_events: pandas.DataFrame
dataframe containing skipped exon event information from MATS
- A5SS_events: pandas.DataFrame
dataframe containing 5’ alternative splice site event information from MATS
- A3SS_events: pandas.DataFrame
dataframe containing 3’ alternative splice site event information from MATS
- RI_events: pandas.DataFrame
dataframe containing retained intron event information from MATS
- MXE_events: pandas.DataFrame
dataframe containing mutually exclusive exon event information from MATS
- coordinate_type: str
indicates the coordinate system used for the start and end positions. Either hg38 or hg19. Default is ‘hg38’.
- dPSI_col: str
Column name indicating delta PSI value. Default is ‘meanDeltaPSI’.
- sig_col: str
Column name indicating significance of the event. Default is ‘FDR’.
- extra_cols: list
List of column names for additional information to add to the results. Default is None.
- **kwargs: additional keyword arguments
Additional keyword arguments, which will be fed into the filter_ptms() function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add ‘min_MS_observations = 2’ to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the filter_ptms() function for more options.
- ptm_pose.flanking_sequences.get_flanking_changes_from_splice_data(splice_data, ptm_coordinates=None, chromosome_col=None, strand_col=None, first_flank_start_col=None, first_flank_end_col=None, spliced_region_start_col=None, spliced_region_end_col=None, second_flank_start_col=None, second_flank_end_col=None, dPSI_col=None, sig_col=None, event_id_col=None, gene_col=None, extra_cols=None, flank_size=5, coordinate_type='hg38', start_coordinate_system='1-based', end_coordinate_system='1-based', lowercase_mod=True, **kwargs)[source]#
Given a DataFrame containing information about splice events, extract the flanking sequences associated with the PTMs in the flanking regions if there is potential for this to be altered. The DataFrame should contain columns for the chromosome, strand, start and stop locations of the first flanking region, spliced region, and second flanking region. The DataFrame should also contain a column for the event ID associated with the splice event. If the DataFrame does not contain the necessary columns, the function will raise an error.
- Parameters:
- splice_datapandas.DataFrame
DataFrame containing information about splice events
- ptm_coordinatespandas.DataFrame
DataFrame containing PTM coordinate information for identify PTMs in the flanking regions
- chromosome_colstr, optional
Column name indicating chromosome, by default None
- strand_colstr, optional
Column name indicating strand, by default None
- first_flank_start_colstr, optional
Column name indicating start location of the first flanking region, by default None
- first_flank_end_colstr, optional
Column name indicating end location of the first flanking region, by default None
- spliced_region_start_colstr, optional
Column name indicating start location of the spliced region, by default None
- spliced_region_end_colstr, optional
Column name indicating end location of the spliced region, by default None
- second_flank_start_colstr, optional
Column name indicating start location of the second flanking region, by default None
- second_flank_end_colstr, optional
Column name indicating end location of the second flanking region, by default None
- event_id_colstr, optional
Column name indicating event ID, by default None
- gene_colstr, optional
Column name indicating gene name, by default None
- extra_colslist, optional
List of additional columns to include in the output DataFrame, by default None
- flank_sizeint, optional
Number of amino acids to include flanking the PTM, by default 7
- coordinate_typestr, optional
Coordinate system used for the regions, by default ‘hg38’. Other options is hg19.
- lowercase_modbool, optional
Whether to lowercase the amino acid associated with the PTM in returned flanking sequences, by default True
- start_coordinate_systemstr, optional
Coordinate system used for the start locations of the regions, by default ‘1-based’. Other option is ‘0-based’.
- end_coordinate_systemstr, optional
Coordinate system used for the end locations of the regions, by default ‘1-based’. Other option is ‘0-based’.
- kwargskeyword arguments, optional
Additional keyword arguments to pass to the find_ptms_in_many_regions function, which will be fed into the filter_ptms() function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add ‘min_MS_observations = 2’ to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the filter_ptms() function for more options.
- Returns:
- list
List containing DataFrames with the PTMs associated with the flanking regions and the amino acid sequences of the flanking regions in the inclusion and exclusion cases
- ptm_pose.flanking_sequences.get_flanking_changes_from_splicegraph(psi_data, splicegraph, ptm_coordinates=None, dPSI_col=None, sig_col=None, event_id_col=None, extra_cols=None, gene_col='symbol', flank_size=5, coordinate_type='hg19', **kwargs)[source]#
Given a DataFrame containing information about splice events obtained from SpliceSeq and the corresponding splicegraph, extract the flanking sequences of PTMs that are nearby the splice boundary (potential for flanking sequence to be altered). Coordinate information of individual exons should be found in splicegraph. You can also provide columns with specific psi or significance information. Extra cols not in these categories can be provided with extra_cols parameter.
- Parameters:
- psi_datapandas.DataFrame
DataFrame containing information about splice events obtained from SpliceSeq
- splicegraphpandas.DataFrame
DataFrame containing information about individual exons and their coordinates
- ptm_coordinatespandas.DataFrame
DataFrame containing PTM coordinate information for identify PTMs in the flanking regions
- dPSI_colstr, optional
Column name indicating delta PSI value, by default None
- sig_colstr, optional
Column name indicating significance of the event, by default None
- event_id_colstr, optional
Column name indicating event ID, by default None
- extra_colslist, optional
List of column names for additional information to add to the results, by default None
- gene_colstr, optional
Column name indicating gene symbol of spliced gene, by default ‘symbol’
- flank_sizeint, optional
Number of amino acids to include flanking the PTM, by default 5
- coordinate_typestr, optional
Coordinate system used for the regions, by default ‘hg19’. Other options is hg38.
- **kwargs: additional keyword arguments
Additional keyword arguments, which will be fed into the filter_ptms() function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add ‘min_MS_observations = 2’ to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the filter_ptms() function for more options.
- Returns:
- altered_flankspandas.DataFrame
DataFrame containing the PTMs associated with the flanking regions that are altered, and the flanking sequences that arise depending on whether the flanking sequence is included or not
Annotating PTMs#
- ptm_pose.annotate.add_ELM_interactions(ptms, file=None, report_success=True)[source]#
Given a spliced ptms or altered flanks dataframe from the project module, add ELM interaction data to the dataframe
- Parameters:
- ptms: pandas.DataFrame
Contains the PTMs identified across the different splice events, either differentially included events, or altered flanking sequences
- file: str
Path to the ELM data file. If not provided, the data will be downloaded directly from the ELM website
- report_success: bool
If True, will print out the number of PTMs identified in the dataset that have ELM interaction information
- Returns:
- ptms: pandas.DataFrame
Contains the PTMs identified across the different splice events with additional columns for ELM interaction data
- ptm_pose.annotate.add_ELM_matched_motifs(ptms, flank_size=7, file=None, report_success=True)[source]#
Given spliced ptms or altered flanks dataframes, compare the canonical flanking sequences of each PTM to motifs recorded in the ELM database. If a match is found, the ELM motif will be recorded in the ELM:Motif Matches column
- Parameters:
- ptms: pandas.DataFrame
Contains the PTMs identified across the different splice events, either differentially included events, or altered flanking sequences
- flank_size: int
Number of residues to include on either side of the PTM for the motif search. Default is 7
- file: str
Path to the ELM data file. If not provided, the data will be downloaded directly from the ELM website
- report_success: bool
If True, will print out the number of PTMs identified in the dataset that have ELM motif data
- ptm_pose.annotate.add_custom_annotation(ptms, annotation_data, source_name, annotation_type, annotation_col, accession_col='UniProtKB Accession', residue_col='Residue', position_col='PTM Position in Isoform')[source]#
Add custom annotation data to ptms or altered flanking sequence dataframes
- Parameters:
- annotation_data: pandas.DataFrame
Dataframe containing the annotation data to be added to the ptms dataframe. Must contain columns for UniProtKB Accession, Residue, PTM Position in Isoform, and the annotation data to be added
- source_name: str
Name of the source of the annotation data, will be used to label the columns in the ptms dataframe
- annotation_type: str
Type of annotation data being added, will be used to label the columns in the ptms dataframe
- annotation_col: str
Column name in the annotation data that contains the annotation data to be added to the ptms dataframe
- accession_col: str
Column name in the annotation data that contains the UniProtKB Accession information. Default is ‘UniProtKB Accession’
- residue_col: str
Column name in the annotation data that contains the residue information
- position_col: str
Column name in the annotation data that contains the PTM position information
- Returns:
- ptms: pandas.DataFrame
Contains the PTMs identified across the different splice events with an additional column for the custom annotation data
- ptm_pose.annotate.add_omnipath_data(ptms, min_sources=1, min_references=1, convert_to_gene_name=True, replace_old_annotations=True, report_success=True)[source]#
Given spliced ptms or altered flanks dataframe, append enzyme-substrate interactions recorded in OmniPath database. These will be split between ‘Writer’ enzymes, or enzymes that add the modification (OmniPath:Writer Enzyme), and ‘Eraser’ enzymes, or enzymes that remove the modification (OmniPath:Eraser Enzyme). Note, we do not consider the ‘post translational modification’ or ‘cleavage’ entries for this purpose.
- Parameters:
- ptmspandas.DataFrame
Spliced PTMs or altered flanks dataframe.
- min_sourcesint
Minimum number of sources (i.e. database) for enzyme-substrate interaction. Default is 1, or all entries.
- min_referencesint
Minimum number of references (i.e. publications) for enzyme-substrate interaction. Default is 1, or all entries.
- convert_to_gene_namebool
Whether to convert enzyme names from UniProt IDs to gene names using pose_config.uniprot_to_genename. Default is True.
- report_successbool
Whether to report success message. Default is True.
- ptm_pose.annotate.annotate_ptms(ptms, annot_type='All', phosphositeplus=True, ptmsigdb=True, ptmcode=True, ptmint=True, omnipath=True, regphos=True, depod=True, elm=False, interactions_to_combine='All', enzymes_to_combine='All', combine_similar=True, report_success=True, **kwargs)[source]#
Given spliced ptm data, add annotations from various databases. The annotations that can be added are the following:
PhosphoSitePlus: regulatory site data (file must be provided), kinase-substrate data (file must be provided), and disease association data (file must be provided) ELM: interaction data (can be downloaded automatically or provided as a file), motif matches (elm class data can be downloaded automatically or provided as a file) PTMInt: interaction data (will be downloaded automatically) PTMcode: intraprotein interactions (can be downloaded automatically or provided as a file), interprotein interactions (can be downloaded automatically or provided as a file) DEPOD: phosphatase-substrate data (will be downloaded automatically) RegPhos: kinase-substrate data (will be downloaded automatically)
- Parameters:
- ptms: pd.DataFrame
Spliced PTM data from project module
- psp_regulatorybool
- interactions_to_combine: list
List of databases to combine interaction data from. Default is [‘PTMcode’, ‘PhosphoSitePlus’, ‘RegPhos’, ‘PTMInt’]
- kinases_to_combine: list
List of databases to combine kinase-substrate data from. Default is [‘PhosphoSitePlus’, ‘RegPhos’]
- combine_similar: bool
Whether to combine annotations of similar information (kinase, interactions, etc) from multiple databases into another column labeled as ‘Combined’. Default is True
- ptm_pose.annotate.append_from_gmt(ptms, database=None, annot_type=None, gmt_df=None, column_name=None, **kwargs)[source]#
Given a gmt annotation file format, add the annotations to the ptms dataframe
- Parameters:
- ptmspd.DataFrame
dataframe containing ptm information, which can be the spliced_ptms or altered_flanks dataframe generated during projection
- databasestr
Name of the database for the annotation. Used to identify proper annotation if gmt_df not provided
- annot_typestr
Type of annotation to append to the ptms dataframe
- gmt_dfpd.DataFrame
If using custom gmt file, provide the dataframe loaded from the GMT file. This will override the database and annot_type parameters if provided.
- column_namestr or None
Name of the column to use for the annotations in the ptms dataframe. If None, will use a default name based on the database and annot_type. Default is None.
- **kwargsadditional keyword arguments
Passes additional keyword arguments to annotation specific functions. For example, you could pass min_sources for the construct_omnipath_gmt() function
- ptm_pose.annotate.check_file(fname, expected_extension='.tsv')[source]#
Given a file name, check if the file exists and has the expected extension. If the file does not exist or has the wrong extension, raise an error.
- Parameters:
- fname: str
File name to check
- expected_extension: str
Expected file extension. Default is ‘.tsv’
- ptm_pose.annotate.check_gmt_file(gmt_file, database, annot_type, automatic_download=False, odir=None, **kwargs)[source]#
Given a gmt file path, check to make sure it exists. If it doesn’t, either raise error or download and save a gmt file in the provided directory.
- Parameters:
- gmt_filestr
file path to gmt file
- databasestr
name of database associated with gmt file
- annot_typestr
type of annotation to check for. This is used to provide more specific error messages
- automatic download: bool
whether to automatically download data and process into gmt file if it does not exist and can be done. Default is false
- odirstr or None
location to save annotations, if automatic download is true
- kwargsadditional keyword arguments
Passes additional keyword arguments to annotation specific functions. For example, you could pass min_sources for the construct_omnipath_gmt() function
- ptm_pose.annotate.combine_enzyme_data(ptms, enzyme_databases=['PhosphoSitePlus', 'RegPhos', 'OmniPath', 'DEPOD'], regphos_conversion={'ABL1(ABL)': 'ABL1', 'CDC2': 'CDK1', 'CK2A1': 'CSNK2A1', 'ERK1(MAPK3)': 'MAPK3', 'ERK2(MAPK1)': 'MAPK1', 'JNK2(MAPK9)': 'MAPK9', 'PKACA': 'PRKACA'})[source]#
Given spliced ptm information, combine enzyme-substrate data from multiple databases (currently support PhosphoSitePlus, RegPhos, OmniPath, DEPOD, and iKiP downloaded from PTMsigDB), assuming that the enzyme data from these resources has already been added to the spliced ptm data. The combined kinase data will be added as a new column labeled ‘Combined:Writer Enzyme’ and ‘Combined:Eraser Enzyme’
- Parameters:
- ptms: pd.DataFrame
Spliced PTM data from project module
- enzyme_databases: list
List of databases to combine enzyme data from. Currently support PhosphoSitePlus, RegPhos, OmniPath, and DEPOD
- regphos_conversion: dict
Allows conversion of RegPhos names to matching names in PhosphoSitePlus.
- Returns:
- ptms: pd.DataFrame
PTM data with combined kinase data added
- ptm_pose.annotate.combine_interaction_data(ptms, interaction_databases=['PhosphoSitePlus', 'PTMcode', 'PTMInt', 'RegPhos', 'DEPOD'], include_enzyme_interactions=True)[source]#
Given annotated spliced ptm data, extract interaction data from various databases and combine into a single dataframe. This will include the interacting protein, the type of interaction, and the source of the interaction data
- Parameters:
- ptms: pd.DataFrame
Dataframe containing PTM data and associated interaction annotations from various databases
- interaction_databases: list
List of databases to extract interaction data from. Options include ‘PhosphoSitePlus’, ‘PTMcode’, ‘PTMInt’, ‘RegPhos’, ‘DEPOD’. These should already have annotation columns in the ptms dataframe, otherwise they will be ignored. For kinase-substrate interactions, if combined column is present, will use that instead of individual databases
- include_enzyme_interactions: bool
If True, will include kinase-substrate and phosphatase interactions in the output dataframe
- Returns:
- interact_data: list
List of dataframes containing PTMs and their interacting proteins, the type of influence the PTM has on the interaction (DISRUPTS, INDUCES, or REGULATES), and the source of the interaction data
- ptm_pose.annotate.construct_DEPOD_gmt_file(odir=None, overwrite=False, max_retries=5, delay=10)[source]#
Download and process DEPOD data to create a GMT file for PTM-POSE. DEPOD contains information on dephosphorylation sites and their corresponding substrates.
- Parameters:
- odirstr, optional
Output directory for the GMT file. If not provided, it will default to the ‘Resource_Files/Annotations/DEPOD’ directory within the PTM-POSE package directory.
- overwritebool, optional
If True, will overwrite any existing GMT file in the output directory. If False and the GMT file already exists, the function will skip processing and print a message.
- max_retriesint, optional
Number of times to try downloading data from DEPOD
- delayint, optional
Delay in seconds between download attempts. Default is 10 seconds.
- ptm_pose.annotate.construct_PTMInt_gmt_file(file=None, odir=None, overwrite=False, max_retries=5, delay=10)[source]#
Download and process PTMInt interaction data to create gmt files for PTM-POSE
- Parameters:
- filestr, optional
Path to the PTMInt data file. If not provided, the data will be downloaded directly from the PTMInt website. Default is None.
- odirstr, optional
Output directory for the gmt file. If not provided, will default to the PTM-POSE resource directory for annotations. Default is None.
- overwritebool, optional
If True, will overwrite any existing gmt files in the output directory. If False, will skip the creation of the gmt file if it already exists. Default is False.
- max_retriesint, optional
Number of times to retry downloading the PTMInt data if the download fails. Default is 5.
- delayint, optional
Amount of time to wait (in seconds) before retrying the download if it fails. Default is 10 seconds.
- ptm_pose.annotate.construct_PTMcode_interprotein_gmt_file(file=None, odir=None, overwrite=False, max_retries=5, delay=10)[source]#
Given the PTMcode interprotein interaction data, convert to readily usable format with PTM-POSE in gmt file format
- file: str
Path to the PTMcode interprotein interaction data file. If not provided, the data will be downloaded directly from the PTMcode website
- odirstr
Output directory for the gmt file. If not provided, will default to the PTM-POSE resource directory for annotations
- overwritebool, optional
If True, will overwrite any existing gmt files in the output directory. If False, will skip the creation of the gmt file if it already exists. Default is False.
- max_retriesint, optional
Number of times to retry downloading the PTMcode data if the initial attempt fails. Default is 5.
- delayint, optional
Number of seconds to wait between retries if the download fails. Default is 10 seconds.
- ptm_pose.annotate.construct_PTMsigDB_gmt_files(file, odir=None, overwrite=False, process_PSP_data=True)[source]#
Given the PTMsigDB xlsx file, convert to readily usable format with PTM-POSE in gmt file format. This will also process the PhosphoSitePlus data in PTMsigDB if requested.
- Parameters:
- filestr
PTMsigDB excel file path. This file can be downloaded from the PTMsigDB website.
- odirstr
Output directory for the gmt files. If None, will default to the PTM-POSE resource directory.
- overwritebool, optional
If True, will overwrite any existing gmt files in the output directory. Default is False.
- process_PSP_databool, optional
If True, will process the PhosphoSitePlus data included in the PTMsigDB file, but only if not already found in odir. Default is True.
- ptm_pose.annotate.construct_PhosphoSitePlus_gmt_files(regulatory_site_file=None, kinase_substrate_file=None, disease_association_file=None, odir=None, overwrite=False)[source]#
Given three PhosphoSitePlus annotation files, convert to readily usable format with PTM-POSE in gmt file format
- Parameters:
- regulatory_site_file: str or None
Path to the PhosphoSitePlus regulatory site file (gzipped). If None, will skip creating function annotations.
- kinase_substrate_file: str or None
Path to the PhosphoSitePlus kinase-substrate file (gzipped). If None, will skip creating kinase-substrate annotations.
- disease_association_file: str or None
Path to the PhosphoSitePlus disease association file (gzipped). If None, will skip creating disease association annotations.
- odirstr or None
Path to the output directory where the GMT files will be saved. If None, will save to the default resource directory for PhosphoSitePlus annotations.
- overwritebool
If True, will overwrite existing GMT files if they already exist. If False, will skip creating the GMT files if they already exist. Default is False.
- ptm_pose.annotate.construct_RegPhos_gmt_file(file=None, odir=None, overwrite=False)[source]#
- filestr
RegPhos text file path. This file can be downloaded from the RegPhos website. If None, the function will raise an error.
- odirstr
Output directory for the gmt files. If None, will default to the PTM-POSE resource directory.
- overwritebool, optional
If True, will overwrite any existing gmt files in the output directory. Default is False.
- ptm_pose.annotate.construct_annotation_dict_from_gmt(gmt_df, key_type='annotation')[source]#
Given a gmt annotation file format, construct a dictionary mapping each item to its annotations, with either the annotation as key or PTM as the key
- ptm_pose.annotate.construct_custom_gmt_file(annotation_df, database, annot_type, annot_col, accession_col='UniProtKB Accession', residue_col='Residue', position_col='PTM Position in Isofrom', odir=None, **kwargs)[source]#
Function for constructing a gmt file for annotations not currently provided by PTM-POSE. Ideally, these annotations should be partially processed to have the same format as PTM-POSE annotations. For example, they should have columns for UniProtKB Accession, Residue, PTM Position in Isoform, and the annotation data to be added.
- Parameters:
- annotation_df: pandas.DataFrame
Dataframe containing the annotation data to be added to the ptms dataframe. Must contain columns for UniProtKB Accession, Residue, PTM Position in Isoform, and the annotation data to be added.
- databasestr
Name of the database for the annotation. This will be used to create the output directory and file name.
- annot_typestr
Type of annotation data being added. This will be used to create the output file name and description.
- annot_colstr
Column name in the annotation data that contains the annotation data to be added to the ptms dataframe. This will be used as the annotation column in the output GMT file.
- accession_colstr
Column name in the annotation data that contains the UniProtKB Accession information. Default is ‘UniProtKB Accession’.
- residue_colstr
Column name in the annotation data that contains the residue information. Default is ‘Residue’.
- position_colstr
Column name in the annotation data that contains the PTM position information. Default is ‘PTM Position in Isoform’.
- odirstr or None
Path to the output directory where the GMT file will be saved. If None, will save to the default resource directory for annotations. Default is None.
- kwargsadditional keyword arguments
additional keywords to pass to construct_gmt_df function. This can include parameters such as annotation_separator, description, and compressed.
- ptm_pose.annotate.construct_gmt_df(df, annotation_col, description=nan, annotation_separator=None, odir=None, fname=None, compressed=True)[source]#
Given annotation data, construct a dataframe in the gmt file format. Save if odir and fname are provided
- Parameters:
- dfpd.DataFrame
Dataframe containing the annotation data to be converted to GMT format. Must contain columns for UniProtKB Accession, Residue, PTM Position in Isoform, and the annotation data to be added.
- annotation_colstr
Column name in the dataframe that contains the annotation data to be added to the GMT file. This will be used as the annotation column in the output GMT file.
- descriptionstr or np.nan
description to add to description column
- annotation_separatorstr or None
what separator to use for splitting annotations in the annotation_col. If None, will not split annotations. Default is None.
- odirstr or None
file path to output directory where the GMT file will be saved. If None, will not save. Default is None.
- fnamestr or None:
name of output file. If None, will use the annotation_col as the file name. Default is None.
- compressedbool
whether to save gmt file in gzip format. Default is True.
- ptm_pose.annotate.construct_omnipath_gmt_file(min_sources=1, min_references=1, convert_to_gene_name=True, odir=None)[source]#
Download enzyme-substrate interactions from the OmniPath database. The data will be filtered based on the number of sources and references specified. The resulting data will be split into two categories: ‘Writer’ enzymes, which add the modification, and ‘Eraser’ enzymes, which remove the modification. The output will be saved as GMT files in resource files directory.
- ptm_pose.annotate.convert_PSP_label_to_UniProt(label)[source]#
Given a label for an interacting protein from PhosphoSitePlus, convert to UniProtKB accession. Required as PhosphoSitePlus interactions are recorded in various ways that aren’t necessarily consistent with other databases (i.e. not always gene name)
- Parameters:
- label: str
Label for interacting protein from PhosphoSitePlus
- ptm_pose.annotate.extract_ids_PTMcode(df, col='## Protein1')[source]#
Many different ID forms are used in PTMcode, but we rely on UniProt IDs. This function is used to convert between Ensembl Gene IDs to UniProt IDs
- ptm_pose.annotate.extract_interaction_details(interaction, column='PhosphoSitePlus:Interactions')[source]#
Given an interaction string from a specific database, extract the type of interaction and the interacting protein. This is required as different databases format their interaction strings differently.
- ptm_pose.annotate.extract_positions_from_DEPOD(x)[source]#
Given string object consisting of multiple modifications in the form of ‘Residue-Position’ separated by ‘, ‘, extract the residue and position. Ignore any excess details in the string.
- Parameters:
- xstr
dephosphosite entry from DEPOD data
- Returns:
- new_x :str
ptm residue and position in format that PTM-POSE recognizes
- ptm_pose.annotate.get_available_gmt_annotations(format='dict')[source]#
Get the annotations available in resource files in GMT format. Can be outputted as either a dictionary or pandas DataFrame
- Parameters:
- format: str
Format to output the available annotations. Options are ‘dict’ or ‘dataframe’
- ptm_pose.annotate.process_database_annotations(database='PhosphoSitePlus', annot_type='Function', key_type='annotation', collapsed=False, resource_dir=None, automatic_download=False, **kwargs)[source]#
Given a database and annotation type, find and process the annotations into a dictionary mapping each PTM to its annotations, or vice versa
- Parameters:
- database: str
source of annotation
- annot_typestr
type of annotation to retrieve
- key_typestr
whether the annotation or ptm should be the key of the output dictionary. Default is annotation
- collapsedbool
whether to combine annotations for similar types into a single annotation. For example, ‘cell growth, induced’ and ‘cell growth, inhibited’ would be simplified to ‘cell growth’. Default is False.
- resource_dirstr or None
location of annotations. By default, this will look for annotations in PTM-POSE resource directory
- automatic_download: bool
Whether to automatically download annotations that are not yet present in resource files directory
- kwargsadditional keyword arguments
Passes additional keyword arguments to annotation specific functions. For example, you could pass min_sources for the construct_omnipath_gmt() function
- ptm_pose.annotate.simplify_annotation(annotation, sep=',')[source]#
Given an annotation, remove additional information such as whether or not a function is increasing or decreasing. For example, ‘cell growth, induced’ would be simplified to ‘cell growth’
- Parameters:
- annotation: str
Annotation to simplify
- sep: str
Separator that splits the core annotation from additional detail. Default is ‘,’. Assumes the first element is the core annotation.
- Returns:
- annotation: str
Simplified annotation
- ptm_pose.annotate.unify_interaction_data(ptms, interaction_col, name_dict={})[source]#
Given spliced ptm data and a column containing interaction data, extract the interacting protein, type of interaction, and convert to UniProtKB accession. This will be added as a new column labeled ‘Interacting ID’
- Parameters:
- ptms: pd.DataFrame
Dataframe containing PTM data
- interaction_col: str
column containing interaction information from a specific database
- name_dict: dict
dictionary to convert names within given database to UniProt IDs. For cases when name is not necessarily one of the gene names listed in UniProt
- Returns:
- interact: pd.DataFrame
Contains PTMs and their interacting proteins, the type of influence the PTM has on the interaction (DISRUPTS, INDUCES, or REGULATES)
Analyze Modules#
Summaries#
- ptm_pose.analyze.summarize.combine_outputs(spliced_ptms, altered_flanks, report_removed_annotations=True, include_stop_codon_introduction=False, remove_conflicting=True, **kwargs)[source]#
Given the spliced_ptms (differentially included) and altered_flanks (altered flanking sequences) dataframes obtained from project and flanking_sequences modules, combine the two into a single dataframe that categorizes each PTM by the impact on the PTM site
- Parameters:
- spliced_ptms: pd.DataFrame
Dataframe with PTMs projected onto splicing events and with annotations appended from various databases
- altered_flanks: pd.DataFrame
Dataframe with PTMs associated with altered flanking sequences and with annotations appended from various databases
- include_stop_codon_introduction: bool
Whether to include PTMs that introduce stop codons in the altered flanks. Default is False.
- remove_conflicting: bool
Whether to remove PTMs that are both included and excluded across different splicing events. Default is True.
- kwargs: dict
Additional keyword arguments to pass to the function, will be passed to helpers.filter_ptms if filtering is desired. Will automatically filter out insignificant events if not provided
- ptm_pose.analyze.summarize.get_modification_class_data(ptms, mod_class)[source]#
Given ptm dataframe and a specific modification class, return a dataframe with only the PTMs of that class
- Parameters:
- ptmspd.DataFrame
Dataframe with ptm information, such as the spliced_ptms or altered_flanks dataframe obtained during projection
- mod_classstr
The modification class to filter by, e.g. ‘Phosphorylation’, ‘Acetylation’, etc.
- ptm_pose.analyze.summarize.get_modification_counts(ptms, **kwargs)[source]#
Given PTM data (either spliced ptms, altered flanks, or combined data), return the counts of each modification class
- Parameters:
- ptms: pd.DataFrame
Dataframe with PTMs projected onto splicing events or with altered flanking sequences
- Returns:
- modification_counts: pd.Series
Series with the counts of each modification class
- ptm_pose.analyze.summarize.plot_modification_breakdown(spliced_ptms=None, altered_flanks=None, colors=[(0.00392156862745098, 0.45098039215686275, 0.6980392156862745), (0.8705882352941177, 0.5607843137254902, 0.0196078431372549), (0.00784313725490196, 0.6196078431372549, 0.45098039215686275), (0.8352941176470589, 0.3686274509803922, 0.0), (0.8, 0.47058823529411764, 0.7372549019607844), (0.792156862745098, 0.5686274509803921, 0.3803921568627451), (0.984313725490196, 0.6862745098039216, 0.8941176470588236), (0.5803921568627451, 0.5803921568627451, 0.5803921568627451), (0.9254901960784314, 0.8823529411764706, 0.2), (0.33725490196078434, 0.7058823529411765, 0.9137254901960784)], ax=None, **kwargs)[source]#
Plot the number of PTMs that are differentially included or have altered flanking sequences, separated by PTM type
- Parameters:
- spliced_ptms: pd.DataFrame
Dataframe with PTMs that are differentially included
- altered_flanks: pd.DataFrame
Dataframe with PTMs that have altered flanking sequences
- colors: list
List of colors to use for the bar plot (first two will be used). Default is seaborn colorblind palette.
- ax: matplotlib.Axes
Axis to plot on. If None, will create new figure. Default is None.
- kwargs: dict
Additional keyword arguments to pass to the function, will be passed to helpers.filter_ptms if filtering is desired. Will automatically filter out insignificant events by min_dpsi and significance if the columns are present
Filtering PTMs and Events#
- ptm_pose.analyze.filter.plot_filter_impact(ptms, output_type='count', topn=10, ax=None, **kwargs)[source]#
Given a dataframe of PTMs and a set of filter arguments to be passed to helpers.filter_ptms, this function will plot the number or fraction of PTMs that are retained after filtering for each modification type
- Parameters:
- ptmspd.DataFrame
Dataframe containing PTM data with a column ‘Modification Class’ that contains the type of modification (e.g. phosphorylation, acetylation, etc.)
- output_typestr, optional
Type of output to plot, either ‘count’ or ‘fraction’. The default is ‘count’.
- topnint, optional
The number of top modification classes to plot. The default is 10.
- axmatplotlib.axes.Axes, optional
The axes to plot on. If None, a new figure and axes will be created. The default is None.
- **kwargskeyword arguments
Additional keyword arguments to be passed to the filter_ptms function (e.g. min_studies, min_compendia, etc.). These will be extracted and checked for validity.
Annotations#
- ptm_pose.analyze.annotations.annotation_enrichment(ptms, database='PhosphoSitePlus', annot_type='Function', background_type='all', collapse_on_similar=False, mod_class=None, alpha=0.05, min_dpsi=0.1, **kwargs)[source]#
Given spliced ptm information (differential inclusion, altered flanking sequences, or both), calculate the enrichment of specific annotations in the dataset using a hypergeometric test. Background data can be provided/constructed in a few ways:
Use annotations from the entire phosphoproteome (background_type = ‘all’)
Use the alpha and min_dpsi parameter to construct a foreground that only includes significantly spliced PTMs, and use the entire provided spliced_ptms dataframe as the background. This will allow you to compare the enrichment of specific annotations in the significantly spliced PTMs compared to the entire dataset. Will do this automatically if alpha or min_dpsi is provided.
- Parameters:
- ptms: pd.DataFrame
Dataframe with PTMs projected onto splicing events and with annotations appended from various databases
- database: str
database from which PTMs are pulled. Options include ‘PhosphoSitePlus’, ‘ELM’, ‘PTMInt’, ‘PTMcode’, ‘DEPOD’, ‘RegPhos’, ‘PTMsigDB’. Default is ‘PhosphoSitePlus’.
- annot_type: str
Type of annotation to pull from spliced_ptms dataframe. Available information depends on the selected database. Default is ‘Function’.
- background_type: str
how to construct the background data. Options include ‘pregenerated’ (default) and ‘significance’. If ‘significance’ is selected, the alpha and min_dpsi parameters must be provided. Otherwise, will use whole proteome in the ptm_coordinates dataframe as the background.
- collapse_on_similar: bool
Whether to collapse similar annotations (for example, increasing and decreasing functions) into a single category. Default is False.
- mod_class: str
modification class to subset, if any
- alpha: float
significance threshold to use to subset foreground PTMs. Default is None.
- min_dpsi: float
minimum delta PSI value to use to subset foreground PTMs. Default is None.
- kwargs: additional keyword arguments
Additional keyword arguments to pass to the filter_ptms() function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add ‘min_MS_observations = 2’ to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the filter_ptms() function for more options.
- ptm_pose.analyze.annotations.draw_pie(dist, xpos, ypos, size, colors, edgecolor=None, type='donut', ax=None)[source]#
Draws pies individually, as if points on a scatter plot. This function was taken from this stack overflow post: https://stackoverflow.com/questions/56337732/how-to-plot-scatter-pie-chart-using-matplotlib
- Parameters:
- dist: list
list of values to be represented as pie slices for a single point
- xpos: float
x position of pie chart in the scatter plot
- ypos: float
y position of pie chart in the scatter plot
- size: float
size of pie chart
- colors: list
list of colors to use for pie slices
- ax: matplotlib.Axes
axis to plot on, if None, will create new figure
- ptm_pose.analyze.annotations.gene_set_enrichment(spliced_ptms=None, altered_flanks=None, sig_col='Significance', dpsi_col='dPSI', alpha=0.05, min_dpsi=None, gene_sets=['GO_Biological_Process_2023', 'Reactome_2022'], background=None, return_sig_only=True, max_retries=5, delay=10, **kwargs)[source]#
Given spliced_ptms and/or altered_flanks dataframes (or the dataframes combined from combine_outputs()), perform gene set enrichment analysis using the enrichr API
- Parameters:
- spliced_ptms: pd.DataFrame
Dataframe with differentially included PTMs projected onto splicing events and with annotations appended from various databases. Default is None (will not be considered in analysis). If combined dataframe is provided, this dataframe will be ignored.
- altered_flanks: pd.DataFrame
Dataframe with PTMs associated with altered flanking sequences and with annotations appended from various databases. Default is None (will not be considered). If combined dataframe is provided, this dataframe will be ignored.
- combined: pd.DataFrame
Combined dataframe with spliced_ptms and altered_flanks dataframes. Default is None. If provided, spliced_ptms and altered_flanks dataframes will be ignored.
- gene_sets: list
List of gene sets to use in enrichment analysis. Default is [‘KEGG_2021_Human’, ‘GO_Biological_Process_2023’, ‘GO_Cellular_Component_2023’, ‘GO_Molecular_Function_2023’,’Reactome_2022’]. Look at gseapy and enrichr documentation for other available gene sets
- background: list
List of genes to use as background in enrichment analysis. Default is None (all genes in the gene set database will be used).
- return_sig_only: bool
Whether to return only significantly enriched gene sets. Default is True.
- max_retries: int
Number of times to retry downloading gene set enrichment data from enrichr API. Default is 5.
- delay: int
Number of seconds to wait between retries. Default is 10.
- **kwargs: additional keyword arguments
Additional keyword arguments to pass to the combine_outputs() function from the summarize module. These will be used to filter the spliced_ptms and altered_flanks dataframes before performing gene set enrichment analysis. For example, if you want to filter PTMs based on the number of MS observations, you can add ‘min_MS_observations = 2’ to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the combine_outputs() function for more options.
- Returns:
- results: pd.DataFrame
Dataframe with gene set enrichment results from enrichr API
- ptm_pose.analyze.annotations.get_available_annotations(ptms)[source]#
Given a PTM dataframe, indicate the annotations that are available for analysis and indicate whether they have already been appended to the PTM dataset
- Parameters:
- ptmspd.DataFrame
contains PTM information and may have annotations already appended, such as spliced_ptms and altered_flanks dataframes generated during projection
- Returns:
- available_annotspd.DataFrame
DataFrame indicating the available annotation types and their sources, as well as whether they have been appended to the PTM data.
- ptm_pose.analyze.annotations.get_background_annotation_counts(database='PhosphoSitePlus', annot_type='Function', **kwargs)[source]#
Given a database and annotation type, retrieve the counts of PTMs associated with the requested annotation type across all PTMs in the ptm_coordinates dataframe used for projection
- Parameters:
- databasestr
Source of annotation. Default is PhosphoSitePlus
- annot_typestr
Type of annotation that can be found in indicated database. Default is ‘Function’. Other options include ‘Process’, ‘Disease’, ‘Enzyme’, ‘Interactions’, etc.
- kwargs: additional keyword arguments
Additional keyword arguments to pass to the filter_ptms() function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add ‘min_MS_observations = 2’ to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the filter_ptms() function for more options.
- ptm_pose.analyze.annotations.get_enrichment_inputs(ptms, annot_type='Function', database='PhosphoSitePlus', background_type='all', collapse_on_similar=False, mod_class=None, alpha=0.05, min_dpsi=0.1, **kwargs)[source]#
Given the spliced ptms, altered_flanks, or combined PTMs dataframe, identify the number of PTMs corresponding to specific annotations in the foreground (PTMs impacted by splicing) and the background (all PTMs in the proteome or all PTMs in dataset not impacted by splicing). This information can be used to calculate the enrichment of specific annotations among PTMs impacted by splicing. Several options are provided for constructing the background data: all (based on entire proteome in the ptm_coordinates dataframe) or significance (foreground PTMs are extracted from provided spliced PTMs based on significance and minimum delta PSI)
- Parameters:
- ptms: pd.DataFrame
Dataframe with PTMs projected onto splicing events and with annotations appended from various databases. This can be either the spliced_ptms, altered_flanks, or combined dataframe.
- annot_typestr
type of annotation to pull the annotations from. Default is ‘Function’.
- databasestr
source of annotations. Default is ‘PhosphoSitePlus’.
- background_typestr
Type of background to construct. Options are either ‘all’ (all PTMs in proteome) or ‘significance’ (only PTMs in dataset). Note that significance option assumes that PTMs have not already been filtered for significance.
- collapse_on_similarbool
Whether to collapse similar annotations (for example, “cell growth, increased” and “cell growth, decreased”) into a single category. Default is False.
- mod_classstr
Type of modification to perform enrichment for
- min_dpsi: float
Minimum change in PSI required to return a PTM as associated with the annotation. Default is 0.1. This can be used to filter out PTMs that are not significantly spliced.
- alphafloat
Significance threshold to use to filter PTMs based on their significance. Default is 0.05. This can be used to filter out PTMs that are not significantly spliced.
- kwargs: additional keyword arguments
Additional keyword arguments to pass to the filter_ptms() function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add ‘min_MS_observations = 2’ to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the filter_ptms() function for more options.
- ptm_pose.analyze.annotations.get_ptm_annotations(ptms, annot_type='Function', database='PhosphoSitePlus', collapse_on_similar=False, min_dpsi=0.1, alpha=0.05, **kwargs)[source]#
Given spliced ptm information obtained from project and annotate modules, grab PTMs in spliced ptms associated with specific PTM modules
- Parameters:
- spliced_ptms: pd.DataFrame
PTMs projected onto splicing events and with annotations appended from various databases
- annot_type: str
Type of annotation to pull from spliced_ptms dataframe. Available information depends on the selected database. Default is ‘Function’.
- database: str
database from which PTMs are pulled. Options include ‘PhosphoSitePlus’, ‘ELM’, or ‘PTMInt’. ELM and PTMInt data will automatically be downloaded, but due to download restrictions, PhosphoSitePlus data must be manually downloaded and annotated in the spliced_ptms data using functions from the annotate module. Default is ‘PhosphoSitePlus’.
- collapse_on_similarbool
Whether to collapse similar annotations (for example, “cell growth, increased” and “cell growth, decreased”) into a single category. Default is False.
- min_dpsi: float
Minimum change in PSI required to return a PTM as associated with the annotation. Default is 0.1. This can be used to filter out PTMs that are not significantly spliced.
- alphafloat
Significance threshold to use to filter PTMs based on their significance. Default is 0.05. This can be used to filter out PTMs that are not significantly spliced.
- kwargs: additional keyword arguments
Additional keyword arguments to pass to the filter_ptms() function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add ‘min_MS_observations = 2’ to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the filter_ptms() function for more options.
- Returns:
- annotationspd.DataFrame
Individual PTM information of PTMs that have been associated with the requested annotation type.
- annotation_countspd.Series or pd.DataFrame
Number of PTMs associated with each annotation of the requested annotation type. If dPSI col is provided or impact col is present, will output annotation counts for each type of impact (‘Included’, ‘Excluded’, ‘Altered Flank’) separately.
- ptm_pose.analyze.annotations.plot_EnrichR_pies(enrichr_results, top_terms=None, terms_to_plot=None, colors=None, edgecolor=None, row_height=0.3, type='circle', ax=None)[source]#
Given PTM-specific EnrichR results, plot EnrichR score for the provided terms, with each self point represented as a pie chart indicating the fraction of genes in the group with PTMs
- Parameters:
- ptm_results: pd.selfFrame
selfFrame containing PTM-specific results from EnrichR analysis
- num_to_plot: int
number of terms to plot, if None, will plot all terms. Ignored if specific terms are provided in terms to plot list
- terms_to_plot: list
list of terms to plot
- colors: list
list of colors to use for pie slices. Default is None, which will use seaborn colorblind palette
- edgecolor: str
color to use for edge of pie slices. Default is None, which will use the same color as the pie slice
- row_height: float
height of each row in the plot. Default is 0.3.
- type: str
type of pie chart to plot. Default is ‘circle’. Options include ‘circle’ and ‘donut’ (hole in center).
- ax: matplotlib.Axes
axis to plot on, if None, will create new figure
- ptm_pose.analyze.annotations.plot_annotation_counts(spliced_ptms=None, altered_flanks=None, database='PhosphoSitePlus', annot_type='Function', collapse_on_similar=True, colors=None, fontsize=10, top_terms=5, legend=False, legend_loc=(1.05, 0.5), title=None, title_type='database', ax=None, **kwargs)[source]#
Given a dataframe with PTM annotations added, plot the top annotations associated with the PTMs
- Parameters:
- spliced_ptms: pd.DataFrame
Dataframe with differentially included PTMs
- altered_flanks: pd.DataFrame
Dataframe with PTMs associated with altered flanking sequences
- database: str
Database to use for annotations. Default is ‘PhosphoSitePlus’.
- annot_type: str
Type of annotation to plot. Default is ‘Function’.
- collapse_on_similar: bool
Whether to collapse similar annotations into a single category. Default is True.
- colors: list
List of colors to use for the bar plot. Default is None.
- top_terms: int
Number of top terms to plot. Default is 5.
- legend: bool
Whether to show the legend. Default is True.
- legend_loc: tuple
Location of the legend. Default is None, which will place the legend in the upper right corner.
- ax: matplotlib.Axes
Axis to plot on. If None, will create new figure. Default is None.
- title_type: str
Type of title to use for the plot. Default is ‘database’. Options include ‘database’ and ‘detailed’.
- title: str
Title to use for the plot. Default is None.
- fontsize: int
Font size for the plot. Default is 10.
- legend_loc: tuple
Location of the legend. Default is None, which will place the legend to the right of the plot.
- **kwargs: additional keyword arguments
Additional keyword arguments, which will be fed into the filter_ptms() function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add ‘min_MS_observations = 2’ to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the filter_ptms() function for more options.
- ptm_pose.analyze.annotations.plot_available_annotations(ptms, only_annotations_in_data=False, show_all_ptm_count=False, ax=None, **kwargs)[source]#
Given a dataframe with ptm annotations added, show the number of PTMs associated with each annotation type
- Parameters:
- ptms: pd.DataFrame
Dataframe with PTMs and annotations added
- only_annotations_in_databool
Only plot annotations that are already appended to the dataset
- show_all_ptm_count: bool
Whether to show the total number of PTMs in the dataset. Default is True.
- ax: matplotlib.Axes
Axis to plot on. If None, will create new figure. Default is None.
Protein Interactions#
- ptm_pose.analyze.interactions.get_edge_colors(interaction_graph, network_data, defaultedgecolor='gray', color_edges_by='Database', database_color_dict={'Multiple': 'purple', 'PSP/RegPhos': 'red', 'PTMInt': 'gold', 'PTMcode': 'blue', 'PhosphoSitePlus': 'green'})[source]#
Get the edge colors to use for a provided networkx graph, either plotting all edges the same color or coloring them based on the database they are from.
- Parameters:
- interaction_graph: nx.Graph
networkx graph containing the interaction network
- network_data: pd.DataFrame
specific network edge data that contains information on which database the interaction is from and any other relevant information (such as regulation change)
- defaultedgecolor‘str’
Default color to use for edges if no specific database color is found. Default is ‘gray’.
- color_edges_bystr
How to color the edges. Options are ‘Database’ to color by the database they are from, or ‘Same’ to color all edges the same color. Default is ‘Database’.
- database_color_dictdict
Colors to use for specific databases
- ptm_pose.analyze.interactions.get_interaction_stats(interaction_graph)[source]#
Given the networkx interaction graph, calculate various network centrality measures to identify the most relevant PTMs or genes in the network
- ptm_pose.analyze.interactions.plot_interaction_network(interaction_graph, network_data, network_stats=None, modified_color='red', modified_node_size=10, interacting_color='lightblue', interacting_node_size=1, defaultedgecolor='gray', color_edges_by='Same', seed=200, legend_fontsize=8, ax=None, proteins_to_label=None, labelcolor='black', legend=True)[source]#
Given the interaction graph and network data outputted from analyze.protein_interactions, plot the interaction network, signifying which proteins or ptms are altered by splicing and the specific regulation change that occurs. by default, will only label proteins
- Parameters:
- interaction_graph: nx.Graph
NetworkX graph object representing the interaction network, created from analyze.get_interaction_network
- network_data: pd.DataFrame
Dataframe containing details about specifici protein interactions (including which protein contains the spliced PTMs)
- network_stats: pd.DataFrame
Dataframe containing network statistics for each protein in the interaction network, obtained from analyze.get_interaction_stats(). Default is None, which will not label any proteins in the network.
- modified_color: str
Color to use for proteins that are spliced. Default is ‘red’.
- modified_node_size: int
Size of nodes that are spliced. Default is 10.
- interacting_color: str
Color to use for proteins that are not spliced. Default is ‘lightblue’.
- interacting_node_size: int
Size of nodes that are not spliced. Default is 1.
- edgecolor: str
Color to use for edges in the network. Default is ‘gray’.
- seed: int
Seed to use for spring layout of network. Default is 200.
- ax: matplotlib.Axes
Axis to plot on. If None, will create new figure. Default is None.
- proteins_to_label: list, int, or str
Specific proteins to label in the network. If list, will label all proteins in the list. If int, will label the top N proteins by degree centrality. If str, will label the specific protein. Default is None, which will not label any proteins in the network.
- labelcolor: str
Color to use for labels. Default is ‘black’.
- ptm_pose.analyze.interactions.plot_network_centrality(network_stats, network_data=None, centrality_measure='Degree', top_N=10, modified_color='red', interacting_color='black', ax=None)[source]#
Given the network statistics data obtained from analyze.get_interaction_stats(), plot the top N proteins in the protein interaction network based on centrality measure (Degree, Betweenness, or Closeness)
- Parameters:
- network_stats: pd.DataFrame
Dataframe containing network statistics for each protein in the interaction network, obtained from analyze.get_interaction_stats()
- network_data: pd.DataFrame
Dataframe containing information on which proteins are spliced and how they are altered. Default is None, which will plot all proteins the same color (interacting_color)
- centrality_measure: str
Centrality measure to use for plotting. Default is ‘Degree’. Options include ‘Degree’, ‘Degree Centrality’, ‘Betweenness Centrality’, ‘Closeness Centrality’.
- top_N: int
Number of top proteins to plot. Default is 10.
- modified_color: str
Color to use for proteins that are spliced. Default is ‘red’.
- interacting_color: str
Color to use for proteins that are not spliced. Default is ‘black’.
- ax: matplotlib.Axes
Axis to plot on. If None, will create new figure. Default is None.
- class ptm_pose.analyze.interactions.protein_interactions(spliced_ptms, include_enzyme_interactions=True, interaction_databases=['PhosphoSitePlus', 'PTMcode', 'PTMInt', 'RegPhos', 'DEPOD', 'OmniPath'], **kwargs)[source]#
Class to assess interactions facilitated by PTMs in splicing network
- Parameters:
- spliced_ptms: pd.DataFrame
Dataframe with PTMs projected onto splicing events and with annotations appended from various databases
- include_enzyme_interactions: bool
Whether to include interactions with enzymes in the network, such as kinase-substrate interactions. Default is True
- interaction_databases: list
List of databases whose information to include in the network. Default is [‘PhosphoSitePlus’, ‘PTMcode’, ‘PTMInt’, ‘RegPhos’, ‘DEPOD’, ‘ELM’]
- **kwargs: additional keyword arguments
Additional keyword arguments, which will be fed into the filter_ptms() function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add ‘min_MS_observations = 2’ to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the filter_ptms() function for more options.
- Attributes:
- interaction_graph: nx.Graph
NetworkX graph object representing the interaction network, created from analyze.get_interaction_network
- network_data: pd.DataFrame
Dataframe containing details about specifici protein interactions (including which protein contains the spliced PTMs)
- network_stats: pd.DataFrame
Dataframe containing network statistics for each protein in the interaction network, obtained from analyze.get_interaction_stats(). Default is None, which will not label any proteins in the network.
Methods
compare_to_nease
(nease_edges)Given the network edges generated by NEASE, compare the network edges generated by NEASE to the network edges generated by the PTM-driven interactions
get_interaction_network
([node_type])Given the spliced PTM data, extract interaction information and construct a dataframe containing all possible interactions driven by PTMs, either centered around specific PTMs or specific genes.
Given the networkx interaction graph, calculate various network centrality measures to identify the most relevant PTMs or genes in the network
get_protein_interaction_network
(protein)Given a specific protein, return the network data for that protein
plot_interaction_network
([modified_color, ...])Given the interactiong graph and network data outputted from analyze.get_interaction_network, plot the interaction network, signifying which proteins or ptms are altered by splicing and the specific regulation change that occurs.
plot_nease_comparison
([ax, nease_edges])Given the comparison data generated by compare_to_nease, plot the number of edges identified by NEASE and PTM-POSE
plot_network_centrality
([...])Plot the centrality measure for the top N proteins in the interaction network based on a specified centrality measure.
summarize_protein_network
(protein)Given a protein of interest, summarize the network data for that protein
- compare_to_nease(nease_edges)[source]#
Given the network edges generated by NEASE, compare the network edges generated by NEASE to the network edges generated by the PTM-driven interactions
- Parameters:
- nease_edgespd.DataFrame
Interactions found by NEASE, which is the output of nease.get_edges() function after running NEASE (see nease_runner module for example)
- Returns:
- nease_comparisonpd.DataFrame
Dataframe containing the all edges found by NEASE and PTM-POSE. This will include edges that are unique to NEASE, unique to PTM-POSE, and common between the two.
- get_interaction_network(node_type='Gene')[source]#
Given the spliced PTM data, extract interaction information and construct a dataframe containing all possible interactions driven by PTMs, either centered around specific PTMs or specific genes.
- Parameters:
- node_type: str
What to define interactions by. Can either be by ‘PTM’, which will consider each PTM as a separate node, or by ‘Gene’, which will aggregate information across all PTMs of a single gene into a single node. Default is ‘Gene’
- get_interaction_stats()[source]#
Given the networkx interaction graph, calculate various network centrality measures to identify the most relevant PTMs or genes in the network
- get_protein_interaction_network(protein)[source]#
Given a specific protein, return the network data for that protein
- Parameters:
- protein: str
Gene name of the protein of interest
- Returns:
- protein_network: pd.DataFrame
Dataframe containing network data for the protein of interest
- plot_interaction_network(modified_color='red', modified_node_size=10, interacting_color='lightblue', interacting_node_size=5, defaultedgecolor='gray', color_edges_by='Same', seed=200, ax=None, proteins_to_label=None, labelcolor='black', legend=True, include_nease_comparison=False)[source]#
Given the interactiong graph and network data outputted from analyze.get_interaction_network, plot the interaction network, signifying which proteins or ptms are altered by splicing and the specific regulation change that occurs. by default, will only label proteins
- Parameters:
- modified_color: str
Color to use for nodes that are modified by splicing. Default is ‘red’
- modified_node_size: int
Size of nodes that are modified by splicing. Default is 10
- interacting_color: str
Color to use for nodes that interact with modified nodes. Default is ‘lightblue’
- interacting_node_size: int
Size of nodes that interact with modified nodes. Default is 5
- defaultedgecolor: str
Color to use for edges in the network. Default is ‘gray’. Can choose to color by database by providing a dictionary with database names as keys and colors as values or by specifying ‘database’ as color
- color_edges_by: str
How to color edges in the network. Default is ‘Same’, which will color all edges the same color. Can also specify ‘Database’ to color edges based on the database they are from. If using ‘Database’, please provide a dictionary with database names as keys and colors as values in defaultedgecolor parameter.
- seed: int
Seed to use for random number generator. Default is 200
- ax: matplotlib.pyplot.Axes
Axes object to plot the network. Default is None
- plot_nease_comparison(ax=None, nease_edges=None)[source]#
Given the comparison data generated by compare_to_nease, plot the number of edges identified by NEASE and PTM-POSE
- Parameters:
- axmatplotlib.pyplot.Axes
axes to plot on
- nease_edgespd.DataFrame, optional
Interactions found by NEASE, which is the output of nease.get_edges() function after running NEASE (see nease_runner module for example). Only needed if you have not run comparison_to_nease() previously
- plot_network_centrality(centrality_measure='Degree', top_N=10, modified_color='red', interacting_color='black', ax=None)[source]#
Plot the centrality measure for the top N proteins in the interaction network based on a specified centrality measure. This will help identify the most relevant PTMs/genes in the network.
- Parameters:
- centrality_measure: str
How to calculate centrality. Options include ‘Degree’, ‘Degree Centrality’, ‘Betweenness Centrality’, ‘Closeness Centrality’, and ‘Eigenvector Centrality’. Default is ‘Degree’.
- top_N: int
Number of top proteins to plot based on the centrality measure. Default is 10.
- modified_color: str
Color to use for proteins that are spliced. Default is ‘red’.
- interacting_color: str
Color to use for proteins that are not spliced. Default is ‘black’.
- ax: matplotlib.pyplot.Axes
Axes object to plot the centrality bar plot. If None, a new figure will be created. Default is None.
Enzyme Regulation#
- ptm_pose.analyze.enzyme.compare_KL_for_sequence(inclusion_seq, exclusion_seq, dpsi=None, comparison_type='percentile')[source]#
Given two sequences, compare the kinase library scores, percentiles, or ranks for each sequence. Optionally, provide a dPSI value to calculate the relative change in preference for each kinase.
- Parameters:
- inclusion_seqstr
sequence to score for inclusion preference, with modification lowercased
- exclusion_seqstr
sequence to score for exclusion preference, with modification lowercased
- dpsifloat
dPSI value for the PTM event, which will be used to calculate the relative change in preference for each kinase (score difference * dPSI). Default is None.
- comparison_typestr
type of comparison to perform. Can be ‘percentile’, ‘score’, or ‘rank’. Default is ‘percentile’.
- ptm_pose.analyze.enzyme.get_all_KL_scores(seq_data, seq_col, kin_type=['ser_thr', 'tyrosine'], score_type='percentiles')[source]#
Given a dataset with flanking sequences, score each flanking sequence
- Parameters:
- seq_datapandas dataframe
processed dataframe containing flanking sequences to score
- seq_colstr
column in seq_data containing the flanking sequences
- kin_typelist
list of kinase types to score. Can be ‘ser_thr’ or ‘ST’ for serine/threonine kinases, or ‘tyrosine’ or ‘Y’ for tyrosine kinases. Default is [‘ser_thr’, ‘tyrosine’].
- score_typestr
type of score to calculate. Can be ‘percentile’, ‘score’, or ‘rank’. Default is ‘percentile’.
- Returns:
- merged_datadict
dictionary containing the merged dataframes for each kinase type with KinaseLibrary scores
Flanking Sequences#
- ptm_pose.analyze.flanking_sequences.compare_inclusion_motifs(flanking_sequences, elm_classes=None)[source]#
Given a DataFrame containing flanking sequences with changes and a DataFrame containing ELM class information, identify motifs that are found in the inclusion and exclusion events, identifying motifs unique to each case. This does not take into account the position of the motif in the sequence or additional information that might validate any potential interaction (i.e. structural information that would indicate whether the motif is accessible or not). ELM class information can be downloaded from the download page of elm (http://elm.eu.org/elms/elms_index.tsv).
- Parameters:
- flanking_sequences: pandas.DataFrame
DataFrame containing flanking sequences with changes, obtained from get_flanking_changes_from_splice_data()
- elm_classes: pandas.DataFrame
DataFrame containing ELM class information (ELMIdentifier, Regex, etc.), downloaded directly from ELM (http://elm.eu.org/elms/elms_index.tsv). Recommended to download this file and input it manually, but will download from ELM otherwise
- Returns:
- flanking_sequences: pandas.DataFrame
DataFrame containing flanking sequences with changes and motifs found in the inclusion and exclusion events
- ptm_pose.analyze.flanking_sequences.findAlteredPositions(seq1, seq2, flank_size=5)[source]#
Given two sequences, identify the location of positions that have changed
- Parameters:
- seq1, seq2: str
sequences to compare (order does not matter)
- flank_size: int
size of the flanking sequences (default is 5). This is used to make sure the provided sequences are the correct length
- Returns:
- altered_positions: list
list of positions that have changed
- residue_change: list
list of residues that have changed associated with that position
- flank_side: str
indicates which side of the flanking sequence the change has occurred (N-term, C-term, or Both)
- ptm_pose.analyze.flanking_sequences.find_motifs(seq, elm_classes)[source]#
Given a sequence and a dataframe containinn ELM class information, identify motifs that can be found in the provided sequence using the RegEx expression provided by ELM (PTMs not considered). This does not take into account the position of the motif in the sequence or additional information that might validate any potential interaction (i.e. structural information that would indicate whether the motif is accessible or not). ELM class information can be downloaded from the download page of elm (http://elm.eu.org/elms/elms_index.tsv).
- Parameters:
- seq: str
sequence to search for motifs
- elm_classes: pandas.DataFrame
DataFrame containing ELM class information (ELMIdentifier, Regex, etc.), downloaded directly from ELM (http://elm.eu.org/elms/elms_index.tsv)
- ptm_pose.analyze.flanking_sequences.getSequenceIdentity(seq1, seq2)[source]#
Given two flanking sequences, calculate the sequence identity between them using Biopython and parameters definded by Pillman et al. BMC Bioinformatics 2011
- Parameters:
- seq1, seq2: str
flanking sequence
- Returns:
- normalized_score: float
normalized score of sequence similarity between flanking sequences (calculated similarity/max possible similarity)
- ptm_pose.analyze.flanking_sequences.plot_alterations_matrix(altered_flanks, modification_class=None, residue=None, title='', ax=None)[source]#
Given the altered flanking sequences dataframe, plot a matrix showing the positions of altered residues for specific proteins, as well as the specific change
- Parameters:
- altered_flanks: pd.DataFrame
Dataframe with altered flanking sequences, and annotated information added with analyze.compare_flanking_sequences
- modification_class: str
Specific modification class to plot. Default is None, which will plot all modification classes.
- residue: str
Specific residue to plot. Default is None, which will plot all residues.
- title: str
Title of the plot. Default is ‘’ (no title).
- ax: matplotlib.Axes
Axis to plot on. If None, will create new figure. Default is None.
- ptm_pose.analyze.flanking_sequences.plot_location_of_altered_flanking_residues(altered_flanks, figsize=(4, 3), modification_class=None, residue=None)[source]#
Plot the number of PTMs with altered residues as specific positions relative to the PTM site. This includes the specific position of the residue (-5 to +5 from PTM site) and the specific side of the PTM site that is altered (N-term or C-term)
- Parameters:
- altered_flanks: pd.DataFrame
Dataframe with altered flanking sequences, and annotated information added with analyze.compare_flanking_sequences
- figsize: tuple
Size of the figure. Default is (4,3).
- modification_class: str
Specific modification class to plot. Default is None, which will plot all modification classes.
- residue: str
Specific residue to plot. Default is None, which will plot all residues.
- ptm_pose.analyze.flanking_sequences.plot_sequence_differences(inclusion_seq, exclusion_seq, dpsi=None, flank_size=5, figsize=(3, 1))[source]#
Given the flanking sequences for a PTM resulting from a specific splice event, plot the differences between the two sequences, coloring the changing residues in red. If dPSI is also provided, will add an arrow to the plot indicating the direction of change
- Parameters:
- inclusion_seq: str
Sequence of the inclusion isoform (with spliced region included)
- exclusion_seq: str
Sequence of the exclusion isoform (with spliced region excluded)
- dpsi: float
Change in PSI for the specific splice event. Default is None, which will not add an arrow to the plot.
- flank_size: int
Size of flanking region to plot. Default is 5. This must be less than half the length of the shortest sequence.
- figsize: tuple
Size of the figure. Default is (3,1).