PTM-POSE Reference#
Configuration#
- ptm_pose.pose_config.download_ptm_coordinates(save=False, max_retries=5, delay=10)[source]#
Download ptm_coordinates dataframe from GitHub Large File Storage (LFS). By default, this will not save the file locally due the larger size (do not want to force users to download but highly encourage), but an option to save the file is provided if desired
- Parameters:
- savebool, optional
Whether to save the file locally into Resource Files directory. The default is False.
- max_retriesint, optional
Number of times to attempt to download the file. The default is 5.
- delayint, optional
Time to wait between download attempts. The default is 10.
PTM Projection#
- ptm_pose.project.project_ptms_onto_MATS(ptm_coordinates=None, SE_events=None, fiveASS_events=None, threeASS_events=None, RI_events=None, MXE_events=None, coordinate_type='hg38', identify_flanking_sequences=False, dPSI_col='meanDeltaPSI', sig_col='FDR', separate_modification_types=False, PROCESSES=1)[source]#
Given splice quantification from the MATS algorithm, annotate with PTMs that are found in the differentially included regions.
- Parameters:
- ptm_coordinates: pandas.DataFrame
dataframe containing PTM information, including chromosome, strand, and genomic location of PTMs
- SE_events: pandas.DataFrame
dataframe containing skipped exon event information from MATS
- fiveASS_events: pandas.DataFrame
dataframe containing 5’ alternative splice site event information from MATS
- threeASS_events: pandas.DataFrame
dataframe containing 3’ alternative splice site event information from MATS
- RI_events: pandas.DataFrame
dataframe containing retained intron event information from MATS
- MXE_events: pandas.DataFrame
dataframe containing mutually exclusive exon event information from MATS
- coordinate_type: str
indicates the coordinate system used for the start and end positions. Either hg38 or hg19. Default is ‘hg38’.
- identify_flanking_sequences: bool
Indicate whether to look for altered flanking sequences from spliced events, in addition to those directly in the spliced region. Default is False. (not yet active)
- PROCESSES: int
Number of processes to use for multiprocessing. Default is 1.
- ptm_pose.project.project_ptms_onto_splice_events(splice_data, ptm_coordinates=None, annotate_original_df=True, chromosome_col='chr', strand_col='strand', region_start_col='exonStart_0base', region_end_col='exonEnd', dPSI_col=None, sig_col=None, event_id_col=None, gene_col=None, extra_cols=None, separate_modification_types=False, coordinate_type='hg38', taskbar_label=None, PROCESSES=1)[source]#
Given splice event quantification data, project PTMs onto the regions impacted by the splice events. Assumes that the splice event data will have chromosome, strand, and genomic start/end positions for the regions of interest, and each row of the splice_event_data corresponds to a unique region.
Parameters
- splice_data: pandas.DataFrame
dataframe containing splice event information, including chromosome, strand, and genomic location of regions of interest
- ptm_coordinates: pandas.DataFrame
dataframe containing PTM information, including chromosome, strand, and genomic location of PTMs. If none, it will pull from the config file.
- chromosome_col: str
column name in splice_data that contains chromosome information. Default is ‘chr’. Expects it to be a str with only the chromosome number: ‘Y’, ‘1’, ‘2’, etc.
- strand_col: str
column name in splice_data that contains strand information. Default is ‘strand’. Expects it to be a str with ‘+’ or ‘-’, or integers as 1 or -1. Will convert to integers automatically if string format is provided.
- region_start_col: str
column name in splice_data that contains the start position of the region of interest. Default is ‘exonStart_0base’.
- region_end_col: str
column name in splice_data that contains the end position of the region of interest. Default is ‘exonEnd’.
- event_id_col: str
column name in splice_data that contains the unique identifier for the splice event. If provided, will be used to annotate the ptm information with the specific splice event ID. Default is None.
- gene_col: str
column name in splice_data that contains the gene name. If provided, will be used to make sure the projected PTMs stem from the same gene (some cases where genomic coordiantes overlap between distinct genes). Default is None.
- dPSI_col: str
column name in splice_data that contains the delta PSI value for the splice event. Default is None, which will not include this information in the output
- sig_col: str
column name in splice_data that contains the significance value for the splice event. Default is None, which will not include this information in the output.
- extra_cols: list
list of additional columns to include in the output dataframe. Default is None, which will not include any additional columns.
- coordinate_type: str
indicates the coordinate system used for the start and end positions. Either hg38 or hg19. Default is ‘hg38’.
- separate_modification_types: bool
Indicate whether to store PTM sites with multiple modification types as multiple rows. For example, if a site at K100 was both an acetylation and methylation site, these will be separated into unique rows with the same site number but different modification types. Default is True.
- taskbar_label: str
Label to display in the tqdm progress bar. Default is None, which will automatically state “Projecting PTMs onto regions using —– coordinates”.
- PROCESSES: int
Number of processes to use for multiprocessing. Default is 1 (single processing)
- Returns:
- spliced_ptm_info: pandas.DataFrame
Contains the PTMs identified across the different splice events
- splice_data: pandas.DataFrame
dataframe containing the original splice data with an additional column ‘PTMs’ that contains the PTMs found in the region of interest, in the format of ‘SiteNumber(ModificationType)’. If no PTMs are found, the value will be np.nan.
Flanking Sequences#
- ptm_pose.flanking_sequences.extract_region_from_splicegraph(splicegraph, region_id)[source]#
Given a region id and the splicegraph from SpliceSeq, extract the chromosome, strand, and start and stop locations of that exon. Start and stop are forced to be in ascending order, which is not necessarily true from the splice graph (i.e. start > stop for negative strand exons). This is done to make the region extraction consistent with the rest of the codebase.
- Parameters:
- spliceseqpandas.DataFrame
SpliceSeq splicegraph dataframe, with region_id as index
- region_idstr
Region ID to extract information from, in the format of ‘GeneName_ExonNumber’
- Returns:
- list
List containing the chromosome, strand (1 for forward, -1 for negative), start, and stop locations of the region
- ptm_pose.flanking_sequences.get_flanking_changes(ptm_coordinates, chromosome, strand, first_flank_region, spliced_region, second_flank_region, gene=None, dPSI=None, sig=None, event_id=None, flank_size=5, coordinate_type='hg38', lowercase_mod=True, order_by='Coordinates')[source]#
Currently has been tested with MATS splicing events.
Given flanking and spliced regions associated with a splice event, identify PTMs that have potential to have an altered flanking sequence depending on whether spliced region is included or excluded (if PTM is close to splice boundary). For these PTMs, extract the flanking sequences associated with the inclusion and exclusion cases and translate into amino acid sequences. If the PTM is not associated with a codon that codes for the expected amino acid, the PTM will be excluded from the results.
- Parameters:
- ptm_coordinatespandas.DataFrame
DataFrame containing PTM coordinate information for identify PTMs in the flanking regions
- chromosomestr
Chromosome associated with the splice event
- strandint
Strand associated with the splice event (1 for forward, -1 for negative)
- first_flank_regionlist
List containing the start and stop locations of the first flanking region (first is currently defined based on location the genome not coding sequence)
- spliced_regionlist
List containing the start and stop locations of the spliced region
- second_flank_regionlist
List containing the start and stop locations of the second flanking region (second is currently defined based on location the genome not coding sequence)
- event_idstr, optional
Event ID associated with the splice event, by default None
- flank_sizeint, optional
Number of amino acids to include flanking the PTM, by default 7
- coordinate_typestr, optional
Coordinate system used for the regions, by default ‘hg38’. Other options is hg19.
- lowercase_modbool, optional
Whether to lowercase the amino acid associated with the PTM in returned flanking sequences, by default True
- order_bystr, optional
Whether the first, spliced and second regions are defined by their genomic coordinates (first has smallest coordinate, spliced next, then second), or if they are defined by their translation (first the first when translated, etc.)
- Returns:
- pandas.DataFrame
DataFrame containing the PTMs associated with the flanking regions and the amino acid sequences of the flanking regions in the inclusion and exclusion cases
- ptm_pose.flanking_sequences.get_flanking_changes_from_splice_data(splice_data, ptm_coordinates=None, chromosome_col=None, strand_col=None, first_flank_start_col=None, first_flank_end_col=None, spliced_region_start_col=None, spliced_region_end_col=None, second_flank_start_col=None, second_flank_end_col=None, dPSI_col=None, sig_col=None, event_id_col=None, gene_col=None, flank_size=5, coordinate_type='hg38', lowercase_mod=True)[source]#
Given a DataFrame containing information about splice events, extract the flanking sequences associated with the PTMs in the flanking regions if there is potential for this to be altered. The DataFrame should contain columns for the chromosome, strand, start and stop locations of the first flanking region, spliced region, and second flanking region. The DataFrame should also contain a column for the event ID associated with the splice event. If the DataFrame does not contain the necessary columns, the function will raise an error.
- Parameters:
- splice_datapandas.DataFrame
DataFrame containing information about splice events
- ptm_coordinatespandas.DataFrame
DataFrame containing PTM coordinate information for identify PTMs in the flanking regions
- chromosome_colstr, optional
Column name indicating chromosome, by default None
- strand_colstr, optional
Column name indicating strand, by default None
- first_flank_start_colstr, optional
Column name indicating start location of the first flanking region, by default None
- first_flank_end_colstr, optional
Column name indicating end location of the first flanking region, by default None
- spliced_region_start_colstr, optional
Column name indicating start location of the spliced region, by default None
- spliced_region_end_colstr, optional
Column name indicating end location of the spliced region, by default None
- second_flank_start_colstr, optional
Column name indicating start location of the second flanking region, by default None
- second_flank_end_colstr, optional
Column name indicating end location of the second flanking region, by default None
- event_id_colstr, optional
Column name indicating event ID, by default None
- flank_sizeint, optional
Number of amino acids to include flanking the PTM, by default 7
- coordinate_typestr, optional
Coordinate system used for the regions, by default ‘hg38’. Other options is hg19.
- lowercase_modbool, optional
Whether to lowercase the amino acid associated with the PTM in returned flanking sequences, by default True
- Returns:
- list
List containing DataFrames with the PTMs associated with the flanking regions and the amino acid sequences of the flanking regions in the inclusion and exclusion cases
- ptm_pose.flanking_sequences.get_flanking_changes_from_splicegraph(psi_data, splicegraph, ptm_coordinates=None, dPSI_col=None, sig_col=None, event_id_col=None, extra_cols=None, gene_col='symbol', flank_size=5, coordinate_type='hg19')[source]#
Given a DataFrame containing information about splice events obtained from SpliceSeq and the corresponding splicegraph, extract the flanking sequences of PTMs that are nearby the splice boundary (potential for flanking sequence to be altered). Coordinate information of individual exons should be found in splicegraph. You can also provide columns with specific psi or significance information. Extra cols not in these categories can be provided with extra_cols parameter.
- Parameters:
- psi_datapandas.DataFrame
DataFrame containing information about splice events obtained from SpliceSeq
- splicegraphpandas.DataFrame
DataFrame containing information about individual exons and their coordinates
- ptm_coordinatespandas.DataFrame
DataFrame containing PTM coordinate information for identify PTMs in the flanking regions
- dPSI_colstr, optional
Column name indicating delta PSI value, by default None
- sig_colstr, optional
Column name indicating significance of the event, by default None
- event_id_colstr, optional
Column name indicating event ID, by default None
- extra_colslist, optional
List of column names for additional information to add to the results, by default None
- gene_colstr, optional
Column name indicating gene symbol of spliced gene, by default ‘symbol’
- flank_sizeint, optional
Number of amino acids to include flanking the PTM, by default 5
- coordinate_typestr, optional
Coordinate system used for the regions, by default ‘hg19’. Other options is hg38.
- Returns:
- altered_flankspandas.DataFrame
DataFrame containing the PTMs associated with the flanking regions that are altered, and the flanking sequences that arise depending on whether the flanking sequence is included or not
- ptm_pose.flanking_sequences.get_flanking_sequence(ptm_loc, seq, ptm_residue, flank_size=5, lowercase_mod=True, full_flanking_seq=False)[source]#
Given a PTM location in a sequence of DNA, extract the flanking sequence around the PTM location and translate into the amino acid sequence. If the sequence is not the correct length, the function will attempt to extract the flanking sequence with spaces to account for missing parts if full_flanking_seq is not True. If the sequence is still not the correct length, the function will raise an error. Any unrecognized codons that are found in the sequence and are not in the standard codon table, including stop codons, will be translated as ‘X’ (unknown) or ‘*’ (stop codon).
- Parameters:
- ptm_locint
Location of the first base pair associated with PTM in the DNA sequence
- seqstr
DNA sequence containing the PTM
- ptm_residuestr
Amino acid residue associated with the PTM
- flank_sizeint, optional
Number of amino acids to include flanking the PTM, by default 5
- lowercase_modbool, optional
Whether to lowercase the amino acid associated with the PTM, by default True
- full_flanking_seqbool, optional
Whether to require the flanking sequence to be the correct length, by default False
- Returns:
- str
Amino acid sequence of the flanking sequence around the PTM if translation was successful, otherwise np.nan
- ptm_pose.flanking_sequences.get_ptm_locs_in_spliced_sequences(ptm_loc_in_flank, first_flank_seq, spliced_seq, second_flank_seq, strand, which_flank='First', order_by='Coordinates')[source]#
Given the location of a PTM in a flanking sequence, extract the location of the PTM in the Inclusion Flanking Sequence and the Exclusion Flanking Sequence associated with a given splice event. Inclusion Flanking Sequence will include the skipped exon region, retained intron, or longer alternative splice site depending on event type. The PTM location should be associated with where the PTM is located relative to spliced region (before = ‘First’, after = ‘Second’).
- Parameters:
- ptm_loc_in_flankint
Location of the PTM in the flanking sequence it is found (either first or second)
- first_flank_seqstr
Flanking exon sequence before the spliced region
- spliced_seqstr
Spliced region sequence
- second_flank_seqstr
Flanking exon sequence after the spliced region
- which_flankstr, optional
Which flank the PTM is associated with, by default ‘First’
- order_bystr, optional
Whether the first, spliced and second regions are defined by their genomic coordinates (first has smallest coordinate, spliced next, then second), or if they are defined by their translation (first the first when translated, etc.)
- Returns:
- tuple
Tuple containing the PTM location in the Inclusion Flanking Sequence and the Exclusion Flanking Sequence
- ptm_pose.flanking_sequences.get_ptms_in_splicegraph_flank(gene_name, chromosome, strand, flank_region_start, flank_region_end, coordinate_type='hg19', which_flank='First', flank_size=5)[source]#
- ptm_pose.flanking_sequences.get_spliceseq_event_regions(gene_name, from_exon, spliced_exons, to_exon, splicegraph)[source]#
Given all exons associated with a splicegraph event, obtain the coordinates associated with the flanking exons and the spliced region. The spliced region is defined as the exons that are associated with psi values, while flanking regions include the “from” and “to” exons that indicate the adjacent, unspliced exons.
- Parameters:
- gene_namestr
Gene name associated with the splice event
- from_exonint
Exon number associated with the first flanking exon
- spliced_exonsstr
Exon numbers associated with the spliced region, separated by colons for each unique exon
- to_exonint
Exon number associated with the second flanking exon
- splicegraphpandas.DataFrame
DataFrame containing information about individual exons and their coordinates
- Returns:
- tuple
Tuple containing the genomic coordinates of the first flanking region, spliced regions, and second flanking region
- ptm_pose.flanking_sequences.get_spliceseq_flank_loc(ptm, strand, from_region_coords, to_region_coords, coordinate_type='hg19')[source]#
Given ptm information for identifying flanking sequences from splicegraph information, extract the relative location of the ptm in the flanking region (where it is located in translation of the flanking region).
- Parameters:
- ptmpandas.Series
Series containing PTM information
- strandint
Strand associated with the splice event (1 for forward, -1 for negative)
- from_region_coordslist
List containing the chromosome, strand, start, and stop locations of the first flanking region
- to_region_coordslist
List containing the chromosome, strand, start, and stop locations of the second flanking region
- Returns:
- int
Relative location of the PTM in the flanking region
- ptm_pose.flanking_sequences.translate_flanking_sequence(seq, flank_size=7, full_flanking_seq=True, lowercase_mod=True, first_flank_length=None, stop_codon_symbol='*', unknown_codon_symbol='X')[source]#
Given a DNA sequence, translate the sequence into an amino acid sequence. If the sequence is not the correct length, the function will attempt to extract the flanking sequence with spaces to account for missing parts if full_flanking_seq is not True. If the sequence is still not the correct length, the function will raise an error. Any unrecognized codons that are found in the sequence and are not in the standard codon table, including stop codons, will be translated as ‘X’ (unknown) or ‘*’ (stop codon).
- Parameters:
- seqstr
DNA sequence to translate
- flank_sizeint, optional
Number of amino acids to include flanking the PTM, by default 7
- full_flanking_seqbool, optional
Whether to require the flanking sequence to be the correct length, by default True
- lowercase_modbool, optional
Whether to lowercase the amino acid associated with the PTM, by default True
- first_flank_lengthint, optional
Length of the flanking sequence in front of the PTM, by default None. If full_flanking_seq is False and sequence is not the correct length, this is required.
- stop_codon_symbolstr, optional
Symbol to use for stop codons, by default ‘*’
- unknown_codon_symbolstr, optional
Symbol to use for unknown codons, by default ‘X’
- Returns:
- str
Amino acid sequence of the flanking sequence if translation was successful, otherwise np.nan
Annotating PTMs#
- ptm_pose.annotate.add_ELM_interactions(spliced_ptms, file=None, report_success=True)[source]#
Given a spliced ptms dataframe from the project module, add ELM interaction data to the dataframe
- ptm_pose.annotate.add_PSP_disease_association(spliced_ptms, file='Disease-associated_sites.gz', report_success=True)[source]#
Process disease asociation data from PhosphoSitePlus (Disease-associated_sites.gz), and add to spliced_ptms dataframe from project_ptms_onto_splice_events() function
- Parameters:
- file: str
Path to the PhosphoSitePlus Kinase_Substrate_Dataset.gz file. Should be downloaded from PhosphoSitePlus in the zipped format
- Returns:
- spliced_ptms: pandas.DataFrame
Contains the PTMs identified across the different splice events with an additional column indicating the kinases known to phosphorylate that site (not relevant to non-phosphorylation PTMs)
- ptm_pose.annotate.add_PSP_kinase_substrate_data(spliced_ptms, file='Kinase_Substrate_Dataset.gz', report_success=True)[source]#
Add kinase substrate data from PhosphoSitePlus (Kinase_Substrate_Dataset.gz) to spliced_ptms dataframe from project_ptms_onto_splice_events() function
- Parameters:
- file: str
Path to the PhosphoSitePlus Kinase_Substrate_Dataset.gz file. Should be downloaded from PhosphoSitePlus in the zipped format
- Returns:
- spliced_ptms: pandas.DataFrame
Contains the PTMs identified across the different splice events with an additional column indicating the kinases known to phosphorylate that site (not relevant to non-phosphorylation PTMs)
- ptm_pose.annotate.add_PSP_regulatory_site_data(spliced_ptms, file='Regulatory_sites.gz', report_success=True)[source]#
Add functional information from PhosphoSitePlus (Regulatory_sites.gz) to spliced_ptms dataframe from project_ptms_onto_splice_events() function
- Parameters:
- file: str
Path to the PhosphoSitePlus Regulatory_sites.gz file. Should be downloaded from PhosphoSitePlus in the zipped format
- Returns:
- spliced_ptms: pandas.DataFrame
Contains the PTMs identified across the different splice events with additional columns for regulatory site information, including domains, biological process, functions, and protein interactions associated with the PTMs
- ptm_pose.annotate.add_PTMInt_data(spliced_ptms, file=None, report_success=True)[source]#
Given spliced_ptms data from project module, add PTMInt interaction data, which will include the protein that is being interacted with, whether it enchances or inhibits binding, and the localization of the interaction. This will be added as a new column labeled PTMInt:Interactions and each entry will be formatted like ‘Protein->Effect|Localization’. If multiple interactions, they will be separated by a semicolon
- ptm_pose.annotate.add_annotation(spliced_ptms, database='PhosphoSitePlus', annotation_type='Function', file=None, check_existing=False)[source]#
Given a desired database and annotation type, add the corresponding annotation data to the spliced ptm dataframe
- Parameters:
- spliced_ptms: pd.DataFrame
Dataframe containing PTM data
- database: str
Database to extract annotation data from. Options include ‘PhosphoSitePlus’, ‘PTMcode’, ‘PTMInt’, ‘RegPhos’, ‘DEPOD’
- annotation_type: str
Type of annotation to extract. Options include ‘Function’, ‘Process’, ‘Interactions’, ‘Disease’, ‘Kinase’, ‘Phosphatase’, but depend on the specific database (run analyze.get_annotation_categories())
- file: str
File path to annotation data. If None, will download from online source, except for PhosphoSitePlus (due to licensing restrictions)
- ptm_pose.annotate.add_custom_annotation(spliced_ptms, annotation_data, source_name, annotation_type, annotation_col, accession_col='UniProtKB Accession', residue_col='Residue', position_col='PTM Position in Canonical Isoform')[source]#
Add custom annotation data to spliced_ptms or altered flanking sequence dataframes
- Parameters:
- annotation_data: pandas.DataFrame
Dataframe containing the annotation data to be added to the spliced_ptms dataframe. Must contain columns for UniProtKB Accession, Residue, PTM Position in Canonical Isoform, and the annotation data to be added
- source_name: str
Name of the source of the annotation data, will be used to label the columns in the spliced_ptms dataframe
- annotation_type: str
Type of annotation data being added, will be used to label the columns in the spliced_ptms dataframe
- annotation_col: str
Column name in the annotation data that contains the annotation data to be added to the spliced_ptms dataframe
- Returns:
- spliced_ptms: pandas.DataFrame
Contains the PTMs identified across the different splice events with an additional column for the custom annotation data
- ptm_pose.annotate.annotate_ptms(spliced_ptms, psp_regulatory_site_file=None, psp_ks_file=None, psp_disease_file=None, elm_interactions=False, elm_motifs=False, PTMint=False, PTMcode_interprotein=False, DEPOD=False, RegPhos=False, ptmsigdb_file=None, interactions_to_combine=['PTMcode', 'PhosphoSitePlus', 'RegPhos', 'PTMInt'], kinases_to_combine=['PhosphoSitePlus', 'RegPhos'], combine_similar=True)[source]#
Given spliced ptm data, add annotations from various databases. The annotations that can be added are the following: - PhosphoSitePlus
regulatory site data (file must be provided)
kinase-substrate data (file must be provided)
disease association data (file must be provided)
- ELM
interaction data (can be downloaded automatically or provided as a file)
motif matches (elm class data can be downloaded automatically or provided as a file)
- PTMInt
interaction data (will be downloaded automatically)
- PTMcode
intraprotein interactions (can be downloaded automatically or provided as a file)
interprotein interactions (can be downloaded automatically or provided as a file)
- DEPOD
phosphatase-substrate data (will be downloaded automatically)
- RegPhos
kinase-substrate data (will be downloaded automatically)
- Parameters:
- spliced_ptms: pd.DataFrame
Spliced PTM data from project module
- psp_regulatory_site_file: str
File path to PhosphoSitePlus regulatory site data
- psp_ks_file: str
File path to PhosphoSitePlus kinase-substrate data
- psp_disease_file: str
File path to PhosphoSitePlus disease association data
- elm_interactions: bool or str
If True, download ELM interaction data automatically. If str, provide file path to ELM interaction data
- elm_motifs: bool or str
If True, download ELM motif data automatically. If str, provide file path to ELM motif data
- PTMint: bool
If True, download PTMInt data automatically
- PTMcode_intraprotein: bool or str
If True, download PTMcode intraprotein data automatically. If str, provide file path to PTMcode intraprotein data
- PTMcode_interprotein: bool or str
If True, download PTMcode interprotein data automatically. If str, provide file path to PTMcode interprotein data
- DEPOD: bool
If True, download DEPOD data automatically
- RegPhos: bool
If True, download RegPhos data automatically
- ptmsigdb_file: str
File path to PTMsigDB data
- interactions_to_combine: list
List of databases to combine interaction data from. Default is [‘PTMcode’, ‘PhosphoSitePlus’, ‘RegPhos’, ‘PTMInt’]
- kinases_to_combine: list
List of databases to combine kinase-substrate data from. Default is [‘PhosphoSitePlus’, ‘RegPhos’]
- combine_similar: bool
Whether to combine annotations of similar information (kinase, interactions, etc) from multiple databases into another column labeled as ‘Combined’. Default is True
- ptm_pose.annotate.check_file(fname, expected_extension='.tsv')[source]#
Given a file name, check if the file exists and has the expected extension. If the file does not exist or has the wrong extension, raise an error.
- Parameters:
- fname: str
File name to check
- expected_extension: str
Expected file extension. Default is ‘.tsv’
- ptm_pose.annotate.combine_KS_data(spliced_ptms, ks_databases=['PhosphoSitePlus', 'RegPhos'], regphos_conversion={'ABL1(ABL)': 'ABL1', 'CDC2': 'CDK1', 'CK2A1': 'CSNK2A1', 'ERK1(MAPK3)': 'MAPK3', 'ERK2(MAPK1)': 'MAPK1', 'JNK2(MAPK9)': 'MAPK9', 'PKACA': 'PRKACA'})[source]#
Given spliced ptm information, combine kinase-substrate data from multiple databases (currently support PhosphoSitePlus and RegPhos), assuming that the kinase data from these resources has already been added to the spliced ptm data. The combined kinase data will be added as a new column labeled ‘Combined:Kinase’
- Parameters:
- spliced_ptms: pd.DataFrame
Spliced PTM data from project module
- ks_databases: list
List of databases to combine kinase data from. Currently support PhosphoSitePlus and RegPhos
- regphos_conversion: dict
Allows conversion of RegPhos names to matching names in PhosphoSitePlus.
- Returns:
- splicde_ptms: pd.DataFrame
Spliced PTM data with combined kinase data added
- ptm_pose.annotate.combine_interaction_data(spliced_ptms, interaction_databases=['PhosphoSitePlus', 'PTMcode', 'PTMInt', 'RegPhos', 'DEPOD', 'ELM'], include_enzyme_interactions=True)[source]#
Given annotated spliced ptm data, extract interaction data from various databases and combine into a single dataframe. This will include the interacting protein, the type of interaction, and the source of the interaction data
- Parameters:
- spliced_ptms: pd.DataFrame
Dataframe containing PTM data and associated interaction annotations from various databases
- interaction_databases: list
List of databases to extract interaction data from. Options include ‘PhosphoSitePlus’, ‘PTMcode’, ‘PTMInt’, ‘RegPhos’, ‘DEPOD’. These should already have annotation columns in the spliced_ptms dataframe, otherwise they will be ignored. For kinase-substrate interactions, if combined column is present, will use that instead of individual databases
- include_enzyme_interactions: bool
If True, will include kinase-substrate and phosphatase interactions in the output dataframe
- Returns:
- interact_data: list
List of dataframes containing PTMs and their interacting proteins, the type of influence the PTM has on the interaction (DISRUPTS, INDUCES, or REGULATES), and the source of the interaction data
- ptm_pose.annotate.convert_PSP_label_to_UniProt(label)[source]#
Given a label for an interacting protein from PhosphoSitePlus, convert to UniProtKB accession. Required as PhosphoSitePlus interactions are recorded in various ways that aren’t necessarily consistent with other databases (i.e. not always gene name)
- Parameters:
- label: str
Label for interacting protein from PhosphoSitePlus
- ptm_pose.annotate.extract_positions_from_DEPOD(x)[source]#
Given string object consisting of multiple modifications in the form of ‘Residue-Position’ separated by ‘, ‘, extract the residue and position. Ignore any excess details in the string.
- ptm_pose.annotate.unify_interaction_data(spliced_ptms, interaction_col, name_dict={})[source]#
Given spliced ptm data and a column containing interaction data, extract the interacting protein, type of interaction, and convert to UniProtKB accession. This will be added as a new column labeled ‘Interacting ID’
- Parameters:
- spliced_ptms: pd.DataFrame
Dataframe containing PTM data
- interaction_col: str
column containing interaction information from a specific database
- name_dict: dict
dictionary to convert names within given database to UniProt IDs. For cases when name is not necessarily one of the gene names listed in UniProt
- Returns:
- interact: pd.DataFrame
Contains PTMs and their interacting proteins, the type of influence the PTM has on the interaction (DISRUPTS, INDUCES, or REGULATES)
Analysis#
- ptm_pose.analyze.annotation_enrichment(spliced_ptms, database='PhosphoSitePlus', annotation_type='Function', background_type='pregenerated', collapse_on_similar=False, mod_class=None, alpha=None, min_dPSI=None, annotation_file=None, save_background=False)[source]#
In progress, needs to be tested
Given spliced ptm information (differential inclusion, altered flanking sequences, or both), calculate the enrichment of specific annotations in the dataset using a hypergeometric test. Background data can be provided/constructed in a few ways:
Use preconstructed background data for the annotation of interest, which considers the entire proteome present in the ptm_coordinates dataframe. While this is the default, it may not be the most accurate representation of your data, so you may alternative decide to use the other options which will be more specific to your context.
Use the alpha and min_dPSI parameter to construct a foreground that only includes significantly spliced PTMs, and use the entire provided spliced_ptms dataframe as the background. This will allow you to compare the enrichment of specific annotations in the significantly spliced PTMs compared to the entire dataset. Will do this automatically if alpha or min_dPSI is provided.
- Parameters:
- spliced_ptms: pd.DataFrame
Dataframe with PTMs projected onto splicing events and with annotations appended from various databases
- database: str
database from which PTMs are pulled. Options include ‘PhosphoSitePlus’, ‘ELM’, ‘PTMInt’, ‘PTMcode’, ‘DEPOD’, ‘RegPhos’, ‘PTMsigDB’. Default is ‘PhosphoSitePlus’.
- annotation_type: str
Type of annotation to pull from spliced_ptms dataframe. Available information depends on the selected database. Default is ‘Function’.
- background_type: str
how to construct the background data. Options include ‘pregenerated’ (default) and ‘significance’. If ‘significance’ is selected, the alpha and min_dPSI parameters must be provided. Otherwise, will use whole proteome in the ptm_coordinates dataframe as the background.
- collapse_on_similar: bool
Whether to collapse similar annotations (for example, increasing and decreasing functions) into a single category. Default is False.
- mod_class: str
modification class to subset, if any
- alpha: float
significance threshold to use to subset foreground PTMs. Default is None.
- min_dPSI: float
minimum delta PSI value to use to subset foreground PTMs. Default is None.
- annotation_file: str
file to use to annotate custom background data. Default is None.
- save_background: bool
Whether to save the background data constructed from the ptm_coordinates dataframe into Resource_Files within package. Default is False.
- ptm_pose.analyze.combine_outputs(spliced_ptms, altered_flanks, mod_class=None, include_stop_codon_introduction=False, remove_conflicting=True)[source]#
Given the spliced_ptms (differentially included) and altered_flanks (altered flanking sequences) dataframes obtained from project and flanking_sequences modules, combine the two into a single dataframe that categorizes each PTM by the impact on the PTM site
- Parameters:
- spliced_ptms: pd.DataFrame
Dataframe with PTMs projected onto splicing events and with annotations appended from various databases
- altered_flanks: pd.DataFrame
Dataframe with PTMs associated with altered flanking sequences and with annotations appended from various databases
- mod_class: str
modification class to subset, if any
- include_stop_codon_introduction: bool
Whether to include PTMs that introduce stop codons in the altered flanks. Default is False.
- remove_conflicting: bool
Whether to remove PTMs that are both included and excluded across different splicing events. Default is True.
- ptm_pose.analyze.compare_inclusion_motifs(flanking_sequences, elm_classes=None)[source]#
Given a DataFrame containing flanking sequences with changes and a DataFrame containing ELM class information, identify motifs that are found in the inclusion and exclusion events, identifying motifs unique to each case. This does not take into account the position of the motif in the sequence or additional information that might validate any potential interaction (i.e. structural information that would indicate whether the motif is accessible or not). ELM class information can be downloaded from the download page of elm (http://elm.eu.org/elms/elms_index.tsv).
- Parameters:
- flanking_sequences: pandas.DataFrame
DataFrame containing flanking sequences with changes, obtained from get_flanking_changes_from_splice_data()
- elm_classes: pandas.DataFrame
DataFrame containing ELM class information (ELMIdentifier, Regex, etc.), downloaded directly from ELM (http://elm.eu.org/elms/elms_index.tsv). Recommended to download this file and input it manually, but will download from ELM otherwise
- Returns:
- flanking_sequences: pandas.DataFrame
DataFrame containing flanking sequences with changes and motifs found in the inclusion and exclusion events
- ptm_pose.analyze.edit_sequence_for_kinase_library(seq)[source]#
Convert flanking sequence to version accepted by kinase library (modified residue denoted by asterick)
- ptm_pose.analyze.findAlteredPositions(seq1, seq2, flank_size=5)[source]#
Given two sequences, identify the location of positions that have changed
- Parameters:
- seq1, seq2: str
sequences to compare (order does not matter)
- flank_size: int
size of the flanking sequences (default is 5). This is used to make sure the provided sequences are the correct length
- Returns:
- altered_positions: list
list of positions that have changed
- residue_change: list
list of residues that have changed associated with that position
- flank_side: str
indicates which side of the flanking sequence the change has occurred (N-term, C-term, or Both)
- ptm_pose.analyze.find_motifs(seq, elm_classes)[source]#
Given a sequence and a dataframe containinn ELM class information, identify motifs that can be found in the provided sequence using the RegEx expression provided by ELM (PTMs not considered). This does not take into account the position of the motif in the sequence or additional information that might validate any potential interaction (i.e. structural information that would indicate whether the motif is accessible or not). ELM class information can be downloaded from the download page of elm (http://elm.eu.org/elms/elms_index.tsv).
- Parameters:
- seq: str
sequence to search for motifs
- elm_classes: pandas.DataFrame
DataFrame containing ELM class information (ELMIdentifier, Regex, etc.), downloaded directly from ELM (http://elm.eu.org/elms/elms_index.tsv)
- ptm_pose.analyze.gene_set_enrichment(spliced_ptms=None, altered_flanks=None, combined=None, alpha=0.05, min_dPSI=None, gene_sets=['KEGG_2021_Human', 'GO_Biological_Process_2023', 'GO_Cellular_Component_2023', 'GO_Molecular_Function_2023', 'Reactome_2022'], background=None, return_sig_only=True, max_retries=5, delay=10)[source]#
Given spliced_ptms and/or altered_flanks dataframes (or the dataframes combined from combine_outputs()), perform gene set enrichment analysis using the enrichr API
- Parameters:
- spliced_ptms: pd.DataFrame
Dataframe with differentially included PTMs projected onto splicing events and with annotations appended from various databases. Default is None (will not be considered in analysis). If combined dataframe is provided, this dataframe will be ignored.
- altered_flanks: pd.DataFrame
Dataframe with PTMs associated with altered flanking sequences and with annotations appended from various databases. Default is None (will not be considered). If combined dataframe is provided, this dataframe will be ignored.
- combined: pd.DataFrame
Combined dataframe with spliced_ptms and altered_flanks dataframes. Default is None. If provided, spliced_ptms and altered_flanks dataframes will be ignored.
- gene_sets: list
List of gene sets to use in enrichment analysis. Default is [‘KEGG_2021_Human’, ‘GO_Biological_Process_2023’, ‘GO_Cellular_Component_2023’, ‘GO_Molecular_Function_2023’,’Reactome_2022’]. Look at gseapy and enrichr documentation for other available gene sets
- background: list
List of genes to use as background in enrichment analysis. Default is None (all genes in the gene set database will be used).
- return_sig_only: bool
Whether to return only significantly enriched gene sets. Default is True.
- max_retries: int
Number of times to retry downloading gene set enrichment data from enrichr API. Default is 5.
- delay: int
Number of seconds to wait between retries. Default is 10.
- Returns:
- results: pd.DataFrame
Dataframe with gene set enrichment results from enrichr API
- ptm_pose.analyze.getSequenceIdentity(seq1, seq2)[source]#
Given two flanking sequences, calculate the sequence identity between them using Biopython and parameters definded by Pillman et al. BMC Bioinformatics 2011
- Parameters:
- seq1, seq2: str
flanking sequence
- Returns:
- normalized_score: float
normalized score of sequence similarity between flanking sequences (calculated similarity/max possible similarity)
- ptm_pose.analyze.get_annotation_categories(spliced_ptms)[source]#
Given spliced ptm information, return the available annotation categories that have been appended to dataframe
- Parameters:
- spliced_ptms: pd.DataFrame
PTMs projected onto splicing events and with annotations appended from various databases
- Returns:
- annot_categories: pd.DataFrame
Dataframe that indicates the available databases, annotations from each database, and column associated with that annotation
- ptm_pose.analyze.get_annotation_col(spliced_ptms, annotation_type='Function', database='PhosphoSitePlus')[source]#
Given the database of interest and annotation type, return the annotation column that will be found in a annotated spliced_ptm dataframe
- Parameters:
- spliced_ptms: pd.DataFrame
Dataframe with PTM annotations added from annotate module
- annotation_type: str
Type of annotation to pull from spliced_ptms dataframe. Available information depends on the selected database. Default is ‘Function’.
- database: str
database from which PTMs are pulled. Options include ‘PhosphoSitePlus’, ‘ELM’, ‘PTMInt’, ‘PTMcode’, ‘DEPOD’, and ‘RegPhos’. Default is ‘PhosphoSitePlus’.
- Returns:
- annotation_col: str
Column name in spliced_ptms dataframe that contains the requested annotation
- ptm_pose.analyze.get_enrichment_inputs(spliced_ptms, annotation_type='Function', database='PhosphoSitePlus', background_type='pregenerated', background=None, collapse_on_similar=False, mod_class=None, alpha=0.05, min_dPSI=0, annotation_file=None, save_background=False)[source]#
Given the spliced ptms, altered_flanks, or combined PTMs dataframe, identify the number of PTMs corresponding to specific annotations in the foreground (PTMs impacted by splicing) and the background (all PTMs in the proteome or all PTMs in dataset not impacted by splicing). This information can be used to calculate the enrichment of specific annotations among PTMs impacted by splicing. Several options are provided for constructing the background data: pregenerated (based on entire proteome in the ptm_coordinates dataframe) or significance (foreground PTMs are extracted from provided spliced PTMs based on significance and minimum delta PSI)
- Parameters:
- spliced_ptms: pd.DataFrame
- ptm_pose.analyze.get_modification_counts(ptms)[source]#
Given PTM data (either spliced ptms, altered flanks, or combined data), return the counts of each modification class
- Parameters:
- ptms: pd.DataFrame
Dataframe with PTMs projected onto splicing events or with altered flanking sequences
- Returns:
- modification_counts: pd.Series
Series with the counts of each modification class
- ptm_pose.analyze.get_ptm_annotations(spliced_ptms, annotation_type='Function', database='PhosphoSitePlus', mod_class=None, collapse_on_similar=False, dPSI_col=None, sig_col=None)[source]#
Given spliced ptm information obtained from project and annotate modules, grab PTMs in spliced ptms associated with specific PTM modules
- Parameters:
- spliced_ptms: pd.DataFrame
PTMs projected onto splicing events and with annotations appended from various databases
- annotation_type: str
Type of annotation to pull from spliced_ptms dataframe. Available information depends on the selected database. Default is ‘Function’.
- database: str
database from which PTMs are pulled. Options include ‘PhosphoSitePlus’, ‘ELM’, or ‘PTMInt’. ELM and PTMInt data will automatically be downloaded, but due to download restrictions, PhosphoSitePlus data must be manually downloaded and annotated in the spliced_ptms data using functions from the annotate module. Default is ‘PhosphoSitePlus’.
- mod_class: str
modification class to subset
- ptm_pose.analyze.simplify_annotation(annotation, sep=',')[source]#
Given an annotation, remove additional information such as whether or not a function is increasing or decreasing. For example, ‘cell growth, induced’ would be simplified to ‘cell growth’
- Parameters:
- annotation: str
Annotation to simplify
- sep: str
Separator that splits the core annotation from additional detail. Default is ‘,’. Assumes the first element is the core annotation.
- Returns:
- annotation: str
Simplified annotation
Plotting#
- ptm_pose.analyze.annotation_enrichment(spliced_ptms, database='PhosphoSitePlus', annotation_type='Function', background_type='pregenerated', collapse_on_similar=False, mod_class=None, alpha=None, min_dPSI=None, annotation_file=None, save_background=False)[source]#
In progress, needs to be tested
Given spliced ptm information (differential inclusion, altered flanking sequences, or both), calculate the enrichment of specific annotations in the dataset using a hypergeometric test. Background data can be provided/constructed in a few ways:
Use preconstructed background data for the annotation of interest, which considers the entire proteome present in the ptm_coordinates dataframe. While this is the default, it may not be the most accurate representation of your data, so you may alternative decide to use the other options which will be more specific to your context.
Use the alpha and min_dPSI parameter to construct a foreground that only includes significantly spliced PTMs, and use the entire provided spliced_ptms dataframe as the background. This will allow you to compare the enrichment of specific annotations in the significantly spliced PTMs compared to the entire dataset. Will do this automatically if alpha or min_dPSI is provided.
- Parameters:
- spliced_ptms: pd.DataFrame
Dataframe with PTMs projected onto splicing events and with annotations appended from various databases
- database: str
database from which PTMs are pulled. Options include ‘PhosphoSitePlus’, ‘ELM’, ‘PTMInt’, ‘PTMcode’, ‘DEPOD’, ‘RegPhos’, ‘PTMsigDB’. Default is ‘PhosphoSitePlus’.
- annotation_type: str
Type of annotation to pull from spliced_ptms dataframe. Available information depends on the selected database. Default is ‘Function’.
- background_type: str
how to construct the background data. Options include ‘pregenerated’ (default) and ‘significance’. If ‘significance’ is selected, the alpha and min_dPSI parameters must be provided. Otherwise, will use whole proteome in the ptm_coordinates dataframe as the background.
- collapse_on_similar: bool
Whether to collapse similar annotations (for example, increasing and decreasing functions) into a single category. Default is False.
- mod_class: str
modification class to subset, if any
- alpha: float
significance threshold to use to subset foreground PTMs. Default is None.
- min_dPSI: float
minimum delta PSI value to use to subset foreground PTMs. Default is None.
- annotation_file: str
file to use to annotate custom background data. Default is None.
- save_background: bool
Whether to save the background data constructed from the ptm_coordinates dataframe into Resource_Files within package. Default is False.
- ptm_pose.analyze.combine_outputs(spliced_ptms, altered_flanks, mod_class=None, include_stop_codon_introduction=False, remove_conflicting=True)[source]#
Given the spliced_ptms (differentially included) and altered_flanks (altered flanking sequences) dataframes obtained from project and flanking_sequences modules, combine the two into a single dataframe that categorizes each PTM by the impact on the PTM site
- Parameters:
- spliced_ptms: pd.DataFrame
Dataframe with PTMs projected onto splicing events and with annotations appended from various databases
- altered_flanks: pd.DataFrame
Dataframe with PTMs associated with altered flanking sequences and with annotations appended from various databases
- mod_class: str
modification class to subset, if any
- include_stop_codon_introduction: bool
Whether to include PTMs that introduce stop codons in the altered flanks. Default is False.
- remove_conflicting: bool
Whether to remove PTMs that are both included and excluded across different splicing events. Default is True.
- ptm_pose.analyze.compare_inclusion_motifs(flanking_sequences, elm_classes=None)[source]#
Given a DataFrame containing flanking sequences with changes and a DataFrame containing ELM class information, identify motifs that are found in the inclusion and exclusion events, identifying motifs unique to each case. This does not take into account the position of the motif in the sequence or additional information that might validate any potential interaction (i.e. structural information that would indicate whether the motif is accessible or not). ELM class information can be downloaded from the download page of elm (http://elm.eu.org/elms/elms_index.tsv).
- Parameters:
- flanking_sequences: pandas.DataFrame
DataFrame containing flanking sequences with changes, obtained from get_flanking_changes_from_splice_data()
- elm_classes: pandas.DataFrame
DataFrame containing ELM class information (ELMIdentifier, Regex, etc.), downloaded directly from ELM (http://elm.eu.org/elms/elms_index.tsv). Recommended to download this file and input it manually, but will download from ELM otherwise
- Returns:
- flanking_sequences: pandas.DataFrame
DataFrame containing flanking sequences with changes and motifs found in the inclusion and exclusion events
- ptm_pose.analyze.edit_sequence_for_kinase_library(seq)[source]#
Convert flanking sequence to version accepted by kinase library (modified residue denoted by asterick)
- ptm_pose.analyze.findAlteredPositions(seq1, seq2, flank_size=5)[source]#
Given two sequences, identify the location of positions that have changed
- Parameters:
- seq1, seq2: str
sequences to compare (order does not matter)
- flank_size: int
size of the flanking sequences (default is 5). This is used to make sure the provided sequences are the correct length
- Returns:
- altered_positions: list
list of positions that have changed
- residue_change: list
list of residues that have changed associated with that position
- flank_side: str
indicates which side of the flanking sequence the change has occurred (N-term, C-term, or Both)
- ptm_pose.analyze.find_motifs(seq, elm_classes)[source]#
Given a sequence and a dataframe containinn ELM class information, identify motifs that can be found in the provided sequence using the RegEx expression provided by ELM (PTMs not considered). This does not take into account the position of the motif in the sequence or additional information that might validate any potential interaction (i.e. structural information that would indicate whether the motif is accessible or not). ELM class information can be downloaded from the download page of elm (http://elm.eu.org/elms/elms_index.tsv).
- Parameters:
- seq: str
sequence to search for motifs
- elm_classes: pandas.DataFrame
DataFrame containing ELM class information (ELMIdentifier, Regex, etc.), downloaded directly from ELM (http://elm.eu.org/elms/elms_index.tsv)
- ptm_pose.analyze.gene_set_enrichment(spliced_ptms=None, altered_flanks=None, combined=None, alpha=0.05, min_dPSI=None, gene_sets=['KEGG_2021_Human', 'GO_Biological_Process_2023', 'GO_Cellular_Component_2023', 'GO_Molecular_Function_2023', 'Reactome_2022'], background=None, return_sig_only=True, max_retries=5, delay=10)[source]#
Given spliced_ptms and/or altered_flanks dataframes (or the dataframes combined from combine_outputs()), perform gene set enrichment analysis using the enrichr API
- Parameters:
- spliced_ptms: pd.DataFrame
Dataframe with differentially included PTMs projected onto splicing events and with annotations appended from various databases. Default is None (will not be considered in analysis). If combined dataframe is provided, this dataframe will be ignored.
- altered_flanks: pd.DataFrame
Dataframe with PTMs associated with altered flanking sequences and with annotations appended from various databases. Default is None (will not be considered). If combined dataframe is provided, this dataframe will be ignored.
- combined: pd.DataFrame
Combined dataframe with spliced_ptms and altered_flanks dataframes. Default is None. If provided, spliced_ptms and altered_flanks dataframes will be ignored.
- gene_sets: list
List of gene sets to use in enrichment analysis. Default is [‘KEGG_2021_Human’, ‘GO_Biological_Process_2023’, ‘GO_Cellular_Component_2023’, ‘GO_Molecular_Function_2023’,’Reactome_2022’]. Look at gseapy and enrichr documentation for other available gene sets
- background: list
List of genes to use as background in enrichment analysis. Default is None (all genes in the gene set database will be used).
- return_sig_only: bool
Whether to return only significantly enriched gene sets. Default is True.
- max_retries: int
Number of times to retry downloading gene set enrichment data from enrichr API. Default is 5.
- delay: int
Number of seconds to wait between retries. Default is 10.
- Returns:
- results: pd.DataFrame
Dataframe with gene set enrichment results from enrichr API
- ptm_pose.analyze.getSequenceIdentity(seq1, seq2)[source]#
Given two flanking sequences, calculate the sequence identity between them using Biopython and parameters definded by Pillman et al. BMC Bioinformatics 2011
- Parameters:
- seq1, seq2: str
flanking sequence
- Returns:
- normalized_score: float
normalized score of sequence similarity between flanking sequences (calculated similarity/max possible similarity)
- ptm_pose.analyze.get_annotation_categories(spliced_ptms)[source]#
Given spliced ptm information, return the available annotation categories that have been appended to dataframe
- Parameters:
- spliced_ptms: pd.DataFrame
PTMs projected onto splicing events and with annotations appended from various databases
- Returns:
- annot_categories: pd.DataFrame
Dataframe that indicates the available databases, annotations from each database, and column associated with that annotation
- ptm_pose.analyze.get_annotation_col(spliced_ptms, annotation_type='Function', database='PhosphoSitePlus')[source]#
Given the database of interest and annotation type, return the annotation column that will be found in a annotated spliced_ptm dataframe
- Parameters:
- spliced_ptms: pd.DataFrame
Dataframe with PTM annotations added from annotate module
- annotation_type: str
Type of annotation to pull from spliced_ptms dataframe. Available information depends on the selected database. Default is ‘Function’.
- database: str
database from which PTMs are pulled. Options include ‘PhosphoSitePlus’, ‘ELM’, ‘PTMInt’, ‘PTMcode’, ‘DEPOD’, and ‘RegPhos’. Default is ‘PhosphoSitePlus’.
- Returns:
- annotation_col: str
Column name in spliced_ptms dataframe that contains the requested annotation
- ptm_pose.analyze.get_enrichment_inputs(spliced_ptms, annotation_type='Function', database='PhosphoSitePlus', background_type='pregenerated', background=None, collapse_on_similar=False, mod_class=None, alpha=0.05, min_dPSI=0, annotation_file=None, save_background=False)[source]#
Given the spliced ptms, altered_flanks, or combined PTMs dataframe, identify the number of PTMs corresponding to specific annotations in the foreground (PTMs impacted by splicing) and the background (all PTMs in the proteome or all PTMs in dataset not impacted by splicing). This information can be used to calculate the enrichment of specific annotations among PTMs impacted by splicing. Several options are provided for constructing the background data: pregenerated (based on entire proteome in the ptm_coordinates dataframe) or significance (foreground PTMs are extracted from provided spliced PTMs based on significance and minimum delta PSI)
- Parameters:
- spliced_ptms: pd.DataFrame
- ptm_pose.analyze.get_modification_counts(ptms)[source]#
Given PTM data (either spliced ptms, altered flanks, or combined data), return the counts of each modification class
- Parameters:
- ptms: pd.DataFrame
Dataframe with PTMs projected onto splicing events or with altered flanking sequences
- Returns:
- modification_counts: pd.Series
Series with the counts of each modification class
- ptm_pose.analyze.get_ptm_annotations(spliced_ptms, annotation_type='Function', database='PhosphoSitePlus', mod_class=None, collapse_on_similar=False, dPSI_col=None, sig_col=None)[source]#
Given spliced ptm information obtained from project and annotate modules, grab PTMs in spliced ptms associated with specific PTM modules
- Parameters:
- spliced_ptms: pd.DataFrame
PTMs projected onto splicing events and with annotations appended from various databases
- annotation_type: str
Type of annotation to pull from spliced_ptms dataframe. Available information depends on the selected database. Default is ‘Function’.
- database: str
database from which PTMs are pulled. Options include ‘PhosphoSitePlus’, ‘ELM’, or ‘PTMInt’. ELM and PTMInt data will automatically be downloaded, but due to download restrictions, PhosphoSitePlus data must be manually downloaded and annotated in the spliced_ptms data using functions from the annotate module. Default is ‘PhosphoSitePlus’.
- mod_class: str
modification class to subset
- ptm_pose.analyze.simplify_annotation(annotation, sep=',')[source]#
Given an annotation, remove additional information such as whether or not a function is increasing or decreasing. For example, ‘cell growth, induced’ would be simplified to ‘cell growth’
- Parameters:
- annotation: str
Annotation to simplify
- sep: str
Separator that splits the core annotation from additional detail. Default is ‘,’. Assumes the first element is the core annotation.
- Returns:
- annotation: str
Simplified annotation