KSTAR Reference#
The “Config” Module#
- kstar.config.check_configuration()#
Verify that all necessary files are downloadable and findable
- kstar.config.find_available_networks(phospho_type)#
Find available network hashes in the current network directory, and return dictionary with information about them
- Returns:
- available_networksdict
dictionary containing all available networks, in the format -> Network hash : network information dictionary
- kstar.config.install_network_files(target_dir=None)#
Retrieves Network files that are the companion for this version release from FigShare, unzips them to the specified directory.
- Parameters:
- target_dirstr, optional
Directory to install network files to. If None, defaults to within package location ({KSTAR_DIR}/NETWORKS/)
- kstar.config.update_configuration(network_dir=None, y_network_name=None, st_network_name=None, save_random_experiments=None, use_pregenerated_random_activities=None, save_new_random_activities=None, custom_pregenerated_activities_dir=None)#
Update configuration parameters in current iteration and save to configuration file.
- Parameters:
- use_pregenerated_random_activitiesbool, optional
Whether to use pregenerated random activities when possible, by default None
- save_new_random_activitiesbool, optional
Whether to save new random activities when they are generated, by default False
- custom_pregenerated_activities_dirstr, optional
Directory to save newly generated random activities for future use, by default None
- network_dirstr, optional
Directory containing the kinase-substrate networks, by default None (which assumes it is located in kstar directory)
- y_network_hashstr, optional
Unique identifier of the tyrosine network to use by default.
- st_network_hashstr, optional
Unique identifier of the serine/threonine network to use by default.
- kstar.config.update_network_directory(network_dir=None, y_network_name=None, st_network_name=None)#
Update the location of network the network files, and verify that all necessary files are located in directory
- Parameters:
- network_dir: string
path to where network files are located
- y_network_name: string
name of the tyrosine network to use
- st_network_name: string
name of the serine/threonine network to use
The “Prune” Module#
The “Pruner” Class#
- class kstar.prune.Pruner(network, network_name, phospho_type='Y', acc_col='substrate_acc', site_col='site', nonweight_cols=['substrate_acc', 'site', 'substrate_id', 'substrate_name', 'pep'], logger=None, network_dir=None)#
Pruning Algorithm used for KSTAR.
- Parameters:
- networkpandas df
weighted kinase-site prediction network where there is an accession, site, kinase, and score column
- network_namestr
name to use when saving pruned networks
- loggerNone or logging.logger
logger used for pruning. Will create a new logger if None is provided
- phospho_typestr
phospho_type(s) to use when building pruned networks
- acc_colstr
the name of the column containing Uniprot Accession IDs for each substrate in the weighted network
- site_colstr
the name of the column containing the residue type and location of each substrate in the weighted network (Y1268, S44, etc.)
- nonweight_colslist
- indicates the non-weight containing columns in the network (these will be removed in the final processed network, as they are not needed). If None, will automatically look
for any non-numeric columns and removes them.
- network_dirstr
location to save the final pruned networks. Will use default network directory from config if None is provided.
Methods
Report how many networks are currently in the work directory
build_multiple_compendia_networks(...[, ...])Builds multiple compendia-limited networks
build_multiple_networks(kinase_size, ...[, ...])Basic Network Generation - only takes into account score when determining sites a kinase connects to
build_pruned_network(network, kinase_size, ...)Builds a heuristic pruned network where each kinase has a specified number of connected sites and each site has an upper limit to the number of kinases it can connect to
calculate_compendia_sizes(kinase_size)Calculates the number of sites per compendia size that a kinase should connect to using same ratios of compendia sizes as found in compendia
checkParameters(kinase_size, site_limit)Given the site_limit and kinase_size parameters to be used during pruning, raise errors if not feasible, and raise warnings if value is higher than we would recommend (>40% of the maximum kinase_size value)
Remove all files in existing work directory
compendia_pruned_network(compendia_sizes, ...)Builds a compendia-pruned network that takes into account compendia size limits per kinase
getMaximumKinaseSize(site_limit)Given a network and site_limit (maximum number of kinases a phosphorylation site can provide evidence to), will calculate the theoretical maximum number of connections each kinase can have (kinase_size parameter)
getRecommendedKinaseSize(site_limit)Given a network and site_limit (maximum number of kinases a phosphorylation site can provide evidence to), will calculate the theoretical maximum number of connections each kinase can have (kinase_size parameter) and recommend a range of values for kinase_size
pregenerate_random_activities([PROCESSES])Docstring for pregenerate_random_activities
report_info(txt)Both log and print information during pruning
report_warning(txt)Both log and print warnings during pruning
run(kinase_size, site_limit[, num_networks, ...])Run the pruning algorithm from start to finish, including pregenerating random activities based on generated networks
save_networks([network_file_used, network_desc])Save the pruned networks generated by the 'build_multiple_networks' or 'build_multiple_compendia_networks' as a pickle to be loaded by KSTAR
save_run_information([network_file_used, ...])Save information about the generation of networks during run_pruning, including the parameters used for generation.
- assess_work_dir()#
Report how many networks are currently in the work directory
- build_multiple_compendia_networks(kinase_size, site_limit, num_networks, PROCESSES=1)#
Builds multiple compendia-limited networks
- Parameters:
- kinase_size: int
number of sites each kinase should connect to
- site_limit :int
upper limit of number of kinases a site can connect to
- num_networks: int
number of networks to build
- network_idstr
id to use for each network in dictionary
- Returns:
- pruned_networksdict
key : <network_id>_<i> value : pruned network
- build_multiple_networks(kinase_size, site_limit, num_networks, PROCESSES=1)#
Basic Network Generation - only takes into account score when determining sites a kinase connects to
- build_pruned_network(network, kinase_size, site_limit)#
Builds a heuristic pruned network where each kinase has a specified number of connected sites and each site has an upper limit to the number of kinases it can connect to
- Parameters:
- networkpandas DataFrame
network to build pruned network on
- kinase_size: int
number of sites each kinase should connect to
- site_limit :int
upper limit of number of kinases a site can connect to
- Returns:
- pruned networkpandas DataFrame
subset of network that has been pruned
- calculate_compendia_sizes(kinase_size)#
Calculates the number of sites per compendia size that a kinase should connect to using same ratios of compendia sizes as found in compendia
- Parameters:
- kinase_size: int
number of sites each kinase should connect to
- Returns:
- sizesdict
key : compendia size value : number of sites each kinase should pull from given compendia size
- checkParameters(kinase_size, site_limit)#
Given the site_limit and kinase_size parameters to be used during pruning, raise errors if not feasible, and raise warnings if value is higher than we would recommend (>40% of the maximum kinase_size value)
- Parameters:
- kinase_size: int
Parameter used in pruning: indicates the number of substrates each kinase will be connected to
- site_limit: int
Parameter used in pruning: indicates the maximum number of kinases a phosphorylation site can be connected to in the final pruned network
- Returns:
- Nothing, will only raise errors/warnings if parameters are not feasible
- clean_work_dir()#
Remove all files in existing work directory
- compendia_pruned_network(compendia_sizes, site_limit, odir)#
Builds a compendia-pruned network that takes into account compendia size limits per kinase
- Parameters:
- compendia_sizesdict
key : compendia size value : number of sites to connect to kinase
- site_limitint
upper limit of number of kinases a site can connect to
- Returns:
- pruned_networkpandas DataFrame
subset of network that has been pruned according to compendia ratios
- getMaximumKinaseSize(site_limit)#
Given a network and site_limit (maximum number of kinases a phosphorylation site can provide evidence to), will calculate the theoretical maximum number of connections each kinase can have (kinase_size parameter)
Theoretical maximum exists when each substrate hits the maximum site_limit
- Parameters:
- site_limit: int
Parameter used in pruning: indicates the maximum number of kinases a phosphorylation site can be connected to in the final pruned network
- Returns:
- theoretical_max_ksize: int
largest possible value that ‘kinase_size’ parameter can have without throwing any errors
- getRecommendedKinaseSize(site_limit)#
Given a network and site_limit (maximum number of kinases a phosphorylation site can provide evidence to), will calculate the theoretical maximum number of connections each kinase can have (kinase_size parameter) and recommend a range of values for kinase_size
Theoretical maximum exists when each substrate hits the maximum site_limit
- Parameters:
- site_limit: int
Parameter used in pruning: indicates the maximum number of kinases a phosphorylation site can be connected to in the final pruned network
- Returns:
- Nothing, prints theoretical maximum of kinase size and the recommened values for the parameter given the site_limit
- pregenerate_random_activities(PROCESSES=1)#
Docstring for pregenerate_random_activities
- Parameters:
self – Description
- report_info(txt)#
Both log and print information during pruning
- report_warning(txt)#
Both log and print warnings during pruning
- run(kinase_size, site_limit, num_networks=50, use_compendia=True, generate_activities=True, network_file_used=None, network_desc=None, restart=False, PROCESSES=1)#
Run the pruning algorithm from start to finish, including pregenerating random activities based on generated networks
- save_networks(network_file_used=None, network_desc=None)#
Save the pruned networks generated by the ‘build_multiple_networks’ or ‘build_multiple_compendia_networks’ as a pickle to be loaded by KSTAR
- save_run_information(network_file_used=None, network_desc=None)#
Save information about the generation of networks during run_pruning, including the parameters used for generation. Primarily used when running bash script.
- Parameters:
- network_file_usedstr, optional
file path of the weighted network file used during pruning
- network_descstr, optional
description of the network used during pruning. Recommended, but not required
Functions to Perform Pruning#
- kstar.prune.run_pruning(weighted_network, network_name, odir, phospho_type, kinase_size, site_limit, num_networks, use_compendia=True, generate_activities=True, network_file_used=None, network_desc=None, restart=False, logger=None, acc_col='substrate_acc', site_col='site', nonweight_cols=['substrate_acc', 'site', 'substrate_id', 'substrate_name', 'pep'], PROCESSES=1)#
Run the pruning algorithm from start to finish, including pregenerating random activities based on generated networks
- Parameters:
- weighted_networkpandas DataFrame
weighted kinase-site prediction network where there is an accession, site, kinase, and score column
- network_namestr
name to use when saving pruned networks
- odirstr
location to save the final pruned networks. Will use default network directory from config if None is provided.
- phospho_typestr
phospho_type(s) to use when building pruned networks
The “ExperimentMapper” class#
- class kstar.mapping.ExperimentMapper(experiment, columns, odir='./', name='experiment', window=7, data_columns=None, logger=None, sequences=None, compendia=None)#
Given an experiment object and reference sequences, map the phosphorylation sites to the common reference. Inputs
- Parameters:
- namestr
Name of experiment. Used for logging
- experiment: pandas dataframe
Pandas dataframe of an experiment that has a reference accession, a peptide column and/or a site column. The peptide column should be upper case, with lower case indicating the site of phosphorylation - this is preferred The site column should be in the format S/T/Y<pos>, e.g. Y15 or S345
- columns: dict
Dictionary with mappings of the experiment dataframe column names for the required names ‘accession_id’, ‘peptide’, or ‘site’. One of ‘peptide’ or ‘site’ is required.
- name: str
Name of experiment, used for logging and output file names
- odir: str
Output directory where mapped data and logs will be saved
- logger: Logger object
used for logging when peptides cannot be matched and when a site location changes. If None, a logger will be created in the output directory.
- sequences: dict
Dictionary of sequences. Key : accession. Value : protein sequence. Default is imported from kstar.config
- compendia: pd.DataFrame
Human phosphoproteome compendia, mapped to KinPred and annotated with number of compendia. Default is imported from kstar.config
- windowint
The length of amino acids to the N- and C-terminal sides of the central phosphoprotein to map a site to. Default is 7.
- data_columns: list, or empty
The list of data columns to use. If this is empty, logger will look for anything that starts with statement data: and those values Default is None.
- Attributes:
- experiment: pandas dataframe
mapped experiment, which for each peptide, no contains the mapped accession, site, peptide, number of compendia, compendia type
- sequences: dict
Dictionary of sequences passed into the class
- compendia: pandas dataframe
compendia dataframe passed into the class
- data_columns: list
indicates which columns will be used as data
Methods
align_sites([window])Map the peptide/sites to the common sequence reference and remove and report errors for sites that do not align as expected.
Return the mapped experiment dataframe
Returns number of missed peptides
Returns number of missed sites
Returns dataframe of unmapped sites with reasons for being unmapped
get_sequence(accession)Gets the sequence that matches the given accession
save_experiment([return_stats, ...])Given a completed mapping process, save the resulting experiment and reporting files (if desired) to the output directory.
set_data_columns(data_columns)Identifies which columns in the experiment should be used as data columns.
- align_sites(window=7)#
Map the peptide/sites to the common sequence reference and remove and report errors for sites that do not align as expected. expMapper.align_sites(window=7). Operates on the experiment dataframe of class.
- Parameters:
- window: int
The length of amino acids to the N- and C-terminal sides of the central phosphoprotein to map a site to.
- get_experiment()#
Return the mapped experiment dataframe
- get_number_missed_peptides()#
Returns number of missed peptides
- get_number_missed_sites()#
Returns number of missed sites
- get_reason_for_unmapped()#
Returns dataframe of unmapped sites with reasons for being unmapped
- Returns:
- errorspandas Series
Series with counts of each error type
- percpandas Series
Series with percentage of each error type
- get_sequence(accession)#
Gets the sequence that matches the given accession
- save_experiment(return_stats=True, return_lost_sites=True)#
Given a completed mapping process, save the resulting experiment and reporting files (if desired) to the output directory.
- Parameters:
- return_statsbool
Whether to save a mapping statistics file. Default is True.
- return_lost_sitesbool
Whether to save csv file containing any sites/peptides that were removed during the mapping process. Default is True.
- set_data_columns(data_columns)#
Identifies which columns in the experiment should be used as data columns. If data_columns is provided, then ‘data:’ is added to the front and experiment dataframe is renamed. Otherwise, function will look for columns with ‘data:’ in front and this to the data_columns attribute.
Functions for Activity Calculation#
The “KinaseActivity” class#
- class kstar.calculate.KinaseActivity(evidence, odir, name='experiment', data_columns=None, phospho_type='Y', kinases=None, network_dir=None, logger=None, network_name=None, seed=None)#
Kinase Activity calculates the estimated activity of kinases given an experiment using hypergeometric distribution. Hypergeometric distribution examines the number of protein sites found to be active in evidence compared to the number of protein sites attributed to a kinase on a provided network.
- Parameters:
- evidencepandas df
a dataframe that contains (at minimum, but can have more) data columms as evidence to use in analysis and KSTAR_ACCESSION and KSTAR_SITE
- odirstring
output directory where results will be saved
- namestring
name of the experiment, used to label output files. Default is ‘experiment’
- kinaseslist or None
list of kinases to predict activity for. If None, will use all kinases found in the provided networks
- network_dirstring or None
directory where pruned KSTAR networks are located. If None, will use config.NETWORK_DIR. If network files were downloaded with config.install_network_files(), this directory should already be set and does not need to be provided.
- network_namestring or None
name of the network to use. If None, will use the default network name from config based on phospho_type
- data_columns: list
list of the columns containing the abundance values, which will be used to determine which sites will be used as evidence for activity prediction in each sample
- phospho_type: string, either ‘Y’ or ‘ST’
indicates the phospho modification of interest
- loggerLogger object or None
keeps track of kstar analysis, including any errors that occur. If None, a new logger will be created automatically
- min_dataset_size_for_pregenerated: int
minimum dataset size required to use pregenerated random activities (by number of sites used as evidence). Default is 150
- max_diff_from_pregenerated: float
maximum percent difference between dataset size and pregenerated random activity size to use pregenerated data. Default is 0.20 (i.e. 20%)
- seedint or None
random seed to use for random number generation. If None, seed will be set to current time
- Attributes:
- ——————-
- Upon Initialization
- ——————-
- evidence: pandas dataframe
inputted dataset used for kinase activity calculation
- networks: dict
dictionary of pruned kinase substrate networks, with keys as network ids and values as pandas dataframes
- data_columns: list
list of columns containing abundance values, which will be used to determine which sites will be used as evidence. If inputted data_columns parameter was None, this lists includes in column in evidence prefixed by ‘data:’
- loggerLogger object
keeps track of kstar analysis, including any errors that occur
- aggregate: string
the type of aggregation to use when determining binary evidence, either ‘count’ or ‘mean’. Default is ‘count’.
- run_date: string
indicates the date that kinase activity object was initialized
- random_seed: int
random seed used for activity calculation. Only relevant if not using pregenerated random activities
- network_info: dict
metadata about the loaded networks
- network_hash: string
unique identifier for the loaded networks
- kinases: list
list of kinases to predict activity for
- ———————————
- After Hypergeometric Calculations
- ———————————
- real_enrichment: pandas dataframe
p-values obtained for all pruned networks indicating statistical enrichment of a kinase’s substrates for each network, based on hypergeometric tests
- activities: pandas dataframe
median p-values obtained from the real_enrichment object for each experiment/kinase
- agg_activities: pandas dataframe
- ———————————–
- After Random Enrichment Calculation
- ———————————–
- random_experiments: pandas dataframe
contains information about the sites randomly sampled for each random experiment. Will only be saved if save_random_experiments=True.
- random_enrichment: KinaseActivity object
KinaseActivity object containing random activities predicted from each of the random experiments
- data_columns_from_scratch: list
list of data columns which generated random activities from scratch
- data_columns_with_pregenerated: list
list of data columns which generated random activities from pregenerated random activities
- —————————
- After Mann Whitney Analysis
- —————————
- activities_mann_whitney: pandas dataframe
p-values obtained from comparing the real distribution of p-values to the distribution of p-values from random datasets, based the Mann Whitney U-test
- fpr_mann_whitney: pandas dataframe
false positive rates for predicted kinase activities
Methods
calculate_kinase_activities([agg, ...])Calculates combined activity of experiments based that uses a threshold value to determine if an experiment sees a site or not To use values use 'mean' as agg mean aggregation drops NA values from consideration To use count use 'count' as agg - present if not na
check_data_columns([min_evidence_size])Checks data columns to make sure column is in evidence and that evidence filtered on that data column has at least one point of evidence (or minimum set by min_evidence_size).
create_binary_evidence([agg, threshold, ...])Returns a binary evidence data frame according to the parameters passed in for method for aggregating duplicates and considering whether a site is included as evidence or not
get_allowable_threshold([greater, agg, ...])Determine the minimum/maximum threshold that still results in all data columns having evidence
get_param_dict([params_to_ignore])Get a dictionary of important parameters needed to reinstantiate the KSTAR object
get_random_activities([...])Generate random experiments and calculate kinase activities.Either uses pre-generated activity lists or generates new random experiments based on the provided parameters.
Retrieve network information from RUN_INFORMATION.txt based on phospho_type.
make_dotplot([include_evidence_sizes])Create a dotplot of the kinase activity results
make_summary_pdf([regenerate_plots])Create a summary PDF of the kinase activity results
recommend_threshold([desired_evidence_size, ...])Recommend a threshold, one based on desired evidence size and one based on maximum average Jaccard similarity between samples.
set_data_columns([data_columns])Sets the data columns to use in the kinase activity calculation If data_columns is None or an empty list then set data_columns to be all columns that start with data:
test_threshold(threshold[, agg, greater, ...])Given a threshold value, calculate the distribution of evidence sizes (i.e. number of sites used in prediction for each sample in the experiment).
test_threshold_range(min_threshold, ...[, ...])Given a range of threshold values, calculate the distribution of evidence sizes (i.e. number of sites used in prediction for each sample in the experiment) and Jaccard similarity between samples at each threshold.
- calculate_kinase_activities(agg='mean', threshold=1.0, evidence_size=None, greater=True, min_evidence_size=0, PROCESSES=1)#
Calculates combined activity of experiments based that uses a threshold value to determine if an experiment sees a site or not To use values use ‘mean’ as agg
mean aggregation drops NA values from consideration
To use count use ‘count’ as agg - present if not na
- Parameters:
- data_columnslist
columns that represent experimental result, if None, takes the columns that start with `data:’’ in experiment. Pass this value in as a list, if seeking to calculate on fewer than all available data columns
- thresholdfloat
threshold value used to filter rows
- evidence_sizeint or None
the number of sites to use for prediction for each sample. If a value is provided, this will override the threshold, and will instead obtain the N sites with the greatest abundance within each sample (or lowest if greater=False).
- agg{‘count’, ‘mean’}
method to use when aggregating duplicate substrate-sites. ‘count’ combines multiple representations and adds if values are non-NaN ‘mean’ uses the mean value of numerical data from multiple representations of the same peptide.
NA values are droped from consideration.
- greater: Boolean
whether to keep sites that have a numerical value >=threshold (TRUE, default) or <=threshold (FALSE)
- min_evidence_sizeint
minimum number of sites required for a data column to be considered for activity calculation
- PROCESSESint
number of processes to use for multiprocessing
- check_data_columns(min_evidence_size=0)#
Checks data columns to make sure column is in evidence and that evidence filtered on that data column has at least one point of evidence (or minimum set by min_evidence_size). Removes all columns that do not meet criteria
- Parameters:
- min_evidence_sizeint
minimum number of sites required for a data column to be considered for activity calculation
- create_binary_evidence(agg='mean', threshold=1.0, evidence_size=None, greater=True, min_evidence_size=0, drop_empty_columns=True)#
Returns a binary evidence data frame according to the parameters passed in for method for aggregating duplicates and considering whether a site is included as evidence or not
- Parameters:
- thresholdfloat
threshold value used to filter rows
- evidence_size: None or int
the number of sites to use for prediction for each sample. If a value is provided, this will override the threshold, and will instead obtain the N sites with the greatest abundance within each sample.
- agg{‘count’, ‘mean’}
method to use when aggregating duplicate substrate-sites. ‘count’ combines multiple representations and adds if values are non-NaN ‘mean’ uses the mean value of numerical data from multiple representations of the same peptide.
NA values are droped from consideration.
- greater: Boolean
whether to keep sites that have a numerical value >=threshold (TRUE, default) or <=threshold (FALSE)
- min_evidence_sizeint
minimum number of sites required for a data column to be considered for activity calculation
- drop_empty_columnsbool
whether to drop data columns with fewer than min_evidence_size sites
- Returns:
- evidence_binarypd.DataFrame
Matches the evidence dataframe of the kinact object, but with 0 or 1 if a site is included or not. This is uniquified and rows that are never used are removed.
- get_allowable_threshold(greater=True, agg='mean', min_evidence_size=20, allow_column_loss=False)#
Determine the minimum/maximum threshold that still results in all data columns having evidence
- Parameters:
- greater: bool
whether to use sites greater (True) or less (False) than the threshold
- agg: str
how to combine sites with multiple instances in experiment
- min_evidence_size: int
minimum number of sites required for a data column to be considered for activity calculation
- Returns:
- allowable threshold: float
maximum or minimum threshold that still results in all data columns having evidence (or at least one if min_evidence_size = None)
- get_param_dict(params_to_ignore=['network_sizes', 'pregenerated_experiments_path', 'mann_whitney'])#
Get a dictionary of important parameters needed to reinstantiate the KSTAR object
- get_random_activities(num_random_experiments=150, use_pregenerated_random_activities=None, default_pregen_only=False, save_new_random_activities=None, custom_pregenerated_path=None, save_random_experiments=None, require_pregenerated=False, max_diff_from_pregenerated=0.25, min_dataset_size_for_pregenerated=150, PROCESSES=1)#
Generate random experiments and calculate kinase activities.Either uses pre-generated activity lists or generates new random experiments based on the provided parameters.
- Parameters:
- num_random_experimentsint, optional
Number of random experiments to generate, by default 150.
- use_pregenerated_random_activitiesbool, optional
Whether to use pre-generated data, by default None and will use configuration value.
- default_pregen_onlybool, optional
Whether to only use the default pregenerated data found in the network directory folder, by default False.
- save_new_random_activitiesbool, optional
Whether to save new pregenerated data, by default None and will use configuration value
- custom_pregenerated_pathstr, optional
Directory to save new precomputed data, by default None and will use configuration value.
- save_random_experimentsbool, optional
Whether to save the generated random experiments, by default None and will use configuration value.
- require_pregeneratedbool, optional
Whether to require using pre-generated data for all datasets, by default False. This is will ensure fast run times, but may result in some datasets not being processed if they do not have matching pre-generated data (most commonly due to smaller samples).
- max_diff_from_pregeneratedfloat, optional
Maximum allowed difference in size between the dataset and pregenerated data to use pregenerated data, by default 0.25.
- min_dataset_size_for_pregeneratedint, optional
Minimum dataset size required to use pregenerated data, by default 150.
- PROCESSESint, optional
Number of processes to use for parallel computation, by default 1.
- get_run_information_content()#
Retrieve network information from RUN_INFORMATION.txt based on phospho_type.
Reads the RUN_INFORMATION.txt file from the appropriate network directory based on the phospho_type (‘Y’ or ‘ST’). The file contains network configuration details including unique ID, date, network specifications, and compendia counts.
- Returns:
- Contents of RUN_INFORMATION.txt if found.
- ‘RUN_INFORMATION.txt file not found.’ if the file doesn’t exist.
- make_dotplot(include_evidence_sizes=True, **kwargs)#
Create a dotplot of the kinase activity results
- Parameters:
- include_evidence_sizesbool
Whether to include evidence sizes in the dotplot
- **kwargs
Additional keyword arguments to pass to the DotPlot initialization and make_complete_dotplot methods
- make_summary_pdf(regenerate_plots=False)#
Create a summary PDF of the kinase activity results
- Parameters:
- regenerate_plotsbool
Whether to regenerate plots even if they already exist
- recommend_threshold(desired_evidence_size=None, max_similarity=0.7, consider_size=True, consider_similarity=True, min_threshold=-inf, max_threshold=inf, step=0.1, pick_best_size_by='median', pick_best_similarity_by='max', greater=True, agg='mean', min_evidence_size=20, allow_column_loss=False)#
Recommend a threshold, one based on desired evidence size and one based on maximum average Jaccard similarity between samples. Will report the characteristics of the resulting evidences for both thresholds
- Parameters:
- desired_evidence_size: int
target evidence size to use when recommending threshold
- max_similarity: float
maximum average Jaccard similarity between samples to use when recommending threshold. Default is 0.7
- consider_size: bool
whether to consider evidence size when recommending threshold
- consider_similarity: bool
whether to consider similarity between data columns when recommending threshold
- min_threshold: float
minimum threshold to consider when recommending threshold. Must be provided if greater = True. Default is -infinity
- max_threshold: float
maximum threshold to consider when recommending threshold. Must be provided if greater = False. Default is infinity
- step: float
step size to use when iterating through thresholds
- pick_best_size_by: str
method to use when aggregating evidence size values across samples, recommended to be either ‘min’, ‘max’, or ‘median’
- pick_best_similarity_by: str
method to use when aggregating Jaccard similarity values across samples, recommended to be either ‘max’ or ‘median’
- greater: bool
whether to use sites greater (True) or less (False) than the threshold
- agg: str
how to combine sites with multiple instances in experiment
- min_evidence_size: int
minimum number of sites required for a data column to be considered for activity calculation
- allow_column_loss: bool
whether to allow some data columns to be lost when recommending threshold based on size. If False, will raise an error if min/max thresholds provided result in loss of any data columns
- Returns:
- float
recommended threshold value
- set_data_columns(data_columns=None)#
Sets the data columns to use in the kinase activity calculation If data_columns is None or an empty list then set data_columns to be all columns that start with data:
Checks all set columns to make sure columns are vaild after filtering evidence
- test_threshold(threshold, agg='mean', greater=True, plot=False, return_evidence_sizes=False, min_evidence_size=0)#
Given a threshold value, calculate the distribution of evidence sizes (i.e. number of sites used in prediction for each sample in the experiment).
- Parameters:
- threshold: float
cutoff for inclusion as evidence for prediction. If greater = True, sites with quantification greater than the threshold are used as evidence.
- agg: str
how to combine sites with multiple instances in experiment
- greater: bool
whether to use sites greater (True) or less (False) than the threshold
- plot: bool
whether to plot a histogram of the evidence sizes used and heatmap of Jaccard similarity between samples
- return_evidence_sizes: bool
indicates whether to return the evidence sizes for all samples or not
- min_evidence_size: int
minimum number of sites required for a data column to be considered for activity calculation
- Returns:
- Outputs the minimum, maximum, and median evidence sizes across all samples. May return evidence sizes of all samples as pandas series
- test_threshold_range(min_threshold, max_threshold, step=0.1, agg='mean', greater=True, min_evidence_size=0, desired_evidence_size=None, show_recommended=False)#
Given a range of threshold values, calculate the distribution of evidence sizes (i.e. number of sites used in prediction for each sample in the experiment) and Jaccard similarity between samples at each threshold
- Parameters:
- min_threshold: float
minimum cutoff for inclusion as evidence for prediction. If greater = True, sites with quantification greater than the threshold are used as evidence.
- max_threshold: float
maximum cutoff for inclusion as evidence for prediction. If greater = True, sites with quantification greater than the threshold are used as evidence.
- step: float
step size to use when iterating through threshold range
- agg: str
how to combine sites with multiple instances in experiment
- greater: bool
whether to use sites greater (True) or less (False) than the threshold
- min_evidence_size: int
minimum number of sites required for a data column to be considered for activity calculation
- desired_evidence_size: int or None
target evidence size to use for plotting. If None, will use 150 for phospho_type ‘Y’ and 1500 for phospho_type ‘ST’
- show_recommended: bool
whether to show recommended evidence size and similarity lines on the plots
Master Functions for Running KSTAR Pipeline#
- kstar.calculate.Mann_Whitney_analysis(kinact_dict, PROCESSES=1)#
For a kinact_dict, where random generation and activity has already been run for the phospho_types of interest, this will calculate the Mann-Whitney U test for comparing the array of p-values for real data to those of random data, across the number of networks used. It will also calculate the false positive rate for a pvalue, given observations of a random bootstrapping analysis
- Parameters:
- kinact_dict: dictionary
A dictionary of kinact objects, with keys ‘Y’ and/or ‘ST’
- PROCESSES: int
number of processes to use for parallel computation, by default 1.
- kstar.calculate.enrichment_analysis(experiment, odir, name='experiment', phospho_types=['Y', 'ST'], data_columns=None, agg='mean', threshold=1.0, evidence_size=None, greater=True, min_evidence_size=0, allow_column_loss=True, kinases=None, PROCESSES=1, **kwargs)#
Function to establish a kstar KinaseActivity object from an experiment with an activity log add the networks, calculate, aggregate, and summarize the hypergeometric enrichment into a final activity object. Should be followed by randomized_analyis, then Mann_Whitney_analysis.
- Parameters:
- experiment: pandas df
experiment dataframe that has been mapped, includes KSTAR_SITE, KSTAR_ACCESSION, etc.
- odirstr
path to where you would like logger and output saved
- namestr
name to use for outputs
- phospho_types: {[‘Y’, ‘ST’], [‘Y’], [‘ST’]}
Which substrate/kinaset-type to run activity for: Both [‘Y, ‘ST’] (default), Tyrosine [‘Y’], or Serine/Threonine [‘ST’]
- data_columnslist
columns that represent experimental result, if None, takes the columns that start with `data:’’ in experiment. Pass this value in as a list, if seeking to calculate on fewer than all available data columns
- agg{‘count’, ‘mean’}
method to use when aggregating duplicate substrate-sites. ‘count’ combines multiple representations and adds if values are non-NaN ‘mean’ uses the mean value of numerical data from multiple representations of the same peptide.
NA values are droped from consideration.
- thresholdfloat or dict
threshold value used to filter rows. If provided as a dictionary, keys should be ‘Y’ and/or ‘ST’ with float values for each phospho_type.
- evidence_sizeint or dict
size of evidence to use for filtering. If provided as a dictionary, keys should be ‘Y’ and/or ‘ST’ with int values for each phospho_type. Will overide threshold if both provided.
- min_evidence_sizeint
minimum size of evidence to run kinase activity on. Default 0, meaning any data column with at least one site will be run on
- greater: Boolean
whether to keep sites that have a numerical value >=threshold (TRUE, default) or <=threshold (FALSE)
- PROCESSESint
number of processes to use for parallel computation, by default 1.
- **kwargs
Additional keyword arguments to pass to the KinaseActivity class
- Returns:
- kinactDict: dictionary of Kinase Activity Objects
Outer keys are phosphoTypes run ‘Y’ and ‘ST’ Includes the activities dictionary (see calculate_kinase_activities) aggregation of activities across networks (see aggregate activities) activity summary (see summarize_activities)
- kstar.calculate.randomized_analysis(kinact_dict, **kwargs)#
Perform randomized analysis on kinase activity data.
- Parameters:
- kinact_dictdict
Dictionary containing kinase activity data.
- kwargskeyword arguments
Additional keyword arguments for random activity generation passed to KinaseActivity.get_random_activities method.
These can include: num_random_experiments : int, optional
Number of random experiments to generate, by default 150.
- use_pregen_databool, optional
Whether to use pre-generated data, by default False.
- max_diff_from_pregeneratedfloat, optional
Maximum fractional difference allowed from pre-generated data, by default 0.25.
- min_dataset_size_for_pregeneratedint, optional
Minimum dataset size to use pre-generated data, by default 150.
- default_pregen_onlybool, optional
Whether to only use default pre-generated data (and not any activities in custom path), by default False.
- require_pregeneratedbool, optional
Whether to require pre-generated data, by default False. This will ensure fast performance, but may result in some data columns being dropped
- custom_pregenerated_pathstr, optional
Directory to save new precomputed data, by default None.
- save_random_experimentsbool, optional
Whether to save the generated random experiments, by default None.
- save_new_random_activitiesbool, optional
Whether to save new precomputed data, by default None.
- PROCESSESint, optional
Number of processes to use for parallel computation, by default 1.
- Returns:
- None
- kstar.calculate.run_kstar_analysis(experiment, odir, name='experiment', phospho_types=['Y', 'ST'], data_columns=None, threshold=1.0, evidence_size=None, greater=True, save_output=True, PROCESSES=1, **kwargs)#
Given a mapped experiment, run the KSTAR analysis pipeline.
- Parameters:
- experiment: DataFrame
Mapped experiment data
- odir: string
Output directory
- name: string
Name of the experiment
- phospho_types: list
List of phospho types to analyze
- network_dir: string
Directory containing network data
- data_columns: list
Columns to use from the data
- agg: string
Aggregation method
- threshold: float
Threshold for analysis
- evidence_size: int
Size of evidence
- greater: bool
Whether to use greater comparison
- PROCESSES: int
Number of processes to use
- **kwargs
Additional keyword arguments for enrichment_analysis, randomized_analysis, and save_kstar functions.
Functions for Saving and Loading KSTAR results#
- kstar.calculate.from_kstar(name, odir, ftype='tsv')#
Given the name and output directory of a saved kstar analyis, load the parameters and minimum dataframes needed for reinstantiating a kinact object This minimum list will allow you to repeat normalization or mann whitney at a different false positive rate threshold and plot results.
- Parameters:
- name: string
The name to used when saving activities and mapped data
- odir: string
Output directory of saved files and parameter pickle
- kstar.calculate.from_kstar_nextflow(name, odir, log=None)#
Given the name and output directory of a saved kstar analyis from the nextflow pipeline, load the results into new kinact object with the minimum dataframes required for analysis (binary experiment, hypergeometric activities, normalized activities, mann whitney activities)
- Parameters:
- name: string
The name to used when saving activities and mapped data
- odir: string
Output directory of saved files
- log: logger
logger used when loading nextflow data into kinase activity object. If not provided, new logger will be created.
- kstar.calculate.save_kstar(kinact_dict, name, odir, minimal=True, ftype='tsv', param_format='json')#
Having performed kinase activities (run_kstar_analyis), save each of the important dataframes, minimizing the memory storage needed to get back to a rebuilt version for plotting results and analysis. For each phospho_type in the kinact_dict, at a minimum, this will save the binarized evidence, mann whitney activities and fpr dataframes, and parameters used during run. If you would like to save all files (hypergeometric and random enrichment intermediate files), set minimal = False.
- Parameters:
- kinact_dict: dictionary of Kinase Activity Objects
Outer keys are phosphoTypes run ‘Y’ and ‘ST’ Includes the activities dictionary (see calculate_kinase_activities) aggregation of activities across networks (see aggregate activities) activity summary (see summarize_activities)
- name: string
The name to use when saving activities
- odir: string
Outputdirectory to save files and pickle to
- minimal: bool
Whether to save only minimal files or all intermediate files
- ftype: {‘tsv’, ‘csv’}
Format to save dataframes in, either tsv or csv
- param_format: {‘pickle’, ‘json’}
Format to save parameter dictionary in, either pickle or json. Json is recommended for easier human readability
- Returns:
- Nothing
Plotting/Analysis Functions#
The “DotPlot” class#
- class kstar.plot.DotPlot(values, fpr, alpha=0.05, inclusive_alpha=True, binary_sig=True, dotsize=5, colormap={0: '#6b838f', 1: '#FF3300'}, facecolor='white', legend_title='-log10(p-value)', size_number=5, size_color='gray', color_title='Significant', markersize=10, legend_distance=1.0, figsize=(4, 8), title=None, xlabel=True, ylabel=True, x_label_dict=None, kinase_dict=None)#
The DotPlot class is used for plotting dotplots, with the option to add clustering and context plots. The size of the dots based on the values dataframe, where the size of the dot is the area of the value * dotsize
- Parameters:
- values: pandas DataFrame instance
values to plot
- fprpandas DataFrame instance
false positive rates associated with values being plotted
- alpha: float, optional
fpr value that defines the significance cutoff to use when plt default : 0.05
- inclusive_alpha: boolean
whether to include the alpha (significance <= alpha), or not (significance < alpha). default: True
- binary_sig: boolean, optional
indicates whether to plot fpr with binary significance or as a change color hue default : True
- dotsizefloat, optional
multiplier to use for scaling size of dots
- colormapdict, optional
maps color values to actual color to use in plotting default : {0: ‘#6b838f’, 1: ‘#FF3300’}
- labelmap =
maps labels of colors, default is to indicate FPR cutoff in legend default : None
- facecolorcolor, optional
Background color of dotplot default : ‘white’
- legend_titlestr, optional
Legend Title for dot sizes, default is `p-value’
- size_numberint, optional
Number of dots to attempt to generate for dot size legend
- size_colorcolor, optional
Size Legend Color to use
- color_titlestr, optional
Legend Title for the Color Legend
- markersizeint, optional
Size of dots for Color Legend
- legend_distanceint, optional
relative distance to place legends
- figsizetuple, optional
size of dotplot figure
- titlestr, optional
Title of dotplot
- xlabelbool, optional
Show xlabel on graph if True
- ylabelbool, optional
Show ylabel on graph if True
- x_label_dict: dict, optional
Mapping dictionary of labels as they appear in values dataframe (keys) to how they should appear on plot (values)
- kinase_dict: dict, optional
Mapping dictionary of kinase names as they appear in values dataframe (keys) to how they should appear on plot (values)
- Attributes:
- values: pandas dataframe
a copy of the original values dataframe
- fpr: pandas dataframe
a copy of the original fpr dataframe
- alpha: float
cutoff used for significance, default 0.05
- inclusive_alpha: boolean
whether to include the alpha (significance <= alpha), or not (significance < alpha)
- significance: pandas dataframe
indicates whether a particular kinases activity is significant, where fpr <= alpha is significant, otherwise it is insignificant
- colors: pandas dataframe
dataframe indicating the color to use when plotting: either a copy of the fpr or significance dataframe
- binary_sig: boolean
indicates whether coloring will be done based on binary significance or fpr values. Default True
- labelmap: dict
indicates how to label each significance color
- figsize: tuple
size of the outputted figure, which is overridden if axes is provided for dotplot
- title: string
title of the dotplot
- xlabel: boolean
indicates whether to plot x-axis labels
- ylabel: boolean
indicates whether to plot y-axis labels
- colormap: dict
colors to be used when plotting
- facecolor: string
background color of dotplot
Methods
cluster(ax[, method, metric, orientation, ...])Performs hierarchical clustering on data and plots result to provided Axes.
context(ax, info, id_column, context_columns)Context plot is generated and returned.
dotplot([ax, orientation, size_legend, ...])Generates the dotplot plot, where size is determined by values dataframe and color is determined by significant dataframe
drop_kinases(kinase_list)Given a list of kinases, drop these from the dot.values dataframe in all future plotting of this object.
Drop kinases from the values dataframe (inplace) when plotting if they are never observed as significant
evidence_count(ax, binary_evidence[, ...])Add bars to dotplot indicating the total number of sites used as evidence in activity calculation
make_complete_dotplot([kinases_to_plot, ...])Master function for creating a comprehensive dotplot visualization, which automatically creates any necessary subplots
set_colors([labelmap])Set colors for the plot based on significance or false positive rate.
set_column_labels
set_index_labels
set_values
setup_figure
- cluster(ax, method='single', metric='euclidean', orientation='top', color_threshold=-inf)#
Performs hierarchical clustering on data and plots result to provided Axes. result and significant dataframes are ordered according to clustering
- Parameters:
- axmatplotlib Axes instance
Axes to plot dendogram to
- methodstr, optional
The linkage algorithm to use.
- metricstr or function, optional
The distance metric to use in the case that y is a collection of observation vectors; ignored otherwise. See the pdist function for a list of valid distance metrics. A custom distance function can also be used.
- orientationstr, optional
The direction to plot the dendrogram, which can be any of the following strings: ‘top’: Plots the root at the top, and plot descendent links going downwards. (default). ‘bottom’: Plots the root at the bottom, and plot descendent links going upwards. ‘left’: Plots the root at the left, and plot descendent links going right. ‘right’: Plots the root at the right, and plot descendent links going left.
- context(ax, info, id_column, context_columns, dotsize=200, markersize=20, orientation='left', color_palette='colorblind', margin=0.2, make_legend=True, **kwargs)#
Context plot is generated and returned. The context plot contains the categorical data used for describing the data.
- Parameters:
- axmaptlotlib axis
where to map subtype information to
- infopandas df
Dataframe where context information is pulled from
- id_column: str
Column used to map the subtype information to
- context_columnslist
list of columns to pull context informaiton from
- dotsizeint, optional
size of context dots
- markersize: int, optional
size of legend markers
- orientationstr, optional
orientation to plot context plots to - determines where legends are placed options : left, right, top, bottom
- color_palettestr, optional
seaborn color palette to use
- margin: float, optional
margin
- make_legendbool, optional
whether to create legend for context colors
- dotplot(ax=None, orientation='left', size_legend=True, color_legend=True, max_size=None, **kwargs)#
Generates the dotplot plot, where size is determined by values dataframe and color is determined by significant dataframe
- Parameters:
- axmatplotlib Axes instance, optional
axes dotplot will be plotted on. If None then new plot generated
- orientationstr, optional
orientation to place legends, either ‘left’ or ‘right’
- size_legendbool, optional
whether to include size legend (indicates meaning of dot size/activity)
- color_legendbool, optional
whether to include color legend (indicates significance)
- max_sizeint, optional
maximum size value to use when generating size legend. If None, automatic legend generated
- Returns:
- axmatplotlib Axes instance
Axes containing the dotplot
- drop_kinases(kinase_list)#
Given a list of kinases, drop these from the dot.values dataframe in all future plotting of this object. Removal is in place
- Parameters:
- kinase_list: list
list of kinase names to remove
- drop_kinases_with_no_significance()#
Drop kinases from the values dataframe (inplace) when plotting if they are never observed as significant
- evidence_count(ax, binary_evidence, plot_type='bars', phospho_type=None, dot_size=1, include_recommendations=False, ideal_min=None, recommended_min=None, dot_colors=None, bar_line_colors=None)#
Add bars to dotplot indicating the total number of sites used as evidence in activity calculation
- Parameters:
- ax: axes object
where to plot the bars
- binary_evidence: pandas dataframe
binarized dataframe produced during activity calculation (threshold applied to original experiment)
- make_complete_dotplot(kinases_to_plot=None, cluster_samples=False, cluster_kinases=False, sort_kinases_by=None, sort_samples_by=None, binary_evidence=None, context=None, significant_kinases_only=True, show_xtick_labels=True, **kwargs)#
Master function for creating a comprehensive dotplot visualization, which automatically creates any necessary subplots
- Parameters:
- kinases_to_plotlist or None, optional
List of kinases to include in the plot. If None, all kinases are included.
- cluster_samplesbool, optional
Whether to cluster samples in the plot.
- cluster_kinasesbool, optional
Whether to cluster kinases in the plot.
- significant_kinases_onlybool, optional
Whether to include only significant kinases in the plot.
- sort_samples_bystr or None, optional
Kinase Column to sort samples by in the plot based on kinase activities. If cluster_sample=True, this will be ignored.
- sort_kinases_bystr or None, optional
Sample Column to sort kinases by in the plot based on kinase activities. If cluster_kinases=True, this will be ignored.
- binary_evidencepd.DataFrame or None
Binary evidence dataframe from KSTAR analysis. If provided, will calculate the number of sites used as evidence in each sample and plot this.
- contextpd.DataFrame or None, optional
Context dataframe providing additional sample information for plotting. If provided, must include an ‘id_column’ for unique sample identifiers and list ‘context_columns’ for context information.
- show_xtick_labelsbool, optional
Whether to show x-axis tick labels in the dotplot.
- **kwargs
Additional keyword arguments passed to plotting functions, like matplotlib.pyplot.scatter, DotPlot.context, DotPlot.dotplot, DotPlot.cluster, and DotPlot.evidence_count
- set_colors(labelmap=None)#
Set colors for the plot based on significance or false positive rate.
The “KSTAR_PDF” class#
- class kstar.plot.KSTAR_PDF(activities, fpr, odir, name, binarized_experiment, param_dict)#
Class to generate a PDF report from KSTAR analysis results, built on fdpf2 module
- Parameters:
- activitiespandas DataFrame
DataFrame of mann whitney kinase activities
- fprpandas DataFrame
DataFrame of false positive rates corresponding to activities
- odirstr
Output directory for saving the PDF report
- namestr
Name of the experiment/run, used for file naming
- binarized_experimentpandas DataFrame
Binarized experiment indicating which sites were used as evidence in each column
- param_dictdict
Dictionary of parameters used in the KSTAR run
- Attributes:
- MARKDOWN_LINK_COLOR
accept_page_breakWhenever a page break condition is met, this @property method is called, and the break is issued or not depending on the returned value.
- char_spacing
char_vposReturn vertical character position relative to line.
- current_font
- current_font_is_set_on_page
- dash_pattern
default_page_dimensionsReturn a pair (width, height) in the unit specified to FPDF constructor
denom_liftReturn lift factor for denominator text.
denom_scaleReturn scale factor for denominator text.
- draw_color
emphasisThe current text emphasis: bold, italics, underline and/or strikethrough.
ephEffective page height: the page height minus its vertical margins.
epwEffective page width: the page width minus its horizontal margins.
- fill_color
- font_family
- font_size
- font_size_pt
- font_stretching
- font_style
- fonts
- is_ttf_font
- line_width
nom_liftReturn lift factor for nominator text.
nom_scaleReturn scale factor for nominator text.
- output_intents
- page_layout
- page_mode
pages_countReturns the total pages of the document, at the time it is called.
- strikethrough
sub_liftReturn lift factor for subscript text.
sub_scaleReturn scale factor for subscript text.
sup_liftReturn lift factor for superscript text.
sup_scaleReturn scale factor for superscript text.
- text_color
- text_mode
- text_shaping
- underline
Methods
HTML2FPDF_CLASSalias of
HTML2FPDFadd_action(action, x, y, w, h, **kwargs)Puts an Action annotation on a rectangular area of the page.
add_font([family, style, fname, ...])Imports a TrueType or OpenType font and makes it available for later calls to the FPDF.set_font() method.
add_link([y, x, page, zoom, name])Creates a new internal link and returns its identifier.
add_output_intent(subtype[, ...])Adds desired Output Intent to the Output Intents array:
add_page([orientation, format, same, ...])Adds a new page to the document.
add_text_markup_annotation(type, text, ...)Adds a text markup annotation on some quadrilateral areas of the page.
alias_nb_pages([alias])Defines an alias for the total number of pages.
arc(x, y, a, start_angle, end_angle[, b, ...])Outputs an arc.
bezier(point_list[, closed, style])Outputs a quadratic or cubic Bézier curve, defined by three or four coordinates.
cell([w, h, text, border, ln, align, fill, ...])Prints a cell (rectangular area) with optional borders, background color and character string.
circle(x, y, radius[, style])Outputs a circle.
code39(text, x, y[, w, h])Barcode 3of9
Generate a standard activity dotplot for use in the PDF report
dashed_line(x1, y1, x2, y2[, dash_length, ...])Draw a dashed line between two points.
dotplot_page([regenerate_plots])Create a PDF page that includes the KSTAR dotplot figure and information on where to find the figure and underlying data in the output directory
draw_path(path[, debug_stream])Add a pre-constructed path to the document.
draw_vector_glyph(path, font)Add a pre-constructed path to the document.
drawing_context([debug_stream])Create a context for drawing paths on the current page.
ellipse(x, y, w, h[, style])Outputs an ellipse.
elliptic_clip(x, y, w, h)Context manager that defines an elliptic crop zone, useful to render only part of an image.
embed_file([file_path, bytes, basename, ...])Embed a file into the PDF as an attachment (and, for PDF/A-3 or PDF/A-4f, as an Associated File).
evidence_count_plot(data_columns)Creates a barplot showing the number of sites used as evidence in each column of the experiment
evidence_overlap_plot(data_columns)Creates a heatmap showing the Jaccard index of evidence overlap between columns in the experiment
evidence_page([regenerate_plots])Create a PDF page that includes the total number of sites used as evidence for each column and the jaccard similarity of evidence between columns
file_attachment_annotation(file_path, x, y)Puts a file attachment annotation on a rectangular area of the page.
file_id()This method can be overridden in inherited classes in order to define a custom file identifier.
font_face()Return a fpdf.fonts.FontFace instance representing a subset of properties of this GraphicsState.
footer()Override the footer method to add a page number at the bottom center of each page.
free_text_annotation(text[, x, y, w, h])Puts a free text annotation on a rectangular area of the page.
generate([regenerate_plots])Generates the PDF report by creating each page in sequence and saving the final PDF to the output directory
get_fallback_font(char[, style])Returns which fallback font has the requested glyph.
get_named_destination(name)Retrieves a named destination by its name and creates a link to it.
get_page_label()Return the current page fpdf.output.PDFPageLabel.
get_string_width(s[, normalized, markdown])Returns the length of a string in user unit.
get_x()Returns the abscissa of the current position.
get_y()Returns the ordinate of the current position.
glyph_drawing_context()Create a context for drawing paths for type 3 font glyphs, without writing on the current page.
header()Header to be implemented in your own inherited class
highlight(text[, type, color, modification_time])Context manager that adds a single highlight annotation based on the text lines inserted inside its indented block.
image(name[, x, y, w, h, type, link, title, ...])Put an image on the page.
ink_annotation(coords[, text, color, ...])Adds add an ink annotation on the page.
insert_toc_placeholder(render_toc_function)Configure Table Of Contents rendering at the end of the document generation, and reserve some vertical space right now in order to insert it.
interleaved2of5(text, x, y[, w, h])Barcode I2of5 (numeric), adds a 0 if odd length
line(x1, y1, x2, y2)Draw a line between two points.
link(x, y, w, h, link[, alt_text])Puts a link annotation on a rectangular area of the page.
ln([h])Line Feed.
local_context(**kwargs)Creates a local graphics state, which won't affect the surrounding code.
mirror(origin, angle)Method to perform a reflection transformation over a given mirror line.
multi_cell(w[, h, text, border, align, ...])This method allows printing text with line breaks.
new_path([x, y, paint_rule, debug_stream])Create a path for appending lines and curves to.
normalize_text(text)Check that text input is in the correct format/encoding
offset_rendering()All rendering performed in this context is made on a dummy FPDF object.
output([name, linearize, output_producer_class])Output PDF to some destination.
page_no()Get the current page number
polygon(point_list[, fill, style])Outputs a polygon defined by three or more points.
polyline(point_list[, fill, polygon, style])Draws lines between two or more points.
preload_image(name[, dims])Read an image and load it into memory.
rect(x, y, w, h[, style, round_corners, ...])Outputs a rectangle.
rect_clip(x, y, w, h)Context manager that defines a rectangular crop zone, useful to render only part of an image.
regular_polygon(x, y, numSides, polyWidth[, ...])Outputs a regular polygon with n sides It can be rotated Style can also be applied (fill, border...)
rotate(angle[, x, y])rotation(angle[, x, y])Method to perform a rotation around a given center.
round_clip(x, y, r)Context manager that defines a circular crop zone, useful to render only part of an image.
set_author(author)Defines the author of the document.
set_auto_page_break(auto[, margin])Set auto page break mode, and optionally the bottom margin that triggers it.
set_char_spacing(spacing)Sets horizontal character spacing.
set_compression(compress)Activates or deactivates page compression.
set_creation_date([date])Sets Creation of Date time, or current time if None given.
set_creator(creator)Defines the creator of the document.
set_dash_pattern([dash, gap, phase])Set the current dash pattern for lines and curves.
set_display_mode(zoom[, layout])Defines the way the document is to be displayed by the viewer.
set_doc_option(opt, value)Defines a document option.
set_draw_color(r[, g, b])Defines the color used for all stroking operations (lines, rectangles and cell borders).
set_encryption(owner_password[, ...])Activate encryption of the document content.
set_fallback_fonts(fallback_fonts[, exact_match])Allows you to specify a list of fonts to be used if any character is not available on the font currently set.
set_fill_color(r[, g, b])Defines the color used for all filling operations (filled rectangles and cell backgrounds).
set_font([family, style, size])Sets the font used to print character strings.
set_font_size(size)Configure the font size in points
set_image_filter(image_filter)Args:
set_keywords(keywords)Associate keywords with the document
set_lang(lang)A language identifier specifying the natural language for all text in the document except where overridden by language specifications for structure elements or marked content.
set_left_margin(margin)Sets the document left margin.
set_line_width(width)Defines the line width of all stroking operations (lines, rectangles and cell borders).
set_link([link, y, x, page, zoom, name])Defines the page and position a link points to.
set_margin(margin)Sets the document right, left, top & bottom margins to the same value.
set_margins(left, top[, right])Sets the document left, top & optionally right margins to the same value.
set_page_background(background)Sets a background color or image to be drawn every time FPDF.add_page() is called, or removes a previously set background.
set_page_label([label_style, label_prefix, ...])Enable fpdf.output.PDFPageLabel to be inserted on every page.
set_producer(producer)Producer of document
set_right_margin(margin)Sets the document right margin.
set_section_title_styles(level0[, level1, ...])Defines a style for section titles.
set_stretching(stretching)Sets horizontal font stretching.
set_subject(subject)Defines the subject of the document.
set_text_color(r[, g, b])Defines the color used for text.
set_text_shaping([use_shaping_engine, ...])Enable or disable text shaping engine when rendering text.
set_title(title)Defines the title of the document.
set_top_margin(margin)Sets the document top margin.
set_x(x)Defines the abscissa of the current position.
set_xy(x, y)Defines the abscissa and ordinate of the current position.
set_y(y)Moves the current abscissa back to the left margin and sets the ordinate.
sign(key, cert[, extra_certs, hashalgo, ...])Args:
sign_pkcs12(pkcs_filepath[, password, ...])Args:
skew([ax, ay, x, y])Method to perform a skew transformation originating from a given center.
solid_arc(x, y, a, start_angle, end_angle[, ...])Outputs a solid arc.
star(x, y, r_in, r_out, corners[, ...])Outputs a regular star with n corners.
start_section(name[, level, strict])Start a section in the document outline.
Create a PDF page that indicates the parameters used in the KSTAR run and the key kinases identified for each column
table(data[, header, column_widths, row_height])Builds a table in the PDF
text(x, y[, text])Prints a character string.
text_annotation(x, y, text[, w, h, name])Puts a text annotation on a rectangular area of the page.
text_columns([text, img, img_fill_width, ...])Establish a layout with multiple columns to fill with text. Args: text (str, optional): A first piece of text to insert. ncols (int, optional): the number of columns to create. (Default: 1). gutter (float, optional): The distance between the columns. (Default: 10). balance: (bool, optional): Specify whether multiple columns should end at approximately the same height, if they don't fill the page. (Default: False) text_align (Align or str, optional): The alignment of the text within the region. (Default: "LEFT") line_height (float, optional): A multiplier relative to the font size changing the vertical space occupied by a line of text. (Default: 1.0). l_margin (float, optional): Override the current left page margin. r_margin (float, optional): Override the current right page margin. print_sh (bool, optional): Treat a soft-hyphen (u00ad) as a printable character, instead of a line breaking opportunity. (Default: False) wrapmode (fpdf.enums.WrapMode, optional): "WORD" for word based line wrapping, "CHAR" for character based line wrapping. (Default: "WORD") skip_leading_spaces (bool, optional): On each line, any space characters at the beginning will be skipped if True. (Default: False).
Constructs a table of the top 5 most active significant kinases per sample and adds it to the PDF page
unbreakable()Ensures that all rendering performed in this context appear on a single page by performing page break beforehand if need be.
use_font_face(font_face)Sets the provided fpdf.fonts.FontFace in a local context, then restore font settings back to they were initially.
use_pattern(shading)Create a context for using a shading pattern on the current page.
will_page_break(height)Let you know if adding an element will trigger a page break, based on its height and the current ordinate (y position).
write([h, text, link, print_sh, wrapmode])Prints text from the current position.
write_html(text, *args, **kwargs)Parse HTML and convert it to PDF.
add_highlight
clear_text_region
is_current_text_region
mapping_page
preload_glyph_image
register_text_region
set_xmp_metadata
use_text_style
x_by_align
- create_dotplot()#
Generate a standard activity dotplot for use in the PDF report
- dotplot_page(regenerate_plots=False)#
Create a PDF page that includes the KSTAR dotplot figure and information on where to find the figure and underlying data in the output directory
- Parameters:
- regenerate_plotsbool, optional
Whether to regenerate the dotplot figure even if it already exists in the output directory
- evidence_count_plot(data_columns)#
Creates a barplot showing the number of sites used as evidence in each column of the experiment
- Parameters:
- data_columnslist
List of column names in the experiment to include in the plot
- evidence_overlap_plot(data_columns)#
Creates a heatmap showing the Jaccard index of evidence overlap between columns in the experiment
- Parameters:
- data_columnslist
List of column names in the experiment to include in the plot
- evidence_page(regenerate_plots=False)#
Create a PDF page that includes the total number of sites used as evidence for each column and the jaccard similarity of evidence between columns
- Parameters:
- regenerate_plotsbool, optional
Whether to regenerate the evidence plots even if they already exist in the output directory
Override the footer method to add a page number at the bottom center of each page.
- generate(regenerate_plots=False)#
Generates the PDF report by creating each page in sequence and saving the final PDF to the output directory
- Parameters:
- regenerate_plotsbool, optional
Whether to regenerate all plots even if they already exist in the output directory
- summary_page()#
Create a PDF page that indicates the parameters used in the KSTAR run and the key kinases identified for each column
- table(data, header=None, column_widths=40, row_height=5)#
Builds a table in the PDF
- Parameters:
- datapandas DataFrame
DataFrame containing the data to be included in the table
- headerlist, optional
List of header names for the table columns. If None, uses DataFrame column names.
- column_widthsint or list, optional
Width of each column in the table. If an integer is provided, all columns will have the same width. If a list is provided, it should contain the width for each column.
- row_heightint, optional
Height of each row in the table.
- top_kinases_table()#
Constructs a table of the top 5 most active significant kinases per sample and adds it to the PDF page
Downstream Analysis Modules#
- kstar.analysis.interactions.getSubstrateInfluence(networks, kinase, substrate_subset=None)#
Given the pruned networks and kinase of interest, return the number of networks each substrate is connected to that kinase in (the ‘substrate influence’ on that kinase’s activity prediction). If subset of substrates is provided, will only do this for the given subset
- Parameters:
- networks: dict
dictionary storing all pruned networks used in activity calculation
- kinase: str
name of the kinase of interest: should match the name found in provided networks
- substrate_subset: list
subset of substrates to analyze, indicated by ‘{KSTAR_ACCESSION}_{KSTAR_SITE}’. If none, will return a series containing info on all substrates with at least one connection to the given kinase
- Returns:
- Pandas series indicating the number of networks each substrate is connected the indicated kinase, sorted from the most connections (highest influence) to the least (lowest influence). Sites with no connection will not be included.
- kstar.analysis.interactions.getSubstrateInfluence_inExperiment(networks, binary_evidence, kinase, data_cols=None)#
Given the binary evidence used for activity prediction, identify which sites are found across the most networks for a given kinase and each sample.
- Parameters:
- networks: dictionary
dictionary containing all 50 pruned networks used for activity prediction
- binary_evidence: pandas dataframe
binarized dataset (using the same threshold/criteria as the one used for activity prediction)
- kinase: str
name of the kinase to probe
- data_cols: list or None
name of the data columns in binary_evidence to probe. If None, will analyze all columns with ‘data:’ at the start of the column name.
- kstar.analysis.coverage.averageUniqueSubstrates_KSTAR(networks=None)#
Calculate the average number of unique substrates covered by each KSTAR pruned network
- Parameters:
- mod_types: list
list containing which networks to calculate average for. Either [‘Y’], [‘ST’], or [‘Y’,’ST’]
- Returns:
- averageSub: dict
indicates the average number of substrates across all pruned networks for indicated modification types
- kstar.analysis.coverage.experimentCoverage(experiment, networks, mod='Y', exp_cols=['KSTAR_ACCESSION', 'KSTAR_SITE'], net_cols=['KSTAR_ACCESSION', 'KSTAR_SITE'])#
Given an experiment, determine how many of the sites observed in the experiment can be captured by a kinase-substrate network (function was designed for KSTAR pruned networks, but should work with any ks-network that indicates UniProt ID and site number)
- Parameters:
- experiment: pandas dataframe
phosphoproteomic experiment, ideally that has been mapped to KinPred by KSTAR already
- network: pandas dataframe
binarized kinase-substrate network (unweighted), ideally having been mapped to KinPred/KSTAR already
- exp_cols: list
list indicating the columns in experiment dataframe that contain uniprot id and site number
- net_cols: list
list indicating the columns in network dataframe that contain the uniprot id and site number
- Returns:
- fraction_of_sites_covered: dict
indicates the fraction of phosphorylation sites observed in experiment that are also found within the kinase-substrate network, for each modification type (tyrosine, serine/threonine).
- kstar.analysis.coverage.getStudyBiasDistribution_InExperiment(binary_experiment, ax=None, figsize=(4, 3), return_dist=False)#
Plot the distribution of study bias within a single phosphoproteomic experiment
- Parameters:
- mapped_experiment: pandas dataframe
phosphoproteomic experiment that has been mapped by KSTAR (contains ‘KSTAR_SITE’,’KSTAR_ACCESSION’, and ‘KSTAR_NUM_COMPENDIA’ columns)
- ax: matplotlib axes object
axis to plot the distribution on. If none, will create subplot
- figsize: tuple
size of matplotlib figure. Default is (4,3)
- return_dist: bool
whether you would like to also return the distribution values. Default is False.
- Returns:
- Histogram plotting the distribution of study bias found in the provided experiment, as defined by the number of compendia a phosphorylation site is recorded in. If return_dist = True, will also return a series object containing the same data as the histogram.
- kstar.analysis.coverage.getStudyBiasDistribution_InPhosphoproteome(mod_type='Y', ax=None, figsize=(4, 3), return_dist=False)#
Plot the distribution of study bias across the reference phosphoproteome
- Parameters:
- mod_type: str
indicates which modification type, tyrosine (‘Y’) or serine/threonine (‘ST’), you would like to plot. Default is ‘Y’
- ax: matplotlib axes object
axis to plot the distribution on. If none, will create subplot
- figsize: tuple
size of matplotlib figure. Default is (4,3)
- return_dist: bool
whether you would like to also return the distribution values. Default is False.
- Returns:
- Histogram plotting the distribution of study bias found in overall phosphoproteome, as defined by the number of compendia a phosphorylation site is recorded in. If return_dist = True, will also return a series object containing the same data as the histogram.
- kstar.analysis.coverage.getStudyBiasDistribution_InSample(binary_experiment, data_column, ax=None, figsize=(4, 3), return_dist=False)#
Plot the distribution of study bias within a single phosphoproteomic experiment
- Parameters:
- mapped_experiment: pandas dataframe
phosphoproteomic experiment that has been mapped by KSTAR (contains ‘KSTAR_SITE’,’KSTAR_ACCESSION’, and ‘KSTAR_NUM_COMPENDIA’ columns)
- ax: matplotlib axes object
axis to plot the distribution on. If none, will create subplot
- figsize: tuple
size of matplotlib figure. Default is (4,3)
- return_dist: bool
whether you would like to also return the distribution values. Default is False.
- Returns:
- Histogram plotting the distribution of study bias found in the provided experiment, as defined by the number of compendia a phosphorylation site is recorded in. If return_dist = True, will also return a series object containing the same data as the histogram.
- kstar.analysis.coverage.numUniqueSubstrates(networks, acc_col='KSTAR_ACCESSION', site_col='KSTAR_SITE')#
Given a KSTAR network(s), return the number of unique substrates within the network (across all kinases). If a dictionary of multiple pruned networks is provided, will calculate the total number of unique substrates across ALL networks.
- Parameters:
- network: pandas dataframe or dict of pandas dataframes
pruned KSTAR network, or dictionary containing multiple pruned networks
- acc_col: str
name of column in network dataframe which indicates UniProt ID of substrates
- site_col: str
name of column in network dataframe which indicates residue and site number (i.e. Y1197)
- Returns:
- Number of unique substrates within network(s)
- kstar.analysis.coverage.sampleCoverage(binary_experiment, data_col, networks, mod='Y', exp_cols=['KSTAR_ACCESSION', 'KSTAR_SITE'], net_cols=['KSTAR_ACCESSION', 'KSTAR_SITE'])#
Given a sample within an experiment, determine how many of the sites observed in the experiment can be captured by KSTAR pruned networks. Essentially the same as experimentCoverage(), but restricts experiment sites to those used as evidence for a given sample
- Parameters:
- binary_experiment: pandas dataframe
binarized phosphoproteomic experiment, with each 1 indicating that site was observed in sample. Ideally has been mapped to KinPred by KSTAR already
- data_col: str
column name of the sample of interest
- network: pandas dataframe
binarized kinase-substrate network (unweighted), ideally having been mapped to KinPred/KSTAR already
- exp_cols: list
list indicating the columns in experiment dataframe that contain uniprot id and site number
- net_cols: list
list indicating the columns in network dataframe that contain the uniprot id and site number
- Returns:
- fraction_of_sites_covered: dict
indicates the fraction of phosphorylation sites observed in sample that are also found within the kinase-substrate network, for each modification type (tyrosine, serine/threonine).
- kstar.analysis.kinase_MI.kinase_mutual_information(network, kinase_column='KSTAR_KINASE', accession_column='KSTAR_ACCESSION', site_column='KSTAR_SITE', substrate_list=None)#
Finds mutual information shared between kinases based on the substrate phosphorylated Mutual Information is defined as the intersection substrates between two kinases A substrate is defined as the substrate accession and site, i.e. P54760_Y596. Normalization is performed by comparing intersection of kinases vs union of the two kinases This the the Jaccard Index. Jaccard Distance can be calcualted by taking 1 - JI
- Parameters:
- networkpandas dataframe or dictionary of pandas dataframe
The network to analyze for mutual kinase information. Can send a dictionary of multiple pandas dataframes and this will average the MI across all networks in dictionary
- kinase_columnstr
Column in network that contiains kinase information
- substrate_columnstr
Column in network that contains substrate information
- substrate_listlist
Optional and default is no subset list to use. You can calculate the MI within network(s) for only the evidence given in a substrate_evidence_list (must matche substrate_column of network passed in)
- Returns:
- heatmappandas dataframe
Number of substrates that overlap between kinases
- normalizedpandas dataframe
Normalized mutual information into Jaccard Index. size of intersection of two kinase networks / size of union of two kinase networks.
- heatlist or heatdict: list or dictionary of lists
intersection of kinase networks. If a single network it is a list. If multiple networks it is a dict of lists with keys the same as the network name
- kstar.analysis.kinase_MI.plot_kinase_heatmap(heatmap, use_mask=True, annotate=False)#
Plots Kinase network heatmap
- Parameters:
- heatmappandas dataframe
Network Heatmap to plot (must be square matrix)
- info_type: str
Indicates what type of informatin is included in heatmap variable. Default is mutual information, equivalent to the normalized matrix obtained from kinase_mutual_information function
- use_maskbool
If true a mask is applied to the heatmap
- annotatebool
If true then numbers are annotated into each heatmap square
Dataset Processing Functions#
Other Helper Functions#
- kstar.helpers.agg_jaccard(jaccard_matrix, agg='max')#
Given a jaccard similarity matrix between samples, calculate the aggregate jaccard similarity excluding self-comparisons
- Parameters:
- jaccard_matrix: pd.DataFrame
jaccard similarity matrix between samples, created using jaci_matrix_between_samples()
- agg: str
aggregation method to use, either ‘max’ or ‘mean’
- kstar.helpers.calculate_jaccard_by_binary(set1, set2)#
Compares two binary arrays and calculates the Jaccard index between them (based on number of matches)
- kstar.helpers.calculate_jaccard_by_sets(set1, set2)#
Compares two sets and calculates the Jaccard index between them
- kstar.helpers.convert_acc_to_uniprot(df, acc_col_name, acc_col_type, acc_uni_name)#
Given an experimental dataframe (df) with an accession column (acc_col_name) that is not uniprot, use uniprot to append an accession column of uniprot IDS
- Parameters:
- df: pandas.DataFrame
Dataframe with at least a column of accession of interest
- acc_col_name: string
name of column to convert FROM
- acc_col_type: string
Uniprot string designation of the accession type to convert FROM, see https://www.uniprot.org/help/api_idmapping
- acc_uni_name:
name of new column
- Returns:
- appended_df: pandas.DataFrame
Input dataframe with an appended acc column of uniprot IDs
- kstar.helpers.get_logger(name, filename)#
Finds and returns logger if it exists. Creates new logger if log file does not exist
- Parameters:
- namestr
- log name
- filenamestr
- location to store log file
- kstar.helpers.jaci_matrix_between_samples(evidence, samples=None)#
This function creates a looks at the similarity of evidence between samples based on Jaccard index of phosphopeptide identities
- Parameters:
- evidence: pd.DataFrame
evidence dataframe, preferably one that has been binarized
- samples: a list of sample columns
- Returns:
- jaccard_matrix: pd.DataFrame
a dataframe showing the similarity of phosphopeptide identities between samples
- kstar.helpers.parse_network_information(network_directory, file_type='txt')#
Parse the RUN_INFORMATION.txt file from network pruning run and extract its data.
- Args:
file_path (str): Path to the RUN_INFORMATION.txt file.
- Returns:
dict: A dictionary containing the parsed data.
- kstar.helpers.process_fasta_file(fasta_file)#
For configuration, to convert the global fasta sequence file into a sequence dictionary that can be used in mapping
- Parameters:
- fasta_filestr
file location of fasta file
- Returns:
- sequencesdict
{acc : sequence} dictionary generated from fasta file
- kstar.helpers.string_to_boolean(string)#
Converts string to boolean
- Parameters:
- string :str
input string
- Returns:
- resultbool
output boolean