KSTAR Reference

KSTAR Reference#

The “Config” Module#

kstar.config.check_configuration()#: Verify that all necessary files are downloadable and findable

kstar.config.find_available_networks(phospho_type)#

Find available network hashes in the current network directory, and return dictionary with information about them

Returns:

available_networksdict: dictionary containing all available networks, in the format -> Network hash : network information dictionary

kstar.config.install_network_files(target_dir=None)#

Retrieves Network files that are the companion for this version release from FigShare, unzips them to the specified directory.

Parameters:

target_dirstr, optional: Directory to install network files to. If None, defaults to within package location ({KSTAR_DIR}/NETWORKS/)

kstar.config.update_configuration(network_dir=None, y_network_name=None, st_network_name=None, save_random_experiments=None, use_pregenerated_random_activities=None, save_new_random_activities=None, custom_pregenerated_activities_dir=None)#

Update configuration parameters in current iteration and save to configuration file.

Parameters:

use_pregenerated_random_activitiesbool, optional: Whether to use pregenerated random activities when possible, by default None
save_new_random_activitiesbool, optional: Whether to save new random activities when they are generated, by default False
custom_pregenerated_activities_dirstr, optional: Directory to save newly generated random activities for future use, by default None
network_dirstr, optional: Directory containing the kinase-substrate networks, by default None (which assumes it is located in kstar directory)
y_network_hashstr, optional: Unique identifier of the tyrosine network to use by default.
st_network_hashstr, optional: Unique identifier of the serine/threonine network to use by default.

kstar.config.update_network_directory(network_dir=None, y_network_name=None, st_network_name=None)#

Update the location of network the network files, and verify that all necessary files are located in directory

Parameters:

network_dir: string: path to where network files are located
y_network_name: string: name of the tyrosine network to use
st_network_name: string: name of the serine/threonine network to use

The “Prune” Module#

The “Pruner” Class#

class kstar.prune.Pruner(network, network_name, phospho_type='Y', acc_col='substrate_acc', site_col='site', nonweight_cols=['substrate_acc', 'site', 'substrate_id', 'substrate_name', 'pep'], logger=None, network_dir=None)#

Pruning Algorithm used for KSTAR.

Parameters:

networkpandas df

weighted kinase-site prediction network where there is an accession, site, kinase, and score column

network_namestr

name to use when saving pruned networks

loggerNone or logging.logger

logger used for pruning. Will create a new logger if None is provided

phospho_typestr

phospho_type(s) to use when building pruned networks

acc_colstr

the name of the column containing Uniprot Accession IDs for each substrate in the weighted network

site_colstr

the name of the column containing the residue type and location of each substrate in the weighted network (Y1268, S44, etc.)

nonweight_colslist

indicates the non-weight containing columns in the network (these will be removed in the final processed network, as they are not needed). If None, will automatically look: for any non-numeric columns and removes them.

network_dirstr

location to save the final pruned networks. Will use default network directory from config if None is provided.

Methods

`assess_work_dir`()	Report how many networks are currently in the work directory
`build_multiple_compendia_networks`(...[, ...])	Builds multiple compendia-limited networks
`build_multiple_networks`(kinase_size, ...[, ...])	Basic Network Generation - only takes into account score when determining sites a kinase connects to
`build_pruned_network`(network, kinase_size, ...)	Builds a heuristic pruned network where each kinase has a specified number of connected sites and each site has an upper limit to the number of kinases it can connect to
`calculate_compendia_sizes`(kinase_size)	Calculates the number of sites per compendia size that a kinase should connect to using same ratios of compendia sizes as found in compendia
`checkParameters`(kinase_size, site_limit)	Given the site_limit and kinase_size parameters to be used during pruning, raise errors if not feasible, and raise warnings if value is higher than we would recommend (>40% of the maximum kinase_size value)
`clean_work_dir`()	Remove all files in existing work directory
`compendia_pruned_network`(compendia_sizes, ...)	Builds a compendia-pruned network that takes into account compendia size limits per kinase
`getMaximumKinaseSize`(site_limit)	Given a network and site_limit (maximum number of kinases a phosphorylation site can provide evidence to), will calculate the theoretical maximum number of connections each kinase can have (kinase_size parameter)
`getRecommendedKinaseSize`(site_limit)	Given a network and site_limit (maximum number of kinases a phosphorylation site can provide evidence to), will calculate the theoretical maximum number of connections each kinase can have (kinase_size parameter) and recommend a range of values for kinase_size
`pregenerate_random_activities`([PROCESSES])	Docstring for pregenerate_random_activities
`report_info`(txt)	Both log and print information during pruning
`report_warning`(txt)	Both log and print warnings during pruning
`run`(kinase_size, site_limit[, num_networks, ...])	Run the pruning algorithm from start to finish, including pregenerating random activities based on generated networks
`save_networks`([network_file_used, network_desc])	Save the pruned networks generated by the 'build_multiple_networks' or 'build_multiple_compendia_networks' as a pickle to be loaded by KSTAR
`save_run_information`([network_file_used, ...])	Save information about the generation of networks during run_pruning, including the parameters used for generation.

assess_work_dir()#: Report how many networks are currently in the work directory

build_multiple_compendia_networks(kinase_size, site_limit, num_networks, PROCESSES=1)#

Builds multiple compendia-limited networks

Parameters:

kinase_size: int: number of sites each kinase should connect to
site_limit :int: upper limit of number of kinases a site can connect to
num_networks: int: number of networks to build
network_idstr: id to use for each network in dictionary

Returns:

pruned_networksdict: key : <network_id>_<i> value : pruned network

build_multiple_networks(kinase_size, site_limit, num_networks, PROCESSES=1)#: Basic Network Generation - only takes into account score when determining sites a kinase connects to

build_pruned_network(network, kinase_size, site_limit)#

Builds a heuristic pruned network where each kinase has a specified number of connected sites and each site has an upper limit to the number of kinases it can connect to

Parameters:

networkpandas DataFrame: network to build pruned network on
kinase_size: int: number of sites each kinase should connect to
site_limit :int: upper limit of number of kinases a site can connect to

Returns:

pruned networkpandas DataFrame: subset of network that has been pruned

calculate_compendia_sizes(kinase_size)#

Calculates the number of sites per compendia size that a kinase should connect to using same ratios of compendia sizes as found in compendia

Parameters:

kinase_size: int: number of sites each kinase should connect to

Returns:

sizesdict: key : compendia size value : number of sites each kinase should pull from given compendia size

checkParameters(kinase_size, site_limit)#

Given the site_limit and kinase_size parameters to be used during pruning, raise errors if not feasible, and raise warnings if value is higher than we would recommend (>40% of the maximum kinase_size value)

Parameters:

kinase_size: int: Parameter used in pruning: indicates the number of substrates each kinase will be connected to
site_limit: int: Parameter used in pruning: indicates the maximum number of kinases a phosphorylation site can be connected to in the final pruned network

Returns:

Nothing, will only raise errors/warnings if parameters are not feasible

clean_work_dir()#: Remove all files in existing work directory

compendia_pruned_network(compendia_sizes, site_limit, odir)#

Builds a compendia-pruned network that takes into account compendia size limits per kinase

Parameters:

compendia_sizesdict: key : compendia size value : number of sites to connect to kinase
site_limitint: upper limit of number of kinases a site can connect to

Returns:

pruned_networkpandas DataFrame: subset of network that has been pruned according to compendia ratios

getMaximumKinaseSize(site_limit)#

Given a network and site_limit (maximum number of kinases a phosphorylation site can provide evidence to), will calculate the theoretical maximum number of connections each kinase can have (kinase_size parameter)

Theoretical maximum exists when each substrate hits the maximum site_limit

Parameters:

site_limit: int: Parameter used in pruning: indicates the maximum number of kinases a phosphorylation site can be connected to in the final pruned network

Returns:

theoretical_max_ksize: int: largest possible value that ‘kinase_size’ parameter can have without throwing any errors

getRecommendedKinaseSize(site_limit)#

Theoretical maximum exists when each substrate hits the maximum site_limit

Parameters:

site_limit: int: Parameter used in pruning: indicates the maximum number of kinases a phosphorylation site can be connected to in the final pruned network

Returns:

Nothing, prints theoretical maximum of kinase size and the recommened values for the parameter given the site_limit

pregenerate_random_activities(PROCESSES=1)#

Docstring for pregenerate_random_activities

Parameters:: self – Description

report_info(txt)#: Both log and print information during pruning

report_warning(txt)#: Both log and print warnings during pruning

run(kinase_size, site_limit, num_networks=50, use_compendia=True, generate_activities=True, network_file_used=None, network_desc=None, restart=False, PROCESSES=1)#: Run the pruning algorithm from start to finish, including pregenerating random activities based on generated networks

save_networks(network_file_used=None, network_desc=None)#: Save the pruned networks generated by the ‘build_multiple_networks’ or ‘build_multiple_compendia_networks’ as a pickle to be loaded by KSTAR

save_run_information(network_file_used=None, network_desc=None)#

Save information about the generation of networks during run_pruning, including the parameters used for generation. Primarily used when running bash script.

Parameters:

network_file_usedstr, optional: file path of the weighted network file used during pruning
network_descstr, optional: description of the network used during pruning. Recommended, but not required

Functions to Perform Pruning#

kstar.prune.run_pruning(weighted_network, network_name, odir, phospho_type, kinase_size, site_limit, num_networks, use_compendia=True, generate_activities=True, network_file_used=None, network_desc=None, restart=False, logger=None, acc_col='substrate_acc', site_col='site', nonweight_cols=['substrate_acc', 'site', 'substrate_id', 'substrate_name', 'pep'], PROCESSES=1)#

Run the pruning algorithm from start to finish, including pregenerating random activities based on generated networks

Parameters:

weighted_networkpandas DataFrame: weighted kinase-site prediction network where there is an accession, site, kinase, and score column
network_namestr: name to use when saving pruned networks
odirstr: location to save the final pruned networks. Will use default network directory from config if None is provided.
phospho_typestr: phospho_type(s) to use when building pruned networks

The “ExperimentMapper” class#

class kstar.mapping.ExperimentMapper(experiment, columns, odir='./', name='experiment', window=7, data_columns=None, logger=None, sequences=None, compendia=None)#

Given an experiment object and reference sequences, map the phosphorylation sites to the common reference. Inputs

Parameters:

namestr: Name of experiment. Used for logging
experiment: pandas dataframe: Pandas dataframe of an experiment that has a reference accession, a peptide column and/or a site column. The peptide column should be upper case, with lower case indicating the site of phosphorylation - this is preferred The site column should be in the format S/T/Y<pos>, e.g. Y15 or S345
columns: dict: Dictionary with mappings of the experiment dataframe column names for the required names ‘accession_id’, ‘peptide’, or ‘site’. One of ‘peptide’ or ‘site’ is required.
name: str: Name of experiment, used for logging and output file names
odir: str: Output directory where mapped data and logs will be saved
logger: Logger object: used for logging when peptides cannot be matched and when a site location changes. If None, a logger will be created in the output directory.
sequences: dict: Dictionary of sequences. Key : accession. Value : protein sequence. Default is imported from kstar.config
compendia: pd.DataFrame: Human phosphoproteome compendia, mapped to KinPred and annotated with number of compendia. Default is imported from kstar.config
windowint: The length of amino acids to the N- and C-terminal sides of the central phosphoprotein to map a site to. Default is 7.
data_columns: list, or empty: The list of data columns to use. If this is empty, logger will look for anything that starts with statement data: and those values Default is None.

Attributes:

experiment: pandas dataframe: mapped experiment, which for each peptide, no contains the mapped accession, site, peptide, number of compendia, compendia type
sequences: dict: Dictionary of sequences passed into the class
compendia: pandas dataframe: compendia dataframe passed into the class
data_columns: list: indicates which columns will be used as data

Methods

`align_sites`([window])	Map the peptide/sites to the common sequence reference and remove and report errors for sites that do not align as expected.
`get_experiment`()	Return the mapped experiment dataframe
`get_number_missed_peptides`()	Returns number of missed peptides
`get_number_missed_sites`()	Returns number of missed sites
`get_reason_for_unmapped`()	Returns dataframe of unmapped sites with reasons for being unmapped
`get_sequence`(accession)	Gets the sequence that matches the given accession
`save_experiment`([return_stats, ...])	Given a completed mapping process, save the resulting experiment and reporting files (if desired) to the output directory.
`set_data_columns`(data_columns)	Identifies which columns in the experiment should be used as data columns.

align_sites(window=7)#

Map the peptide/sites to the common sequence reference and remove and report errors for sites that do not align as expected. expMapper.align_sites(window=7). Operates on the experiment dataframe of class.

Parameters:

window: int: The length of amino acids to the N- and C-terminal sides of the central phosphoprotein to map a site to.

get_experiment()#: Return the mapped experiment dataframe

get_number_missed_peptides()#: Returns number of missed peptides

get_number_missed_sites()#: Returns number of missed sites

get_reason_for_unmapped()#

Returns dataframe of unmapped sites with reasons for being unmapped

Returns:

errorspandas Series: Series with counts of each error type
percpandas Series: Series with percentage of each error type

get_sequence(accession)#: Gets the sequence that matches the given accession

save_experiment(return_stats=True, return_lost_sites=True)#

Given a completed mapping process, save the resulting experiment and reporting files (if desired) to the output directory.

Parameters:

return_statsbool: Whether to save a mapping statistics file. Default is True.
return_lost_sitesbool: Whether to save csv file containing any sites/peptides that were removed during the mapping process. Default is True.

set_data_columns(data_columns)#: Identifies which columns in the experiment should be used as data columns. If data_columns is provided, then ‘data:’ is added to the front and experiment dataframe is renamed. Otherwise, function will look for columns with ‘data:’ in front and this to the data_columns attribute.

Functions for Activity Calculation#

The “KinaseActivity” class#

class kstar.calculate.KinaseActivity(evidence, odir, name='experiment', data_columns=None, phospho_type='Y', kinases=None, network_dir=None, logger=None, network_name=None, seed=None)#

Kinase Activity calculates the estimated activity of kinases given an experiment using hypergeometric distribution. Hypergeometric distribution examines the number of protein sites found to be active in evidence compared to the number of protein sites attributed to a kinase on a provided network.

Parameters:

evidencepandas df: a dataframe that contains (at minimum, but can have more) data columms as evidence to use in analysis and KSTAR_ACCESSION and KSTAR_SITE
odirstring: output directory where results will be saved
namestring: name of the experiment, used to label output files. Default is ‘experiment’
kinaseslist or None: list of kinases to predict activity for. If None, will use all kinases found in the provided networks
network_dirstring or None: directory where pruned KSTAR networks are located. If None, will use config.NETWORK_DIR. If network files were downloaded with config.install_network_files(), this directory should already be set and does not need to be provided.
network_namestring or None: name of the network to use. If None, will use the default network name from config based on phospho_type
data_columns: list: list of the columns containing the abundance values, which will be used to determine which sites will be used as evidence for activity prediction in each sample
phospho_type: string, either ‘Y’ or ‘ST’: indicates the phospho modification of interest
loggerLogger object or None: keeps track of kstar analysis, including any errors that occur. If None, a new logger will be created automatically
min_dataset_size_for_pregenerated: int: minimum dataset size required to use pregenerated random activities (by number of sites used as evidence). Default is 150
max_diff_from_pregenerated: float: maximum percent difference between dataset size and pregenerated random activity size to use pregenerated data. Default is 0.20 (i.e. 20%)
seedint or None: random seed to use for random number generation. If None, seed will be set to current time

Attributes:

——————-
Upon Initialization
——————-
evidence: pandas dataframe: inputted dataset used for kinase activity calculation
networks: dict: dictionary of pruned kinase substrate networks, with keys as network ids and values as pandas dataframes
data_columns: list: list of columns containing abundance values, which will be used to determine which sites will be used as evidence. If inputted data_columns parameter was None, this lists includes in column in evidence prefixed by ‘data:’
loggerLogger object: keeps track of kstar analysis, including any errors that occur
aggregate: string: the type of aggregation to use when determining binary evidence, either ‘count’ or ‘mean’. Default is ‘count’.
run_date: string: indicates the date that kinase activity object was initialized
random_seed: int: random seed used for activity calculation. Only relevant if not using pregenerated random activities
network_info: dict: metadata about the loaded networks
network_hash: string: unique identifier for the loaded networks
kinases: list: list of kinases to predict activity for
———————————
After Hypergeometric Calculations
———————————
real_enrichment: pandas dataframe: p-values obtained for all pruned networks indicating statistical enrichment of a kinase’s substrates for each network, based on hypergeometric tests
activities: pandas dataframe: median p-values obtained from the real_enrichment object for each experiment/kinase
agg_activities: pandas dataframe
———————————–
After Random Enrichment Calculation
———————————–
random_experiments: pandas dataframe: contains information about the sites randomly sampled for each random experiment. Will only be saved if save_random_experiments=True.
random_enrichment: KinaseActivity object: KinaseActivity object containing random activities predicted from each of the random experiments
data_columns_from_scratch: list: list of data columns which generated random activities from scratch
data_columns_with_pregenerated: list: list of data columns which generated random activities from pregenerated random activities
—————————
After Mann Whitney Analysis
—————————
activities_mann_whitney: pandas dataframe: p-values obtained from comparing the real distribution of p-values to the distribution of p-values from random datasets, based the Mann Whitney U-test
fpr_mann_whitney: pandas dataframe: false positive rates for predicted kinase activities

Methods

`calculate_kinase_activities`([agg, ...])	Calculates combined activity of experiments based that uses a threshold value to determine if an experiment sees a site or not To use values use 'mean' as agg mean aggregation drops NA values from consideration To use count use 'count' as agg - present if not na
`check_data_columns`([min_evidence_size])	Checks data columns to make sure column is in evidence and that evidence filtered on that data column has at least one point of evidence (or minimum set by min_evidence_size).
`create_binary_evidence`([agg, threshold, ...])	Returns a binary evidence data frame according to the parameters passed in for method for aggregating duplicates and considering whether a site is included as evidence or not
`get_allowable_threshold`([greater, agg, ...])	Determine the minimum/maximum threshold that still results in all data columns having evidence
`get_param_dict`([params_to_ignore])	Get a dictionary of important parameters needed to reinstantiate the KSTAR object
`get_random_activities`([...])	Generate random experiments and calculate kinase activities.Either uses pre-generated activity lists or generates new random experiments based on the provided parameters.
`get_run_information_content`()	Retrieve network information from RUN_INFORMATION.txt based on phospho_type.
`make_dotplot`([include_evidence_sizes])	Create a dotplot of the kinase activity results
`make_summary_pdf`([regenerate_plots])	Create a summary PDF of the kinase activity results
`recommend_threshold`([desired_evidence_size, ...])	Recommend a threshold, one based on desired evidence size and one based on maximum average Jaccard similarity between samples.
`set_data_columns`([data_columns])	Sets the data columns to use in the kinase activity calculation If data_columns is None or an empty list then set data_columns to be all columns that start with data:
`test_threshold`(threshold[, agg, greater, ...])	Given a threshold value, calculate the distribution of evidence sizes (i.e. number of sites used in prediction for each sample in the experiment).
`test_threshold_range`(min_threshold, ...[, ...])	Given a range of threshold values, calculate the distribution of evidence sizes (i.e. number of sites used in prediction for each sample in the experiment) and Jaccard similarity between samples at each threshold.

calculate_kinase_activities(agg='mean', threshold=1.0, evidence_size=None, greater=True, min_evidence_size=0, PROCESSES=1)#

Calculates combined activity of experiments based that uses a threshold value to determine if an experiment sees a site or not To use values use ‘mean’ as agg

mean aggregation drops NA values from consideration

To use count use ‘count’ as agg - present if not na

Parameters:

data_columnslist: columns that represent experimental result, if None, takes the columns that start with `data:’’ in experiment. Pass this value in as a list, if seeking to calculate on fewer than all available data columns
thresholdfloat: threshold value used to filter rows
evidence_sizeint or None: the number of sites to use for prediction for each sample. If a value is provided, this will override the threshold, and will instead obtain the N sites with the greatest abundance within each sample (or lowest if greater=False).
agg{‘count’, ‘mean’}: method to use when aggregating duplicate substrate-sites. ‘count’ combines multiple representations and adds if values are non-NaN ‘mean’ uses the mean value of numerical data from multiple representations of the same peptide.

NA values are droped from consideration.
greater: Boolean: whether to keep sites that have a numerical value >=threshold (TRUE, default) or <=threshold (FALSE)
min_evidence_sizeint: minimum number of sites required for a data column to be considered for activity calculation
PROCESSESint: number of processes to use for multiprocessing

check_data_columns(min_evidence_size=0)#

Checks data columns to make sure column is in evidence and that evidence filtered on that data column has at least one point of evidence (or minimum set by min_evidence_size). Removes all columns that do not meet criteria

Parameters:

min_evidence_sizeint: minimum number of sites required for a data column to be considered for activity calculation

create_binary_evidence(agg='mean', threshold=1.0, evidence_size=None, greater=True, min_evidence_size=0, drop_empty_columns=True)#

Returns a binary evidence data frame according to the parameters passed in for method for aggregating duplicates and considering whether a site is included as evidence or not

Parameters:

thresholdfloat: threshold value used to filter rows
evidence_size: None or int: the number of sites to use for prediction for each sample. If a value is provided, this will override the threshold, and will instead obtain the N sites with the greatest abundance within each sample.
agg{‘count’, ‘mean’}: method to use when aggregating duplicate substrate-sites. ‘count’ combines multiple representations and adds if values are non-NaN ‘mean’ uses the mean value of numerical data from multiple representations of the same peptide.

NA values are droped from consideration.
greater: Boolean: whether to keep sites that have a numerical value >=threshold (TRUE, default) or <=threshold (FALSE)
min_evidence_sizeint: minimum number of sites required for a data column to be considered for activity calculation
drop_empty_columnsbool: whether to drop data columns with fewer than min_evidence_size sites

Returns:

evidence_binarypd.DataFrame: Matches the evidence dataframe of the kinact object, but with 0 or 1 if a site is included or not. This is uniquified and rows that are never used are removed.

get_allowable_threshold(greater=True, agg='mean', min_evidence_size=20, allow_column_loss=False)#

Determine the minimum/maximum threshold that still results in all data columns having evidence

Parameters:

greater: bool: whether to use sites greater (True) or less (False) than the threshold
agg: str: how to combine sites with multiple instances in experiment
min_evidence_size: int: minimum number of sites required for a data column to be considered for activity calculation

Returns:

allowable threshold: float: maximum or minimum threshold that still results in all data columns having evidence (or at least one if min_evidence_size = None)

get_param_dict(params_to_ignore=['network_sizes', 'pregenerated_experiments_path', 'mann_whitney'])#: Get a dictionary of important parameters needed to reinstantiate the KSTAR object

get_random_activities(num_random_experiments=150, use_pregenerated_random_activities=None, default_pregen_only=False, save_new_random_activities=None, custom_pregenerated_path=None, save_random_experiments=None, require_pregenerated=False, max_diff_from_pregenerated=0.25, min_dataset_size_for_pregenerated=150, PROCESSES=1)#

Generate random experiments and calculate kinase activities.Either uses pre-generated activity lists or generates new random experiments based on the provided parameters.

Parameters:

num_random_experimentsint, optional: Number of random experiments to generate, by default 150.
use_pregenerated_random_activitiesbool, optional: Whether to use pre-generated data, by default None and will use configuration value.
default_pregen_onlybool, optional: Whether to only use the default pregenerated data found in the network directory folder, by default False.
save_new_random_activitiesbool, optional: Whether to save new pregenerated data, by default None and will use configuration value
custom_pregenerated_pathstr, optional: Directory to save new precomputed data, by default None and will use configuration value.
save_random_experimentsbool, optional: Whether to save the generated random experiments, by default None and will use configuration value.
require_pregeneratedbool, optional: Whether to require using pre-generated data for all datasets, by default False. This is will ensure fast run times, but may result in some datasets not being processed if they do not have matching pre-generated data (most commonly due to smaller samples).
max_diff_from_pregeneratedfloat, optional: Maximum allowed difference in size between the dataset and pregenerated data to use pregenerated data, by default 0.25.
min_dataset_size_for_pregeneratedint, optional: Minimum dataset size required to use pregenerated data, by default 150.
PROCESSESint, optional: Number of processes to use for parallel computation, by default 1.

get_run_information_content()#

Retrieve network information from RUN_INFORMATION.txt based on phospho_type.

Reads the RUN_INFORMATION.txt file from the appropriate network directory based on the phospho_type (‘Y’ or ‘ST’). The file contains network configuration details including unique ID, date, network specifications, and compendia counts.

Returns:

Contents of RUN_INFORMATION.txt if found.
‘RUN_INFORMATION.txt file not found.’ if the file doesn’t exist.

make_dotplot(include_evidence_sizes=True, **kwargs)#

Create a dotplot of the kinase activity results

Parameters:

include_evidence_sizesbool: Whether to include evidence sizes in the dotplot
**kwargs: Additional keyword arguments to pass to the DotPlot initialization and make_complete_dotplot methods

make_summary_pdf(regenerate_plots=False)#

Create a summary PDF of the kinase activity results

Parameters:

regenerate_plotsbool: Whether to regenerate plots even if they already exist

recommend_threshold(desired_evidence_size=None, max_similarity=0.7, consider_size=True, consider_similarity=True, min_threshold=-inf, max_threshold=inf, step=0.1, pick_best_size_by='median', pick_best_similarity_by='max', greater=True, agg='mean', min_evidence_size=20, allow_column_loss=False)#

Recommend a threshold, one based on desired evidence size and one based on maximum average Jaccard similarity between samples. Will report the characteristics of the resulting evidences for both thresholds

Parameters:

desired_evidence_size: int: target evidence size to use when recommending threshold
max_similarity: float: maximum average Jaccard similarity between samples to use when recommending threshold. Default is 0.7
consider_size: bool: whether to consider evidence size when recommending threshold
consider_similarity: bool: whether to consider similarity between data columns when recommending threshold
min_threshold: float: minimum threshold to consider when recommending threshold. Must be provided if greater = True. Default is -infinity
max_threshold: float: maximum threshold to consider when recommending threshold. Must be provided if greater = False. Default is infinity
step: float: step size to use when iterating through thresholds
pick_best_size_by: str: method to use when aggregating evidence size values across samples, recommended to be either ‘min’, ‘max’, or ‘median’
pick_best_similarity_by: str: method to use when aggregating Jaccard similarity values across samples, recommended to be either ‘max’ or ‘median’
greater: bool: whether to use sites greater (True) or less (False) than the threshold
agg: str: how to combine sites with multiple instances in experiment
min_evidence_size: int: minimum number of sites required for a data column to be considered for activity calculation
allow_column_loss: bool: whether to allow some data columns to be lost when recommending threshold based on size. If False, will raise an error if min/max thresholds provided result in loss of any data columns

Returns:

float: recommended threshold value

set_data_columns(data_columns=None)#

Sets the data columns to use in the kinase activity calculation If data_columns is None or an empty list then set data_columns to be all columns that start with data:

Checks all set columns to make sure columns are vaild after filtering evidence

test_threshold(threshold, agg='mean', greater=True, plot=False, return_evidence_sizes=False, min_evidence_size=0)#

Given a threshold value, calculate the distribution of evidence sizes (i.e. number of sites used in prediction for each sample in the experiment).

Parameters:

threshold: float: cutoff for inclusion as evidence for prediction. If greater = True, sites with quantification greater than the threshold are used as evidence.
agg: str: how to combine sites with multiple instances in experiment
greater: bool: whether to use sites greater (True) or less (False) than the threshold
plot: bool: whether to plot a histogram of the evidence sizes used and heatmap of Jaccard similarity between samples
return_evidence_sizes: bool: indicates whether to return the evidence sizes for all samples or not
min_evidence_size: int: minimum number of sites required for a data column to be considered for activity calculation

Returns:

Outputs the minimum, maximum, and median evidence sizes across all samples. May return evidence sizes of all samples as pandas series

test_threshold_range(min_threshold, max_threshold, step=0.1, agg='mean', greater=True, min_evidence_size=0, desired_evidence_size=None, show_recommended=False)#

Given a range of threshold values, calculate the distribution of evidence sizes (i.e. number of sites used in prediction for each sample in the experiment) and Jaccard similarity between samples at each threshold

Parameters:

min_threshold: float: minimum cutoff for inclusion as evidence for prediction. If greater = True, sites with quantification greater than the threshold are used as evidence.
max_threshold: float: maximum cutoff for inclusion as evidence for prediction. If greater = True, sites with quantification greater than the threshold are used as evidence.
step: float: step size to use when iterating through threshold range
agg: str: how to combine sites with multiple instances in experiment
greater: bool: whether to use sites greater (True) or less (False) than the threshold
min_evidence_size: int: minimum number of sites required for a data column to be considered for activity calculation
desired_evidence_size: int or None: target evidence size to use for plotting. If None, will use 150 for phospho_type ‘Y’ and 1500 for phospho_type ‘ST’
show_recommended: bool: whether to show recommended evidence size and similarity lines on the plots

Master Functions for Running KSTAR Pipeline#

kstar.calculate.Mann_Whitney_analysis(kinact_dict, PROCESSES=1)#

For a kinact_dict, where random generation and activity has already been run for the phospho_types of interest, this will calculate the Mann-Whitney U test for comparing the array of p-values for real data to those of random data, across the number of networks used. It will also calculate the false positive rate for a pvalue, given observations of a random bootstrapping analysis

Parameters:

kinact_dict: dictionary: A dictionary of kinact objects, with keys ‘Y’ and/or ‘ST’
PROCESSES: int: number of processes to use for parallel computation, by default 1.

kstar.calculate.enrichment_analysis(experiment, odir, name='experiment', phospho_types=['Y', 'ST'], data_columns=None, agg='mean', threshold=1.0, evidence_size=None, greater=True, min_evidence_size=0, allow_column_loss=True, kinases=None, PROCESSES=1, **kwargs)#

Function to establish a kstar KinaseActivity object from an experiment with an activity log add the networks, calculate, aggregate, and summarize the hypergeometric enrichment into a final activity object. Should be followed by randomized_analyis, then Mann_Whitney_analysis.

Parameters:

experiment: pandas df: experiment dataframe that has been mapped, includes KSTAR_SITE, KSTAR_ACCESSION, etc.
odirstr: path to where you would like logger and output saved
namestr: name to use for outputs
phospho_types: {[‘Y’, ‘ST’], [‘Y’], [‘ST’]}: Which substrate/kinaset-type to run activity for: Both [‘Y, ‘ST’] (default), Tyrosine [‘Y’], or Serine/Threonine [‘ST’]
data_columnslist: columns that represent experimental result, if None, takes the columns that start with `data:’’ in experiment. Pass this value in as a list, if seeking to calculate on fewer than all available data columns
agg{‘count’, ‘mean’}: method to use when aggregating duplicate substrate-sites. ‘count’ combines multiple representations and adds if values are non-NaN ‘mean’ uses the mean value of numerical data from multiple representations of the same peptide.

NA values are droped from consideration.
thresholdfloat or dict: threshold value used to filter rows. If provided as a dictionary, keys should be ‘Y’ and/or ‘ST’ with float values for each phospho_type.
evidence_sizeint or dict: size of evidence to use for filtering. If provided as a dictionary, keys should be ‘Y’ and/or ‘ST’ with int values for each phospho_type. Will overide threshold if both provided.
min_evidence_sizeint: minimum size of evidence to run kinase activity on. Default 0, meaning any data column with at least one site will be run on
greater: Boolean: whether to keep sites that have a numerical value >=threshold (TRUE, default) or <=threshold (FALSE)
PROCESSESint: number of processes to use for parallel computation, by default 1.
**kwargs: Additional keyword arguments to pass to the KinaseActivity class

Returns:

kinactDict: dictionary of Kinase Activity Objects: Outer keys are phosphoTypes run ‘Y’ and ‘ST’ Includes the activities dictionary (see calculate_kinase_activities) aggregation of activities across networks (see aggregate activities) activity summary (see summarize_activities)

kstar.calculate.randomized_analysis(kinact_dict, **kwargs)#

Perform randomized analysis on kinase activity data.

Parameters:

kinact_dictdict

Dictionary containing kinase activity data.

kwargskeyword arguments

Additional keyword arguments for random activity generation passed to KinaseActivity.get_random_activities method.

These can include: num_random_experiments : int, optional

Number of random experiments to generate, by default 150.

use_pregen_databool, optional: Whether to use pre-generated data, by default False.
max_diff_from_pregeneratedfloat, optional: Maximum fractional difference allowed from pre-generated data, by default 0.25.
min_dataset_size_for_pregeneratedint, optional: Minimum dataset size to use pre-generated data, by default 150.
default_pregen_onlybool, optional: Whether to only use default pre-generated data (and not any activities in custom path), by default False.
require_pregeneratedbool, optional: Whether to require pre-generated data, by default False. This will ensure fast performance, but may result in some data columns being dropped
custom_pregenerated_pathstr, optional: Directory to save new precomputed data, by default None.
save_random_experimentsbool, optional: Whether to save the generated random experiments, by default None.
save_new_random_activitiesbool, optional: Whether to save new precomputed data, by default None.
PROCESSESint, optional: Number of processes to use for parallel computation, by default 1.

Returns:

None

kstar.calculate.run_kstar_analysis(experiment, odir, name='experiment', phospho_types=['Y', 'ST'], data_columns=None, threshold=1.0, evidence_size=None, greater=True, save_output=True, PROCESSES=1, **kwargs)#

Given a mapped experiment, run the KSTAR analysis pipeline.

Parameters:

experiment: DataFrame: Mapped experiment data
odir: string: Output directory
name: string: Name of the experiment
phospho_types: list: List of phospho types to analyze
network_dir: string: Directory containing network data
data_columns: list: Columns to use from the data
agg: string: Aggregation method
threshold: float: Threshold for analysis
evidence_size: int: Size of evidence
greater: bool: Whether to use greater comparison
PROCESSES: int: Number of processes to use
**kwargs: Additional keyword arguments for enrichment_analysis, randomized_analysis, and save_kstar functions.

Functions for Saving and Loading KSTAR results#

kstar.calculate.from_kstar(name, odir, ftype='tsv')#

Given the name and output directory of a saved kstar analyis, load the parameters and minimum dataframes needed for reinstantiating a kinact object This minimum list will allow you to repeat normalization or mann whitney at a different false positive rate threshold and plot results.

Parameters:

name: string: The name to used when saving activities and mapped data
odir: string: Output directory of saved files and parameter pickle

kstar.calculate.from_kstar_nextflow(name, odir, log=None)#

Given the name and output directory of a saved kstar analyis from the nextflow pipeline, load the results into new kinact object with the minimum dataframes required for analysis (binary experiment, hypergeometric activities, normalized activities, mann whitney activities)

Parameters:

name: string: The name to used when saving activities and mapped data
odir: string: Output directory of saved files
log: logger: logger used when loading nextflow data into kinase activity object. If not provided, new logger will be created.

kstar.calculate.save_kstar(kinact_dict, name, odir, minimal=True, ftype='tsv', param_format='json')#

Having performed kinase activities (run_kstar_analyis), save each of the important dataframes, minimizing the memory storage needed to get back to a rebuilt version for plotting results and analysis. For each phospho_type in the kinact_dict, at a minimum, this will save the binarized evidence, mann whitney activities and fpr dataframes, and parameters used during run. If you would like to save all files (hypergeometric and random enrichment intermediate files), set minimal = False.

Parameters:

kinact_dict: dictionary of Kinase Activity Objects: Outer keys are phosphoTypes run ‘Y’ and ‘ST’ Includes the activities dictionary (see calculate_kinase_activities) aggregation of activities across networks (see aggregate activities) activity summary (see summarize_activities)
name: string: The name to use when saving activities
odir: string: Outputdirectory to save files and pickle to
minimal: bool: Whether to save only minimal files or all intermediate files
ftype: {‘tsv’, ‘csv’}: Format to save dataframes in, either tsv or csv
param_format: {‘pickle’, ‘json’}: Format to save parameter dictionary in, either pickle or json. Json is recommended for easier human readability

Returns:

Nothing

Plotting/Analysis Functions#

The “DotPlot” class#

class kstar.plot.DotPlot(values, fpr, alpha=0.05, inclusive_alpha=True, binary_sig=True, dotsize=5, colormap={0: '#6b838f', 1: '#FF3300'}, facecolor='white', legend_title='-log10(p-value)', size_number=5, size_color='gray', color_title='Significant', markersize=10, legend_distance=1.0, figsize=(4, 8), title=None, xlabel=True, ylabel=True, x_label_dict=None, kinase_dict=None)#

The DotPlot class is used for plotting dotplots, with the option to add clustering and context plots. The size of the dots based on the values dataframe, where the size of the dot is the area of the value * dotsize

Parameters:

values: pandas DataFrame instance: values to plot
fprpandas DataFrame instance: false positive rates associated with values being plotted
alpha: float, optional: fpr value that defines the significance cutoff to use when plt default : 0.05
inclusive_alpha: boolean: whether to include the alpha (significance <= alpha), or not (significance < alpha). default: True
binary_sig: boolean, optional: indicates whether to plot fpr with binary significance or as a change color hue default : True
dotsizefloat, optional: multiplier to use for scaling size of dots
colormapdict, optional: maps color values to actual color to use in plotting default : {0: ‘#6b838f’, 1: ‘#FF3300’}
labelmap =: maps labels of colors, default is to indicate FPR cutoff in legend default : None
facecolorcolor, optional: Background color of dotplot default : ‘white’
legend_titlestr, optional: Legend Title for dot sizes, default is `p-value’
size_numberint, optional: Number of dots to attempt to generate for dot size legend
size_colorcolor, optional: Size Legend Color to use
color_titlestr, optional: Legend Title for the Color Legend
markersizeint, optional: Size of dots for Color Legend
legend_distanceint, optional: relative distance to place legends
figsizetuple, optional: size of dotplot figure
titlestr, optional: Title of dotplot
xlabelbool, optional: Show xlabel on graph if True
ylabelbool, optional: Show ylabel on graph if True
x_label_dict: dict, optional: Mapping dictionary of labels as they appear in values dataframe (keys) to how they should appear on plot (values)
kinase_dict: dict, optional: Mapping dictionary of kinase names as they appear in values dataframe (keys) to how they should appear on plot (values)

Attributes:

values: pandas dataframe: a copy of the original values dataframe
fpr: pandas dataframe: a copy of the original fpr dataframe
alpha: float: cutoff used for significance, default 0.05
inclusive_alpha: boolean: whether to include the alpha (significance <= alpha), or not (significance < alpha)
significance: pandas dataframe: indicates whether a particular kinases activity is significant, where fpr <= alpha is significant, otherwise it is insignificant
colors: pandas dataframe: dataframe indicating the color to use when plotting: either a copy of the fpr or significance dataframe
binary_sig: boolean: indicates whether coloring will be done based on binary significance or fpr values. Default True
labelmap: dict: indicates how to label each significance color
figsize: tuple: size of the outputted figure, which is overridden if axes is provided for dotplot
title: string: title of the dotplot
xlabel: boolean: indicates whether to plot x-axis labels
ylabel: boolean: indicates whether to plot y-axis labels
colormap: dict: colors to be used when plotting
facecolor: string: background color of dotplot

Methods

`cluster`(ax[, method, metric, orientation, ...])	Performs hierarchical clustering on data and plots result to provided Axes.
`context`(ax, info, id_column, context_columns)	Context plot is generated and returned.
`dotplot`([ax, orientation, size_legend, ...])	Generates the dotplot plot, where size is determined by values dataframe and color is determined by significant dataframe
`drop_kinases`(kinase_list)	Given a list of kinases, drop these from the dot.values dataframe in all future plotting of this object.
`drop_kinases_with_no_significance`()	Drop kinases from the values dataframe (inplace) when plotting if they are never observed as significant
`evidence_count`(ax, binary_evidence[, ...])	Add bars to dotplot indicating the total number of sites used as evidence in activity calculation
`make_complete_dotplot`([kinases_to_plot, ...])	Master function for creating a comprehensive dotplot visualization, which automatically creates any necessary subplots
`set_colors`([labelmap])	Set colors for the plot based on significance or false positive rate.

set_column_labels
set_index_labels
set_values
setup_figure

cluster(ax, method='single', metric='euclidean', orientation='top', color_threshold=-inf)#

Performs hierarchical clustering on data and plots result to provided Axes. result and significant dataframes are ordered according to clustering

Parameters:

axmatplotlib Axes instance: Axes to plot dendogram to
methodstr, optional: The linkage algorithm to use.
metricstr or function, optional: The distance metric to use in the case that y is a collection of observation vectors; ignored otherwise. See the pdist function for a list of valid distance metrics. A custom distance function can also be used.
orientationstr, optional: The direction to plot the dendrogram, which can be any of the following strings: ‘top’: Plots the root at the top, and plot descendent links going downwards. (default). ‘bottom’: Plots the root at the bottom, and plot descendent links going upwards. ‘left’: Plots the root at the left, and plot descendent links going right. ‘right’: Plots the root at the right, and plot descendent links going left.

context(ax, info, id_column, context_columns, dotsize=200, markersize=20, orientation='left', color_palette='colorblind', margin=0.2, make_legend=True, **kwargs)#

Context plot is generated and returned. The context plot contains the categorical data used for describing the data.

Parameters:

axmaptlotlib axis: where to map subtype information to
infopandas df: Dataframe where context information is pulled from
id_column: str: Column used to map the subtype information to
context_columnslist: list of columns to pull context informaiton from
dotsizeint, optional: size of context dots
markersize: int, optional: size of legend markers
orientationstr, optional: orientation to plot context plots to - determines where legends are placed options : left, right, top, bottom
color_palettestr, optional: seaborn color palette to use
margin: float, optional: margin
make_legendbool, optional: whether to create legend for context colors

dotplot(ax=None, orientation='left', size_legend=True, color_legend=True, max_size=None, **kwargs)#

Generates the dotplot plot, where size is determined by values dataframe and color is determined by significant dataframe

Parameters:

axmatplotlib Axes instance, optional: axes dotplot will be plotted on. If None then new plot generated
orientationstr, optional: orientation to place legends, either ‘left’ or ‘right’
size_legendbool, optional: whether to include size legend (indicates meaning of dot size/activity)
color_legendbool, optional: whether to include color legend (indicates significance)
max_sizeint, optional: maximum size value to use when generating size legend. If None, automatic legend generated

Returns:

axmatplotlib Axes instance: Axes containing the dotplot

drop_kinases(kinase_list)#

Given a list of kinases, drop these from the dot.values dataframe in all future plotting of this object. Removal is in place

Parameters:

kinase_list: list: list of kinase names to remove

drop_kinases_with_no_significance()#: Drop kinases from the values dataframe (inplace) when plotting if they are never observed as significant

evidence_count(ax, binary_evidence, plot_type='bars', phospho_type=None, dot_size=1, include_recommendations=False, ideal_min=None, recommended_min=None, dot_colors=None, bar_line_colors=None)#

Add bars to dotplot indicating the total number of sites used as evidence in activity calculation

Parameters:

ax: axes object: where to plot the bars
binary_evidence: pandas dataframe: binarized dataframe produced during activity calculation (threshold applied to original experiment)

make_complete_dotplot(kinases_to_plot=None, cluster_samples=False, cluster_kinases=False, sort_kinases_by=None, sort_samples_by=None, binary_evidence=None, context=None, significant_kinases_only=True, show_xtick_labels=True, **kwargs)#

Master function for creating a comprehensive dotplot visualization, which automatically creates any necessary subplots

Parameters:

kinases_to_plotlist or None, optional: List of kinases to include in the plot. If None, all kinases are included.
cluster_samplesbool, optional: Whether to cluster samples in the plot.
cluster_kinasesbool, optional: Whether to cluster kinases in the plot.
significant_kinases_onlybool, optional: Whether to include only significant kinases in the plot.
sort_samples_bystr or None, optional: Kinase Column to sort samples by in the plot based on kinase activities. If cluster_sample=True, this will be ignored.
sort_kinases_bystr or None, optional: Sample Column to sort kinases by in the plot based on kinase activities. If cluster_kinases=True, this will be ignored.
binary_evidencepd.DataFrame or None: Binary evidence dataframe from KSTAR analysis. If provided, will calculate the number of sites used as evidence in each sample and plot this.
contextpd.DataFrame or None, optional: Context dataframe providing additional sample information for plotting. If provided, must include an ‘id_column’ for unique sample identifiers and list ‘context_columns’ for context information.
show_xtick_labelsbool, optional: Whether to show x-axis tick labels in the dotplot.
**kwargs: Additional keyword arguments passed to plotting functions, like matplotlib.pyplot.scatter, DotPlot.context, DotPlot.dotplot, DotPlot.cluster, and DotPlot.evidence_count

set_colors(labelmap=None)#: Set colors for the plot based on significance or false positive rate.

The “KSTAR_PDF” class#

class kstar.plot.KSTAR_PDF(activities, fpr, odir, name, binarized_experiment, param_dict)#

Class to generate a PDF report from KSTAR analysis results, built on fdpf2 module

Parameters:

activitiespandas DataFrame: DataFrame of mann whitney kinase activities
fprpandas DataFrame: DataFrame of false positive rates corresponding to activities
odirstr: Output directory for saving the PDF report
namestr: Name of the experiment/run, used for file naming
binarized_experimentpandas DataFrame: Binarized experiment indicating which sites were used as evidence in each column
param_dictdict: Dictionary of parameters used in the KSTAR run

Attributes:

MARKDOWN_LINK_COLOR
accept_page_break: Whenever a page break condition is met, this @property method is called, and the break is issued or not depending on the returned value.
char_spacing
char_vpos: Return vertical character position relative to line.
current_font
current_font_is_set_on_page
dash_pattern
default_page_dimensions: Return a pair (width, height) in the unit specified to FPDF constructor
denom_lift: Return lift factor for denominator text.
denom_scale: Return scale factor for denominator text.
draw_color
emphasis: The current text emphasis: bold, italics, underline and/or strikethrough.
eph: Effective page height: the page height minus its vertical margins.
epw: Effective page width: the page width minus its horizontal margins.
fill_color
font_family
font_size
font_size_pt
font_stretching
font_style
fonts
is_ttf_font
line_width
nom_lift: Return lift factor for nominator text.
nom_scale: Return scale factor for nominator text.
output_intents
page_layout
page_mode
pages_count: Returns the total pages of the document, at the time it is called.
strikethrough
sub_lift: Return lift factor for subscript text.
sub_scale: Return scale factor for subscript text.
sup_lift: Return lift factor for superscript text.
sup_scale: Return scale factor for superscript text.
text_color
text_mode
text_shaping
underline

Methods

`HTML2FPDF_CLASS`	alias of `HTML2FPDF`
`add_action`(action, x, y, w, h, **kwargs)	Puts an Action annotation on a rectangular area of the page.
`add_font`([family, style, fname, ...])	Imports a TrueType or OpenType font and makes it available for later calls to the FPDF.set_font() method.
`add_link`([y, x, page, zoom, name])	Creates a new internal link and returns its identifier.
`add_output_intent`(subtype[, ...])	Adds desired Output Intent to the Output Intents array:
`add_page`([orientation, format, same, ...])	Adds a new page to the document.
`add_text_markup_annotation`(type, text, ...)	Adds a text markup annotation on some quadrilateral areas of the page.
`alias_nb_pages`([alias])	Defines an alias for the total number of pages.
`arc`(x, y, a, start_angle, end_angle[, b, ...])	Outputs an arc.
`bezier`(point_list[, closed, style])	Outputs a quadratic or cubic Bézier curve, defined by three or four coordinates.
`cell`([w, h, text, border, ln, align, fill, ...])	Prints a cell (rectangular area) with optional borders, background color and character string.
`circle`(x, y, radius[, style])	Outputs a circle.
`code39`(text, x, y[, w, h])	Barcode 3of9
`create_dotplot`()	Generate a standard activity dotplot for use in the PDF report
`dashed_line`(x1, y1, x2, y2[, dash_length, ...])	Draw a dashed line between two points.
`dotplot_page`([regenerate_plots])	Create a PDF page that includes the KSTAR dotplot figure and information on where to find the figure and underlying data in the output directory
`draw_path`(path[, debug_stream])	Add a pre-constructed path to the document.
`draw_vector_glyph`(path, font)	Add a pre-constructed path to the document.
`drawing_context`([debug_stream])	Create a context for drawing paths on the current page.
`ellipse`(x, y, w, h[, style])	Outputs an ellipse.
`elliptic_clip`(x, y, w, h)	Context manager that defines an elliptic crop zone, useful to render only part of an image.
`embed_file`([file_path, bytes, basename, ...])	Embed a file into the PDF as an attachment (and, for PDF/A-3 or PDF/A-4f, as an Associated File).
`evidence_count_plot`(data_columns)	Creates a barplot showing the number of sites used as evidence in each column of the experiment
`evidence_overlap_plot`(data_columns)	Creates a heatmap showing the Jaccard index of evidence overlap between columns in the experiment
`evidence_page`([regenerate_plots])	Create a PDF page that includes the total number of sites used as evidence for each column and the jaccard similarity of evidence between columns
`file_attachment_annotation`(file_path, x, y)	Puts a file attachment annotation on a rectangular area of the page.
`file_id`()	This method can be overridden in inherited classes in order to define a custom file identifier.
`font_face`()	Return a fpdf.fonts.FontFace instance representing a subset of properties of this GraphicsState.
`footer`()	Override the footer method to add a page number at the bottom center of each page.
`free_text_annotation`(text[, x, y, w, h])	Puts a free text annotation on a rectangular area of the page.
`generate`([regenerate_plots])	Generates the PDF report by creating each page in sequence and saving the final PDF to the output directory
`get_fallback_font`(char[, style])	Returns which fallback font has the requested glyph.
`get_named_destination`(name)	Retrieves a named destination by its name and creates a link to it.
`get_page_label`()	Return the current page fpdf.output.PDFPageLabel.
`get_string_width`(s[, normalized, markdown])	Returns the length of a string in user unit.
`get_x`()	Returns the abscissa of the current position.
`get_y`()	Returns the ordinate of the current position.
`glyph_drawing_context`()	Create a context for drawing paths for type 3 font glyphs, without writing on the current page.
`header`()	Header to be implemented in your own inherited class
`highlight`(text[, type, color, modification_time])	Context manager that adds a single highlight annotation based on the text lines inserted inside its indented block.
`image`(name[, x, y, w, h, type, link, title, ...])	Put an image on the page.
`ink_annotation`(coords[, text, color, ...])	Adds add an ink annotation on the page.
`insert_toc_placeholder`(render_toc_function)	Configure Table Of Contents rendering at the end of the document generation, and reserve some vertical space right now in order to insert it.
`interleaved2of5`(text, x, y[, w, h])	Barcode I2of5 (numeric), adds a 0 if odd length
`line`(x1, y1, x2, y2)	Draw a line between two points.
`link`(x, y, w, h, link[, alt_text])	Puts a link annotation on a rectangular area of the page.
`ln`([h])	Line Feed.
`local_context`(**kwargs)	Creates a local graphics state, which won't affect the surrounding code.
`mirror`(origin, angle)	Method to perform a reflection transformation over a given mirror line.
`multi_cell`(w[, h, text, border, align, ...])	This method allows printing text with line breaks.
`new_path`([x, y, paint_rule, debug_stream])	Create a path for appending lines and curves to.
`normalize_text`(text)	Check that text input is in the correct format/encoding
`offset_rendering`()	All rendering performed in this context is made on a dummy FPDF object.
`output`([name, linearize, output_producer_class])	Output PDF to some destination.
`page_no`()	Get the current page number
`polygon`(point_list[, fill, style])	Outputs a polygon defined by three or more points.
`polyline`(point_list[, fill, polygon, style])	Draws lines between two or more points.
`preload_image`(name[, dims])	Read an image and load it into memory.
`rect`(x, y, w, h[, style, round_corners, ...])	Outputs a rectangle.
`rect_clip`(x, y, w, h)	Context manager that defines a rectangular crop zone, useful to render only part of an image.
`regular_polygon`(x, y, numSides, polyWidth[, ...])	Outputs a regular polygon with n sides It can be rotated Style can also be applied (fill, border...)
`rotate`(angle[, x, y])
`rotation`(angle[, x, y])	Method to perform a rotation around a given center.
`round_clip`(x, y, r)	Context manager that defines a circular crop zone, useful to render only part of an image.
`set_author`(author)	Defines the author of the document.
`set_auto_page_break`(auto[, margin])	Set auto page break mode, and optionally the bottom margin that triggers it.
`set_char_spacing`(spacing)	Sets horizontal character spacing.
`set_compression`(compress)	Activates or deactivates page compression.
`set_creation_date`([date])	Sets Creation of Date time, or current time if None given.
`set_creator`(creator)	Defines the creator of the document.
`set_dash_pattern`([dash, gap, phase])	Set the current dash pattern for lines and curves.
`set_display_mode`(zoom[, layout])	Defines the way the document is to be displayed by the viewer.
`set_doc_option`(opt, value)	Defines a document option.
`set_draw_color`(r[, g, b])	Defines the color used for all stroking operations (lines, rectangles and cell borders).
`set_encryption`(owner_password[, ...])	Activate encryption of the document content.
`set_fallback_fonts`(fallback_fonts[, exact_match])	Allows you to specify a list of fonts to be used if any character is not available on the font currently set.
`set_fill_color`(r[, g, b])	Defines the color used for all filling operations (filled rectangles and cell backgrounds).
`set_font`([family, style, size])	Sets the font used to print character strings.
`set_font_size`(size)	Configure the font size in points
`set_image_filter`(image_filter)	Args:
`set_keywords`(keywords)	Associate keywords with the document
`set_lang`(lang)	A language identifier specifying the natural language for all text in the document except where overridden by language specifications for structure elements or marked content.
`set_left_margin`(margin)	Sets the document left margin.
`set_line_width`(width)	Defines the line width of all stroking operations (lines, rectangles and cell borders).
`set_link`([link, y, x, page, zoom, name])	Defines the page and position a link points to.
`set_margin`(margin)	Sets the document right, left, top & bottom margins to the same value.
`set_margins`(left, top[, right])	Sets the document left, top & optionally right margins to the same value.
`set_page_background`(background)	Sets a background color or image to be drawn every time FPDF.add_page() is called, or removes a previously set background.
`set_page_label`([label_style, label_prefix, ...])	Enable fpdf.output.PDFPageLabel to be inserted on every page.
`set_producer`(producer)	Producer of document
`set_right_margin`(margin)	Sets the document right margin.
`set_section_title_styles`(level0[, level1, ...])	Defines a style for section titles.
`set_stretching`(stretching)	Sets horizontal font stretching.
`set_subject`(subject)	Defines the subject of the document.
`set_text_color`(r[, g, b])	Defines the color used for text.
`set_text_shaping`([use_shaping_engine, ...])	Enable or disable text shaping engine when rendering text.
`set_title`(title)	Defines the title of the document.
`set_top_margin`(margin)	Sets the document top margin.
`set_x`(x)	Defines the abscissa of the current position.
`set_xy`(x, y)	Defines the abscissa and ordinate of the current position.
`set_y`(y)	Moves the current abscissa back to the left margin and sets the ordinate.
`sign`(key, cert[, extra_certs, hashalgo, ...])	Args:
`sign_pkcs12`(pkcs_filepath[, password, ...])	Args:
`skew`([ax, ay, x, y])	Method to perform a skew transformation originating from a given center.
`solid_arc`(x, y, a, start_angle, end_angle[, ...])	Outputs a solid arc.
`star`(x, y, r_in, r_out, corners[, ...])	Outputs a regular star with n corners.
`start_section`(name[, level, strict])	Start a section in the document outline.
`summary_page`()	Create a PDF page that indicates the parameters used in the KSTAR run and the key kinases identified for each column
`table`(data[, header, column_widths, row_height])	Builds a table in the PDF
`text`(x, y[, text])	Prints a character string.
`text_annotation`(x, y, text[, w, h, name])	Puts a text annotation on a rectangular area of the page.
`text_columns`([text, img, img_fill_width, ...])	Establish a layout with multiple columns to fill with text. Args: text (str, optional): A first piece of text to insert. ncols (int, optional): the number of columns to create. (Default: 1). gutter (float, optional): The distance between the columns. (Default: 10). balance: (bool, optional): Specify whether multiple columns should end at approximately the same height, if they don't fill the page. (Default: False) text_align (Align or str, optional): The alignment of the text within the region. (Default: "LEFT") line_height (float, optional): A multiplier relative to the font size changing the vertical space occupied by a line of text. (Default: 1.0). l_margin (float, optional): Override the current left page margin. r_margin (float, optional): Override the current right page margin. print_sh (bool, optional): Treat a soft-hyphen (u00ad) as a printable character, instead of a line breaking opportunity. (Default: False) wrapmode (fpdf.enums.WrapMode, optional): "WORD" for word based line wrapping, "CHAR" for character based line wrapping. (Default: "WORD") skip_leading_spaces (bool, optional): On each line, any space characters at the beginning will be skipped if True. (Default: False).
`top_kinases_table`()	Constructs a table of the top 5 most active significant kinases per sample and adds it to the PDF page
`unbreakable`()	Ensures that all rendering performed in this context appear on a single page by performing page break beforehand if need be.
`use_font_face`(font_face)	Sets the provided fpdf.fonts.FontFace in a local context, then restore font settings back to they were initially.
`use_pattern`(shading)	Create a context for using a shading pattern on the current page.
`will_page_break`(height)	Let you know if adding an element will trigger a page break, based on its height and the current ordinate (y position).
`write`([h, text, link, print_sh, wrapmode])	Prints text from the current position.
`write_html`(text, args, *kwargs)	Parse HTML and convert it to PDF.

add_highlight
clear_text_region
is_current_text_region
mapping_page
preload_glyph_image
register_text_region
set_xmp_metadata
use_text_style
x_by_align

create_dotplot()#: Generate a standard activity dotplot for use in the PDF report

dotplot_page(regenerate_plots=False)#

Create a PDF page that includes the KSTAR dotplot figure and information on where to find the figure and underlying data in the output directory

Parameters:

regenerate_plotsbool, optional: Whether to regenerate the dotplot figure even if it already exists in the output directory

evidence_count_plot(data_columns)#

Creates a barplot showing the number of sites used as evidence in each column of the experiment

Parameters:

data_columnslist: List of column names in the experiment to include in the plot

evidence_overlap_plot(data_columns)#

Creates a heatmap showing the Jaccard index of evidence overlap between columns in the experiment

Parameters:

data_columnslist: List of column names in the experiment to include in the plot

evidence_page(regenerate_plots=False)#

Create a PDF page that includes the total number of sites used as evidence for each column and the jaccard similarity of evidence between columns

Parameters:

regenerate_plotsbool, optional: Whether to regenerate the evidence plots even if they already exist in the output directory

footer()#: Override the footer method to add a page number at the bottom center of each page.

generate(regenerate_plots=False)#

Generates the PDF report by creating each page in sequence and saving the final PDF to the output directory

Parameters:

regenerate_plotsbool, optional: Whether to regenerate all plots even if they already exist in the output directory

summary_page()#: Create a PDF page that indicates the parameters used in the KSTAR run and the key kinases identified for each column

table(data, header=None, column_widths=40, row_height=5)#

Builds a table in the PDF

Parameters:

datapandas DataFrame: DataFrame containing the data to be included in the table
headerlist, optional: List of header names for the table columns. If None, uses DataFrame column names.
column_widthsint or list, optional: Width of each column in the table. If an integer is provided, all columns will have the same width. If a list is provided, it should contain the width for each column.
row_heightint, optional: Height of each row in the table.

top_kinases_table()#: Constructs a table of the top 5 most active significant kinases per sample and adds it to the PDF page

Downstream Analysis Modules#

kstar.analysis.interactions.getSubstrateInfluence(networks, kinase, substrate_subset=None)#

Given the pruned networks and kinase of interest, return the number of networks each substrate is connected to that kinase in (the ‘substrate influence’ on that kinase’s activity prediction). If subset of substrates is provided, will only do this for the given subset

Parameters:

networks: dict: dictionary storing all pruned networks used in activity calculation
kinase: str: name of the kinase of interest: should match the name found in provided networks
substrate_subset: list: subset of substrates to analyze, indicated by ‘{KSTAR_ACCESSION}_{KSTAR_SITE}’. If none, will return a series containing info on all substrates with at least one connection to the given kinase

Returns:

Pandas series indicating the number of networks each substrate is connected the indicated kinase, sorted from the most connections (highest influence) to the least (lowest influence). Sites with no connection will not be included.

kstar.analysis.interactions.getSubstrateInfluence_inExperiment(networks, binary_evidence, kinase, data_cols=None)#

Given the binary evidence used for activity prediction, identify which sites are found across the most networks for a given kinase and each sample.

Parameters:

networks: dictionary: dictionary containing all 50 pruned networks used for activity prediction
binary_evidence: pandas dataframe: binarized dataset (using the same threshold/criteria as the one used for activity prediction)
kinase: str: name of the kinase to probe
data_cols: list or None: name of the data columns in binary_evidence to probe. If None, will analyze all columns with ‘data:’ at the start of the column name.

kstar.analysis.coverage.averageUniqueSubstrates_KSTAR(networks=None)#

Calculate the average number of unique substrates covered by each KSTAR pruned network

Parameters:

mod_types: list: list containing which networks to calculate average for. Either [‘Y’], [‘ST’], or [‘Y’,’ST’]

Returns:

averageSub: dict: indicates the average number of substrates across all pruned networks for indicated modification types

kstar.analysis.coverage.experimentCoverage(experiment, networks, mod='Y', exp_cols=['KSTAR_ACCESSION', 'KSTAR_SITE'], net_cols=['KSTAR_ACCESSION', 'KSTAR_SITE'])#

Given an experiment, determine how many of the sites observed in the experiment can be captured by a kinase-substrate network (function was designed for KSTAR pruned networks, but should work with any ks-network that indicates UniProt ID and site number)

Parameters:

experiment: pandas dataframe: phosphoproteomic experiment, ideally that has been mapped to KinPred by KSTAR already
network: pandas dataframe: binarized kinase-substrate network (unweighted), ideally having been mapped to KinPred/KSTAR already
exp_cols: list: list indicating the columns in experiment dataframe that contain uniprot id and site number
net_cols: list: list indicating the columns in network dataframe that contain the uniprot id and site number

Returns:

fraction_of_sites_covered: dict: indicates the fraction of phosphorylation sites observed in experiment that are also found within the kinase-substrate network, for each modification type (tyrosine, serine/threonine).

kstar.analysis.coverage.getStudyBiasDistribution_InExperiment(binary_experiment, ax=None, figsize=(4, 3), return_dist=False)#

Plot the distribution of study bias within a single phosphoproteomic experiment

Parameters:

mapped_experiment: pandas dataframe: phosphoproteomic experiment that has been mapped by KSTAR (contains ‘KSTAR_SITE’,’KSTAR_ACCESSION’, and ‘KSTAR_NUM_COMPENDIA’ columns)
ax: matplotlib axes object: axis to plot the distribution on. If none, will create subplot
figsize: tuple: size of matplotlib figure. Default is (4,3)
return_dist: bool: whether you would like to also return the distribution values. Default is False.

Returns:

Histogram plotting the distribution of study bias found in the provided experiment, as defined by the number of compendia a phosphorylation site is recorded in. If return_dist = True, will also return a series object containing the same data as the histogram.

kstar.analysis.coverage.getStudyBiasDistribution_InPhosphoproteome(mod_type='Y', ax=None, figsize=(4, 3), return_dist=False)#

Plot the distribution of study bias across the reference phosphoproteome

Parameters:

mod_type: str: indicates which modification type, tyrosine (‘Y’) or serine/threonine (‘ST’), you would like to plot. Default is ‘Y’
ax: matplotlib axes object: axis to plot the distribution on. If none, will create subplot
figsize: tuple: size of matplotlib figure. Default is (4,3)
return_dist: bool: whether you would like to also return the distribution values. Default is False.

Returns:

Histogram plotting the distribution of study bias found in overall phosphoproteome, as defined by the number of compendia a phosphorylation site is recorded in. If return_dist = True, will also return a series object containing the same data as the histogram.

kstar.analysis.coverage.getStudyBiasDistribution_InSample(binary_experiment, data_column, ax=None, figsize=(4, 3), return_dist=False)#

Plot the distribution of study bias within a single phosphoproteomic experiment

Parameters:

mapped_experiment: pandas dataframe: phosphoproteomic experiment that has been mapped by KSTAR (contains ‘KSTAR_SITE’,’KSTAR_ACCESSION’, and ‘KSTAR_NUM_COMPENDIA’ columns)
ax: matplotlib axes object: axis to plot the distribution on. If none, will create subplot
figsize: tuple: size of matplotlib figure. Default is (4,3)
return_dist: bool: whether you would like to also return the distribution values. Default is False.

Returns:

Histogram plotting the distribution of study bias found in the provided experiment, as defined by the number of compendia a phosphorylation site is recorded in. If return_dist = True, will also return a series object containing the same data as the histogram.

kstar.analysis.coverage.numUniqueSubstrates(networks, acc_col='KSTAR_ACCESSION', site_col='KSTAR_SITE')#

Given a KSTAR network(s), return the number of unique substrates within the network (across all kinases). If a dictionary of multiple pruned networks is provided, will calculate the total number of unique substrates across ALL networks.

Parameters:

network: pandas dataframe or dict of pandas dataframes: pruned KSTAR network, or dictionary containing multiple pruned networks
acc_col: str: name of column in network dataframe which indicates UniProt ID of substrates
site_col: str: name of column in network dataframe which indicates residue and site number (i.e. Y1197)

Returns:

Number of unique substrates within network(s)

kstar.analysis.coverage.sampleCoverage(binary_experiment, data_col, networks, mod='Y', exp_cols=['KSTAR_ACCESSION', 'KSTAR_SITE'], net_cols=['KSTAR_ACCESSION', 'KSTAR_SITE'])#

Given a sample within an experiment, determine how many of the sites observed in the experiment can be captured by KSTAR pruned networks. Essentially the same as experimentCoverage(), but restricts experiment sites to those used as evidence for a given sample

Parameters:

binary_experiment: pandas dataframe: binarized phosphoproteomic experiment, with each 1 indicating that site was observed in sample. Ideally has been mapped to KinPred by KSTAR already
data_col: str: column name of the sample of interest
network: pandas dataframe: binarized kinase-substrate network (unweighted), ideally having been mapped to KinPred/KSTAR already
exp_cols: list: list indicating the columns in experiment dataframe that contain uniprot id and site number
net_cols: list: list indicating the columns in network dataframe that contain the uniprot id and site number

Returns:

fraction_of_sites_covered: dict: indicates the fraction of phosphorylation sites observed in sample that are also found within the kinase-substrate network, for each modification type (tyrosine, serine/threonine).

kstar.analysis.kinase_MI.kinase_mutual_information(network, kinase_column='KSTAR_KINASE', accession_column='KSTAR_ACCESSION', site_column='KSTAR_SITE', substrate_list=None)#

Finds mutual information shared between kinases based on the substrate phosphorylated Mutual Information is defined as the intersection substrates between two kinases A substrate is defined as the substrate accession and site, i.e. P54760_Y596. Normalization is performed by comparing intersection of kinases vs union of the two kinases This the the Jaccard Index. Jaccard Distance can be calcualted by taking 1 - JI

Parameters:

networkpandas dataframe or dictionary of pandas dataframe: The network to analyze for mutual kinase information. Can send a dictionary of multiple pandas dataframes and this will average the MI across all networks in dictionary
kinase_columnstr: Column in network that contiains kinase information
substrate_columnstr: Column in network that contains substrate information
substrate_listlist: Optional and default is no subset list to use. You can calculate the MI within network(s) for only the evidence given in a substrate_evidence_list (must matche substrate_column of network passed in)

Returns:

heatmappandas dataframe: Number of substrates that overlap between kinases
normalizedpandas dataframe: Normalized mutual information into Jaccard Index. size of intersection of two kinase networks / size of union of two kinase networks.
heatlist or heatdict: list or dictionary of lists: intersection of kinase networks. If a single network it is a list. If multiple networks it is a dict of lists with keys the same as the network name

kstar.analysis.kinase_MI.plot_kinase_heatmap(heatmap, use_mask=True, annotate=False)#

Plots Kinase network heatmap

Parameters:

heatmappandas dataframe: Network Heatmap to plot (must be square matrix)
info_type: str: Indicates what type of informatin is included in heatmap variable. Default is mutual information, equivalent to the normalized matrix obtained from kinase_mutual_information function
use_maskbool: If true a mask is applied to the heatmap
annotatebool: If true then numbers are annotated into each heatmap square

Dataset Processing Functions#

Other Helper Functions#

kstar.helpers.agg_jaccard(jaccard_matrix, agg='max')#

Given a jaccard similarity matrix between samples, calculate the aggregate jaccard similarity excluding self-comparisons

Parameters:

jaccard_matrix: pd.DataFrame: jaccard similarity matrix between samples, created using jaci_matrix_between_samples()
agg: str: aggregation method to use, either ‘max’ or ‘mean’

kstar.helpers.calculate_jaccard_by_binary(set1, set2)#: Compares two binary arrays and calculates the Jaccard index between them (based on number of matches)

kstar.helpers.calculate_jaccard_by_sets(set1, set2)#: Compares two sets and calculates the Jaccard index between them

kstar.helpers.convert_acc_to_uniprot(df, acc_col_name, acc_col_type, acc_uni_name)#

Given an experimental dataframe (df) with an accession column (acc_col_name) that is not uniprot, use uniprot to append an accession column of uniprot IDS

Parameters:

df: pandas.DataFrame: Dataframe with at least a column of accession of interest
acc_col_name: string: name of column to convert FROM
acc_col_type: string: Uniprot string designation of the accession type to convert FROM, see https://www.uniprot.org/help/api_idmapping
acc_uni_name:: name of new column

Returns:

appended_df: pandas.DataFrame: Input dataframe with an appended acc column of uniprot IDs

kstar.helpers.get_logger(name, filename)#

Finds and returns logger if it exists. Creates new logger if log file does not exist

Parameters:

namestr
log name
filenamestr
location to store log file

kstar.helpers.jaci_matrix_between_samples(evidence, samples=None)#

This function creates a looks at the similarity of evidence between samples based on Jaccard index of phosphopeptide identities

Parameters:

evidence: pd.DataFrame: evidence dataframe, preferably one that has been binarized
samples: a list of sample columns

Returns:

jaccard_matrix: pd.DataFrame: a dataframe showing the similarity of phosphopeptide identities between samples

kstar.helpers.parse_network_information(network_directory, file_type='txt')#

Parse the RUN_INFORMATION.txt file from network pruning run and extract its data.

Args:: file_path (str): Path to the RUN_INFORMATION.txt file.
Returns:: dict: A dictionary containing the parsed data.

kstar.helpers.process_fasta_file(fasta_file)#

For configuration, to convert the global fasta sequence file into a sequence dictionary that can be used in mapping

Parameters:

fasta_filestr: file location of fasta file

Returns:

sequencesdict: {acc : sequence} dictionary generated from fasta file

kstar.helpers.string_to_boolean(string)#

Converts string to boolean

Parameters:

string :str: input string

Returns:

resultbool: output boolean

KSTAR Reference

Contents

KSTAR Reference#

The “Config” Module#

The “Prune” Module#

The “Pruner” Class#

Functions to Perform Pruning#

The “ExperimentMapper” class#

Functions for Activity Calculation#

The “KinaseActivity” class#

Master Functions for Running KSTAR Pipeline#

Functions for Saving and Loading KSTAR results#

Plotting/Analysis Functions#

The “DotPlot” class#

The “KSTAR_PDF” class#

Downstream Analysis Modules#

Dataset Processing Functions#

Other Helper Functions#