KSTAR¶

The “Config” Module¶

kstar.config.install_resource_files()[source]¶: Retrieves RESOURCE_FILES that are the companion for this version release from FigShare, unzips them to the correct directory for resource files.

kstar.config.install_network_files(target_dir=None)[source]¶: Retrieves Network files that are the companion for this version release from FigShare, unzips them to the specified directory.

kstar.config.update_network_directory(directory, create_pickles=True, KSTAR_DIR='/Users/zxa7aw/Documents/KSTAR/KSTAR_documentation_update/KSTAR_documentation-master', NETWORK_DIR='./NETWORKS/NetworKIN')[source]¶

Update the location of network the network files, and verify that all necessary files are located in directory

Parameters:

directory: string: path to where network files are located

kstar.config.create_network_pickles(phosphoTypes=['Y', 'ST'], network_directory='./NETWORKS/NetworKIN')[source]¶: Given network files declared in globals, create pickles of the kstar object that can then be quickly loaded in analysis Assumes that the Network structure has two folders Y and ST under the NETWORK_DIR global variable and that all .csv files in those directories should be loaded into a network pickle.

The “Prune” Module¶

The “Pruner” Class¶

class kstar.prune.Pruner(network, logger, phospho_type='Y', acc_col='substrate_acc', site_col='site', nonweight_cols=['substrate_acc', 'site', 'substrate_id', 'substrate_name', 'pep'])[source]¶

Pruning Algorithm used for KSTAR.

Parameters:

networkpandas df

weighted kinase-site prediction network where there is an accession, site, kinase, and score column

logger

logger used for pruning

phospho_typestr

phospho_type(s) to use when building pruned networks

acc_colstr

the name of the column containing Uniprot Accession IDs for each substrate in the weighted network

site_colstr

the name of the column containing the residue type and location of each substrate in the weighted network (Y1268, S44, etc.)

nonweight_colslist

indicates the non-weight containing columns in the network (these will be removed in the final processed network, as they are not needed). If None, will automatically look: for any non-numeric columns and removes them.

Methods

`build_multiple_compendia_networks`(...[, ...])	Builds multiple compendia-limited networks
`build_multiple_networks`(kinase_size, ...[, ...])	Basic Network Generation - only takes into account score when determining sites a kinase connects to
`build_pruned_network`(network, kinase_size, ...)	Builds a heuristic pruned network where each kinase has a specified number of connected sites and each site has an upper limit to the number of kinases it can connect to
`calculate_compendia_sizes`(kinase_size)	Calculates the number of sites per compendia size that a kinase should connect to using same ratios of compendia sizes as found in compendia
`checkParameters`(kinase_size, site_limit)	Given the site_limit and kinase_size parameters to be used during pruning, raise errors if not feasible, and raise warnings if value is higher than we would recommend (>40% of the maximum kinase_size value)
`compendia_pruned_network`(compendia_sizes, ...)	Builds a compendia-pruned network that takes into account compendia size limits per kinase
`getMaximumKinaseSize`(site_limit)	Given a network and site_limit (maximum number of kinases a phosphorylation site can provide evidence to), will calculate the theoretical maximum number of connections each kinase can have (kinase_size parameter)
`getRecommendedKinaseSize`(site_limit)	Given a network and site_limit (maximum number of kinases a phosphorylation site can provide evidence to), will calculate the theoretical maximum number of connections each kinase can have (kinase_size parameter) and recommend a range of values for kinase_size
`save_networks`(network_file)	Save the pruned networks generated by the 'build_multiple_networks' or 'build_multiple_compendia_networks' as a pickle to be loaded by KSTAR
`save_run_information`()	Save information about the generation of networks during run_pruning, including the parameters used for generation.

build_pruned_network(network, kinase_size, site_limit)[source]¶

Builds a heuristic pruned network where each kinase has a specified number of connected sites and each site has an upper limit to the number of kinases it can connect to

Parameters:

networkpandas DataFrame: network to build pruned network on
kinase_size: int: number of sites each kinase should connect to
site_limit :int: upper limit of number of kinases a site can connect to

Returns:

pruned networkpandas DataFrame: subset of network that has been pruned

compendia_pruned_network(compendia_sizes, site_limit, odir)[source]¶

Builds a compendia-pruned network that takes into account compendia size limits per kinase

Parameters:

compendia_sizesdict: key : compendia size value : number of sites to connect to kinase
site_limitint: upper limit of number of kinases a site can connect to

Returns:

pruned_networkpandas DataFrame: subset of network that has been pruned according to compendia ratios

calculate_compendia_sizes(kinase_size)[source]¶

Calculates the number of sites per compendia size that a kinase should connect to using same ratios of compendia sizes as found in compendia

Parameters:

kinase_size: int: number of sites each kinase should connect to

Returns:

sizesdict: key : compendia size value : number of sites each kinase should pull from given compendia size

build_multiple_compendia_networks(kinase_size, site_limit, num_networks, network_id, odir, PROCESSES=1)[source]¶

Builds multiple compendia-limited networks

Parameters:

kinase_size: int: number of sites each kinase should connect to
site_limit :int: upper limit of number of kinases a site can connect to
num_networks: int: number of networks to build
network_idstr: id to use for each network in dictionary

Returns:

pruned_networksdict: key : <network_id>_<i> value : pruned network

build_multiple_networks(kinase_size, site_limit, num_networks, network_id, odir, PROCESSES=1)[source]¶: Basic Network Generation - only takes into account score when determining sites a kinase connects to

getMaximumKinaseSize(site_limit)[source]¶

Given a network and site_limit (maximum number of kinases a phosphorylation site can provide evidence to), will calculate the theoretical maximum number of connections each kinase can have (kinase_size parameter)

Theoretical maximum exists when each substrate hits the maximum site_limit

Parameters:

site_limit: int: Parameter used in pruning: indicates the maximum number of kinases a phosphorylation site can be connected to in the final pruned network

Returns:

theoretical_max_ksize: int: largest possible value that ‘kinase_size’ parameter can have without throwing any errors

getRecommendedKinaseSize(site_limit)[source]¶

Given a network and site_limit (maximum number of kinases a phosphorylation site can provide evidence to), will calculate the theoretical maximum number of connections each kinase can have (kinase_size parameter) and recommend a range of values for kinase_size

Theoretical maximum exists when each substrate hits the maximum site_limit

Parameters:

site_limit: int: Parameter used in pruning: indicates the maximum number of kinases a phosphorylation site can be connected to in the final pruned network

Returns:

Nothing, prints theoretical maximum of kinase size and the recommened values for the parameter given the site_limit

checkParameters(kinase_size, site_limit)[source]¶

Given the site_limit and kinase_size parameters to be used during pruning, raise errors if not feasible, and raise warnings if value is higher than we would recommend (>40% of the maximum kinase_size value)

Parameters:

kinase_size: int: Parameter used in pruning: indicates the number of substrates each kinase will be connected to
site_limit: int: Parameter used in pruning: indicates the maximum number of kinases a phosphorylation site can be connected to in the final pruned network

Returns:

Nothing, will only raise errors/warnings if parameters are not feasible

save_networks(network_file)[source]¶: Save the pruned networks generated by the ‘build_multiple_networks’ or ‘build_multiple_compendia_networks’ as a pickle to be loaded by KSTAR

save_run_information()[source]¶: Save information about the generation of networks during run_pruning, including the parameters used for generation. Primarily used when running bash script.

Functions to Perform Pruning¶

kstar.prune.run_pruning(network, log, phospho_type, kinase_size, site_limit, num_networks, network_id, odir, use_compendia=True, acc_col='substrate_acc', site_col='site', netcols_todrop=['substrate_acc', 'site', 'substrate_id', 'substrate_name', 'pep'], PROCESSES=1)[source]¶

Generate pruned networks from a weighted kinase-substrate graph and log run information

Parameters:

network: pandas dataframe: kinase substrate network matrix, with values indicating weight of kinase-substrate relationship
log: logger: logger to document the pruning process from start to finish
use_compendia: string: whether to use compendia ratios to build network
phospho_type: string: phospho type (‘Y’, ‘ST’, …)
kinase_size: int: number of sites a kinase connects to
site_limit: int: upper limit of number of kinases can connect to
num_networks: int: number of networks to generate
network_id: string: name of network to use in building dictionary
odir: string: output directory for results

Returns:

pruner: Prune object: prune object that contains the number of pruned networks indicated by the num_networks paramater

kstar.prune.save_pruning(phospho_type, network_id, kinase_size, site_limit, use_compendia, odir, log)[source]¶

Save the pruned networks generated by run_pruning function as a pickle to be loaded by KSTAR

Parameters:

phosho_type: string: type of phosphomodification to networks were generated for (either ‘Y’ or ‘ST’)
network_id: string: name of network used to build dictionary
kinase_size: int: number of sites a kinase connects to
site_limit: int: upper limit of number of kinases can connect to
use_compendia: string: whether compendia was used for ratios to build networks
odir: string: output directory for results
log: logger: logger to document pruning process from start to finish

Returns:

Nothing

kstar.prune.save_run_information(results, use_compendia, pruner, unique_id)[source]¶

Save information about the generation of networks during run_pruning, including the parameters used for generation. Primarily used when running bash script.

Parameters:

results:: object that stores all parameters used in the pruning process
use_compendia: string: whether compendia was used for ratios to build network
pruner: Prune object: output of the run_pruning() function

Returns:

Nothing

The “ExperimentMapper” class¶

class kstar.mapping.ExperimentMapper(experiment, columns, logger, sequences=None, compendia=None, window=7, data_columns=None)[source]¶

Given an experiment object and reference sequences, map the phosphorylation sites to the common reference. Inputs

Parameters:

namestr: Name of experiment. Used for logging
experiment: pandas dataframe: Pandas dataframe of an experiment that has a reference accession, a peptide column and/or a site column. The peptide column should be upper case, with lower case indicating the site of phosphorylation - this is preferred The site column should be in the format S/T/Y<pos>, e.g. Y15 or S345
columns: dict: Dictionary with mappings of the experiment dataframe column names for the required names ‘accession_id’, ‘peptide’, or ‘site’. One of ‘peptide’ or ‘site’ is required.
logger: Logger object: used for logging when peptides cannot be matched and when a site location changes
sequences: dict: Dictionary of sequences. Key : accession. Value : protein sequence. Default is imported from kstar.config
compendia: pd.DataFrame: Human phosphoproteome compendia, mapped to KinPred and annotated with number of compendia. Default is imported from kstar.config
windowint: The length of amino acids to the N- and C-terminal sides of the central phosphoprotein to map a site to. Default is 7.
data_columns: list, or empty: The list of data columns to use. If this is empty, logger will look for anything that starts with statement data: and those values Default is None.

Attributes:

experiment: pandas dataframe: mapped experiment, which for each peptide, no contains the mapped accession, site, peptide, number of compendia, compendia type
sequences: dict: Dictionary of sequences passed into the class
compendia: pandas dataframe: compendia dataframe passed into the class
data_columns: list: indicates which columns will be used as data

Methods

`align_sites`([window])	Map the peptide/sites to the common sequence reference and remove and report errors for sites that do not align as expected.
`get_experiment`()	Return the mapped experiment dataframe
`get_sequence`(accession)	Gets the sequence that matches the given accession
`set_data_columns`(data_columns)	Identifies which columns in the experiment should be used as data columns.

set_data_columns(data_columns)[source]¶: Identifies which columns in the experiment should be used as data columns. If data_columns is provided, then ‘data:’ is added to the front and experiment dataframe is renamed. Otherwise, function will look for columns with ‘data:’ in front and this to the data_columns attribute.

get_experiment()[source]¶: Return the mapped experiment dataframe

get_sequence(accession)[source]¶: Gets the sequence that matches the given accession

align_sites(window=7)[source]¶

Map the peptide/sites to the common sequence reference and remove and report errors for sites that do not align as expected. expMapper.align_sites(window=7). Operates on the experiment dataframe of class.

Parameters:

window: int: The length of amino acids to the N- and C-terminal sides of the central phosphoprotein to map a site to.

The “KinaseActivity” class¶

class kstar.calculate.KinaseActivity(evidence, logger, data_columns=None, phospho_type='Y')[source]¶

Kinase Activity calculates the estimated activity of kinases given an experiment using hypergeometric distribution. Hypergeometric distribution examines the number of protein sites found to be active in evidence compared to the number of protein sites attributed to a kinase on a provided network.

Parameters:

evidencepandas df: a dataframe that contains (at minimum, but can have more) data columms as evidence to use in analysis and KSTAR_ACCESSION and KSTAR_SITE
data_columns: list: list of the columns containing the abundance values, which will be used to determine which sites will be used as evidence for activity prediction in each sample
loggerLogger object: keeps track of kstar analysis, including any errors that occur
phospho_type: string, either ‘Y’ or ‘ST’: indicates the phoshpo modification of interest

Attributes:

——————-
Upon Initialization
——————-
evidence: pandas dataframe: inputted evidence column
data_columns: list: list of columns containing abundance values, which will be used to determine which sites will be used as evidence. If inputted data_columns parameter was None, this lists includes in column in evidence prefixed by ‘data:’
loggerLogger object: keeps track of kstar analysis, including any errors that occur
phospho_type: string: indicated phosphomod of interest
network_directory: string: directory where kinase substrate networks can be downloaded, as indicated in config.py
normalized: bool: indicates whether normalization analysis has been performed
aggregate: string: the type of aggregation to use when determining binary evidence, either ‘count’ or ‘mean’. Default is ‘count’.
threshold: float: cutoff to use when determining what sites to use for each experiment
greater: bool: indicates whether sites with greater or lower abundances than the threshold will be used
run_data: string: indicates the date that kinase activity object was initialized
———————————
After Hypergeometric Calculations
———————————
real_enrichment: pandas dataframe: p-values obtained for all pruned networks indicating statistical enrichment of a kinase’s substrates for each network, based on hypergeometric tests
activities: pandas dataframe: median p-values obtained from the real_enrichment object for each experiment/kinase
agg_activities: pandas dataframe
———————————–
After Random Enrichment Calculation
———————————–
random_experiments: pandas dataframe: contains information about the sites randomly sampled for each random experiment
random_kinact: KinaseActivity object: KinaseActivity object containing random activities predicted from each of the random experiments
—————————
After Mann Whitney Analysis
—————————
activities_mann_whitney: pandas dataframe: p-values obtained from comparing the real distribution of p-values to the distribution of p-values from random datasets, based the Mann Whitney U-test
fpr_mann_whitney: pandas dataframe: false positive rates for predicted kinase activities

Methods

`add_network`(network_id, network[, network_size])	Add network to be analyzed
`add_pregenerated_to_random_enrichment`()	Combine pre-generated random activities with random enrichment, sort based on the "data" column, and reorganize the combined DataFrame based on the original column order in self.data_columns.
`aggregate_activities`([activities])	Aggregate network activity using median for all activities
`calculate_Mann_Whitney_activities_sig`(log[, ...])	For a kinact_dict, where random generation and activity has already been run for the phospho_types of interest, this will calculate the Mann-Whitney U test for comparing the array of p-values for real data to those of random data, across the number of networks used.
`calculate_kinase_activities`([agg, ...])	Calculates combined activity of experiments based that uses a threshold value to determine if an experiment sees a site or not To use values use 'mean' as agg mean aggregation drops NA values from consideration To use count use 'count' as agg - present if not na
`calculate_random_activities`(logger[, ...])	Generate random experiments and calculate kinase activities.Either uses pre-generated activity lists or generates new random experiments based on the provided parameters.
`check_data_columns`()	Checks data columns to make sure column is in evidence and that evidence filtered on that data column has at least one point of evidence.
`create_binary_evidence`([agg, threshold, ...])	Returns a binary evidence data frame according to the parameters passed in for method for aggregating duplicates and considering whether a site is included as evidence or not
`find_pvalue_limits`(data_columns[, agg, ...])	For each data column and network find the lowest p-value achievable and how many seen sites are required to get to that limit. Assumptions - kinase size in network is same for all kinases.
`getFilteredCompendia`([selection_type])	Get phosphorylation sites binned based on selection type
`get_compendia_distribution`(...[, selection_type])	Get the compendia distribution for each data column.
`get_run_date`()	return date that kinase activities were run
`get_run_information_content`()	Retrieve network information from RUN_INFORMATION.txt based on phospho_type.
`load_pregenerated_random_activities`(...)	Load pre-generated random activities for the given datasets.
`network_check_for_pregeneration`()	Check if the network hash matches a pre-generated network in pregen_experiments and verifies RUN_INFORMATION.txt within the hash subdirectory.
`parse_network_information`(file_path)	Parse the RUN_INFORMATION.txt file and extract its data.
`save_new_precomputed_random_enrichment`(...)	Save the new precomputed random enrichment activities to a file.
`set_data_columns`([data_columns])	Sets the data columns to use in the kinase activity calculation If data_columns is None or an empty list then set data_columns to be all columns that start with data:
`set_evidence`(evidence)	Evidence to use in analysis
`summarize_activities`([activities, method, ...])	Builds a single combined dataframe from the provided activities such that each piece of evidence is given a single column.
`test_threshold`(threshold[, agg, greater, ...])	Given a threshold value, calculate the distribution of evidence sizes (i.e.

add_networks_batch

check_data_columns()[source]¶: Checks data columns to make sure column is in evidence and that evidence filtered on that data column has at least one point of evidence. Removes all columns that do not meet criteria

set_data_columns(data_columns=None)[source]¶

Sets the data columns to use in the kinase activity calculation If data_columns is None or an empty list then set data_columns to be all columns that start with data:

Checks all set columns to make sure columns are vaild after filtering evidence

test_threshold(threshold, agg='mean', greater=True, plot=False, return_evidence_sizes=False)[source]¶

Given a threshold value, calculate the distribution of evidence sizes (i.e. number of sites used in prediction for each sample in the experiment).

Parameters:

threshold: float: cutoff for inclusion as evidence for prediction. If greater = True, sites with quantification greater than the threshold are used as evidence.
agg: str: how to combine sites with multiple instances in experiment
greater: bool: whether to use sites greater (True) or less (False) than the threshold
plot: bool: whether to plot a histogram of the evidence sizes used
return_site_nums: bool: indicates whether to return the evidence sizes for all samples or not

Returns:

Outputs the minimum, maximum, and median evidence sizes across all samples. May return evidence sizes of all samples as pandas series

parse_network_information(file_path)[source]¶

Parse the RUN_INFORMATION.txt file and extract its data.

Args:: file_path (str): Path to the RUN_INFORMATION.txt file.
Returns:: dict: A dictionary containing the parsed data.

network_check_for_pregeneration()[source]¶

Check if the network hash matches a pre-generated network in pregen_experiments and verifies RUN_INFORMATION.txt within the hash subdirectory.

Returns:: bool: True if the data matches, False otherwise.

get_compendia_distribution(with_pregenerated_evidence, data_columns, selection_type='KSTAR_NUM_COMPENDIA_CLASS')[source]¶

Get the compendia distribution for each data column.

Parameters:

with_pregenerated_evidencepandas DataFrame: KSTAR mapped experimental dataframe that has been binarized by kstar_activity generation.
data_columnslist: Columns that represent experimental results.
selection_typestr, optional: The type of compendia selection, by default ‘KSTAR_NUM_COMPENDIA_CLASS’.

Returns:

dict: Dictionary containing the compendia distribution for each data column.

calculate_random_activities(logger, num_random_experiments=150, use_pregen_data=None, save_new_precompute=None, pregenerated_experiments_path=None, directory_for_save_precompute=None, network_hash=None, save_random_experiments=None, PROCESSES=1)[source]¶

Generate random experiments and calculate kinase activities.Either uses pre-generated activity lists or generates new random experiments based on the provided parameters.

Parameters:

loggerLogger object: Logger to record the progress and any issues during the randomization pipeline.
num_random_experimentsint, optional: Number of random experiments to generate, by default 150.
use_pregen_databool, optional: Whether to use pre-generated data, by default None.
save_new_precomputebool, optional: Whether to save new precomputed data, by default None.
pregenerated_experiments_pathstr, optional: Path to the directory containing pre-generated experiments, by default None.
directory_for_save_precomputestr, optional: Directory to save new precomputed data, by default None.
network_hashstr, optional: Hash of the network used, by default None.
save_random_experimentsbool, optional: Whether to save the generated random experiments, by default None.
PROCESSESint, optional: Number of processes to use for parallel computation, by default 1.

Returns:

None

load_pregenerated_random_activities(with_pregenerated_evidence, with_pregenerated, pregen_activities_list)[source]¶

Load pre-generated random activities for the given datasets.

This function processes datasets that have pre-generated random experiments. It identifies the appropriate pre-generated file based on the size of the dataset and appends the activities to the provided list.

Parameters:

with_pregenerated_evidencepandas.DataFrame: DataFrame containing the evidence for the datasets with pre-generated random experiments.
with_pregeneratedlist: List of dataset names that have pre-generated random experiments.
random_activities_listlist: List to which the concatenated activities of each dataset will be appended.

Returns:

None

add_pregenerated_to_random_enrichment()[source]¶

Combine pre-generated random activities with random enrichment, sort based on the “data” column, and reorganize the combined DataFrame based on the original column order in self.data_columns.

If use_pregen_data is True and data_columns_from_scratch is None, uses only pre-generated activities. If use_pregen_data is True and data_columns_from_scratch exists, combines both pre-generated and newly calculated random activities. If use_pregen_data is False, uses only newly calculated random activities.

Returns:

None: Updates self.random_enrichment with the combined and sorted activities

save_new_precomputed_random_enrichment(activities_list_df, col)[source]¶

Save the new precomputed random enrichment activities to a file.

This function saves the provided DataFrame of random enrichment activities to a file, using the specified column name.

Parameters:

activities_list_dfpandas.DataFrame: DataFrame containing the random enrichment activities to be saved.
colstr: Column name to be used for saving the activities.

Returns:

None

get_run_information_content()[source]¶

Retrieve network information from RUN_INFORMATION.txt based on phospho_type.

Reads the RUN_INFORMATION.txt file from the appropriate network directory based on the phospho_type (‘Y’ or ‘ST’). The file contains network configuration details including unique ID, date, network specifications, and compendia counts.

Returns:

str: Contents of RUN_INFORMATION.txt if found. ‘RUN_INFORMATION.txt file not found.’ if the file doesn’t exist.

Raises:

ValueError: If phospho_type is not ‘Y’ or ‘ST’.

add_network(network_id, network, network_size=None)[source]¶

Add network to be analyzed

Parameters:

network_idstr: name of the network
networkpandas DataFrame: network with columns substrate_id, site, kinase_id

get_run_date()[source]¶: return date that kinase activities were run

set_evidence(evidence)[source]¶

Evidence to use in analysis

Parameters:

evidencepandas DataFrame: substrate sites with activity seen. columns : dict for column mapping

substrate : Uniprot ID (P12345) site : phosphorylation site (Y123)

create_binary_evidence(agg='mean', threshold=1.0, evidence_size=None, greater=True)[source]¶

Returns a binary evidence data frame according to the parameters passed in for method for aggregating duplicates and considering whether a site is included as evidence or not

Parameters:

thresholdfloat: threshold value used to filter rows
evidence_size: None or int: the number of sites to use for prediction for each sample. If a value is provided, this will override the threshold, and will instead obtain the N sites with the greatest abundance within each sample.
agg{‘count’, ‘mean’}: method to use when aggregating duplicate substrate-sites. ‘count’ combines multiple representations and adds if values are non-NaN ‘mean’ uses the mean value of numerical data from multiple representations of the same peptide.

NA values are droped from consideration.
greater: Boolean: whether to keep sites that have a numerical value >=threshold (TRUE, default) or <=threshold (FALSE)

Returns:

evidence_binarypd.DataFrame: Matches the evidence dataframe of the kinact object, but with 0 or 1 if a site is included or not. This is uniquified and rows that are never used are removed.

calculate_kinase_activities(agg='mean', threshold=1.0, evidence_size=None, greater=True, PROCESSES=1)[source]¶

Calculates combined activity of experiments based that uses a threshold value to determine if an experiment sees a site or not To use values use ‘mean’ as agg

mean aggregation drops NA values from consideration

To use count use ‘count’ as agg - present if not na

Parameters:

data_columnslist: columns that represent experimental result, if None, takes the columns that start with `data:’’ in experiment. Pass this value in as a list, if seeking to calculate on fewer than all available data columns
thresholdfloat: threshold value used to filter rows
agg{‘count’, ‘mean’}: method to use when aggregating duplicate substrate-sites. ‘count’ combines multiple representations and adds if values are non-NaN ‘mean’ uses the mean value of numerical data from multiple representations of the same peptide.

NA values are droped from consideration.
greater: Boolean: whether to keep sites that have a numerical value >=threshold (TRUE, default) or <=threshold (FALSE)

Returns:

activitiesdict: key : experiment value : pd DataFrame

network : network name, from networks key kinase : kinase examined frequency : number of times kinase was seen in subgraph of evidence and network kinase_activity : hypergeometric kinase activity

summarize_activities(activities=None, method='median_activity', normalized=False)[source]¶

Builds a single combined dataframe from the provided activities such that each piece of evidence is given a single column. Values are based on the method selected. The method must be a column in the activities

Parameters:

activitiesdict: hypergeometric activities that have previously been summarized by network. key : experiment name value : hypergeometric activity
methodstr: The column in the hypergeometric activity to use for summarizing data

Returns:

activity_summarypandas DataFrame

aggregate_activities(activities=None)[source]¶

Aggregate network activity using median for all activities

Parameters¶

activitiesdict: key : Experiment value : kinase activity result

Returns:

summariesdict: key : experiment value : summarized kinase activities accross networks

find_pvalue_limits(data_columns, agg='count', threshold=1.0)[source]¶

For each data column and network find the lowest p-value achievable and how many seen sites are required to get to that limit. Assumptions

kinase size in network is same for all kinases

Parameters:

data_columnslist

what columns in evidence to compare

aggstr

aggregate function - what function to use for determining if site is present: count : use when using activity_count mean : use when using activity_threshold

thresholdfloat

threshold to use in determining if site present in evidence

Returns:

all_limitspandas DataFrame: p-value limits of each column for each network columns:

evidence evidence data column network network being compared kinase kinase being evaluated evidence_size size of evidence limit_size number of sites to get non-zero p-value p-value p-value generated
limit_summarypandas DataFrame: summary of all_limits by taking average over by evidence

calculate_Mann_Whitney_activities_sig(log, number_sig_trials=100, PROCESSES=1)[source]¶

For a kinact_dict, where random generation and activity has already been run for the phospho_types of interest, this will calculate the Mann-Whitney U test for comparing the array of p-values for real data to those of random data, across the number of networks used. It will also calculate the false positive rate for a pvalue, given observations of a random bootstrapping analysis

Parameters:

kinact_dict: dictionary: A dictionary of kinact objects, with keys ‘Y’ and/or ‘ST’
log: logger: Logger for logging activity messages
phospho_types: {[‘Y’, ‘ST’], [‘Y’], [‘ST’]}: Which substrate/kinaset-type to run activity for: Both [‘Y, ‘ST’] (default), Tyrosine [‘Y’], or Serine/Threonine [‘ST’]
number_sig_trials: int: Maximum number of significant trials to run

Returns:

getFilteredCompendia(selection_type='KSTAR_NUM_COMPENDIA_CLASS')[source]¶: Get phosphorylation sites binned based on selection type

The “DotPlot” class¶

class kstar.plot.DotPlot(values, fpr, alpha=0.05, inclusive_alpha=True, binary_sig=True, dotsize=5, colormap={0: '#6b838f', 1: '#FF3300'}, facecolor='white', labelmap=None, legend_title='p-value', size_number=5, size_color='gray', color_title='Significant', markersize=10, legend_distance=1.0, figsize=(20, 4), title=None, xlabel=True, ylabel=True, x_label_dict=None, kinase_dict=None)[source]¶

The DotPlot class is used for plotting dotplots, with the option to add clustering and context plots. The size of the dots based on the values dataframe, where the size of the dot is the area of the value * dotsize

Parameters:

values: pandas DataFrame instance: values to plot
fprpandas DataFrame instance: false positive rates associated with values being plotted
alpha: float, optional: fpr value that defines the significance cutoff to use when plt default : 0.05
inclusive_alpha: boolean: whether to include the alpha (significance <= alpha), or not (significance < alpha). default: True
binary_sig: boolean, optional: indicates whether to plot fpr with binary significance or as a change color hue default : True
dotsizefloat, optional: multiplier to use for scaling size of dots
colormapdict, optional: maps color values to actual color to use in plotting default : {0: ‘#6b838f’, 1: ‘#FF3300’}
labelmap =: maps labels of colors, default is to indicate FPR cutoff in legend default : None
facecolorcolor, optional: Background color of dotplot default : ‘white’
legend_titlestr, optional: Legend Title for dot sizes, default is `p-value’
size_numberint, optional: Number of dots to attempt to generate for dot size legend
size_colorcolor, optional: Size Legend Color to use
color_titlestr, optional: Legend Title for the Color Legend
markersizeint, optional: Size of dots for Color Legend
legend_distanceint, optional: relative distance to place legends
figsizetuple, optional: size of dotplot figure
titlestr, optional: Title of dotplot
xlabelbool, optional: Show xlabel on graph if True
ylabelbool, optional: Show ylabel on graph if True
x_label_dict: dict, optional: Mapping dictionary of labels as they appear in values dataframe (keys) to how they should appear on plot (values)
kinase_dict: dict, optional: Mapping dictionary of kinase names as they appear in values dataframe (keys) to how they should appear on plot (values)

Attributes:

values: pandas dataframe: a copy of the original values dataframe
fpr: pandas dataframe: a copy of the original fpr dataframe
alpha: float: cutoff used for significance, default 0.05
inclusive_alpha: boolean: whether to include the alpha (significance <= alpha), or not (significance < alpha)
significance: pandas dataframe: indicates whether a particular kinases activity is significant, where fpr <= alpha is significant, otherwise it is insignificant
colors: pandas dataframe: dataframe indicating the color to use when plotting: either a copy of the fpr or significance dataframe
binary_sig: boolean: indicates whether coloring will be done based on binary significance or fpr values. Default True
labelmap: dict: indicates how to label each significance color
figsize: tuple: size of the outputted figure, which is overridden if axes is provided for dotplot
title: string: title of the dotplot
xlabel: boolean: indicates whether to plot x-axis labels
ylabel: boolean: indicates whether to plot y-axis labels
colormap: dict: colors to be used when plotting
facecolor: string: background color of dotplot

Methods

`cluster`(ax[, method, metric, orientation, ...])	Performs hierarchical clustering on data and plots result to provided Axes.
`context`(ax, info, id_column, context_columns)	Context plot is generated and returned.
`dotplot`([ax, orientation, size_legend, ...])	Generates the dotplot plot, where size is determined by values dataframe and color is determined by significant dataframe
`drop_kinases`(kinase_list)	Given a list of kinases, drop these from the dot.values dataframe in all future plotting of this object.
`drop_kinases_with_no_significance`()	Drop kinases from the values dataframe (inplace) when plotting if they are never observed as significant
`evidence_count`(ax, binary_evidence[, ...])	Add bars to dotplot indicating the total number of sites used as evidence in activity calculation

dotplot(ax=None, orientation='left', size_legend=True, color_legend=True, max_size=None)[source]¶

Generates the dotplot plot, where size is determined by values dataframe and color is determined by significant dataframe

Parameters:

axmatplotlib Axes instance, optional: axes dotplot will be plotted on. If None then new plot generated

cluster(ax, method='single', metric='euclidean', orientation='top', color_threshold=-inf)[source]¶

Performs hierarchical clustering on data and plots result to provided Axes. result and significant dataframes are ordered according to clustering

Parameters:

axmatplotlib Axes instance: Axes to plot dendogram to
methodstr, optional: The linkage algorithm to use.
metricstr or function, optional: The distance metric to use in the case that y is a collection of observation vectors; ignored otherwise. See the pdist function for a list of valid distance metrics. A custom distance function can also be used.
orientationstr, optional: The direction to plot the dendrogram, which can be any of the following strings: ‘top’: Plots the root at the top, and plot descendent links going downwards. (default). ‘bottom’: Plots the root at the bottom, and plot descendent links going upwards. ‘left’: Plots the root at the left, and plot descendent links going right. ‘right’: Plots the root at the right, and plot descendent links going left.

drop_kinases_with_no_significance()[source]¶: Drop kinases from the values dataframe (inplace) when plotting if they are never observed as significant

drop_kinases(kinase_list)[source]¶

Given a list of kinases, drop these from the dot.values dataframe in all future plotting of this object. Removal is in place

Parameters:

kinase_list: list: list of kinase names to remove

context(ax, info, id_column, context_columns, dotsize=200, markersize=20, orientation='left', color_palette='colorblind', margin=0.2, make_legend=True)[source]¶

Context plot is generated and returned. The context plot contains the categorical data used for describing the data.

Parameters:

axmaptlotlib axis: where to map subtype information to
infopandas df: Dataframe where context information is pulled from
id_column: str: Column used to map the subtype information to
context_columnslist: list of columns to pull context informaiton from
dotsizeint, optional: size of context dots
markersize: int, optional: size of legend markers
orientationstr, optional: orientation to plot context plots to - determines where legends are placed options : left, right, top, bottom
color_palettestr, optional: seaborn color palette to use
margin: float, optional: margin
make_legendbool, optional: whether to create legend for context colors

evidence_count(ax, binary_evidence, plot_type='bars', phospho_type=None, dot_size=1, include_recommendations=True, ideal_min=None, recommended_min=None, dot_colors=None, bar_line_colors=None)[source]¶

Add bars to dotplot indicating the total number of sites used as evidence in activity calculation

Parameters:

ax: axes object: where to plot the bars
binary_evidence: pandas dataframe: binarized dataframe produced during activity calculation (threshold applied to original experiment)

Supporting Functions¶

Master Functions for Running KSTAR Pipeline¶

kstar.calculate.enrichment_analysis(experiment, log, networks, phospho_types=['Y', 'ST'], data_columns=None, agg='mean', threshold=1.0, evidence_size=None, greater=True, PROCESSES=1)[source]¶

Function to establish a kstar KinaseActivity object from an experiment with an activity log add the networks, calculate, aggregate, and summarize the hypergeometric enrichment into a final activity object. Should be followed by randomized_analyis, then Mann_Whitney_analysis.

Parameters:

experiment: pandas df: experiment dataframe that has been mapped, includes KSTAR_SITE, KSTAR_ACCESSION, etc.
log: logger object: Log to write activity log error and update to
networks: dictionary of dictionaries: Outer dictionary keys are ‘Y’ and ‘ST’. Establish a network by loading a pickle of desired networks. See the helpers and config file for this. If downloaded from FigShare, then the GLOBAL network pickles in config file can be loaded For example: networks[‘Y’] = pickle.load(open(config.NETWORK_Y_PICKLE, “rb” ))
phospho_types: {[‘Y’, ‘ST’], [‘Y’], [‘ST’]}: Which substrate/kinaset-type to run activity for: Both [‘Y, ‘ST’] (default), Tyrosine [‘Y’], or Serine/Threonine [‘ST’]
data_columnslist: columns that represent experimental result, if None, takes the columns that start with `data:’’ in experiment. Pass this value in as a list, if seeking to calculate on fewer than all available data columns
agg{‘count’, ‘mean’}: method to use when aggregating duplicate substrate-sites. ‘count’ combines multiple representations and adds if values are non-NaN ‘mean’ uses the mean value of numerical data from multiple representations of the same peptide.

NA values are droped from consideration.
thresholdfloat: threshold value used to filter rows
greater: Boolean: whether to keep sites that have a numerical value >=threshold (TRUE, default) or <=threshold (FALSE)

Returns:

kinactDict: dictionary of Kinase Activity Objects: Outer keys are phosphoTypes run ‘Y’ and ‘ST’ Includes the activities dictionary (see calculate_kinase_activities) aggregation of activities across networks (see aggregate activities) activity summary (see summarize_activities)

kstar.calculate.randomized_analysis(kinact_dict, log, num_random_experiments=150, use_pregen_data=False, save_new_precompute=False, pregenerated_experiments_path=None, directory_for_save_precompute=None, network_hash=None, save_random_experiments=None, PROCESSES=1)[source]¶

Perform randomized analysis on kinase activity data.

Parameters:

kinact_dictdict: Dictionary containing kinase activity data.
logLogger object: Logger to record the progress and any issues during the randomization pipeline.
num_random_experimentsint, optional: Number of random experiments to generate, by default 150.
use_pregen_databool, optional: Whether to use pre-generated data, by default False.
save_new_precomputebool, optional: Whether to save new precomputed data, by default None.
pregenerated_experiments_pathstr, optional: Path to the directory containing pre-generated experiments, by default None.
directory_for_save_precomputestr, optional: Directory to save new precomputed data, by default None.
network_hashstr, optional: Hash of the network used, by default None.
save_random_experimentsbool, optional: Whether to save the generated random experiments, by default None.
PROCESSESint, optional: Number of processes to use for parallel computation, by default 1.

Returns:

None

kstar.calculate.Mann_Whitney_analysis(kinact_dict, log, number_sig_trials=100, PROCESSES=1)[source]¶

For a kinact_dict, where random generation and activity has already been run for the phospho_types of interest, this will calculate the Mann-Whitney U test for comparing the array of p-values for real data to those of random data, across the number of networks used. It will also calculate the false positive rate for a pvalue, given observations of a random bootstrapping analysis

Parameters:

kinact_dict: dictionary: A dictionary of kinact objects, with keys ‘Y’ and/or ‘ST’
log: logger: Logger for logging activity messages
number_sig_trials: int: Maximum number of significant trials to run

Functions for Saving and Loading KSTAR results¶

kstar.calculate.save_kstar(kinact_dict, name, odir, PICKLE=True)[source]¶

Having performed kinase activities (run_kstar_analyis), save each of the important dataframes to files and the final pickle Saves an activities, aggregated_activities, summarized_activities tab-separated files Saves a pickle file of dictionary

Parameters:

kinact_dict: dictionary of Kinase Activity Objects: Outer keys are phosphoTypes run ‘Y’ and ‘ST’ Includes the activities dictionary (see calculate_kinase_activities) aggregation of activities across networks (see aggregate activities) activity summary (see summarize_activities)
name: string: The name to use when saving activities
odir: string: Outputdirectory to save files and pickle to
PICKLE: boolean: Whether to save the entire pickle file

Returns:

Nothing

kstar.calculate.save_kstar_slim(kinact_dict, name, odir)[source]¶

Having performed kinase activities (run_kstar_analyis), save each of the important dataframes, minimizing the memory storage needed to get back to a rebuilt version for plotting results and analysis. For each phospho_type in the kinact_dict, this will save three .tsv files for every activities analysis run, two additional if random analysis was run, and two more if Mann Whitney based analysis was run. It also creates a readme file of the parameter values used

Parameters:

kinact_dict: dictionary of Kinase Activity Objects: Outer keys are phosphoTypes run ‘Y’ and ‘ST’ Includes the activities dictionary (see calculate_kinase_activities) aggregation of activities across networks (see aggregate activities) activity summary (see summarize_activities)
name: string: The name to use when saving activities
odir: string: Outputdirectory to save files and pickle to

Returns:

Nothing

kstar.calculate.from_kstar_slim(name, odir, log)[source]¶

Given the name and output directory of a saved kstar analyis, load the parameters and minimum dataframes needed for reinstantiating a kinact object This minimum list will allow you to repeat normalization or mann whitney at a different false positive rate threshold and plot results.

Parameters:

name: string: The name to used when saving activities and mapped data
odir: string: Output directory of saved files and parameter pickle
log: logger: Logger for logging activity messages

kstar.calculate.from_kstar_nextflow(name, odir, log=None)[source]¶

Given the name and output directory of a saved kstar analyis from the nextflow pipeline, load the results into new kinact object with the minimum dataframes required for analysis (binary experiment, hypergeometric activities, normalized activities, mann whitney activities)

Parameters:

name: string: The name to used when saving activities and mapped data
odir: string: Output directory of saved files
log: logger: logger used when loading nextflow data into kinase activity object. If not provided, new logger will be created.

Other Helper Functions¶

kstar.helpers.process_fasta_file(fasta_file)[source]¶

For configuration, to convert the global fasta sequence file into a sequence dictionary that can be used in mapping

Parameters:

fasta_filestr: file location of fasta file

Returns:

sequencesdict: {acc : sequence} dictionary generated from fasta file

kstar.helpers.get_logger(name, filename)[source]¶

Finds and returns logger if it exists. Creates new logger if log file does not exist

Parameters:

namestr
log name
filenamestr
location to store log file

kstar.helpers.string_to_boolean(string)[source]¶

Converts string to boolean

Parameters:

string :str: input string

Returns:

resultbool: output boolean

kstar.helpers.convert_acc_to_uniprot(df, acc_col_name, acc_col_type, acc_uni_name)[source]¶

Given an experimental dataframe (df) with an accession column (acc_col_name) that is not uniprot, use uniprot to append an accession column of uniprot IDS

Parameters:

df: pandas.DataFrame: Dataframe with at least a column of accession of interest
acc_col_name: string: name of column to convert FROM
acc_col_type: string: Uniprot string designation of the accession type to convert FROM, see https://www.uniprot.org/help/api_idmapping
acc_uni_name:: name of new column

Returns:

appended_df: pandas.DataFrame: Input dataframe with an appended acc column of uniprot IDs

KSTAR¶

The “Config” Module¶

The “Prune” Module¶

The “Pruner” Class¶

Functions to Perform Pruning¶

The “ExperimentMapper” class¶

The “KinaseActivity” class¶

Parameters¶

The “DotPlot” class¶

Supporting Functions¶

Master Functions for Running KSTAR Pipeline¶

Functions for Saving and Loading KSTAR results¶

Other Helper Functions¶

Table of Contents

Previous topic

Next topic

This Page