KSTAR

The “Config” Module

kstar.config.install_resource_files()[source]

Retrieves RESOURCE_FILES that are the companion for this version release from FigShare, unzips them to the correct directory for resource files.

kstar.config.install_network_files(target_dir=None)[source]

Retrieves Network files that are the companion for this version release from FigShare, unzips them to the specified directory.

kstar.config.update_network_directory(directory, create_pickles=True, KSTAR_DIR='/Users/zxa7aw/Documents/KSTAR/KSTAR_documentation_update/KSTAR_documentation-master', NETWORK_DIR='./NETWORKS/NetworKIN')[source]

Update the location of network the network files, and verify that all necessary files are located in directory

Parameters:
directory: string

path to where network files are located

kstar.config.create_network_pickles(phosphoTypes=['Y', 'ST'], network_directory='./NETWORKS/NetworKIN')[source]

Given network files declared in globals, create pickles of the kstar object that can then be quickly loaded in analysis Assumes that the Network structure has two folders Y and ST under the NETWORK_DIR global variable and that all .csv files in those directories should be loaded into a network pickle.

The “Prune” Module

The “Pruner” Class

class kstar.prune.Pruner(network, logger, phospho_type='Y', acc_col='substrate_acc', site_col='site', nonweight_cols=['substrate_acc', 'site', 'substrate_id', 'substrate_name', 'pep'])[source]

Pruning Algorithm used for KSTAR.

Parameters:
networkpandas df

weighted kinase-site prediction network where there is an accession, site, kinase, and score column

logger

logger used for pruning

phospho_typestr

phospho_type(s) to use when building pruned networks

acc_colstr

the name of the column containing Uniprot Accession IDs for each substrate in the weighted network

site_colstr

the name of the column containing the residue type and location of each substrate in the weighted network (Y1268, S44, etc.)

nonweight_colslist
indicates the non-weight containing columns in the network (these will be removed in the final processed network, as they are not needed). If None, will automatically look

for any non-numeric columns and removes them.

Methods

build_multiple_compendia_networks(...[, ...])

Builds multiple compendia-limited networks

build_multiple_networks(kinase_size, ...[, ...])

Basic Network Generation - only takes into account score when determining sites a kinase connects to

build_pruned_network(network, kinase_size, ...)

Builds a heuristic pruned network where each kinase has a specified number of connected sites and each site has an upper limit to the number of kinases it can connect to

calculate_compendia_sizes(kinase_size)

Calculates the number of sites per compendia size that a kinase should connect to using same ratios of compendia sizes as found in compendia

checkParameters(kinase_size, site_limit)

Given the site_limit and kinase_size parameters to be used during pruning, raise errors if not feasible, and raise warnings if value is higher than we would recommend (>40% of the maximum kinase_size value)

compendia_pruned_network(compendia_sizes, ...)

Builds a compendia-pruned network that takes into account compendia size limits per kinase

getMaximumKinaseSize(site_limit)

Given a network and site_limit (maximum number of kinases a phosphorylation site can provide evidence to), will calculate the theoretical maximum number of connections each kinase can have (kinase_size parameter)

getRecommendedKinaseSize(site_limit)

Given a network and site_limit (maximum number of kinases a phosphorylation site can provide evidence to), will calculate the theoretical maximum number of connections each kinase can have (kinase_size parameter) and recommend a range of values for kinase_size

save_networks(network_file)

Save the pruned networks generated by the 'build_multiple_networks' or 'build_multiple_compendia_networks' as a pickle to be loaded by KSTAR

save_run_information()

Save information about the generation of networks during run_pruning, including the parameters used for generation.

build_pruned_network(network, kinase_size, site_limit)[source]

Builds a heuristic pruned network where each kinase has a specified number of connected sites and each site has an upper limit to the number of kinases it can connect to

Parameters:
networkpandas DataFrame

network to build pruned network on

kinase_size: int

number of sites each kinase should connect to

site_limit :int

upper limit of number of kinases a site can connect to

Returns:
pruned networkpandas DataFrame

subset of network that has been pruned

compendia_pruned_network(compendia_sizes, site_limit, odir)[source]

Builds a compendia-pruned network that takes into account compendia size limits per kinase

Parameters:
compendia_sizesdict

key : compendia size value : number of sites to connect to kinase

site_limitint

upper limit of number of kinases a site can connect to

Returns:
pruned_networkpandas DataFrame

subset of network that has been pruned according to compendia ratios

calculate_compendia_sizes(kinase_size)[source]

Calculates the number of sites per compendia size that a kinase should connect to using same ratios of compendia sizes as found in compendia

Parameters:
kinase_size: int

number of sites each kinase should connect to

Returns:
sizesdict

key : compendia size value : number of sites each kinase should pull from given compendia size

build_multiple_compendia_networks(kinase_size, site_limit, num_networks, network_id, odir, PROCESSES=1)[source]

Builds multiple compendia-limited networks

Parameters:
kinase_size: int

number of sites each kinase should connect to

site_limit :int

upper limit of number of kinases a site can connect to

num_networks: int

number of networks to build

network_idstr

id to use for each network in dictionary

Returns:
pruned_networksdict

key : <network_id>_<i> value : pruned network

build_multiple_networks(kinase_size, site_limit, num_networks, network_id, odir, PROCESSES=1)[source]

Basic Network Generation - only takes into account score when determining sites a kinase connects to

getMaximumKinaseSize(site_limit)[source]

Given a network and site_limit (maximum number of kinases a phosphorylation site can provide evidence to), will calculate the theoretical maximum number of connections each kinase can have (kinase_size parameter)

Theoretical maximum exists when each substrate hits the maximum site_limit

Parameters:
site_limit: int

Parameter used in pruning: indicates the maximum number of kinases a phosphorylation site can be connected to in the final pruned network

Returns:
theoretical_max_ksize: int

largest possible value that ‘kinase_size’ parameter can have without throwing any errors

getRecommendedKinaseSize(site_limit)[source]

Given a network and site_limit (maximum number of kinases a phosphorylation site can provide evidence to), will calculate the theoretical maximum number of connections each kinase can have (kinase_size parameter) and recommend a range of values for kinase_size

Theoretical maximum exists when each substrate hits the maximum site_limit

Parameters:
site_limit: int

Parameter used in pruning: indicates the maximum number of kinases a phosphorylation site can be connected to in the final pruned network

Returns:
Nothing, prints theoretical maximum of kinase size and the recommened values for the parameter given the site_limit
checkParameters(kinase_size, site_limit)[source]

Given the site_limit and kinase_size parameters to be used during pruning, raise errors if not feasible, and raise warnings if value is higher than we would recommend (>40% of the maximum kinase_size value)

Parameters:
kinase_size: int

Parameter used in pruning: indicates the number of substrates each kinase will be connected to

site_limit: int

Parameter used in pruning: indicates the maximum number of kinases a phosphorylation site can be connected to in the final pruned network

Returns:
Nothing, will only raise errors/warnings if parameters are not feasible
save_networks(network_file)[source]

Save the pruned networks generated by the ‘build_multiple_networks’ or ‘build_multiple_compendia_networks’ as a pickle to be loaded by KSTAR

save_run_information()[source]

Save information about the generation of networks during run_pruning, including the parameters used for generation. Primarily used when running bash script.

Functions to Perform Pruning

kstar.prune.run_pruning(network, log, phospho_type, kinase_size, site_limit, num_networks, network_id, odir, use_compendia=True, acc_col='substrate_acc', site_col='site', netcols_todrop=['substrate_acc', 'site', 'substrate_id', 'substrate_name', 'pep'], PROCESSES=1)[source]

Generate pruned networks from a weighted kinase-substrate graph and log run information

Parameters:
network: pandas dataframe

kinase substrate network matrix, with values indicating weight of kinase-substrate relationship

log: logger

logger to document the pruning process from start to finish

use_compendia: string

whether to use compendia ratios to build network

phospho_type: string

phospho type (‘Y’, ‘ST’, …)

kinase_size: int

number of sites a kinase connects to

site_limit: int

upper limit of number of kinases can connect to

num_networks: int

number of networks to generate

network_id: string

name of network to use in building dictionary

odir: string

output directory for results

Returns:
pruner: Prune object

prune object that contains the number of pruned networks indicated by the num_networks paramater

kstar.prune.save_pruning(phospho_type, network_id, kinase_size, site_limit, use_compendia, odir, log)[source]

Save the pruned networks generated by run_pruning function as a pickle to be loaded by KSTAR

Parameters:
phosho_type: string

type of phosphomodification to networks were generated for (either ‘Y’ or ‘ST’)

network_id: string

name of network used to build dictionary

kinase_size: int

number of sites a kinase connects to

site_limit: int

upper limit of number of kinases can connect to

use_compendia: string

whether compendia was used for ratios to build networks

odir: string

output directory for results

log: logger

logger to document pruning process from start to finish

Returns:
Nothing
kstar.prune.save_run_information(results, use_compendia, pruner, unique_id)[source]

Save information about the generation of networks during run_pruning, including the parameters used for generation. Primarily used when running bash script.

Parameters:
results:

object that stores all parameters used in the pruning process

use_compendia: string

whether compendia was used for ratios to build network

pruner: Prune object

output of the run_pruning() function

Returns:
Nothing

The “ExperimentMapper” class

class kstar.mapping.ExperimentMapper(experiment, columns, logger, sequences=None, compendia=None, window=7, data_columns=None)[source]

Given an experiment object and reference sequences, map the phosphorylation sites to the common reference. Inputs

Parameters:
namestr

Name of experiment. Used for logging

experiment: pandas dataframe

Pandas dataframe of an experiment that has a reference accession, a peptide column and/or a site column. The peptide column should be upper case, with lower case indicating the site of phosphorylation - this is preferred The site column should be in the format S/T/Y<pos>, e.g. Y15 or S345

columns: dict

Dictionary with mappings of the experiment dataframe column names for the required names ‘accession_id’, ‘peptide’, or ‘site’. One of ‘peptide’ or ‘site’ is required.

logger: Logger object

used for logging when peptides cannot be matched and when a site location changes

sequences: dict

Dictionary of sequences. Key : accession. Value : protein sequence. Default is imported from kstar.config

compendia: pd.DataFrame

Human phosphoproteome compendia, mapped to KinPred and annotated with number of compendia. Default is imported from kstar.config

windowint

The length of amino acids to the N- and C-terminal sides of the central phosphoprotein to map a site to. Default is 7.

data_columns: list, or empty

The list of data columns to use. If this is empty, logger will look for anything that starts with statement data: and those values Default is None.

Attributes:
experiment: pandas dataframe

mapped experiment, which for each peptide, no contains the mapped accession, site, peptide, number of compendia, compendia type

sequences: dict

Dictionary of sequences passed into the class

compendia: pandas dataframe

compendia dataframe passed into the class

data_columns: list

indicates which columns will be used as data

Methods

align_sites([window])

Map the peptide/sites to the common sequence reference and remove and report errors for sites that do not align as expected.

get_experiment()

Return the mapped experiment dataframe

get_sequence(accession)

Gets the sequence that matches the given accession

set_data_columns(data_columns)

Identifies which columns in the experiment should be used as data columns.

set_data_columns(data_columns)[source]

Identifies which columns in the experiment should be used as data columns. If data_columns is provided, then ‘data:’ is added to the front and experiment dataframe is renamed. Otherwise, function will look for columns with ‘data:’ in front and this to the data_columns attribute.

get_experiment()[source]

Return the mapped experiment dataframe

get_sequence(accession)[source]

Gets the sequence that matches the given accession

align_sites(window=7)[source]

Map the peptide/sites to the common sequence reference and remove and report errors for sites that do not align as expected. expMapper.align_sites(window=7). Operates on the experiment dataframe of class.

Parameters:
window: int

The length of amino acids to the N- and C-terminal sides of the central phosphoprotein to map a site to.

The “KinaseActivity” class

class kstar.calculate.KinaseActivity(evidence, logger, data_columns=None, phospho_type='Y')[source]

Kinase Activity calculates the estimated activity of kinases given an experiment using hypergeometric distribution. Hypergeometric distribution examines the number of protein sites found to be active in evidence compared to the number of protein sites attributed to a kinase on a provided network.

Parameters:
evidencepandas df

a dataframe that contains (at minimum, but can have more) data columms as evidence to use in analysis and KSTAR_ACCESSION and KSTAR_SITE

data_columns: list

list of the columns containing the abundance values, which will be used to determine which sites will be used as evidence for activity prediction in each sample

loggerLogger object

keeps track of kstar analysis, including any errors that occur

phospho_type: string, either ‘Y’ or ‘ST’

indicates the phoshpo modification of interest

Attributes:
——————-
Upon Initialization
——————-
evidence: pandas dataframe

inputted evidence column

data_columns: list

list of columns containing abundance values, which will be used to determine which sites will be used as evidence. If inputted data_columns parameter was None, this lists includes in column in evidence prefixed by ‘data:’

loggerLogger object

keeps track of kstar analysis, including any errors that occur

phospho_type: string

indicated phosphomod of interest

network_directory: string

directory where kinase substrate networks can be downloaded, as indicated in config.py

normalized: bool

indicates whether normalization analysis has been performed

aggregate: string

the type of aggregation to use when determining binary evidence, either ‘count’ or ‘mean’. Default is ‘count’.

threshold: float

cutoff to use when determining what sites to use for each experiment

greater: bool

indicates whether sites with greater or lower abundances than the threshold will be used

run_data: string

indicates the date that kinase activity object was initialized

———————————
After Hypergeometric Calculations
———————————
real_enrichment: pandas dataframe

p-values obtained for all pruned networks indicating statistical enrichment of a kinase’s substrates for each network, based on hypergeometric tests

activities: pandas dataframe

median p-values obtained from the real_enrichment object for each experiment/kinase

agg_activities: pandas dataframe
———————————–
After Random Enrichment Calculation
———————————–
random_experiments: pandas dataframe

contains information about the sites randomly sampled for each random experiment

random_kinact: KinaseActivity object

KinaseActivity object containing random activities predicted from each of the random experiments

—————————
After Mann Whitney Analysis
—————————
activities_mann_whitney: pandas dataframe

p-values obtained from comparing the real distribution of p-values to the distribution of p-values from random datasets, based the Mann Whitney U-test

fpr_mann_whitney: pandas dataframe

false positive rates for predicted kinase activities

Methods

add_network(network_id, network[, network_size])

Add network to be analyzed

add_pregenerated_to_random_enrichment()

Combine pre-generated random activities with random enrichment, sort based on the "data" column, and reorganize the combined DataFrame based on the original column order in self.data_columns.

aggregate_activities([activities])

Aggregate network activity using median for all activities

calculate_Mann_Whitney_activities_sig(log[, ...])

For a kinact_dict, where random generation and activity has already been run for the phospho_types of interest, this will calculate the Mann-Whitney U test for comparing the array of p-values for real data to those of random data, across the number of networks used.

calculate_kinase_activities([agg, ...])

Calculates combined activity of experiments based that uses a threshold value to determine if an experiment sees a site or not To use values use 'mean' as agg mean aggregation drops NA values from consideration To use count use 'count' as agg - present if not na

calculate_random_activities(logger[, ...])

Generate random experiments and calculate kinase activities.Either uses pre-generated activity lists or generates new random experiments based on the provided parameters.

check_data_columns()

Checks data columns to make sure column is in evidence and that evidence filtered on that data column has at least one point of evidence.

create_binary_evidence([agg, threshold, ...])

Returns a binary evidence data frame according to the parameters passed in for method for aggregating duplicates and considering whether a site is included as evidence or not

find_pvalue_limits(data_columns[, agg, ...])

For each data column and network find the lowest p-value achievable and how many seen sites are required to get to that limit. Assumptions - kinase size in network is same for all kinases.

getFilteredCompendia([selection_type])

Get phosphorylation sites binned based on selection type

get_compendia_distribution(...[, selection_type])

Get the compendia distribution for each data column.

get_run_date()

return date that kinase activities were run

get_run_information_content()

Retrieve network information from RUN_INFORMATION.txt based on phospho_type.

load_pregenerated_random_activities(...)

Load pre-generated random activities for the given datasets.

network_check_for_pregeneration()

Check if the network hash matches a pre-generated network in pregen_experiments and verifies RUN_INFORMATION.txt within the hash subdirectory.

parse_network_information(file_path)

Parse the RUN_INFORMATION.txt file and extract its data.

save_new_precomputed_random_enrichment(...)

Save the new precomputed random enrichment activities to a file.

set_data_columns([data_columns])

Sets the data columns to use in the kinase activity calculation If data_columns is None or an empty list then set data_columns to be all columns that start with data:

set_evidence(evidence)

Evidence to use in analysis

summarize_activities([activities, method, ...])

Builds a single combined dataframe from the provided activities such that each piece of evidence is given a single column.

test_threshold(threshold[, agg, greater, ...])

Given a threshold value, calculate the distribution of evidence sizes (i.e.

add_networks_batch

check_data_columns()[source]

Checks data columns to make sure column is in evidence and that evidence filtered on that data column has at least one point of evidence. Removes all columns that do not meet criteria

set_data_columns(data_columns=None)[source]

Sets the data columns to use in the kinase activity calculation If data_columns is None or an empty list then set data_columns to be all columns that start with data:

Checks all set columns to make sure columns are vaild after filtering evidence

test_threshold(threshold, agg='mean', greater=True, plot=False, return_evidence_sizes=False)[source]

Given a threshold value, calculate the distribution of evidence sizes (i.e. number of sites used in prediction for each sample in the experiment).

Parameters:
threshold: float

cutoff for inclusion as evidence for prediction. If greater = True, sites with quantification greater than the threshold are used as evidence.

agg: str

how to combine sites with multiple instances in experiment

greater: bool

whether to use sites greater (True) or less (False) than the threshold

plot: bool

whether to plot a histogram of the evidence sizes used

return_site_nums: bool

indicates whether to return the evidence sizes for all samples or not

Returns:
Outputs the minimum, maximum, and median evidence sizes across all samples. May return evidence sizes of all samples as pandas series
parse_network_information(file_path)[source]

Parse the RUN_INFORMATION.txt file and extract its data.

Args:

file_path (str): Path to the RUN_INFORMATION.txt file.

Returns:

dict: A dictionary containing the parsed data.

network_check_for_pregeneration()[source]

Check if the network hash matches a pre-generated network in pregen_experiments and verifies RUN_INFORMATION.txt within the hash subdirectory.

Returns:

bool: True if the data matches, False otherwise.

get_compendia_distribution(with_pregenerated_evidence, data_columns, selection_type='KSTAR_NUM_COMPENDIA_CLASS')[source]

Get the compendia distribution for each data column.

Parameters:
with_pregenerated_evidencepandas DataFrame

KSTAR mapped experimental dataframe that has been binarized by kstar_activity generation.

data_columnslist

Columns that represent experimental results.

selection_typestr, optional

The type of compendia selection, by default ‘KSTAR_NUM_COMPENDIA_CLASS’.

Returns:
dict

Dictionary containing the compendia distribution for each data column.

calculate_random_activities(logger, num_random_experiments=150, use_pregen_data=None, save_new_precompute=None, pregenerated_experiments_path=None, directory_for_save_precompute=None, network_hash=None, save_random_experiments=None, PROCESSES=1)[source]

Generate random experiments and calculate kinase activities.Either uses pre-generated activity lists or generates new random experiments based on the provided parameters.

Parameters:
loggerLogger object

Logger to record the progress and any issues during the randomization pipeline.

num_random_experimentsint, optional

Number of random experiments to generate, by default 150.

use_pregen_databool, optional

Whether to use pre-generated data, by default None.

save_new_precomputebool, optional

Whether to save new precomputed data, by default None.

pregenerated_experiments_pathstr, optional

Path to the directory containing pre-generated experiments, by default None.

directory_for_save_precomputestr, optional

Directory to save new precomputed data, by default None.

network_hashstr, optional

Hash of the network used, by default None.

save_random_experimentsbool, optional

Whether to save the generated random experiments, by default None.

PROCESSESint, optional

Number of processes to use for parallel computation, by default 1.

Returns:
None
load_pregenerated_random_activities(with_pregenerated_evidence, with_pregenerated, pregen_activities_list)[source]

Load pre-generated random activities for the given datasets.

This function processes datasets that have pre-generated random experiments. It identifies the appropriate pre-generated file based on the size of the dataset and appends the activities to the provided list.

Parameters:
with_pregenerated_evidencepandas.DataFrame

DataFrame containing the evidence for the datasets with pre-generated random experiments.

with_pregeneratedlist

List of dataset names that have pre-generated random experiments.

random_activities_listlist

List to which the concatenated activities of each dataset will be appended.

Returns:
None
add_pregenerated_to_random_enrichment()[source]

Combine pre-generated random activities with random enrichment, sort based on the “data” column, and reorganize the combined DataFrame based on the original column order in self.data_columns.

If use_pregen_data is True and data_columns_from_scratch is None, uses only pre-generated activities. If use_pregen_data is True and data_columns_from_scratch exists, combines both pre-generated and newly calculated random activities. If use_pregen_data is False, uses only newly calculated random activities.

Returns:
None

Updates self.random_enrichment with the combined and sorted activities

save_new_precomputed_random_enrichment(activities_list_df, col)[source]

Save the new precomputed random enrichment activities to a file.

This function saves the provided DataFrame of random enrichment activities to a file, using the specified column name.

Parameters:
activities_list_dfpandas.DataFrame

DataFrame containing the random enrichment activities to be saved.

colstr

Column name to be used for saving the activities.

Returns:
None
get_run_information_content()[source]

Retrieve network information from RUN_INFORMATION.txt based on phospho_type.

Reads the RUN_INFORMATION.txt file from the appropriate network directory based on the phospho_type (‘Y’ or ‘ST’). The file contains network configuration details including unique ID, date, network specifications, and compendia counts.

Returns:
str

Contents of RUN_INFORMATION.txt if found. ‘RUN_INFORMATION.txt file not found.’ if the file doesn’t exist.

Raises:
ValueError

If phospho_type is not ‘Y’ or ‘ST’.

add_network(network_id, network, network_size=None)[source]

Add network to be analyzed

Parameters:
network_idstr

name of the network

networkpandas DataFrame

network with columns substrate_id, site, kinase_id

get_run_date()[source]

return date that kinase activities were run

set_evidence(evidence)[source]

Evidence to use in analysis

Parameters:
evidencepandas DataFrame

substrate sites with activity seen. columns : dict for column mapping

substrate : Uniprot ID (P12345) site : phosphorylation site (Y123)

create_binary_evidence(agg='mean', threshold=1.0, evidence_size=None, greater=True)[source]

Returns a binary evidence data frame according to the parameters passed in for method for aggregating duplicates and considering whether a site is included as evidence or not

Parameters:
thresholdfloat

threshold value used to filter rows

evidence_size: None or int

the number of sites to use for prediction for each sample. If a value is provided, this will override the threshold, and will instead obtain the N sites with the greatest abundance within each sample.

agg{‘count’, ‘mean’}

method to use when aggregating duplicate substrate-sites. ‘count’ combines multiple representations and adds if values are non-NaN ‘mean’ uses the mean value of numerical data from multiple representations of the same peptide.

NA values are droped from consideration.

greater: Boolean

whether to keep sites that have a numerical value >=threshold (TRUE, default) or <=threshold (FALSE)

Returns:
evidence_binarypd.DataFrame

Matches the evidence dataframe of the kinact object, but with 0 or 1 if a site is included or not. This is uniquified and rows that are never used are removed.

calculate_kinase_activities(agg='mean', threshold=1.0, evidence_size=None, greater=True, PROCESSES=1)[source]

Calculates combined activity of experiments based that uses a threshold value to determine if an experiment sees a site or not To use values use ‘mean’ as agg

mean aggregation drops NA values from consideration

To use count use ‘count’ as agg - present if not na

Parameters:
data_columnslist

columns that represent experimental result, if None, takes the columns that start with `data:’’ in experiment. Pass this value in as a list, if seeking to calculate on fewer than all available data columns

thresholdfloat

threshold value used to filter rows

agg{‘count’, ‘mean’}

method to use when aggregating duplicate substrate-sites. ‘count’ combines multiple representations and adds if values are non-NaN ‘mean’ uses the mean value of numerical data from multiple representations of the same peptide.

NA values are droped from consideration.

greater: Boolean

whether to keep sites that have a numerical value >=threshold (TRUE, default) or <=threshold (FALSE)

Returns:
activitiesdict

key : experiment value : pd DataFrame

network : network name, from networks key kinase : kinase examined frequency : number of times kinase was seen in subgraph of evidence and network kinase_activity : hypergeometric kinase activity

summarize_activities(activities=None, method='median_activity', normalized=False)[source]

Builds a single combined dataframe from the provided activities such that each piece of evidence is given a single column. Values are based on the method selected. The method must be a column in the activities

Parameters:
activitiesdict

hypergeometric activities that have previously been summarized by network. key : experiment name value : hypergeometric activity

methodstr

The column in the hypergeometric activity to use for summarizing data

Returns:
activity_summarypandas DataFrame
aggregate_activities(activities=None)[source]

Aggregate network activity using median for all activities

Parameters

activitiesdict

key : Experiment value : kinase activity result

Returns:
summariesdict

key : experiment value : summarized kinase activities accross networks

find_pvalue_limits(data_columns, agg='count', threshold=1.0)[source]

For each data column and network find the lowest p-value achievable and how many seen sites are required to get to that limit. Assumptions

  • kinase size in network is same for all kinases

Parameters:
data_columnslist

what columns in evidence to compare

aggstr
aggregate function - what function to use for determining if site is present

count : use when using activity_count mean : use when using activity_threshold

thresholdfloat

threshold to use in determining if site present in evidence

Returns:
all_limitspandas DataFrame

p-value limits of each column for each network columns:

evidence evidence data column network network being compared kinase kinase being evaluated evidence_size size of evidence limit_size number of sites to get non-zero p-value p-value p-value generated

limit_summarypandas DataFrame

summary of all_limits by taking average over by evidence

calculate_Mann_Whitney_activities_sig(log, number_sig_trials=100, PROCESSES=1)[source]

For a kinact_dict, where random generation and activity has already been run for the phospho_types of interest, this will calculate the Mann-Whitney U test for comparing the array of p-values for real data to those of random data, across the number of networks used. It will also calculate the false positive rate for a pvalue, given observations of a random bootstrapping analysis

Parameters:
kinact_dict: dictionary

A dictionary of kinact objects, with keys ‘Y’ and/or ‘ST’

log: logger

Logger for logging activity messages

phospho_types: {[‘Y’, ‘ST’], [‘Y’], [‘ST’]}

Which substrate/kinaset-type to run activity for: Both [‘Y, ‘ST’] (default), Tyrosine [‘Y’], or Serine/Threonine [‘ST’]

number_sig_trials: int

Maximum number of significant trials to run

Returns:
getFilteredCompendia(selection_type='KSTAR_NUM_COMPENDIA_CLASS')[source]

Get phosphorylation sites binned based on selection type

The “DotPlot” class

class kstar.plot.DotPlot(values, fpr, alpha=0.05, inclusive_alpha=True, binary_sig=True, dotsize=5, colormap={0: '#6b838f', 1: '#FF3300'}, facecolor='white', labelmap=None, legend_title='p-value', size_number=5, size_color='gray', color_title='Significant', markersize=10, legend_distance=1.0, figsize=(20, 4), title=None, xlabel=True, ylabel=True, x_label_dict=None, kinase_dict=None)[source]

The DotPlot class is used for plotting dotplots, with the option to add clustering and context plots. The size of the dots based on the values dataframe, where the size of the dot is the area of the value * dotsize

Parameters:
values: pandas DataFrame instance

values to plot

fprpandas DataFrame instance

false positive rates associated with values being plotted

alpha: float, optional

fpr value that defines the significance cutoff to use when plt default : 0.05

inclusive_alpha: boolean

whether to include the alpha (significance <= alpha), or not (significance < alpha). default: True

binary_sig: boolean, optional

indicates whether to plot fpr with binary significance or as a change color hue default : True

dotsizefloat, optional

multiplier to use for scaling size of dots

colormapdict, optional

maps color values to actual color to use in plotting default : {0: ‘#6b838f’, 1: ‘#FF3300’}

labelmap =

maps labels of colors, default is to indicate FPR cutoff in legend default : None

facecolorcolor, optional

Background color of dotplot default : ‘white’

legend_titlestr, optional

Legend Title for dot sizes, default is `p-value’

size_numberint, optional

Number of dots to attempt to generate for dot size legend

size_colorcolor, optional

Size Legend Color to use

color_titlestr, optional

Legend Title for the Color Legend

markersizeint, optional

Size of dots for Color Legend

legend_distanceint, optional

relative distance to place legends

figsizetuple, optional

size of dotplot figure

titlestr, optional

Title of dotplot

xlabelbool, optional

Show xlabel on graph if True

ylabelbool, optional

Show ylabel on graph if True

x_label_dict: dict, optional

Mapping dictionary of labels as they appear in values dataframe (keys) to how they should appear on plot (values)

kinase_dict: dict, optional

Mapping dictionary of kinase names as they appear in values dataframe (keys) to how they should appear on plot (values)

Attributes:
values: pandas dataframe

a copy of the original values dataframe

fpr: pandas dataframe

a copy of the original fpr dataframe

alpha: float

cutoff used for significance, default 0.05

inclusive_alpha: boolean

whether to include the alpha (significance <= alpha), or not (significance < alpha)

significance: pandas dataframe

indicates whether a particular kinases activity is significant, where fpr <= alpha is significant, otherwise it is insignificant

colors: pandas dataframe

dataframe indicating the color to use when plotting: either a copy of the fpr or significance dataframe

binary_sig: boolean

indicates whether coloring will be done based on binary significance or fpr values. Default True

labelmap: dict

indicates how to label each significance color

figsize: tuple

size of the outputted figure, which is overridden if axes is provided for dotplot

title: string

title of the dotplot

xlabel: boolean

indicates whether to plot x-axis labels

ylabel: boolean

indicates whether to plot y-axis labels

colormap: dict

colors to be used when plotting

facecolor: string

background color of dotplot

Methods

cluster(ax[, method, metric, orientation, ...])

Performs hierarchical clustering on data and plots result to provided Axes.

context(ax, info, id_column, context_columns)

Context plot is generated and returned.

dotplot([ax, orientation, size_legend, ...])

Generates the dotplot plot, where size is determined by values dataframe and color is determined by significant dataframe

drop_kinases(kinase_list)

Given a list of kinases, drop these from the dot.values dataframe in all future plotting of this object.

drop_kinases_with_no_significance()

Drop kinases from the values dataframe (inplace) when plotting if they are never observed as significant

evidence_count(ax, binary_evidence[, ...])

Add bars to dotplot indicating the total number of sites used as evidence in activity calculation

dotplot(ax=None, orientation='left', size_legend=True, color_legend=True, max_size=None)[source]

Generates the dotplot plot, where size is determined by values dataframe and color is determined by significant dataframe

Parameters:
axmatplotlib Axes instance, optional

axes dotplot will be plotted on. If None then new plot generated

cluster(ax, method='single', metric='euclidean', orientation='top', color_threshold=-inf)[source]

Performs hierarchical clustering on data and plots result to provided Axes. result and significant dataframes are ordered according to clustering

Parameters:
axmatplotlib Axes instance

Axes to plot dendogram to

methodstr, optional

The linkage algorithm to use.

metricstr or function, optional

The distance metric to use in the case that y is a collection of observation vectors; ignored otherwise. See the pdist function for a list of valid distance metrics. A custom distance function can also be used.

orientationstr, optional

The direction to plot the dendrogram, which can be any of the following strings: ‘top’: Plots the root at the top, and plot descendent links going downwards. (default). ‘bottom’: Plots the root at the bottom, and plot descendent links going upwards. ‘left’: Plots the root at the left, and plot descendent links going right. ‘right’: Plots the root at the right, and plot descendent links going left.

drop_kinases_with_no_significance()[source]

Drop kinases from the values dataframe (inplace) when plotting if they are never observed as significant

drop_kinases(kinase_list)[source]

Given a list of kinases, drop these from the dot.values dataframe in all future plotting of this object. Removal is in place

Parameters:
kinase_list: list

list of kinase names to remove

context(ax, info, id_column, context_columns, dotsize=200, markersize=20, orientation='left', color_palette='colorblind', margin=0.2, make_legend=True)[source]

Context plot is generated and returned. The context plot contains the categorical data used for describing the data.

Parameters:
axmaptlotlib axis

where to map subtype information to

infopandas df

Dataframe where context information is pulled from

id_column: str

Column used to map the subtype information to

context_columnslist

list of columns to pull context informaiton from

dotsizeint, optional

size of context dots

markersize: int, optional

size of legend markers

orientationstr, optional

orientation to plot context plots to - determines where legends are placed options : left, right, top, bottom

color_palettestr, optional

seaborn color palette to use

margin: float, optional

margin

make_legendbool, optional

whether to create legend for context colors

evidence_count(ax, binary_evidence, plot_type='bars', phospho_type=None, dot_size=1, include_recommendations=True, ideal_min=None, recommended_min=None, dot_colors=None, bar_line_colors=None)[source]

Add bars to dotplot indicating the total number of sites used as evidence in activity calculation

Parameters:
ax: axes object

where to plot the bars

binary_evidence: pandas dataframe

binarized dataframe produced during activity calculation (threshold applied to original experiment)

Supporting Functions

Master Functions for Running KSTAR Pipeline

kstar.calculate.enrichment_analysis(experiment, log, networks, phospho_types=['Y', 'ST'], data_columns=None, agg='mean', threshold=1.0, evidence_size=None, greater=True, PROCESSES=1)[source]

Function to establish a kstar KinaseActivity object from an experiment with an activity log add the networks, calculate, aggregate, and summarize the hypergeometric enrichment into a final activity object. Should be followed by randomized_analyis, then Mann_Whitney_analysis.

Parameters:
experiment: pandas df

experiment dataframe that has been mapped, includes KSTAR_SITE, KSTAR_ACCESSION, etc.

log: logger object

Log to write activity log error and update to

networks: dictionary of dictionaries

Outer dictionary keys are ‘Y’ and ‘ST’. Establish a network by loading a pickle of desired networks. See the helpers and config file for this. If downloaded from FigShare, then the GLOBAL network pickles in config file can be loaded For example: networks[‘Y’] = pickle.load(open(config.NETWORK_Y_PICKLE, “rb” ))

phospho_types: {[‘Y’, ‘ST’], [‘Y’], [‘ST’]}

Which substrate/kinaset-type to run activity for: Both [‘Y, ‘ST’] (default), Tyrosine [‘Y’], or Serine/Threonine [‘ST’]

data_columnslist

columns that represent experimental result, if None, takes the columns that start with `data:’’ in experiment. Pass this value in as a list, if seeking to calculate on fewer than all available data columns

agg{‘count’, ‘mean’}

method to use when aggregating duplicate substrate-sites. ‘count’ combines multiple representations and adds if values are non-NaN ‘mean’ uses the mean value of numerical data from multiple representations of the same peptide.

NA values are droped from consideration.

thresholdfloat

threshold value used to filter rows

greater: Boolean

whether to keep sites that have a numerical value >=threshold (TRUE, default) or <=threshold (FALSE)

Returns:
kinactDict: dictionary of Kinase Activity Objects

Outer keys are phosphoTypes run ‘Y’ and ‘ST’ Includes the activities dictionary (see calculate_kinase_activities) aggregation of activities across networks (see aggregate activities) activity summary (see summarize_activities)

kstar.calculate.randomized_analysis(kinact_dict, log, num_random_experiments=150, use_pregen_data=False, save_new_precompute=False, pregenerated_experiments_path=None, directory_for_save_precompute=None, network_hash=None, save_random_experiments=None, PROCESSES=1)[source]

Perform randomized analysis on kinase activity data.

Parameters:
kinact_dictdict

Dictionary containing kinase activity data.

logLogger object

Logger to record the progress and any issues during the randomization pipeline.

num_random_experimentsint, optional

Number of random experiments to generate, by default 150.

use_pregen_databool, optional

Whether to use pre-generated data, by default False.

save_new_precomputebool, optional

Whether to save new precomputed data, by default None.

pregenerated_experiments_pathstr, optional

Path to the directory containing pre-generated experiments, by default None.

directory_for_save_precomputestr, optional

Directory to save new precomputed data, by default None.

network_hashstr, optional

Hash of the network used, by default None.

save_random_experimentsbool, optional

Whether to save the generated random experiments, by default None.

PROCESSESint, optional

Number of processes to use for parallel computation, by default 1.

Returns:
None
kstar.calculate.Mann_Whitney_analysis(kinact_dict, log, number_sig_trials=100, PROCESSES=1)[source]

For a kinact_dict, where random generation and activity has already been run for the phospho_types of interest, this will calculate the Mann-Whitney U test for comparing the array of p-values for real data to those of random data, across the number of networks used. It will also calculate the false positive rate for a pvalue, given observations of a random bootstrapping analysis

Parameters:
kinact_dict: dictionary

A dictionary of kinact objects, with keys ‘Y’ and/or ‘ST’

log: logger

Logger for logging activity messages

number_sig_trials: int

Maximum number of significant trials to run

Functions for Saving and Loading KSTAR results

kstar.calculate.save_kstar(kinact_dict, name, odir, PICKLE=True)[source]

Having performed kinase activities (run_kstar_analyis), save each of the important dataframes to files and the final pickle Saves an activities, aggregated_activities, summarized_activities tab-separated files Saves a pickle file of dictionary

Parameters:
kinact_dict: dictionary of Kinase Activity Objects

Outer keys are phosphoTypes run ‘Y’ and ‘ST’ Includes the activities dictionary (see calculate_kinase_activities) aggregation of activities across networks (see aggregate activities) activity summary (see summarize_activities)

name: string

The name to use when saving activities

odir: string

Outputdirectory to save files and pickle to

PICKLE: boolean

Whether to save the entire pickle file

Returns:
Nothing
kstar.calculate.save_kstar_slim(kinact_dict, name, odir)[source]

Having performed kinase activities (run_kstar_analyis), save each of the important dataframes, minimizing the memory storage needed to get back to a rebuilt version for plotting results and analysis. For each phospho_type in the kinact_dict, this will save three .tsv files for every activities analysis run, two additional if random analysis was run, and two more if Mann Whitney based analysis was run. It also creates a readme file of the parameter values used

Parameters:
kinact_dict: dictionary of Kinase Activity Objects

Outer keys are phosphoTypes run ‘Y’ and ‘ST’ Includes the activities dictionary (see calculate_kinase_activities) aggregation of activities across networks (see aggregate activities) activity summary (see summarize_activities)

name: string

The name to use when saving activities

odir: string

Outputdirectory to save files and pickle to

Returns:
Nothing
kstar.calculate.from_kstar_slim(name, odir, log)[source]

Given the name and output directory of a saved kstar analyis, load the parameters and minimum dataframes needed for reinstantiating a kinact object This minimum list will allow you to repeat normalization or mann whitney at a different false positive rate threshold and plot results.

Parameters:
name: string

The name to used when saving activities and mapped data

odir: string

Output directory of saved files and parameter pickle

log: logger

Logger for logging activity messages

kstar.calculate.from_kstar_nextflow(name, odir, log=None)[source]

Given the name and output directory of a saved kstar analyis from the nextflow pipeline, load the results into new kinact object with the minimum dataframes required for analysis (binary experiment, hypergeometric activities, normalized activities, mann whitney activities)

Parameters:
name: string

The name to used when saving activities and mapped data

odir: string

Output directory of saved files

log: logger

logger used when loading nextflow data into kinase activity object. If not provided, new logger will be created.

Other Helper Functions

kstar.helpers.process_fasta_file(fasta_file)[source]

For configuration, to convert the global fasta sequence file into a sequence dictionary that can be used in mapping

Parameters:
fasta_filestr

file location of fasta file

Returns:
sequencesdict

{acc : sequence} dictionary generated from fasta file

kstar.helpers.get_logger(name, filename)[source]

Finds and returns logger if it exists. Creates new logger if log file does not exist

Parameters:
namestr
log name
filenamestr
location to store log file
kstar.helpers.string_to_boolean(string)[source]

Converts string to boolean

Parameters:
string :str

input string

Returns:
resultbool

output boolean

kstar.helpers.convert_acc_to_uniprot(df, acc_col_name, acc_col_type, acc_uni_name)[source]

Given an experimental dataframe (df) with an accession column (acc_col_name) that is not uniprot, use uniprot to append an accession column of uniprot IDS

Parameters:
df: pandas.DataFrame

Dataframe with at least a column of accession of interest

acc_col_name: string

name of column to convert FROM

acc_col_type: string

Uniprot string designation of the accession type to convert FROM, see https://www.uniprot.org/help/api_idmapping

acc_uni_name:

name of new column

Returns:
appended_df: pandas.DataFrame

Input dataframe with an appended acc column of uniprot IDs