KSTAR¶
The “Config” Module¶
- kstar.config.install_resource_files()[source]¶
Retrieves RESOURCE_FILES that are the companion for this version release from FigShare, unzips them to the correct directory for resource files.
- kstar.config.install_network_files(target_dir=None)[source]¶
Retrieves Network files that are the companion for this version release from FigShare, unzips them to the specified directory.
- kstar.config.update_network_directory(directory, create_pickles=True, KSTAR_DIR='/Users/zxa7aw/Documents/KSTAR/KSTAR_documentation_update/KSTAR_documentation-master', NETWORK_DIR='./NETWORKS/NetworKIN')[source]¶
Update the location of network the network files, and verify that all necessary files are located in directory
- Parameters:
- directory: string
path to where network files are located
- kstar.config.create_network_pickles(phosphoTypes=['Y', 'ST'], network_directory='./NETWORKS/NetworKIN')[source]¶
Given network files declared in globals, create pickles of the kstar object that can then be quickly loaded in analysis Assumes that the Network structure has two folders Y and ST under the NETWORK_DIR global variable and that all .csv files in those directories should be loaded into a network pickle.
The “Prune” Module¶
The “Pruner” Class¶
- class kstar.prune.Pruner(network, logger, phospho_type='Y', acc_col='substrate_acc', site_col='site', nonweight_cols=['substrate_acc', 'site', 'substrate_id', 'substrate_name', 'pep'])[source]¶
Pruning Algorithm used for KSTAR.
- Parameters:
- networkpandas df
weighted kinase-site prediction network where there is an accession, site, kinase, and score column
- logger
logger used for pruning
- phospho_typestr
phospho_type(s) to use when building pruned networks
- acc_colstr
the name of the column containing Uniprot Accession IDs for each substrate in the weighted network
- site_colstr
the name of the column containing the residue type and location of each substrate in the weighted network (Y1268, S44, etc.)
- nonweight_colslist
- indicates the non-weight containing columns in the network (these will be removed in the final processed network, as they are not needed). If None, will automatically look
for any non-numeric columns and removes them.
Methods
build_multiple_compendia_networks
(...[, ...])Builds multiple compendia-limited networks
build_multiple_networks
(kinase_size, ...[, ...])Basic Network Generation - only takes into account score when determining sites a kinase connects to
build_pruned_network
(network, kinase_size, ...)Builds a heuristic pruned network where each kinase has a specified number of connected sites and each site has an upper limit to the number of kinases it can connect to
calculate_compendia_sizes
(kinase_size)Calculates the number of sites per compendia size that a kinase should connect to using same ratios of compendia sizes as found in compendia
checkParameters
(kinase_size, site_limit)Given the site_limit and kinase_size parameters to be used during pruning, raise errors if not feasible, and raise warnings if value is higher than we would recommend (>40% of the maximum kinase_size value)
compendia_pruned_network
(compendia_sizes, ...)Builds a compendia-pruned network that takes into account compendia size limits per kinase
getMaximumKinaseSize
(site_limit)Given a network and site_limit (maximum number of kinases a phosphorylation site can provide evidence to), will calculate the theoretical maximum number of connections each kinase can have (kinase_size parameter)
getRecommendedKinaseSize
(site_limit)Given a network and site_limit (maximum number of kinases a phosphorylation site can provide evidence to), will calculate the theoretical maximum number of connections each kinase can have (kinase_size parameter) and recommend a range of values for kinase_size
save_networks
(network_file)Save the pruned networks generated by the 'build_multiple_networks' or 'build_multiple_compendia_networks' as a pickle to be loaded by KSTAR
Save information about the generation of networks during run_pruning, including the parameters used for generation.
- build_pruned_network(network, kinase_size, site_limit)[source]¶
Builds a heuristic pruned network where each kinase has a specified number of connected sites and each site has an upper limit to the number of kinases it can connect to
- Parameters:
- networkpandas DataFrame
network to build pruned network on
- kinase_size: int
number of sites each kinase should connect to
- site_limit :int
upper limit of number of kinases a site can connect to
- Returns:
- pruned networkpandas DataFrame
subset of network that has been pruned
- compendia_pruned_network(compendia_sizes, site_limit, odir)[source]¶
Builds a compendia-pruned network that takes into account compendia size limits per kinase
- Parameters:
- compendia_sizesdict
key : compendia size value : number of sites to connect to kinase
- site_limitint
upper limit of number of kinases a site can connect to
- Returns:
- pruned_networkpandas DataFrame
subset of network that has been pruned according to compendia ratios
- calculate_compendia_sizes(kinase_size)[source]¶
Calculates the number of sites per compendia size that a kinase should connect to using same ratios of compendia sizes as found in compendia
- Parameters:
- kinase_size: int
number of sites each kinase should connect to
- Returns:
- sizesdict
key : compendia size value : number of sites each kinase should pull from given compendia size
- build_multiple_compendia_networks(kinase_size, site_limit, num_networks, network_id, odir, PROCESSES=1)[source]¶
Builds multiple compendia-limited networks
- Parameters:
- kinase_size: int
number of sites each kinase should connect to
- site_limit :int
upper limit of number of kinases a site can connect to
- num_networks: int
number of networks to build
- network_idstr
id to use for each network in dictionary
- Returns:
- pruned_networksdict
key : <network_id>_<i> value : pruned network
- build_multiple_networks(kinase_size, site_limit, num_networks, network_id, odir, PROCESSES=1)[source]¶
Basic Network Generation - only takes into account score when determining sites a kinase connects to
- getMaximumKinaseSize(site_limit)[source]¶
Given a network and site_limit (maximum number of kinases a phosphorylation site can provide evidence to), will calculate the theoretical maximum number of connections each kinase can have (kinase_size parameter)
Theoretical maximum exists when each substrate hits the maximum site_limit
- Parameters:
- site_limit: int
Parameter used in pruning: indicates the maximum number of kinases a phosphorylation site can be connected to in the final pruned network
- Returns:
- theoretical_max_ksize: int
largest possible value that ‘kinase_size’ parameter can have without throwing any errors
- getRecommendedKinaseSize(site_limit)[source]¶
Given a network and site_limit (maximum number of kinases a phosphorylation site can provide evidence to), will calculate the theoretical maximum number of connections each kinase can have (kinase_size parameter) and recommend a range of values for kinase_size
Theoretical maximum exists when each substrate hits the maximum site_limit
- Parameters:
- site_limit: int
Parameter used in pruning: indicates the maximum number of kinases a phosphorylation site can be connected to in the final pruned network
- Returns:
- Nothing, prints theoretical maximum of kinase size and the recommened values for the parameter given the site_limit
- checkParameters(kinase_size, site_limit)[source]¶
Given the site_limit and kinase_size parameters to be used during pruning, raise errors if not feasible, and raise warnings if value is higher than we would recommend (>40% of the maximum kinase_size value)
- Parameters:
- kinase_size: int
Parameter used in pruning: indicates the number of substrates each kinase will be connected to
- site_limit: int
Parameter used in pruning: indicates the maximum number of kinases a phosphorylation site can be connected to in the final pruned network
- Returns:
- Nothing, will only raise errors/warnings if parameters are not feasible
Functions to Perform Pruning¶
- kstar.prune.run_pruning(network, log, phospho_type, kinase_size, site_limit, num_networks, network_id, odir, use_compendia=True, acc_col='substrate_acc', site_col='site', netcols_todrop=['substrate_acc', 'site', 'substrate_id', 'substrate_name', 'pep'], PROCESSES=1)[source]¶
Generate pruned networks from a weighted kinase-substrate graph and log run information
- Parameters:
- network: pandas dataframe
kinase substrate network matrix, with values indicating weight of kinase-substrate relationship
- log: logger
logger to document the pruning process from start to finish
- use_compendia: string
whether to use compendia ratios to build network
- phospho_type: string
phospho type (‘Y’, ‘ST’, …)
- kinase_size: int
number of sites a kinase connects to
- site_limit: int
upper limit of number of kinases can connect to
- num_networks: int
number of networks to generate
- network_id: string
name of network to use in building dictionary
- odir: string
output directory for results
- Returns:
- pruner: Prune object
prune object that contains the number of pruned networks indicated by the num_networks paramater
- kstar.prune.save_pruning(phospho_type, network_id, kinase_size, site_limit, use_compendia, odir, log)[source]¶
Save the pruned networks generated by run_pruning function as a pickle to be loaded by KSTAR
- Parameters:
- phosho_type: string
type of phosphomodification to networks were generated for (either ‘Y’ or ‘ST’)
- network_id: string
name of network used to build dictionary
- kinase_size: int
number of sites a kinase connects to
- site_limit: int
upper limit of number of kinases can connect to
- use_compendia: string
whether compendia was used for ratios to build networks
- odir: string
output directory for results
- log: logger
logger to document pruning process from start to finish
- Returns:
- Nothing
- kstar.prune.save_run_information(results, use_compendia, pruner, unique_id)[source]¶
Save information about the generation of networks during run_pruning, including the parameters used for generation. Primarily used when running bash script.
- Parameters:
- results:
object that stores all parameters used in the pruning process
- use_compendia: string
whether compendia was used for ratios to build network
- pruner: Prune object
output of the run_pruning() function
- Returns:
- Nothing
The “ExperimentMapper” class¶
- class kstar.mapping.ExperimentMapper(experiment, columns, logger, sequences=None, compendia=None, window=7, data_columns=None)[source]¶
Given an experiment object and reference sequences, map the phosphorylation sites to the common reference. Inputs
- Parameters:
- namestr
Name of experiment. Used for logging
- experiment: pandas dataframe
Pandas dataframe of an experiment that has a reference accession, a peptide column and/or a site column. The peptide column should be upper case, with lower case indicating the site of phosphorylation - this is preferred The site column should be in the format S/T/Y<pos>, e.g. Y15 or S345
- columns: dict
Dictionary with mappings of the experiment dataframe column names for the required names ‘accession_id’, ‘peptide’, or ‘site’. One of ‘peptide’ or ‘site’ is required.
- logger: Logger object
used for logging when peptides cannot be matched and when a site location changes
- sequences: dict
Dictionary of sequences. Key : accession. Value : protein sequence. Default is imported from kstar.config
- compendia: pd.DataFrame
Human phosphoproteome compendia, mapped to KinPred and annotated with number of compendia. Default is imported from kstar.config
- windowint
The length of amino acids to the N- and C-terminal sides of the central phosphoprotein to map a site to. Default is 7.
- data_columns: list, or empty
The list of data columns to use. If this is empty, logger will look for anything that starts with statement data: and those values Default is None.
- Attributes:
- experiment: pandas dataframe
mapped experiment, which for each peptide, no contains the mapped accession, site, peptide, number of compendia, compendia type
- sequences: dict
Dictionary of sequences passed into the class
- compendia: pandas dataframe
compendia dataframe passed into the class
- data_columns: list
indicates which columns will be used as data
Methods
align_sites
([window])Map the peptide/sites to the common sequence reference and remove and report errors for sites that do not align as expected.
Return the mapped experiment dataframe
get_sequence
(accession)Gets the sequence that matches the given accession
set_data_columns
(data_columns)Identifies which columns in the experiment should be used as data columns.
- set_data_columns(data_columns)[source]¶
Identifies which columns in the experiment should be used as data columns. If data_columns is provided, then ‘data:’ is added to the front and experiment dataframe is renamed. Otherwise, function will look for columns with ‘data:’ in front and this to the data_columns attribute.
- align_sites(window=7)[source]¶
Map the peptide/sites to the common sequence reference and remove and report errors for sites that do not align as expected. expMapper.align_sites(window=7). Operates on the experiment dataframe of class.
- Parameters:
- window: int
The length of amino acids to the N- and C-terminal sides of the central phosphoprotein to map a site to.
The “KinaseActivity” class¶
- class kstar.calculate.KinaseActivity(evidence, logger, data_columns=None, phospho_type='Y')[source]¶
Kinase Activity calculates the estimated activity of kinases given an experiment using hypergeometric distribution. Hypergeometric distribution examines the number of protein sites found to be active in evidence compared to the number of protein sites attributed to a kinase on a provided network.
- Parameters:
- evidencepandas df
a dataframe that contains (at minimum, but can have more) data columms as evidence to use in analysis and KSTAR_ACCESSION and KSTAR_SITE
- data_columns: list
list of the columns containing the abundance values, which will be used to determine which sites will be used as evidence for activity prediction in each sample
- loggerLogger object
keeps track of kstar analysis, including any errors that occur
- phospho_type: string, either ‘Y’ or ‘ST’
indicates the phoshpo modification of interest
- Attributes:
- ——————-
- Upon Initialization
- ——————-
- evidence: pandas dataframe
inputted evidence column
- data_columns: list
list of columns containing abundance values, which will be used to determine which sites will be used as evidence. If inputted data_columns parameter was None, this lists includes in column in evidence prefixed by ‘data:’
- loggerLogger object
keeps track of kstar analysis, including any errors that occur
- phospho_type: string
indicated phosphomod of interest
- network_directory: string
directory where kinase substrate networks can be downloaded, as indicated in config.py
- normalized: bool
indicates whether normalization analysis has been performed
- aggregate: string
the type of aggregation to use when determining binary evidence, either ‘count’ or ‘mean’. Default is ‘count’.
- threshold: float
cutoff to use when determining what sites to use for each experiment
- greater: bool
indicates whether sites with greater or lower abundances than the threshold will be used
- run_data: string
indicates the date that kinase activity object was initialized
- ———————————
- After Hypergeometric Calculations
- ———————————
- real_enrichment: pandas dataframe
p-values obtained for all pruned networks indicating statistical enrichment of a kinase’s substrates for each network, based on hypergeometric tests
- activities: pandas dataframe
median p-values obtained from the real_enrichment object for each experiment/kinase
- agg_activities: pandas dataframe
- ———————————–
- After Random Enrichment Calculation
- ———————————–
- random_experiments: pandas dataframe
contains information about the sites randomly sampled for each random experiment
- random_kinact: KinaseActivity object
KinaseActivity object containing random activities predicted from each of the random experiments
- —————————
- After Mann Whitney Analysis
- —————————
- activities_mann_whitney: pandas dataframe
p-values obtained from comparing the real distribution of p-values to the distribution of p-values from random datasets, based the Mann Whitney U-test
- fpr_mann_whitney: pandas dataframe
false positive rates for predicted kinase activities
Methods
add_network
(network_id, network[, network_size])Add network to be analyzed
Combine pre-generated random activities with random enrichment, sort based on the "data" column, and reorganize the combined DataFrame based on the original column order in self.data_columns.
aggregate_activities
([activities])Aggregate network activity using median for all activities
calculate_Mann_Whitney_activities_sig
(log[, ...])For a kinact_dict, where random generation and activity has already been run for the phospho_types of interest, this will calculate the Mann-Whitney U test for comparing the array of p-values for real data to those of random data, across the number of networks used.
calculate_kinase_activities
([agg, ...])Calculates combined activity of experiments based that uses a threshold value to determine if an experiment sees a site or not To use values use 'mean' as agg mean aggregation drops NA values from consideration To use count use 'count' as agg - present if not na
calculate_random_activities
(logger[, ...])Generate random experiments and calculate kinase activities.Either uses pre-generated activity lists or generates new random experiments based on the provided parameters.
Checks data columns to make sure column is in evidence and that evidence filtered on that data column has at least one point of evidence.
create_binary_evidence
([agg, threshold, ...])Returns a binary evidence data frame according to the parameters passed in for method for aggregating duplicates and considering whether a site is included as evidence or not
find_pvalue_limits
(data_columns[, agg, ...])For each data column and network find the lowest p-value achievable and how many seen sites are required to get to that limit. Assumptions - kinase size in network is same for all kinases.
getFilteredCompendia
([selection_type])Get phosphorylation sites binned based on selection type
get_compendia_distribution
(...[, selection_type])Get the compendia distribution for each data column.
return date that kinase activities were run
Retrieve network information from RUN_INFORMATION.txt based on phospho_type.
Load pre-generated random activities for the given datasets.
Check if the network hash matches a pre-generated network in pregen_experiments and verifies RUN_INFORMATION.txt within the hash subdirectory.
parse_network_information
(file_path)Parse the RUN_INFORMATION.txt file and extract its data.
Save the new precomputed random enrichment activities to a file.
set_data_columns
([data_columns])Sets the data columns to use in the kinase activity calculation If data_columns is None or an empty list then set data_columns to be all columns that start with data:
set_evidence
(evidence)Evidence to use in analysis
summarize_activities
([activities, method, ...])Builds a single combined dataframe from the provided activities such that each piece of evidence is given a single column.
test_threshold
(threshold[, agg, greater, ...])Given a threshold value, calculate the distribution of evidence sizes (i.e.
add_networks_batch
- check_data_columns()[source]¶
Checks data columns to make sure column is in evidence and that evidence filtered on that data column has at least one point of evidence. Removes all columns that do not meet criteria
- set_data_columns(data_columns=None)[source]¶
Sets the data columns to use in the kinase activity calculation If data_columns is None or an empty list then set data_columns to be all columns that start with data:
Checks all set columns to make sure columns are vaild after filtering evidence
- test_threshold(threshold, agg='mean', greater=True, plot=False, return_evidence_sizes=False)[source]¶
Given a threshold value, calculate the distribution of evidence sizes (i.e. number of sites used in prediction for each sample in the experiment).
- Parameters:
- threshold: float
cutoff for inclusion as evidence for prediction. If greater = True, sites with quantification greater than the threshold are used as evidence.
- agg: str
how to combine sites with multiple instances in experiment
- greater: bool
whether to use sites greater (True) or less (False) than the threshold
- plot: bool
whether to plot a histogram of the evidence sizes used
- return_site_nums: bool
indicates whether to return the evidence sizes for all samples or not
- Returns:
- Outputs the minimum, maximum, and median evidence sizes across all samples. May return evidence sizes of all samples as pandas series
- parse_network_information(file_path)[source]¶
Parse the RUN_INFORMATION.txt file and extract its data.
- Args:
file_path (str): Path to the RUN_INFORMATION.txt file.
- Returns:
dict: A dictionary containing the parsed data.
- network_check_for_pregeneration()[source]¶
Check if the network hash matches a pre-generated network in pregen_experiments and verifies RUN_INFORMATION.txt within the hash subdirectory.
- Returns:
bool: True if the data matches, False otherwise.
- get_compendia_distribution(with_pregenerated_evidence, data_columns, selection_type='KSTAR_NUM_COMPENDIA_CLASS')[source]¶
Get the compendia distribution for each data column.
- Parameters:
- with_pregenerated_evidencepandas DataFrame
KSTAR mapped experimental dataframe that has been binarized by kstar_activity generation.
- data_columnslist
Columns that represent experimental results.
- selection_typestr, optional
The type of compendia selection, by default ‘KSTAR_NUM_COMPENDIA_CLASS’.
- Returns:
- dict
Dictionary containing the compendia distribution for each data column.
- calculate_random_activities(logger, num_random_experiments=150, use_pregen_data=None, save_new_precompute=None, pregenerated_experiments_path=None, directory_for_save_precompute=None, network_hash=None, save_random_experiments=None, PROCESSES=1)[source]¶
Generate random experiments and calculate kinase activities.Either uses pre-generated activity lists or generates new random experiments based on the provided parameters.
- Parameters:
- loggerLogger object
Logger to record the progress and any issues during the randomization pipeline.
- num_random_experimentsint, optional
Number of random experiments to generate, by default 150.
- use_pregen_databool, optional
Whether to use pre-generated data, by default None.
- save_new_precomputebool, optional
Whether to save new precomputed data, by default None.
- pregenerated_experiments_pathstr, optional
Path to the directory containing pre-generated experiments, by default None.
- directory_for_save_precomputestr, optional
Directory to save new precomputed data, by default None.
- network_hashstr, optional
Hash of the network used, by default None.
- save_random_experimentsbool, optional
Whether to save the generated random experiments, by default None.
- PROCESSESint, optional
Number of processes to use for parallel computation, by default 1.
- Returns:
- None
- load_pregenerated_random_activities(with_pregenerated_evidence, with_pregenerated, pregen_activities_list)[source]¶
Load pre-generated random activities for the given datasets.
This function processes datasets that have pre-generated random experiments. It identifies the appropriate pre-generated file based on the size of the dataset and appends the activities to the provided list.
- Parameters:
- with_pregenerated_evidencepandas.DataFrame
DataFrame containing the evidence for the datasets with pre-generated random experiments.
- with_pregeneratedlist
List of dataset names that have pre-generated random experiments.
- random_activities_listlist
List to which the concatenated activities of each dataset will be appended.
- Returns:
- None
- add_pregenerated_to_random_enrichment()[source]¶
Combine pre-generated random activities with random enrichment, sort based on the “data” column, and reorganize the combined DataFrame based on the original column order in self.data_columns.
If use_pregen_data is True and data_columns_from_scratch is None, uses only pre-generated activities. If use_pregen_data is True and data_columns_from_scratch exists, combines both pre-generated and newly calculated random activities. If use_pregen_data is False, uses only newly calculated random activities.
- Returns:
- None
Updates self.random_enrichment with the combined and sorted activities
- save_new_precomputed_random_enrichment(activities_list_df, col)[source]¶
Save the new precomputed random enrichment activities to a file.
This function saves the provided DataFrame of random enrichment activities to a file, using the specified column name.
- Parameters:
- activities_list_dfpandas.DataFrame
DataFrame containing the random enrichment activities to be saved.
- colstr
Column name to be used for saving the activities.
- Returns:
- None
- get_run_information_content()[source]¶
Retrieve network information from RUN_INFORMATION.txt based on phospho_type.
Reads the RUN_INFORMATION.txt file from the appropriate network directory based on the phospho_type (‘Y’ or ‘ST’). The file contains network configuration details including unique ID, date, network specifications, and compendia counts.
- Returns:
- str
Contents of RUN_INFORMATION.txt if found. ‘RUN_INFORMATION.txt file not found.’ if the file doesn’t exist.
- Raises:
- ValueError
If phospho_type is not ‘Y’ or ‘ST’.
- add_network(network_id, network, network_size=None)[source]¶
Add network to be analyzed
- Parameters:
- network_idstr
name of the network
- networkpandas DataFrame
network with columns substrate_id, site, kinase_id
- set_evidence(evidence)[source]¶
Evidence to use in analysis
- Parameters:
- evidencepandas DataFrame
substrate sites with activity seen. columns : dict for column mapping
substrate : Uniprot ID (P12345) site : phosphorylation site (Y123)
- create_binary_evidence(agg='mean', threshold=1.0, evidence_size=None, greater=True)[source]¶
Returns a binary evidence data frame according to the parameters passed in for method for aggregating duplicates and considering whether a site is included as evidence or not
- Parameters:
- thresholdfloat
threshold value used to filter rows
- evidence_size: None or int
the number of sites to use for prediction for each sample. If a value is provided, this will override the threshold, and will instead obtain the N sites with the greatest abundance within each sample.
- agg{‘count’, ‘mean’}
method to use when aggregating duplicate substrate-sites. ‘count’ combines multiple representations and adds if values are non-NaN ‘mean’ uses the mean value of numerical data from multiple representations of the same peptide.
NA values are droped from consideration.
- greater: Boolean
whether to keep sites that have a numerical value >=threshold (TRUE, default) or <=threshold (FALSE)
- Returns:
- evidence_binarypd.DataFrame
Matches the evidence dataframe of the kinact object, but with 0 or 1 if a site is included or not. This is uniquified and rows that are never used are removed.
- calculate_kinase_activities(agg='mean', threshold=1.0, evidence_size=None, greater=True, PROCESSES=1)[source]¶
Calculates combined activity of experiments based that uses a threshold value to determine if an experiment sees a site or not To use values use ‘mean’ as agg
mean aggregation drops NA values from consideration
To use count use ‘count’ as agg - present if not na
- Parameters:
- data_columnslist
columns that represent experimental result, if None, takes the columns that start with `data:’’ in experiment. Pass this value in as a list, if seeking to calculate on fewer than all available data columns
- thresholdfloat
threshold value used to filter rows
- agg{‘count’, ‘mean’}
method to use when aggregating duplicate substrate-sites. ‘count’ combines multiple representations and adds if values are non-NaN ‘mean’ uses the mean value of numerical data from multiple representations of the same peptide.
NA values are droped from consideration.
- greater: Boolean
whether to keep sites that have a numerical value >=threshold (TRUE, default) or <=threshold (FALSE)
- Returns:
- activitiesdict
key : experiment value : pd DataFrame
network : network name, from networks key kinase : kinase examined frequency : number of times kinase was seen in subgraph of evidence and network kinase_activity : hypergeometric kinase activity
- summarize_activities(activities=None, method='median_activity', normalized=False)[source]¶
Builds a single combined dataframe from the provided activities such that each piece of evidence is given a single column. Values are based on the method selected. The method must be a column in the activities
- Parameters:
- activitiesdict
hypergeometric activities that have previously been summarized by network. key : experiment name value : hypergeometric activity
- methodstr
The column in the hypergeometric activity to use for summarizing data
- Returns:
- activity_summarypandas DataFrame
- aggregate_activities(activities=None)[source]¶
Aggregate network activity using median for all activities
Parameters¶
- activitiesdict
key : Experiment value : kinase activity result
- Returns:
- summariesdict
key : experiment value : summarized kinase activities accross networks
- find_pvalue_limits(data_columns, agg='count', threshold=1.0)[source]¶
For each data column and network find the lowest p-value achievable and how many seen sites are required to get to that limit. Assumptions
kinase size in network is same for all kinases
- Parameters:
- data_columnslist
what columns in evidence to compare
- aggstr
- aggregate function - what function to use for determining if site is present
count : use when using activity_count mean : use when using activity_threshold
- thresholdfloat
threshold to use in determining if site present in evidence
- Returns:
- all_limitspandas DataFrame
p-value limits of each column for each network columns:
evidence evidence data column network network being compared kinase kinase being evaluated evidence_size size of evidence limit_size number of sites to get non-zero p-value p-value p-value generated
- limit_summarypandas DataFrame
summary of all_limits by taking average over by evidence
- calculate_Mann_Whitney_activities_sig(log, number_sig_trials=100, PROCESSES=1)[source]¶
For a kinact_dict, where random generation and activity has already been run for the phospho_types of interest, this will calculate the Mann-Whitney U test for comparing the array of p-values for real data to those of random data, across the number of networks used. It will also calculate the false positive rate for a pvalue, given observations of a random bootstrapping analysis
- Parameters:
- kinact_dict: dictionary
A dictionary of kinact objects, with keys ‘Y’ and/or ‘ST’
- log: logger
Logger for logging activity messages
- phospho_types: {[‘Y’, ‘ST’], [‘Y’], [‘ST’]}
Which substrate/kinaset-type to run activity for: Both [‘Y, ‘ST’] (default), Tyrosine [‘Y’], or Serine/Threonine [‘ST’]
- number_sig_trials: int
Maximum number of significant trials to run
- Returns:
The “DotPlot” class¶
- class kstar.plot.DotPlot(values, fpr, alpha=0.05, inclusive_alpha=True, binary_sig=True, dotsize=5, colormap={0: '#6b838f', 1: '#FF3300'}, facecolor='white', labelmap=None, legend_title='p-value', size_number=5, size_color='gray', color_title='Significant', markersize=10, legend_distance=1.0, figsize=(20, 4), title=None, xlabel=True, ylabel=True, x_label_dict=None, kinase_dict=None)[source]¶
The DotPlot class is used for plotting dotplots, with the option to add clustering and context plots. The size of the dots based on the values dataframe, where the size of the dot is the area of the value * dotsize
- Parameters:
- values: pandas DataFrame instance
values to plot
- fprpandas DataFrame instance
false positive rates associated with values being plotted
- alpha: float, optional
fpr value that defines the significance cutoff to use when plt default : 0.05
- inclusive_alpha: boolean
whether to include the alpha (significance <= alpha), or not (significance < alpha). default: True
- binary_sig: boolean, optional
indicates whether to plot fpr with binary significance or as a change color hue default : True
- dotsizefloat, optional
multiplier to use for scaling size of dots
- colormapdict, optional
maps color values to actual color to use in plotting default : {0: ‘#6b838f’, 1: ‘#FF3300’}
- labelmap =
maps labels of colors, default is to indicate FPR cutoff in legend default : None
- facecolorcolor, optional
Background color of dotplot default : ‘white’
- legend_titlestr, optional
Legend Title for dot sizes, default is `p-value’
- size_numberint, optional
Number of dots to attempt to generate for dot size legend
- size_colorcolor, optional
Size Legend Color to use
- color_titlestr, optional
Legend Title for the Color Legend
- markersizeint, optional
Size of dots for Color Legend
- legend_distanceint, optional
relative distance to place legends
- figsizetuple, optional
size of dotplot figure
- titlestr, optional
Title of dotplot
- xlabelbool, optional
Show xlabel on graph if True
- ylabelbool, optional
Show ylabel on graph if True
- x_label_dict: dict, optional
Mapping dictionary of labels as they appear in values dataframe (keys) to how they should appear on plot (values)
- kinase_dict: dict, optional
Mapping dictionary of kinase names as they appear in values dataframe (keys) to how they should appear on plot (values)
- Attributes:
- values: pandas dataframe
a copy of the original values dataframe
- fpr: pandas dataframe
a copy of the original fpr dataframe
- alpha: float
cutoff used for significance, default 0.05
- inclusive_alpha: boolean
whether to include the alpha (significance <= alpha), or not (significance < alpha)
- significance: pandas dataframe
indicates whether a particular kinases activity is significant, where fpr <= alpha is significant, otherwise it is insignificant
- colors: pandas dataframe
dataframe indicating the color to use when plotting: either a copy of the fpr or significance dataframe
- binary_sig: boolean
indicates whether coloring will be done based on binary significance or fpr values. Default True
- labelmap: dict
indicates how to label each significance color
- figsize: tuple
size of the outputted figure, which is overridden if axes is provided for dotplot
- title: string
title of the dotplot
- xlabel: boolean
indicates whether to plot x-axis labels
- ylabel: boolean
indicates whether to plot y-axis labels
- colormap: dict
colors to be used when plotting
- facecolor: string
background color of dotplot
Methods
cluster
(ax[, method, metric, orientation, ...])Performs hierarchical clustering on data and plots result to provided Axes.
context
(ax, info, id_column, context_columns)Context plot is generated and returned.
dotplot
([ax, orientation, size_legend, ...])Generates the dotplot plot, where size is determined by values dataframe and color is determined by significant dataframe
drop_kinases
(kinase_list)Given a list of kinases, drop these from the dot.values dataframe in all future plotting of this object.
Drop kinases from the values dataframe (inplace) when plotting if they are never observed as significant
evidence_count
(ax, binary_evidence[, ...])Add bars to dotplot indicating the total number of sites used as evidence in activity calculation
- dotplot(ax=None, orientation='left', size_legend=True, color_legend=True, max_size=None)[source]¶
Generates the dotplot plot, where size is determined by values dataframe and color is determined by significant dataframe
- Parameters:
- axmatplotlib Axes instance, optional
axes dotplot will be plotted on. If None then new plot generated
- cluster(ax, method='single', metric='euclidean', orientation='top', color_threshold=-inf)[source]¶
Performs hierarchical clustering on data and plots result to provided Axes. result and significant dataframes are ordered according to clustering
- Parameters:
- axmatplotlib Axes instance
Axes to plot dendogram to
- methodstr, optional
The linkage algorithm to use.
- metricstr or function, optional
The distance metric to use in the case that y is a collection of observation vectors; ignored otherwise. See the pdist function for a list of valid distance metrics. A custom distance function can also be used.
- orientationstr, optional
The direction to plot the dendrogram, which can be any of the following strings: ‘top’: Plots the root at the top, and plot descendent links going downwards. (default). ‘bottom’: Plots the root at the bottom, and plot descendent links going upwards. ‘left’: Plots the root at the left, and plot descendent links going right. ‘right’: Plots the root at the right, and plot descendent links going left.
- drop_kinases_with_no_significance()[source]¶
Drop kinases from the values dataframe (inplace) when plotting if they are never observed as significant
- drop_kinases(kinase_list)[source]¶
Given a list of kinases, drop these from the dot.values dataframe in all future plotting of this object. Removal is in place
- Parameters:
- kinase_list: list
list of kinase names to remove
- context(ax, info, id_column, context_columns, dotsize=200, markersize=20, orientation='left', color_palette='colorblind', margin=0.2, make_legend=True)[source]¶
Context plot is generated and returned. The context plot contains the categorical data used for describing the data.
- Parameters:
- axmaptlotlib axis
where to map subtype information to
- infopandas df
Dataframe where context information is pulled from
- id_column: str
Column used to map the subtype information to
- context_columnslist
list of columns to pull context informaiton from
- dotsizeint, optional
size of context dots
- markersize: int, optional
size of legend markers
- orientationstr, optional
orientation to plot context plots to - determines where legends are placed options : left, right, top, bottom
- color_palettestr, optional
seaborn color palette to use
- margin: float, optional
margin
- make_legendbool, optional
whether to create legend for context colors
- evidence_count(ax, binary_evidence, plot_type='bars', phospho_type=None, dot_size=1, include_recommendations=True, ideal_min=None, recommended_min=None, dot_colors=None, bar_line_colors=None)[source]¶
Add bars to dotplot indicating the total number of sites used as evidence in activity calculation
- Parameters:
- ax: axes object
where to plot the bars
- binary_evidence: pandas dataframe
binarized dataframe produced during activity calculation (threshold applied to original experiment)
Supporting Functions¶
Master Functions for Running KSTAR Pipeline¶
- kstar.calculate.enrichment_analysis(experiment, log, networks, phospho_types=['Y', 'ST'], data_columns=None, agg='mean', threshold=1.0, evidence_size=None, greater=True, PROCESSES=1)[source]¶
Function to establish a kstar KinaseActivity object from an experiment with an activity log add the networks, calculate, aggregate, and summarize the hypergeometric enrichment into a final activity object. Should be followed by randomized_analyis, then Mann_Whitney_analysis.
- Parameters:
- experiment: pandas df
experiment dataframe that has been mapped, includes KSTAR_SITE, KSTAR_ACCESSION, etc.
- log: logger object
Log to write activity log error and update to
- networks: dictionary of dictionaries
Outer dictionary keys are ‘Y’ and ‘ST’. Establish a network by loading a pickle of desired networks. See the helpers and config file for this. If downloaded from FigShare, then the GLOBAL network pickles in config file can be loaded For example: networks[‘Y’] = pickle.load(open(config.NETWORK_Y_PICKLE, “rb” ))
- phospho_types: {[‘Y’, ‘ST’], [‘Y’], [‘ST’]}
Which substrate/kinaset-type to run activity for: Both [‘Y, ‘ST’] (default), Tyrosine [‘Y’], or Serine/Threonine [‘ST’]
- data_columnslist
columns that represent experimental result, if None, takes the columns that start with `data:’’ in experiment. Pass this value in as a list, if seeking to calculate on fewer than all available data columns
- agg{‘count’, ‘mean’}
method to use when aggregating duplicate substrate-sites. ‘count’ combines multiple representations and adds if values are non-NaN ‘mean’ uses the mean value of numerical data from multiple representations of the same peptide.
NA values are droped from consideration.
- thresholdfloat
threshold value used to filter rows
- greater: Boolean
whether to keep sites that have a numerical value >=threshold (TRUE, default) or <=threshold (FALSE)
- Returns:
- kinactDict: dictionary of Kinase Activity Objects
Outer keys are phosphoTypes run ‘Y’ and ‘ST’ Includes the activities dictionary (see calculate_kinase_activities) aggregation of activities across networks (see aggregate activities) activity summary (see summarize_activities)
- kstar.calculate.randomized_analysis(kinact_dict, log, num_random_experiments=150, use_pregen_data=False, save_new_precompute=False, pregenerated_experiments_path=None, directory_for_save_precompute=None, network_hash=None, save_random_experiments=None, PROCESSES=1)[source]¶
Perform randomized analysis on kinase activity data.
- Parameters:
- kinact_dictdict
Dictionary containing kinase activity data.
- logLogger object
Logger to record the progress and any issues during the randomization pipeline.
- num_random_experimentsint, optional
Number of random experiments to generate, by default 150.
- use_pregen_databool, optional
Whether to use pre-generated data, by default False.
- save_new_precomputebool, optional
Whether to save new precomputed data, by default None.
- pregenerated_experiments_pathstr, optional
Path to the directory containing pre-generated experiments, by default None.
- directory_for_save_precomputestr, optional
Directory to save new precomputed data, by default None.
- network_hashstr, optional
Hash of the network used, by default None.
- save_random_experimentsbool, optional
Whether to save the generated random experiments, by default None.
- PROCESSESint, optional
Number of processes to use for parallel computation, by default 1.
- Returns:
- None
- kstar.calculate.Mann_Whitney_analysis(kinact_dict, log, number_sig_trials=100, PROCESSES=1)[source]¶
For a kinact_dict, where random generation and activity has already been run for the phospho_types of interest, this will calculate the Mann-Whitney U test for comparing the array of p-values for real data to those of random data, across the number of networks used. It will also calculate the false positive rate for a pvalue, given observations of a random bootstrapping analysis
- Parameters:
- kinact_dict: dictionary
A dictionary of kinact objects, with keys ‘Y’ and/or ‘ST’
- log: logger
Logger for logging activity messages
- number_sig_trials: int
Maximum number of significant trials to run
Functions for Saving and Loading KSTAR results¶
- kstar.calculate.save_kstar(kinact_dict, name, odir, PICKLE=True)[source]¶
Having performed kinase activities (run_kstar_analyis), save each of the important dataframes to files and the final pickle Saves an activities, aggregated_activities, summarized_activities tab-separated files Saves a pickle file of dictionary
- Parameters:
- kinact_dict: dictionary of Kinase Activity Objects
Outer keys are phosphoTypes run ‘Y’ and ‘ST’ Includes the activities dictionary (see calculate_kinase_activities) aggregation of activities across networks (see aggregate activities) activity summary (see summarize_activities)
- name: string
The name to use when saving activities
- odir: string
Outputdirectory to save files and pickle to
- PICKLE: boolean
Whether to save the entire pickle file
- Returns:
- Nothing
- kstar.calculate.save_kstar_slim(kinact_dict, name, odir)[source]¶
Having performed kinase activities (run_kstar_analyis), save each of the important dataframes, minimizing the memory storage needed to get back to a rebuilt version for plotting results and analysis. For each phospho_type in the kinact_dict, this will save three .tsv files for every activities analysis run, two additional if random analysis was run, and two more if Mann Whitney based analysis was run. It also creates a readme file of the parameter values used
- Parameters:
- kinact_dict: dictionary of Kinase Activity Objects
Outer keys are phosphoTypes run ‘Y’ and ‘ST’ Includes the activities dictionary (see calculate_kinase_activities) aggregation of activities across networks (see aggregate activities) activity summary (see summarize_activities)
- name: string
The name to use when saving activities
- odir: string
Outputdirectory to save files and pickle to
- Returns:
- Nothing
- kstar.calculate.from_kstar_slim(name, odir, log)[source]¶
Given the name and output directory of a saved kstar analyis, load the parameters and minimum dataframes needed for reinstantiating a kinact object This minimum list will allow you to repeat normalization or mann whitney at a different false positive rate threshold and plot results.
- Parameters:
- name: string
The name to used when saving activities and mapped data
- odir: string
Output directory of saved files and parameter pickle
- log: logger
Logger for logging activity messages
- kstar.calculate.from_kstar_nextflow(name, odir, log=None)[source]¶
Given the name and output directory of a saved kstar analyis from the nextflow pipeline, load the results into new kinact object with the minimum dataframes required for analysis (binary experiment, hypergeometric activities, normalized activities, mann whitney activities)
- Parameters:
- name: string
The name to used when saving activities and mapped data
- odir: string
Output directory of saved files
- log: logger
logger used when loading nextflow data into kinase activity object. If not provided, new logger will be created.
Other Helper Functions¶
- kstar.helpers.process_fasta_file(fasta_file)[source]¶
For configuration, to convert the global fasta sequence file into a sequence dictionary that can be used in mapping
- Parameters:
- fasta_filestr
file location of fasta file
- Returns:
- sequencesdict
{acc : sequence} dictionary generated from fasta file
- kstar.helpers.get_logger(name, filename)[source]¶
Finds and returns logger if it exists. Creates new logger if log file does not exist
- Parameters:
- namestr
- log name
- filenamestr
- location to store log file
- kstar.helpers.string_to_boolean(string)[source]¶
Converts string to boolean
- Parameters:
- string :str
input string
- Returns:
- resultbool
output boolean
- kstar.helpers.convert_acc_to_uniprot(df, acc_col_name, acc_col_type, acc_uni_name)[source]¶
Given an experimental dataframe (df) with an accession column (acc_col_name) that is not uniprot, use uniprot to append an accession column of uniprot IDS
- Parameters:
- df: pandas.DataFrame
Dataframe with at least a column of accession of interest
- acc_col_name: string
name of column to convert FROM
- acc_col_type: string
Uniprot string designation of the accession type to convert FROM, see https://www.uniprot.org/help/api_idmapping
- acc_uni_name:
name of new column
- Returns:
- appended_df: pandas.DataFrame
Input dataframe with an appended acc column of uniprot IDs