KSTAR

The “Config” Module

kstar.config.create_network_pickles(phosphoTypes=['Y', 'ST'], network_directory='./NETWORKS/NetworKIN')[source]

Given network files declared in globals, create pickles of the kstar object that can then be quickly loaded in analysis Assumes that the Network structure has two folders Y and ST under the NETWORK_DIR global variable and that all .csv files in those directories should be loaded into a network pickle.

kstar.config.install_resource_files()[source]

Retrieves RESOURCE_FILES that are the companion for this version release from FigShare, unzips them to the correct directory for resource files.

kstar.config.update_network_directory(directory, create_pickles=True, KSTAR_DIR='/home/srcrowl/miniconda3/envs/documentation/lib/python3.10/site-packages', NETWORK_DIR='./NETWORKS/NetworKIN')[source]

Update the location of network the network files, and verify that all necessary files are located in directory

Parameters
directory: string

path to where network files are located

The “Prune” Module

The “Pruner” Class

class kstar.prune.Pruner(network, logger, phospho_type='Y')[source]

Pruning Algorithm used for KSTAR.

Parameters
networkpandas df

kinase-site prediction network where there is an accession, site, kinase, and score column

logger

logger used for pruning

phospho_typestr

phospho_type(s) to use when building pruned networks

columnsdict

relevant columns in network

Methods

build_multiple_compendia_networks(...[, ...])

Builds multiple compendia-limited networks

build_pruned_network(network, kinase_size, ...)

Builds a heuristic pruned network where each kinase has a specified number of connected sites and each site has an upper limit to the number of kinases it can connect to

calculate_compendia_sizes(kinase_size)

Calculates the number of sites per compendia size that a kinase should connect to using same ratios of compendia sizes as found in compendia

compendia_pruned_network(compendia_sizes, ...)

Builds a compendia-pruned network that takes into account compendia size limits per kinase

build_multiple_compendia_networks(kinase_size, site_limit, num_networks, network_id, odir, PROCESSES=1)[source]

Builds multiple compendia-limited networks

Parameters
kinase_size: int

number of sites each kinase should connect to

site_limitint

upper limit of number of kinases a site can connect to

num_networks: int

number of networks to build

network_idstr

id to use for each network in dictionary

Returns
pruned_networksdict

key : <network_id>_<i> value : pruned network

build_pruned_network(network, kinase_size, site_limit)[source]

Builds a heuristic pruned network where each kinase has a specified number of connected sites and each site has an upper limit to the number of kinases it can connect to

Parameters
networkpandas DataFrame

network to build pruned network on

kinase_size: int

number of sites each kinase should connect to

site_limitint

upper limit of number of kinases a site can connect to

Returns
pruned networkpandas DataFrame

subset of network that has been pruned

calculate_compendia_sizes(kinase_size)[source]

Calculates the number of sites per compendia size that a kinase should connect to using same ratios of compendia sizes as found in compendia

Parameters
kinase_size: int

number of sites each kinase should connect to

Returns
sizesdict

key : compendia size value : number of sites each kinase should pull from given compendia size

compendia_pruned_network(compendia_sizes, site_limit, odir)[source]

Builds a compendia-pruned network that takes into account compendia size limits per kinase

Parameters
compendia_sizesdict

key : compendia size value : number of sites to connect to kinase

site_limitint

upper limit of number of kinases a site can connect to

Returns
pruned_networkpandas DataFrame

subset of network that has been pruned according to compendia ratios

Functions to Perform Pruning

kstar.prune.run_pruning(network, log, use_compendia, phospho_type, kinase_size, site_limit, num_networks, network_id, odir, PROCESSES=1)[source]

Generate pruned networks from a weighted kinase-substrate graph and log run information

Parameters
network: pandas dataframe

kinase substrate network matrix, with values indicating weight of kinase-substrate relationship

log: logger

logger to document the pruning process from start to finish

use_compendia: string

whether to use compendia ratios to build network

phospho_type: string

phospho type (‘Y’, ‘ST’, …)

kinase_size: int

number of sites a kinase connects to

site_limit: int

upper limit of number of kinases can connect to

num_networks: int

number of networks to generate

network_id: string

name of network to use in building dictionary

odir: string

output directory for results

Returns
pruner: Prune object

prune object that contains the number of pruned networks indicated by the num_networks paramater

kstar.prune.save_pruning(phospho_type, network_id, kinase_size, site_limit, use_compendia, odir, log)[source]

Save the pruned networks generated by run_pruning function as a pickle to be loaded by KSTAR

Parameters
phosho_type: string

type of phosphomodification to networks were generated for (either ‘Y’ or ‘ST’)

network_id: string

name of network used to build dictionary

kinase_size: int

number of sites a kinase connects to

site_limit: int

upper limit of number of kinases can connect to

use_compendia: string

whether compendia was used for ratios to build networks

odir: string

output directory for results

log: logger

logger to document pruning process from start to finish

Returns
Nothing
kstar.prune.save_run_information(results, use_compendia, pruner)[source]

Save information about the generation of networks during run_pruning, including the parameters used for generation. Primarily used when running bash script.

Parameters
results:

object that stores all parameters used in the pruning process

use_compendia: string

whether compendia was used for ratios to build network

pruner: Prune object

output of the run_pruning() function

Returns
Nothing

The “ExperimentMapper” class

class kstar.mapping.ExperimentMapper(experiment, columns, logger, sequences=None, compendia=None, window=7, data_columns=None)[source]

Given an experiment object and reference sequences, map the phosphorylation sites to the common reference. Inputs

Parameters
namestr

Name of experiment. Used for logging

experiment: pandas dataframe

Pandas dataframe of an experiment that has a reference accession, a peptide column and/or a site column. The peptide column should be upper case, with lower case indicating the site of phosphorylation - this is preferred The site column should be in the format S/T/Y<pos>, e.g. Y15 or S345

columns: dict

Dictionary with mappings of the experiment dataframe column names for the required names ‘accession_id’, ‘peptide’, or ‘site’. One of ‘peptide’ or ‘site’ is required.

logger: Logger object

used for logging when peptides cannot be matched and when a site location changes

sequences: dict

Dictionary of sequences. Key : accession. Value : protein sequence. Default is imported from kstar.config

compendia: pd.DataFrame

Human phosphoproteome compendia, mapped to KinPred and annotated with number of compendia. Default is imported from kstar.config

windowint

The length of amino acids to the N- and C-terminal sides of the central phosphoprotein to map a site to. Default is 7.

data_columns: list, or empty

The list of data columns to use. If this is empty, logger will look for anything that starts with statement data: and those values Default is None.

Attributes
experiment: pandas dataframe

mapped experiment, which for each peptide, no contains the mapped accession, site, peptide, number of compendia, compendia type

sequences: dict

Dictionary of sequences passed into the class

compendia: pandas dataframe

compendia dataframe passed into the class

data_columns: list

indicates which columns will be used as data

Methods

align_sites([window])

Map the peptide/sites to the common sequence reference and remove and report errors for sites that do not align as expected.

get_experiment()

Return the mapped experiment dataframe

get_sequence(accession)

Gets the sequence that matches the given accession

set_data_columns(data_columns)

Identifies which columns in the experiment should be used as data columns.

align_sites(window=7)[source]

Map the peptide/sites to the common sequence reference and remove and report errors for sites that do not align as expected. expMapper.align_sites(window=7). Operates on the experiment dataframe of class.

Parameters
window: int

The length of amino acids to the N- and C-terminal sides of the central phosphoprotein to map a site to.

get_experiment()[source]

Return the mapped experiment dataframe

get_sequence(accession)[source]

Gets the sequence that matches the given accession

set_data_columns(data_columns)[source]

Identifies which columns in the experiment should be used as data columns. If data_columns is provided, then ‘data:’ is added to the front and experiment dataframe is renamed. Otherwise, function will look for columns with ‘data:’ in front and this to the data_columns attribute.

The “KinaseActivity” class

class kstar.calculate.KinaseActivity(evidence, logger, data_columns=None, phospho_type='Y')[source]

Kinase Activity calculates the estimated activity of kinases given an experiment using hypergeometric distribution. Hypergeometric distribution examines the number of protein sites found to be active in evidence compared to the number of protein sites attributed to a kinase on a provided network.

Parameters
evidencepandas df

a dataframe that contains (at minimum, but can have more) data columms as evidence to use in analysis and KSTAR_ACCESSION and KSTAR_SITE

data_columns: list

list of the columns containing the abundance values, which will be used to determine which sites will be used as evidence for activity prediction in each sample

loggerLogger object

keeps track of kstar analysis, including any errors that occur

phospho_type: string, either ‘Y’ or ‘ST’

indicates the phoshpo modification of interest

Attributes
——————-
Upon Initialization
——————-
evidence: pandas dataframe

inputted evidence column

data_columns: list

list of columns containing abundance values, which will be used to determine which sites will be used as evidence. If inputted data_columns parameter was None, this lists includes in column in evidence prefixed by ‘data:’

loggerLogger object

keeps track of kstar analysis, including any errors that occur

phospho_type: string

indicated phosphomod of interest

network_directory: string

directory where kinase substrate networks can be downloaded, as indicated in config.py

normalized: bool

indicates whether normalization analysis has been performed

aggregate: string

the type of aggregation to use when determining binary evidence, either ‘count’ or ‘mean’. Default is ‘count’.

threshold: float

cutoff to use when determining what sites to use for each experiment

greater: bool

indicates whether sites with greater or lower abundances than the threshold will be used

run_data: string

indicates the date that kinase activity object was initialized

———————————
After Hypergeometric Calculations
———————————
activities_list: pandas dataframe

p-values obtained for all pruned networks indicating statistical enrichment of a kinase’s substrates for each network, based on hypergeometric tests

activities: pandas dataframe

median p-values obtained from the activities_list object for each experiment/kinase

agg_activities: pandas dataframe
———————————–
After Random Enrichment Calculation
———————————–
random_experiments: pandas dataframe

contains information about the sites randomly sampled for each random experiment

random_kinact: KinaseActivity object

KinaseActivity object containing random activities predicted from each of the random experiments

—————————
After Mann Whitney Analysis
—————————
activities_mann_whitney: pandas dataframe

p-values obtained from comparing the real distribution of p-values to the distribution of p-values from random datasets, based the Mann Whitney U-test

fpr_mann_whitney: pandas dataframe

false positive rates for predicted kinase activities

Methods

calculate_Mann_Whitney_activities_sig(log[, ...])

For a kinact_dict, where random generation and activity has already been run for the phospho_types of interest, this will calculate the Mann-Whitney U test for comparing the array of p-values for real data to those of random data, across the number of networks used.

calculate_kinase_activities([agg, ...])

Calculates combined activity of experiments based that uses a threshold value to determine if an experiment sees a site or not To use values use 'mean' as agg mean aggregation drops NA values from consideration To use count use 'count' as agg - present if not na

calculate_random_activities(logger[, ...])

Generate random experiments and calculate the kinase activities for these random experiments

check_data_columns()

Checks data columns to make sure column is in evidence and that evidence filtered on that data column has at least one point of evidence.

create_binary_evidence([agg, threshold, greater])

Returns a binary evidence data frame according to the parameters passed in for method for aggregating duplicates and considering whether a site is included as evidence or not

get_run_date()

return date that kinase activities were run

set_data_columns([data_columns])

Sets the data columns to use in the kinase activity calculation If data_columns is None or an empty list then set data_columns to be all columns that start with data:

calculate_Mann_Whitney_activities_sig(log, number_sig_trials=100, PROCESSES=1)[source]

For a kinact_dict, where random generation and activity has already been run for the phospho_types of interest, this will calculate the Mann-Whitney U test for comparing the array of p-values for real data to those of random data, across the number of networks used. It will also calculate the false positive rate for a pvalue, given observations of a random bootstrapping analysis

Parameters
kinact_dict: dictionary

A dictionary of kinact objects, with keys ‘Y’ and/or ‘ST’

log: logger

Logger for logging activity messages

phospho_types: {[‘Y’, ‘ST’], [‘Y’], [‘ST’]}

Which substrate/kinaset-type to run activity for: Both [‘Y, ‘ST’] (default), Tyrosine [‘Y’], or Serine/Threonine [‘ST’]

number_sig_trials: int

Maximum number of significant trials to run

Returns
calculate_kinase_activities(agg='mean', threshold=1.0, greater=True, PROCESSES=1)[source]

Calculates combined activity of experiments based that uses a threshold value to determine if an experiment sees a site or not To use values use ‘mean’ as agg

mean aggregation drops NA values from consideration

To use count use ‘count’ as agg - present if not na

Parameters
data_columnslist

columns that represent experimental result, if None, takes the columns that start with `data:’’ in experiment. Pass this value in as a list, if seeking to calculate on fewer than all available data columns

thresholdfloat

threshold value used to filter rows

agg{‘count’, ‘mean’}

method to use when aggregating duplicate substrate-sites. ‘count’ combines multiple representations and adds if values are non-NaN ‘mean’ uses the mean value of numerical data from multiple representations of the same peptide.

NA values are droped from consideration.

greater: Boolean

whether to keep sites that have a numerical value >=threshold (TRUE, default) or <=threshold (FALSE)

Returns
activitiesdict

key : experiment value : pd DataFrame

network : network name, from networks key kinase : kinase examined frequency : number of times kinase was seen in subgraph of evidence and network kinase_activity : hypergeometric kinase activity

calculate_random_activities(logger, num_random_experiments=150, PROCESSES=1)[source]

Generate random experiments and calculate the kinase activities for these random experiments

check_data_columns()[source]

Checks data columns to make sure column is in evidence and that evidence filtered on that data column has at least one point of evidence. Removes all columns that do not meet criteria

create_binary_evidence(agg='mean', threshold=1.0, greater=True)[source]

Returns a binary evidence data frame according to the parameters passed in for method for aggregating duplicates and considering whether a site is included as evidence or not

Parameters
thresholdfloat

threshold value used to filter rows

agg{‘count’, ‘mean’}

method to use when aggregating duplicate substrate-sites. ‘count’ combines multiple representations and adds if values are non-NaN ‘mean’ uses the mean value of numerical data from multiple representations of the same peptide.

NA values are droped from consideration.

greater: Boolean

whether to keep sites that have a numerical value >=threshold (TRUE, default) or <=threshold (FALSE)

Returns
evidence_binarypd.DataFrame

Matches the evidence dataframe of the kinact object, but with 0 or 1 if a site is included or not. This is uniquified and rows that are never used are removed.

get_run_date()[source]

return date that kinase activities were run

set_data_columns(data_columns=None)[source]

Sets the data columns to use in the kinase activity calculation If data_columns is None or an empty list then set data_columns to be all columns that start with data:

Checks all set columns to make sure columns are vaild after filtering evidence

The “DotPlot” class

class kstar.plot.DotPlot(values, fpr, alpha=0.05, inclusive_alpha=True, binary_sig=True, dotsize=5, colormap={0: '#6b838f', 1: '#FF3300'}, facecolor='white', labelmap=None, legend_title='p-value', size_number=5, size_color='gray', color_title='Significant', markersize=10, legend_distance=1.0, figsize=(20, 4), title=None, xlabel=True, ylabel=True, x_label_dict=None, kinase_dict=None)[source]

The DotPlot class is used for plotting dotplots, with the option to add clustering and context plots. The size of the dots based on the values dataframe, where the size of the dot is the area of the value * dotsize

Parameters
values: pandas DataFrame instance

values to plot

fprpandas DataFrame instance

false positive rates associated with values being plotted

alpha: float, optional

fpr value that defines the significance cutoff to use when plt default : 0.05

inclusive_alpha: boolean

whether to include the alpha (significance <= alpha), or not (significance < alpha). default: True

binary_sig: boolean, optional

indicates whether to plot fpr with binary significance or as a change color hue default : True

dotsizefloat, optional

multiplier to use for scaling size of dots

colormapdict, optional

maps color values to actual color to use in plotting default : {0: ‘#6b838f’, 1: ‘#FF3300’}

labelmap =

maps labels of colors, default is to indicate FPR cutoff in legend default : None

facecolorcolor, optional

Background color of dotplot default : ‘white’

legend_titlestr, optional

Legend Title for dot sizes, default is `p-value’

size_numberint, optional

Number of dots to attempt to generate for dot size legend

size_colorcolor, optional

Size Legend Color to use

color_titlestr, optional

Legend Title for the Color Legend

markersizeint, optional

Size of dots for Color Legend

legend_distanceint, optional

relative distance to place legends

figsizetuple, optional

size of dotplot figure

titlestr, optional

Title of dotplot

xlabelbool, optional

Show xlabel on graph if True

ylabelbool, optional

Show ylabel on graph if True

x_label_dict: dict, optional

Mapping dictionary of labels as they appear in values dataframe (keys) to how they should appear on plot (values)

kinase_dict: dict, optional

Mapping dictionary of kinase names as they appear in values dataframe (keys) to how they should appear on plot (values)

Attributes
values: pandas dataframe

a copy of the original values dataframe

fpr: pandas dataframe

a copy of the original fpr dataframe

alpha: float

cutoff used for significance, default 0.05

inclusive_alpha: boolean

whether to include the alpha (significance <= alpha), or not (significance < alpha)

significance: pandas dataframe

indicates whether a particular kinases activity is significant, where fpr <= alpha is significant, otherwise it is insignificant

colors: pandas dataframe

dataframe indicating the color to use when plotting: either a copy of the fpr or significance dataframe

binary_sig: boolean

indicates whether coloring will be done based on binary significance or fpr values. Default True

labelmap: dict

indicates how to label each significance color

figsize: tuple

size of the outputted figure, which is overridden if axes is provided for dotplot

title: string

title of the dotplot

xlabel: boolean

indicates whether to plot x-axis labels

ylabel: boolean

indicates whether to plot y-axis labels

colormap: dict

colors to be used when plotting

facecolor: string

background color of dotplot

Methods

cluster(ax[, method, metric, orientation, ...])

Performs hierarchical clustering on data and plots result to provided Axes.

context(ax, info, id_column, context_columns)

Context plot is generated and returned.

dotplot([ax, orientation, size_legend, ...])

Generates the dotplot plot, where size is determined by values dataframe and color is determined by significant dataframe

drop_kinases(kinase_list)

Given a list of kinases, drop these from the dot.values dataframe in all future plotting of this object.

drop_kinases_with_no_significance()

Drop kinases from the values dataframe (inplace) when plotting if they are never observed as significant

cluster(ax, method='single', metric='euclidean', orientation='top', color_threshold=- inf)[source]

Performs hierarchical clustering on data and plots result to provided Axes. result and significant dataframes are ordered according to clustering

axmatplotlib Axes instance

Axes to plot dendogram to

methodstr, optional

The linkage algorithm to use.

metricstr or function, optional

The distance metric to use in the case that y is a collection of observation vectors; ignored otherwise. See the pdist function for a list of valid distance metrics. A custom distance function can also be used.

orientationstr, optional

The direction to plot the dendrogram, which can be any of the following strings: ‘top’: Plots the root at the top, and plot descendent links going downwards. (default). ‘bottom’: Plots the root at the bottom, and plot descendent links going upwards. ‘left’: Plots the root at the left, and plot descendent links going right. ‘right’: Plots the root at the right, and plot descendent links going left.

context(ax, info, id_column, context_columns, dotsize=200, markersize=20, orientation='left', color_palette='colorblind', margin=0.2, make_legend=True)[source]

Context plot is generated and returned. The context plot contains the categorical data used for describing the data.

Parameters
axmaptlotlib axis

where to map subtype information to

infopandas df

Dataframe where context information is pulled from

id_column: str

Column used to map the subtype information to

context_columnslist

list of columns to pull context informaiton from

dotsizeint, optional

size of context dots

markersize: int, optional

size of legend markers

orientationstr, optional

orientation to plot context plots to - determines where legends are placed options : left, right, top, bottom

color_palettestr, optional

seaborn color palette to use

margin: float, optional

margin

make_legendbool, optional

whether to create legend for context colors

dotplot(ax=None, orientation='left', size_legend=True, color_legend=True, max_size=None)[source]

Generates the dotplot plot, where size is determined by values dataframe and color is determined by significant dataframe

Parameters
axmatplotlib Axes instance, optional

axes dotplot will be plotted on. If None then new plot generated

drop_kinases(kinase_list)[source]

Given a list of kinases, drop these from the dot.values dataframe in all future plotting of this object. Removal is in place

Parameters
kinase_list: list

list of kinase names to remove

drop_kinases_with_no_significance()[source]

Drop kinases from the values dataframe (inplace) when plotting if they are never observed as significant

Supporting Functions

Master Functions for Running KSTAR Pipeline

kstar.calculate.Mann_Whitney_analysis(kinact_dict, log, number_sig_trials=100, PROCESSES=1)[source]

For a kinact_dict, where random generation and activity has already been run for the phospho_types of interest, this will calculate the Mann-Whitney U test for comparing the array of p-values for real data to those of random data, across the number of networks used. It will also calculate the false positive rate for a pvalue, given observations of a random bootstrapping analysis

Parameters
kinact_dict: dictionary

A dictionary of kinact objects, with keys ‘Y’ and/or ‘ST’

log: logger

Logger for logging activity messages

number_sig_trials: int

Maximum number of significant trials to run

kstar.calculate.enrichment_analysis(experiment, log, networks, phospho_types=['Y', 'ST'], data_columns=None, agg='mean', threshold=1.0, greater=True, PROCESSES=1)[source]

Function to establish a kstar KinaseActivity object from an experiment with an activity log add the networks, calculate, aggregate, and summarize the hypergeometric enrichment into a final activity object. Should be followed by randomized_analyis, then Mann_Whitney_analysis.

Parameters
experiment: pandas df

experiment dataframe that has been mapped, includes KSTAR_SITE, KSTAR_ACCESSION, etc.

log: logger object

Log to write activity log error and update to

networks: dictionary of dictionaries

Outer dictionary keys are ‘Y’ and ‘ST’. Establish a network by loading a pickle of desired networks. See the helpers and config file for this. If downloaded from FigShare, then the GLOBAL network pickles in config file can be loaded For example: networks[‘Y’] = pickle.load(open(config.NETWORK_Y_PICKLE, “rb” ))

phospho_types: {[‘Y’, ‘ST’], [‘Y’], [‘ST’]}

Which substrate/kinaset-type to run activity for: Both [‘Y, ‘ST’] (default), Tyrosine [‘Y’], or Serine/Threonine [‘ST’]

data_columnslist

columns that represent experimental result, if None, takes the columns that start with `data:’’ in experiment. Pass this value in as a list, if seeking to calculate on fewer than all available data columns

agg{‘count’, ‘mean’}

method to use when aggregating duplicate substrate-sites. ‘count’ combines multiple representations and adds if values are non-NaN ‘mean’ uses the mean value of numerical data from multiple representations of the same peptide.

NA values are droped from consideration.

thresholdfloat

threshold value used to filter rows

greater: Boolean

whether to keep sites that have a numerical value >=threshold (TRUE, default) or <=threshold (FALSE)

Returns
kinactDict: dictionary of Kinase Activity Objects

Outer keys are phosphoTypes run ‘Y’ and ‘ST’ Includes the activities dictionary (see calculate_kinase_activities) aggregation of activities across networks (see aggregate activities) activity summary (see summarize_activities)

kstar.calculate.randomized_analysis(kinact_dict, log, num_random_experiments=150, PROCESSES=1)[source]

Creates random experiments, drawn from the human phosphoproteome, according to the distribution of the number of compendia that each data column in the experiment has for num_random_experiments. Kinase activity calculation is then run on every random experiment.

Parameters
kinact_dict: KinaseActivities dictionary

Has keys [‘Y’] and/or [‘ST’] and values that are KinaseActivity objects. These objects are modified to add normalization

log: logger

Logger for logging activity messages

num_random_experiments: int

Number of random experiments, for each data column, to create and run activity from

Functions for Saving and Loading KSTAR results

kstar.calculate.from_kstar_nextflow(name, odir, log=None)[source]

Given the name and output directory of a saved kstar analyis from the nextflow pipeline, load the results into new kinact object with the minimum dataframes required for analysis (binary experiment, hypergeometric activities, normalized activities, mann whitney activities)

Parameters
name: string

The name to used when saving activities and mapped data

odir: string

Output directory of saved files

log: logger

logger used when loading nextflow data into kinase activity object. If not provided, new logger will be created.

kstar.calculate.from_kstar_slim(name, odir, log)[source]

Given the name and output directory of a saved kstar analyis, load the parameters and minimum dataframes needed for reinstantiating a kinact object This minimum list will allow you to repeat normalization or mann whitney at a different false positive rate threshold and plot results.

Parameters
name: string

The name to used when saving activities and mapped data

odir: string

Output directory of saved files and parameter pickle

log: logger

Logger for logging activity messages

kstar.calculate.save_kstar(kinact_dict, name, odir, PICKLE=True)[source]

Having performed kinase activities (run_kstar_analyis), save each of the important dataframes to files and the final pickle Saves an activities, aggregated_activities, summarized_activities tab-separated files Saves a pickle file of dictionary

Parameters
kinact_dict: dictionary of Kinase Activity Objects

Outer keys are phosphoTypes run ‘Y’ and ‘ST’ Includes the activities dictionary (see calculate_kinase_activities) aggregation of activities across networks (see aggregate activities) activity summary (see summarize_activities)

name: string

The name to use when saving activities

odir: string

Outputdirectory to save files and pickle to

PICKLE: boolean

Whether to save the entire pickle file

Returns
Nothing
kstar.calculate.save_kstar_slim(kinact_dict, name, odir)[source]

Having performed kinase activities (run_kstar_analyis), save each of the important dataframes, minimizing the memory storage needed to get back to a rebuilt version for plotting results and analysis. For each phospho_type in the kinact_dict, this will save three .tsv files for every activities analysis run, two additional if random analysis was run, and two more if Mann Whitney based analysis was run. It also creates a readme file of the parameter values used

Parameters
kinact_dict: dictionary of Kinase Activity Objects

Outer keys are phosphoTypes run ‘Y’ and ‘ST’ Includes the activities dictionary (see calculate_kinase_activities) aggregation of activities across networks (see aggregate activities) activity summary (see summarize_activities)

name: string

The name to use when saving activities

odir: string

Outputdirectory to save files and pickle to

Returns
Nothing

Other Helper Functions

kstar.helpers.convert_acc_to_uniprot(df, acc_col_name, acc_col_type, acc_uni_name)[source]

Given an experimental dataframe (df) with an accession column (acc_col_name) that is not uniprot, use uniprot to append an accession column of uniprot IDS

Parameters
df: pandas.DataFrame

Dataframe with at least a column of accession of interest

acc_col_name: string

name of column to convert FROM

acc_col_type: string

Uniprot string designation of the accession type to convert FROM, see https://www.uniprot.org/help/api_idmapping

acc_uni_name:

name of new column

Returns
appended_df: pandas.DataFrame

Input dataframe with an appended acc column of uniprot IDs

kstar.helpers.get_logger(name, filename)[source]

Finds and returns logger if it exists. Creates new logger if log file does not exist

Parameters
namestr
log name
filenamestr
location to store log file
kstar.helpers.process_fasta_file(fasta_file)[source]

For configuration, to convert the global fasta sequence file into a sequence dictionary that can be used in mapping

fasta_filestr

file location of fasta file

Returns
sequencesdict

{acc : sequence} dictionary generated from fasta file

kstar.helpers.string_to_boolean(string)[source]

Converts string to boolean

Parameters
stringstr

input string

Returns
resultbool

output boolean