KSTAR Reference

Contents

KSTAR Reference#

The “Config” Module#

kstar.config.check_configuration()#

Verify that all necessary files are downloadable and findable

kstar.config.find_available_networks(phospho_type)#

Find available network hashes in the current network directory, and return dictionary with information about them

Returns:
available_networksdict

dictionary containing all available networks, in the format -> Network hash : network information dictionary

kstar.config.install_network_files(target_dir=None)#

Retrieves Network files that are the companion for this version release from FigShare, unzips them to the specified directory.

Parameters:
target_dirstr, optional

Directory to install network files to. If None, defaults to within package location ({KSTAR_DIR}/NETWORKS/)

kstar.config.update_configuration(network_dir=None, y_network_name=None, st_network_name=None, save_random_experiments=None, use_pregenerated_random_activities=None, save_new_random_activities=None, custom_pregenerated_activities_dir=None)#

Update configuration parameters in current iteration and save to configuration file.

Parameters:
use_pregenerated_random_activitiesbool, optional

Whether to use pregenerated random activities when possible, by default None

save_new_random_activitiesbool, optional

Whether to save new random activities when they are generated, by default False

custom_pregenerated_activities_dirstr, optional

Directory to save newly generated random activities for future use, by default None

network_dirstr, optional

Directory containing the kinase-substrate networks, by default None (which assumes it is located in kstar directory)

y_network_hashstr, optional

Unique identifier of the tyrosine network to use by default.

st_network_hashstr, optional

Unique identifier of the serine/threonine network to use by default.

kstar.config.update_network_directory(network_dir=None, y_network_name=None, st_network_name=None)#

Update the location of network the network files, and verify that all necessary files are located in directory

Parameters:
network_dir: string

path to where network files are located

y_network_name: string

name of the tyrosine network to use

st_network_name: string

name of the serine/threonine network to use

The “Prune” Module#

The “Pruner” Class#

class kstar.prune.Pruner(network, network_name, phospho_type='Y', acc_col='substrate_acc', site_col='site', nonweight_cols=['substrate_acc', 'site', 'substrate_id', 'substrate_name', 'pep'], logger=None, network_dir=None)#

Pruning Algorithm used for KSTAR.

Parameters:
networkpandas df

weighted kinase-site prediction network where there is an accession, site, kinase, and score column

network_namestr

name to use when saving pruned networks

loggerNone or logging.logger

logger used for pruning. Will create a new logger if None is provided

phospho_typestr

phospho_type(s) to use when building pruned networks

acc_colstr

the name of the column containing Uniprot Accession IDs for each substrate in the weighted network

site_colstr

the name of the column containing the residue type and location of each substrate in the weighted network (Y1268, S44, etc.)

nonweight_colslist
indicates the non-weight containing columns in the network (these will be removed in the final processed network, as they are not needed). If None, will automatically look

for any non-numeric columns and removes them.

network_dirstr

location to save the final pruned networks. Will use default network directory from config if None is provided.

Methods

assess_work_dir()

Report how many networks are currently in the work directory

build_multiple_compendia_networks(...[, ...])

Builds multiple compendia-limited networks

build_multiple_networks(kinase_size, ...[, ...])

Basic Network Generation - only takes into account score when determining sites a kinase connects to

build_pruned_network(network, kinase_size, ...)

Builds a heuristic pruned network where each kinase has a specified number of connected sites and each site has an upper limit to the number of kinases it can connect to

calculate_compendia_sizes(kinase_size)

Calculates the number of sites per compendia size that a kinase should connect to using same ratios of compendia sizes as found in compendia

checkParameters(kinase_size, site_limit)

Given the site_limit and kinase_size parameters to be used during pruning, raise errors if not feasible, and raise warnings if value is higher than we would recommend (>40% of the maximum kinase_size value)

clean_work_dir()

Remove all files in existing work directory

compendia_pruned_network(compendia_sizes, ...)

Builds a compendia-pruned network that takes into account compendia size limits per kinase

getMaximumKinaseSize(site_limit)

Given a network and site_limit (maximum number of kinases a phosphorylation site can provide evidence to), will calculate the theoretical maximum number of connections each kinase can have (kinase_size parameter)

getRecommendedKinaseSize(site_limit)

Given a network and site_limit (maximum number of kinases a phosphorylation site can provide evidence to), will calculate the theoretical maximum number of connections each kinase can have (kinase_size parameter) and recommend a range of values for kinase_size

pregenerate_random_activities([PROCESSES])

Docstring for pregenerate_random_activities

report_info(txt)

Both log and print information during pruning

report_warning(txt)

Both log and print warnings during pruning

run(kinase_size, site_limit[, num_networks, ...])

Run the pruning algorithm from start to finish, including pregenerating random activities based on generated networks

save_networks([network_file_used, network_desc])

Save the pruned networks generated by the 'build_multiple_networks' or 'build_multiple_compendia_networks' as a pickle to be loaded by KSTAR

save_run_information([network_file_used, ...])

Save information about the generation of networks during run_pruning, including the parameters used for generation.

assess_work_dir()#

Report how many networks are currently in the work directory

build_multiple_compendia_networks(kinase_size, site_limit, num_networks, PROCESSES=1)#

Builds multiple compendia-limited networks

Parameters:
kinase_size: int

number of sites each kinase should connect to

site_limit :int

upper limit of number of kinases a site can connect to

num_networks: int

number of networks to build

network_idstr

id to use for each network in dictionary

Returns:
pruned_networksdict

key : <network_id>_<i> value : pruned network

build_multiple_networks(kinase_size, site_limit, num_networks, PROCESSES=1)#

Basic Network Generation - only takes into account score when determining sites a kinase connects to

build_pruned_network(network, kinase_size, site_limit)#

Builds a heuristic pruned network where each kinase has a specified number of connected sites and each site has an upper limit to the number of kinases it can connect to

Parameters:
networkpandas DataFrame

network to build pruned network on

kinase_size: int

number of sites each kinase should connect to

site_limit :int

upper limit of number of kinases a site can connect to

Returns:
pruned networkpandas DataFrame

subset of network that has been pruned

calculate_compendia_sizes(kinase_size)#

Calculates the number of sites per compendia size that a kinase should connect to using same ratios of compendia sizes as found in compendia

Parameters:
kinase_size: int

number of sites each kinase should connect to

Returns:
sizesdict

key : compendia size value : number of sites each kinase should pull from given compendia size

checkParameters(kinase_size, site_limit)#

Given the site_limit and kinase_size parameters to be used during pruning, raise errors if not feasible, and raise warnings if value is higher than we would recommend (>40% of the maximum kinase_size value)

Parameters:
kinase_size: int

Parameter used in pruning: indicates the number of substrates each kinase will be connected to

site_limit: int

Parameter used in pruning: indicates the maximum number of kinases a phosphorylation site can be connected to in the final pruned network

Returns:
Nothing, will only raise errors/warnings if parameters are not feasible
clean_work_dir()#

Remove all files in existing work directory

compendia_pruned_network(compendia_sizes, site_limit, odir)#

Builds a compendia-pruned network that takes into account compendia size limits per kinase

Parameters:
compendia_sizesdict

key : compendia size value : number of sites to connect to kinase

site_limitint

upper limit of number of kinases a site can connect to

Returns:
pruned_networkpandas DataFrame

subset of network that has been pruned according to compendia ratios

getMaximumKinaseSize(site_limit)#

Given a network and site_limit (maximum number of kinases a phosphorylation site can provide evidence to), will calculate the theoretical maximum number of connections each kinase can have (kinase_size parameter)

Theoretical maximum exists when each substrate hits the maximum site_limit

Parameters:
site_limit: int

Parameter used in pruning: indicates the maximum number of kinases a phosphorylation site can be connected to in the final pruned network

Returns:
theoretical_max_ksize: int

largest possible value that ‘kinase_size’ parameter can have without throwing any errors

getRecommendedKinaseSize(site_limit)#

Given a network and site_limit (maximum number of kinases a phosphorylation site can provide evidence to), will calculate the theoretical maximum number of connections each kinase can have (kinase_size parameter) and recommend a range of values for kinase_size

Theoretical maximum exists when each substrate hits the maximum site_limit

Parameters:
site_limit: int

Parameter used in pruning: indicates the maximum number of kinases a phosphorylation site can be connected to in the final pruned network

Returns:
Nothing, prints theoretical maximum of kinase size and the recommened values for the parameter given the site_limit
pregenerate_random_activities(PROCESSES=1)#

Docstring for pregenerate_random_activities

Parameters:

self – Description

report_info(txt)#

Both log and print information during pruning

report_warning(txt)#

Both log and print warnings during pruning

run(kinase_size, site_limit, num_networks=50, use_compendia=True, generate_activities=True, network_file_used=None, network_desc=None, restart=False, PROCESSES=1)#

Run the pruning algorithm from start to finish, including pregenerating random activities based on generated networks

save_networks(network_file_used=None, network_desc=None)#

Save the pruned networks generated by the ‘build_multiple_networks’ or ‘build_multiple_compendia_networks’ as a pickle to be loaded by KSTAR

save_run_information(network_file_used=None, network_desc=None)#

Save information about the generation of networks during run_pruning, including the parameters used for generation. Primarily used when running bash script.

Parameters:
network_file_usedstr, optional

file path of the weighted network file used during pruning

network_descstr, optional

description of the network used during pruning. Recommended, but not required

Functions to Perform Pruning#

kstar.prune.run_pruning(weighted_network, network_name, odir, phospho_type, kinase_size, site_limit, num_networks, use_compendia=True, generate_activities=True, network_file_used=None, network_desc=None, restart=False, logger=None, acc_col='substrate_acc', site_col='site', nonweight_cols=['substrate_acc', 'site', 'substrate_id', 'substrate_name', 'pep'], PROCESSES=1)#

Run the pruning algorithm from start to finish, including pregenerating random activities based on generated networks

Parameters:
weighted_networkpandas DataFrame

weighted kinase-site prediction network where there is an accession, site, kinase, and score column

network_namestr

name to use when saving pruned networks

odirstr

location to save the final pruned networks. Will use default network directory from config if None is provided.

phospho_typestr

phospho_type(s) to use when building pruned networks

The “ExperimentMapper” class#

class kstar.mapping.ExperimentMapper(experiment, columns, odir='./', name='experiment', window=7, data_columns=None, logger=None, sequences=None, compendia=None)#

Given an experiment object and reference sequences, map the phosphorylation sites to the common reference. Inputs

Parameters:
namestr

Name of experiment. Used for logging

experiment: pandas dataframe

Pandas dataframe of an experiment that has a reference accession, a peptide column and/or a site column. The peptide column should be upper case, with lower case indicating the site of phosphorylation - this is preferred The site column should be in the format S/T/Y<pos>, e.g. Y15 or S345

columns: dict

Dictionary with mappings of the experiment dataframe column names for the required names ‘accession_id’, ‘peptide’, or ‘site’. One of ‘peptide’ or ‘site’ is required.

name: str

Name of experiment, used for logging and output file names

odir: str

Output directory where mapped data and logs will be saved

logger: Logger object

used for logging when peptides cannot be matched and when a site location changes. If None, a logger will be created in the output directory.

sequences: dict

Dictionary of sequences. Key : accession. Value : protein sequence. Default is imported from kstar.config

compendia: pd.DataFrame

Human phosphoproteome compendia, mapped to KinPred and annotated with number of compendia. Default is imported from kstar.config

windowint

The length of amino acids to the N- and C-terminal sides of the central phosphoprotein to map a site to. Default is 7.

data_columns: list, or empty

The list of data columns to use. If this is empty, logger will look for anything that starts with statement data: and those values Default is None.

Attributes:
experiment: pandas dataframe

mapped experiment, which for each peptide, no contains the mapped accession, site, peptide, number of compendia, compendia type

sequences: dict

Dictionary of sequences passed into the class

compendia: pandas dataframe

compendia dataframe passed into the class

data_columns: list

indicates which columns will be used as data

Methods

align_sites([window])

Map the peptide/sites to the common sequence reference and remove and report errors for sites that do not align as expected.

get_experiment()

Return the mapped experiment dataframe

get_number_missed_peptides()

Returns number of missed peptides

get_number_missed_sites()

Returns number of missed sites

get_reason_for_unmapped()

Returns dataframe of unmapped sites with reasons for being unmapped

get_sequence(accession)

Gets the sequence that matches the given accession

save_experiment([return_stats, ...])

Given a completed mapping process, save the resulting experiment and reporting files (if desired) to the output directory.

set_data_columns(data_columns)

Identifies which columns in the experiment should be used as data columns.

align_sites(window=7)#

Map the peptide/sites to the common sequence reference and remove and report errors for sites that do not align as expected. expMapper.align_sites(window=7). Operates on the experiment dataframe of class.

Parameters:
window: int

The length of amino acids to the N- and C-terminal sides of the central phosphoprotein to map a site to.

get_experiment()#

Return the mapped experiment dataframe

get_number_missed_peptides()#

Returns number of missed peptides

get_number_missed_sites()#

Returns number of missed sites

get_reason_for_unmapped()#

Returns dataframe of unmapped sites with reasons for being unmapped

Returns:
errorspandas Series

Series with counts of each error type

percpandas Series

Series with percentage of each error type

get_sequence(accession)#

Gets the sequence that matches the given accession

save_experiment(return_stats=True, return_lost_sites=True)#

Given a completed mapping process, save the resulting experiment and reporting files (if desired) to the output directory.

Parameters:
return_statsbool

Whether to save a mapping statistics file. Default is True.

return_lost_sitesbool

Whether to save csv file containing any sites/peptides that were removed during the mapping process. Default is True.

set_data_columns(data_columns)#

Identifies which columns in the experiment should be used as data columns. If data_columns is provided, then ‘data:’ is added to the front and experiment dataframe is renamed. Otherwise, function will look for columns with ‘data:’ in front and this to the data_columns attribute.

Functions for Activity Calculation#

The “KinaseActivity” class#

class kstar.calculate.KinaseActivity(evidence, odir, name='experiment', data_columns=None, phospho_type='Y', kinases=None, network_dir=None, logger=None, network_name=None, seed=None)#

Kinase Activity calculates the estimated activity of kinases given an experiment using hypergeometric distribution. Hypergeometric distribution examines the number of protein sites found to be active in evidence compared to the number of protein sites attributed to a kinase on a provided network.

Parameters:
evidencepandas df

a dataframe that contains (at minimum, but can have more) data columms as evidence to use in analysis and KSTAR_ACCESSION and KSTAR_SITE

odirstring

output directory where results will be saved

namestring

name of the experiment, used to label output files. Default is ‘experiment’

kinaseslist or None

list of kinases to predict activity for. If None, will use all kinases found in the provided networks

network_dirstring or None

directory where pruned KSTAR networks are located. If None, will use config.NETWORK_DIR. If network files were downloaded with config.install_network_files(), this directory should already be set and does not need to be provided.

network_namestring or None

name of the network to use. If None, will use the default network name from config based on phospho_type

data_columns: list

list of the columns containing the abundance values, which will be used to determine which sites will be used as evidence for activity prediction in each sample

phospho_type: string, either ‘Y’ or ‘ST’

indicates the phospho modification of interest

loggerLogger object or None

keeps track of kstar analysis, including any errors that occur. If None, a new logger will be created automatically

min_dataset_size_for_pregenerated: int

minimum dataset size required to use pregenerated random activities (by number of sites used as evidence). Default is 150

max_diff_from_pregenerated: float

maximum percent difference between dataset size and pregenerated random activity size to use pregenerated data. Default is 0.20 (i.e. 20%)

seedint or None

random seed to use for random number generation. If None, seed will be set to current time

Attributes:
——————-
Upon Initialization
——————-
evidence: pandas dataframe

inputted dataset used for kinase activity calculation

networks: dict

dictionary of pruned kinase substrate networks, with keys as network ids and values as pandas dataframes

data_columns: list

list of columns containing abundance values, which will be used to determine which sites will be used as evidence. If inputted data_columns parameter was None, this lists includes in column in evidence prefixed by ‘data:’

loggerLogger object

keeps track of kstar analysis, including any errors that occur

aggregate: string

the type of aggregation to use when determining binary evidence, either ‘count’ or ‘mean’. Default is ‘count’.

run_date: string

indicates the date that kinase activity object was initialized

random_seed: int

random seed used for activity calculation. Only relevant if not using pregenerated random activities

network_info: dict

metadata about the loaded networks

network_hash: string

unique identifier for the loaded networks

kinases: list

list of kinases to predict activity for

———————————
After Hypergeometric Calculations
———————————
real_enrichment: pandas dataframe

p-values obtained for all pruned networks indicating statistical enrichment of a kinase’s substrates for each network, based on hypergeometric tests

activities: pandas dataframe

median p-values obtained from the real_enrichment object for each experiment/kinase

agg_activities: pandas dataframe
———————————–
After Random Enrichment Calculation
———————————–
random_experiments: pandas dataframe

contains information about the sites randomly sampled for each random experiment. Will only be saved if save_random_experiments=True.

random_enrichment: KinaseActivity object

KinaseActivity object containing random activities predicted from each of the random experiments

data_columns_from_scratch: list

list of data columns which generated random activities from scratch

data_columns_with_pregenerated: list

list of data columns which generated random activities from pregenerated random activities

—————————
After Mann Whitney Analysis
—————————
activities_mann_whitney: pandas dataframe

p-values obtained from comparing the real distribution of p-values to the distribution of p-values from random datasets, based the Mann Whitney U-test

fpr_mann_whitney: pandas dataframe

false positive rates for predicted kinase activities

Methods

calculate_kinase_activities([agg, ...])

Calculates combined activity of experiments based that uses a threshold value to determine if an experiment sees a site or not To use values use 'mean' as agg mean aggregation drops NA values from consideration To use count use 'count' as agg - present if not na

check_data_columns([min_evidence_size])

Checks data columns to make sure column is in evidence and that evidence filtered on that data column has at least one point of evidence (or minimum set by min_evidence_size).

create_binary_evidence([agg, threshold, ...])

Returns a binary evidence data frame according to the parameters passed in for method for aggregating duplicates and considering whether a site is included as evidence or not

get_allowable_threshold([greater, agg, ...])

Determine the minimum/maximum threshold that still results in all data columns having evidence

get_param_dict([params_to_ignore])

Get a dictionary of important parameters needed to reinstantiate the KSTAR object

get_random_activities([...])

Generate random experiments and calculate kinase activities.Either uses pre-generated activity lists or generates new random experiments based on the provided parameters.

get_run_information_content()

Retrieve network information from RUN_INFORMATION.txt based on phospho_type.

make_dotplot([include_evidence_sizes])

Create a dotplot of the kinase activity results

make_summary_pdf([regenerate_plots])

Create a summary PDF of the kinase activity results

recommend_threshold([desired_evidence_size, ...])

Recommend a threshold, one based on desired evidence size and one based on maximum average Jaccard similarity between samples.

set_data_columns([data_columns])

Sets the data columns to use in the kinase activity calculation If data_columns is None or an empty list then set data_columns to be all columns that start with data:

test_threshold(threshold[, agg, greater, ...])

Given a threshold value, calculate the distribution of evidence sizes (i.e. number of sites used in prediction for each sample in the experiment).

test_threshold_range(min_threshold, ...[, ...])

Given a range of threshold values, calculate the distribution of evidence sizes (i.e. number of sites used in prediction for each sample in the experiment) and Jaccard similarity between samples at each threshold.

calculate_kinase_activities(agg='mean', threshold=1.0, evidence_size=None, greater=True, min_evidence_size=0, PROCESSES=1)#

Calculates combined activity of experiments based that uses a threshold value to determine if an experiment sees a site or not To use values use ‘mean’ as agg

mean aggregation drops NA values from consideration

To use count use ‘count’ as agg - present if not na

Parameters:
data_columnslist

columns that represent experimental result, if None, takes the columns that start with `data:’’ in experiment. Pass this value in as a list, if seeking to calculate on fewer than all available data columns

thresholdfloat

threshold value used to filter rows

evidence_sizeint or None

the number of sites to use for prediction for each sample. If a value is provided, this will override the threshold, and will instead obtain the N sites with the greatest abundance within each sample (or lowest if greater=False).

agg{‘count’, ‘mean’}

method to use when aggregating duplicate substrate-sites. ‘count’ combines multiple representations and adds if values are non-NaN ‘mean’ uses the mean value of numerical data from multiple representations of the same peptide.

NA values are droped from consideration.

greater: Boolean

whether to keep sites that have a numerical value >=threshold (TRUE, default) or <=threshold (FALSE)

min_evidence_sizeint

minimum number of sites required for a data column to be considered for activity calculation

PROCESSESint

number of processes to use for multiprocessing

check_data_columns(min_evidence_size=0)#

Checks data columns to make sure column is in evidence and that evidence filtered on that data column has at least one point of evidence (or minimum set by min_evidence_size). Removes all columns that do not meet criteria

Parameters:
min_evidence_sizeint

minimum number of sites required for a data column to be considered for activity calculation

create_binary_evidence(agg='mean', threshold=1.0, evidence_size=None, greater=True, min_evidence_size=0, drop_empty_columns=True)#

Returns a binary evidence data frame according to the parameters passed in for method for aggregating duplicates and considering whether a site is included as evidence or not

Parameters:
thresholdfloat

threshold value used to filter rows

evidence_size: None or int

the number of sites to use for prediction for each sample. If a value is provided, this will override the threshold, and will instead obtain the N sites with the greatest abundance within each sample.

agg{‘count’, ‘mean’}

method to use when aggregating duplicate substrate-sites. ‘count’ combines multiple representations and adds if values are non-NaN ‘mean’ uses the mean value of numerical data from multiple representations of the same peptide.

NA values are droped from consideration.

greater: Boolean

whether to keep sites that have a numerical value >=threshold (TRUE, default) or <=threshold (FALSE)

min_evidence_sizeint

minimum number of sites required for a data column to be considered for activity calculation

drop_empty_columnsbool

whether to drop data columns with fewer than min_evidence_size sites

Returns:
evidence_binarypd.DataFrame

Matches the evidence dataframe of the kinact object, but with 0 or 1 if a site is included or not. This is uniquified and rows that are never used are removed.

get_allowable_threshold(greater=True, agg='mean', min_evidence_size=20, allow_column_loss=False)#

Determine the minimum/maximum threshold that still results in all data columns having evidence

Parameters:
greater: bool

whether to use sites greater (True) or less (False) than the threshold

agg: str

how to combine sites with multiple instances in experiment

min_evidence_size: int

minimum number of sites required for a data column to be considered for activity calculation

Returns:
allowable threshold: float

maximum or minimum threshold that still results in all data columns having evidence (or at least one if min_evidence_size = None)

get_param_dict(params_to_ignore=['network_sizes', 'pregenerated_experiments_path', 'mann_whitney'])#

Get a dictionary of important parameters needed to reinstantiate the KSTAR object

get_random_activities(num_random_experiments=150, use_pregenerated_random_activities=None, default_pregen_only=False, save_new_random_activities=None, custom_pregenerated_path=None, save_random_experiments=None, require_pregenerated=False, max_diff_from_pregenerated=0.25, min_dataset_size_for_pregenerated=150, PROCESSES=1)#

Generate random experiments and calculate kinase activities.Either uses pre-generated activity lists or generates new random experiments based on the provided parameters.

Parameters:
num_random_experimentsint, optional

Number of random experiments to generate, by default 150.

use_pregenerated_random_activitiesbool, optional

Whether to use pre-generated data, by default None and will use configuration value.

default_pregen_onlybool, optional

Whether to only use the default pregenerated data found in the network directory folder, by default False.

save_new_random_activitiesbool, optional

Whether to save new pregenerated data, by default None and will use configuration value

custom_pregenerated_pathstr, optional

Directory to save new precomputed data, by default None and will use configuration value.

save_random_experimentsbool, optional

Whether to save the generated random experiments, by default None and will use configuration value.

require_pregeneratedbool, optional

Whether to require using pre-generated data for all datasets, by default False. This is will ensure fast run times, but may result in some datasets not being processed if they do not have matching pre-generated data (most commonly due to smaller samples).

max_diff_from_pregeneratedfloat, optional

Maximum allowed difference in size between the dataset and pregenerated data to use pregenerated data, by default 0.25.

min_dataset_size_for_pregeneratedint, optional

Minimum dataset size required to use pregenerated data, by default 150.

PROCESSESint, optional

Number of processes to use for parallel computation, by default 1.

get_run_information_content()#

Retrieve network information from RUN_INFORMATION.txt based on phospho_type.

Reads the RUN_INFORMATION.txt file from the appropriate network directory based on the phospho_type (‘Y’ or ‘ST’). The file contains network configuration details including unique ID, date, network specifications, and compendia counts.

Returns:
Contents of RUN_INFORMATION.txt if found.
‘RUN_INFORMATION.txt file not found.’ if the file doesn’t exist.
make_dotplot(include_evidence_sizes=True, **kwargs)#

Create a dotplot of the kinase activity results

Parameters:
include_evidence_sizesbool

Whether to include evidence sizes in the dotplot

**kwargs

Additional keyword arguments to pass to the DotPlot initialization and make_complete_dotplot methods

make_summary_pdf(regenerate_plots=False)#

Create a summary PDF of the kinase activity results

Parameters:
regenerate_plotsbool

Whether to regenerate plots even if they already exist

recommend_threshold(desired_evidence_size=None, max_similarity=0.7, consider_size=True, consider_similarity=True, min_threshold=-inf, max_threshold=inf, step=0.1, pick_best_size_by='median', pick_best_similarity_by='max', greater=True, agg='mean', min_evidence_size=20, allow_column_loss=False)#

Recommend a threshold, one based on desired evidence size and one based on maximum average Jaccard similarity between samples. Will report the characteristics of the resulting evidences for both thresholds

Parameters:
desired_evidence_size: int

target evidence size to use when recommending threshold

max_similarity: float

maximum average Jaccard similarity between samples to use when recommending threshold. Default is 0.7

consider_size: bool

whether to consider evidence size when recommending threshold

consider_similarity: bool

whether to consider similarity between data columns when recommending threshold

min_threshold: float

minimum threshold to consider when recommending threshold. Must be provided if greater = True. Default is -infinity

max_threshold: float

maximum threshold to consider when recommending threshold. Must be provided if greater = False. Default is infinity

step: float

step size to use when iterating through thresholds

pick_best_size_by: str

method to use when aggregating evidence size values across samples, recommended to be either ‘min’, ‘max’, or ‘median’

pick_best_similarity_by: str

method to use when aggregating Jaccard similarity values across samples, recommended to be either ‘max’ or ‘median’

greater: bool

whether to use sites greater (True) or less (False) than the threshold

agg: str

how to combine sites with multiple instances in experiment

min_evidence_size: int

minimum number of sites required for a data column to be considered for activity calculation

allow_column_loss: bool

whether to allow some data columns to be lost when recommending threshold based on size. If False, will raise an error if min/max thresholds provided result in loss of any data columns

Returns:
float

recommended threshold value

set_data_columns(data_columns=None)#

Sets the data columns to use in the kinase activity calculation If data_columns is None or an empty list then set data_columns to be all columns that start with data:

Checks all set columns to make sure columns are vaild after filtering evidence

test_threshold(threshold, agg='mean', greater=True, plot=False, return_evidence_sizes=False, min_evidence_size=0)#

Given a threshold value, calculate the distribution of evidence sizes (i.e. number of sites used in prediction for each sample in the experiment).

Parameters:
threshold: float

cutoff for inclusion as evidence for prediction. If greater = True, sites with quantification greater than the threshold are used as evidence.

agg: str

how to combine sites with multiple instances in experiment

greater: bool

whether to use sites greater (True) or less (False) than the threshold

plot: bool

whether to plot a histogram of the evidence sizes used and heatmap of Jaccard similarity between samples

return_evidence_sizes: bool

indicates whether to return the evidence sizes for all samples or not

min_evidence_size: int

minimum number of sites required for a data column to be considered for activity calculation

Returns:
Outputs the minimum, maximum, and median evidence sizes across all samples. May return evidence sizes of all samples as pandas series
test_threshold_range(min_threshold, max_threshold, step=0.1, agg='mean', greater=True, min_evidence_size=0, desired_evidence_size=None, show_recommended=False)#

Given a range of threshold values, calculate the distribution of evidence sizes (i.e. number of sites used in prediction for each sample in the experiment) and Jaccard similarity between samples at each threshold

Parameters:
min_threshold: float

minimum cutoff for inclusion as evidence for prediction. If greater = True, sites with quantification greater than the threshold are used as evidence.

max_threshold: float

maximum cutoff for inclusion as evidence for prediction. If greater = True, sites with quantification greater than the threshold are used as evidence.

step: float

step size to use when iterating through threshold range

agg: str

how to combine sites with multiple instances in experiment

greater: bool

whether to use sites greater (True) or less (False) than the threshold

min_evidence_size: int

minimum number of sites required for a data column to be considered for activity calculation

desired_evidence_size: int or None

target evidence size to use for plotting. If None, will use 150 for phospho_type ‘Y’ and 1500 for phospho_type ‘ST’

show_recommended: bool

whether to show recommended evidence size and similarity lines on the plots

Master Functions for Running KSTAR Pipeline#

kstar.calculate.Mann_Whitney_analysis(kinact_dict, PROCESSES=1)#

For a kinact_dict, where random generation and activity has already been run for the phospho_types of interest, this will calculate the Mann-Whitney U test for comparing the array of p-values for real data to those of random data, across the number of networks used. It will also calculate the false positive rate for a pvalue, given observations of a random bootstrapping analysis

Parameters:
kinact_dict: dictionary

A dictionary of kinact objects, with keys ‘Y’ and/or ‘ST’

PROCESSES: int

number of processes to use for parallel computation, by default 1.

kstar.calculate.enrichment_analysis(experiment, odir, name='experiment', phospho_types=['Y', 'ST'], data_columns=None, agg='mean', threshold=1.0, evidence_size=None, greater=True, min_evidence_size=0, allow_column_loss=True, kinases=None, PROCESSES=1, **kwargs)#

Function to establish a kstar KinaseActivity object from an experiment with an activity log add the networks, calculate, aggregate, and summarize the hypergeometric enrichment into a final activity object. Should be followed by randomized_analyis, then Mann_Whitney_analysis.

Parameters:
experiment: pandas df

experiment dataframe that has been mapped, includes KSTAR_SITE, KSTAR_ACCESSION, etc.

odirstr

path to where you would like logger and output saved

namestr

name to use for outputs

phospho_types: {[‘Y’, ‘ST’], [‘Y’], [‘ST’]}

Which substrate/kinaset-type to run activity for: Both [‘Y, ‘ST’] (default), Tyrosine [‘Y’], or Serine/Threonine [‘ST’]

data_columnslist

columns that represent experimental result, if None, takes the columns that start with `data:’’ in experiment. Pass this value in as a list, if seeking to calculate on fewer than all available data columns

agg{‘count’, ‘mean’}

method to use when aggregating duplicate substrate-sites. ‘count’ combines multiple representations and adds if values are non-NaN ‘mean’ uses the mean value of numerical data from multiple representations of the same peptide.

NA values are droped from consideration.

thresholdfloat or dict

threshold value used to filter rows. If provided as a dictionary, keys should be ‘Y’ and/or ‘ST’ with float values for each phospho_type.

evidence_sizeint or dict

size of evidence to use for filtering. If provided as a dictionary, keys should be ‘Y’ and/or ‘ST’ with int values for each phospho_type. Will overide threshold if both provided.

min_evidence_sizeint

minimum size of evidence to run kinase activity on. Default 0, meaning any data column with at least one site will be run on

greater: Boolean

whether to keep sites that have a numerical value >=threshold (TRUE, default) or <=threshold (FALSE)

PROCESSESint

number of processes to use for parallel computation, by default 1.

**kwargs

Additional keyword arguments to pass to the KinaseActivity class

Returns:
kinactDict: dictionary of Kinase Activity Objects

Outer keys are phosphoTypes run ‘Y’ and ‘ST’ Includes the activities dictionary (see calculate_kinase_activities) aggregation of activities across networks (see aggregate activities) activity summary (see summarize_activities)

kstar.calculate.randomized_analysis(kinact_dict, **kwargs)#

Perform randomized analysis on kinase activity data.

Parameters:
kinact_dictdict

Dictionary containing kinase activity data.

kwargskeyword arguments

Additional keyword arguments for random activity generation passed to KinaseActivity.get_random_activities method.

These can include: num_random_experiments : int, optional

Number of random experiments to generate, by default 150.

use_pregen_databool, optional

Whether to use pre-generated data, by default False.

max_diff_from_pregeneratedfloat, optional

Maximum fractional difference allowed from pre-generated data, by default 0.25.

min_dataset_size_for_pregeneratedint, optional

Minimum dataset size to use pre-generated data, by default 150.

default_pregen_onlybool, optional

Whether to only use default pre-generated data (and not any activities in custom path), by default False.

require_pregeneratedbool, optional

Whether to require pre-generated data, by default False. This will ensure fast performance, but may result in some data columns being dropped

custom_pregenerated_pathstr, optional

Directory to save new precomputed data, by default None.

save_random_experimentsbool, optional

Whether to save the generated random experiments, by default None.

save_new_random_activitiesbool, optional

Whether to save new precomputed data, by default None.

PROCESSESint, optional

Number of processes to use for parallel computation, by default 1.

Returns:
None
kstar.calculate.run_kstar_analysis(experiment, odir, name='experiment', phospho_types=['Y', 'ST'], data_columns=None, threshold=1.0, evidence_size=None, greater=True, save_output=True, PROCESSES=1, **kwargs)#

Given a mapped experiment, run the KSTAR analysis pipeline.

Parameters:
experiment: DataFrame

Mapped experiment data

odir: string

Output directory

name: string

Name of the experiment

phospho_types: list

List of phospho types to analyze

network_dir: string

Directory containing network data

data_columns: list

Columns to use from the data

agg: string

Aggregation method

threshold: float

Threshold for analysis

evidence_size: int

Size of evidence

greater: bool

Whether to use greater comparison

PROCESSES: int

Number of processes to use

**kwargs

Additional keyword arguments for enrichment_analysis, randomized_analysis, and save_kstar functions.

Functions for Saving and Loading KSTAR results#

kstar.calculate.from_kstar(name, odir, ftype='tsv')#

Given the name and output directory of a saved kstar analyis, load the parameters and minimum dataframes needed for reinstantiating a kinact object This minimum list will allow you to repeat normalization or mann whitney at a different false positive rate threshold and plot results.

Parameters:
name: string

The name to used when saving activities and mapped data

odir: string

Output directory of saved files and parameter pickle

kstar.calculate.from_kstar_nextflow(name, odir, log=None)#

Given the name and output directory of a saved kstar analyis from the nextflow pipeline, load the results into new kinact object with the minimum dataframes required for analysis (binary experiment, hypergeometric activities, normalized activities, mann whitney activities)

Parameters:
name: string

The name to used when saving activities and mapped data

odir: string

Output directory of saved files

log: logger

logger used when loading nextflow data into kinase activity object. If not provided, new logger will be created.

kstar.calculate.save_kstar(kinact_dict, name, odir, minimal=True, ftype='tsv', param_format='json')#

Having performed kinase activities (run_kstar_analyis), save each of the important dataframes, minimizing the memory storage needed to get back to a rebuilt version for plotting results and analysis. For each phospho_type in the kinact_dict, at a minimum, this will save the binarized evidence, mann whitney activities and fpr dataframes, and parameters used during run. If you would like to save all files (hypergeometric and random enrichment intermediate files), set minimal = False.

Parameters:
kinact_dict: dictionary of Kinase Activity Objects

Outer keys are phosphoTypes run ‘Y’ and ‘ST’ Includes the activities dictionary (see calculate_kinase_activities) aggregation of activities across networks (see aggregate activities) activity summary (see summarize_activities)

name: string

The name to use when saving activities

odir: string

Outputdirectory to save files and pickle to

minimal: bool

Whether to save only minimal files or all intermediate files

ftype: {‘tsv’, ‘csv’}

Format to save dataframes in, either tsv or csv

param_format: {‘pickle’, ‘json’}

Format to save parameter dictionary in, either pickle or json. Json is recommended for easier human readability

Returns:
Nothing

Plotting/Analysis Functions#

The “DotPlot” class#

class kstar.plot.DotPlot(values, fpr, alpha=0.05, inclusive_alpha=True, binary_sig=True, dotsize=5, colormap={0: '#6b838f', 1: '#FF3300'}, facecolor='white', legend_title='-log10(p-value)', size_number=5, size_color='gray', color_title='Significant', markersize=10, legend_distance=1.0, figsize=(4, 8), title=None, xlabel=True, ylabel=True, x_label_dict=None, kinase_dict=None)#

The DotPlot class is used for plotting dotplots, with the option to add clustering and context plots. The size of the dots based on the values dataframe, where the size of the dot is the area of the value * dotsize

Parameters:
values: pandas DataFrame instance

values to plot

fprpandas DataFrame instance

false positive rates associated with values being plotted

alpha: float, optional

fpr value that defines the significance cutoff to use when plt default : 0.05

inclusive_alpha: boolean

whether to include the alpha (significance <= alpha), or not (significance < alpha). default: True

binary_sig: boolean, optional

indicates whether to plot fpr with binary significance or as a change color hue default : True

dotsizefloat, optional

multiplier to use for scaling size of dots

colormapdict, optional

maps color values to actual color to use in plotting default : {0: ‘#6b838f’, 1: ‘#FF3300’}

labelmap =

maps labels of colors, default is to indicate FPR cutoff in legend default : None

facecolorcolor, optional

Background color of dotplot default : ‘white’

legend_titlestr, optional

Legend Title for dot sizes, default is `p-value’

size_numberint, optional

Number of dots to attempt to generate for dot size legend

size_colorcolor, optional

Size Legend Color to use

color_titlestr, optional

Legend Title for the Color Legend

markersizeint, optional

Size of dots for Color Legend

legend_distanceint, optional

relative distance to place legends

figsizetuple, optional

size of dotplot figure

titlestr, optional

Title of dotplot

xlabelbool, optional

Show xlabel on graph if True

ylabelbool, optional

Show ylabel on graph if True

x_label_dict: dict, optional

Mapping dictionary of labels as they appear in values dataframe (keys) to how they should appear on plot (values)

kinase_dict: dict, optional

Mapping dictionary of kinase names as they appear in values dataframe (keys) to how they should appear on plot (values)

Attributes:
values: pandas dataframe

a copy of the original values dataframe

fpr: pandas dataframe

a copy of the original fpr dataframe

alpha: float

cutoff used for significance, default 0.05

inclusive_alpha: boolean

whether to include the alpha (significance <= alpha), or not (significance < alpha)

significance: pandas dataframe

indicates whether a particular kinases activity is significant, where fpr <= alpha is significant, otherwise it is insignificant

colors: pandas dataframe

dataframe indicating the color to use when plotting: either a copy of the fpr or significance dataframe

binary_sig: boolean

indicates whether coloring will be done based on binary significance or fpr values. Default True

labelmap: dict

indicates how to label each significance color

figsize: tuple

size of the outputted figure, which is overridden if axes is provided for dotplot

title: string

title of the dotplot

xlabel: boolean

indicates whether to plot x-axis labels

ylabel: boolean

indicates whether to plot y-axis labels

colormap: dict

colors to be used when plotting

facecolor: string

background color of dotplot

Methods

cluster(ax[, method, metric, orientation, ...])

Performs hierarchical clustering on data and plots result to provided Axes.

context(ax, info, id_column, context_columns)

Context plot is generated and returned.

dotplot([ax, orientation, size_legend, ...])

Generates the dotplot plot, where size is determined by values dataframe and color is determined by significant dataframe

drop_kinases(kinase_list)

Given a list of kinases, drop these from the dot.values dataframe in all future plotting of this object.

drop_kinases_with_no_significance()

Drop kinases from the values dataframe (inplace) when plotting if they are never observed as significant

evidence_count(ax, binary_evidence[, ...])

Add bars to dotplot indicating the total number of sites used as evidence in activity calculation

make_complete_dotplot([kinases_to_plot, ...])

Master function for creating a comprehensive dotplot visualization, which automatically creates any necessary subplots

set_colors([labelmap])

Set colors for the plot based on significance or false positive rate.

set_column_labels

set_index_labels

set_values

setup_figure

cluster(ax, method='single', metric='euclidean', orientation='top', color_threshold=-inf)#

Performs hierarchical clustering on data and plots result to provided Axes. result and significant dataframes are ordered according to clustering

Parameters:
axmatplotlib Axes instance

Axes to plot dendogram to

methodstr, optional

The linkage algorithm to use.

metricstr or function, optional

The distance metric to use in the case that y is a collection of observation vectors; ignored otherwise. See the pdist function for a list of valid distance metrics. A custom distance function can also be used.

orientationstr, optional

The direction to plot the dendrogram, which can be any of the following strings: ‘top’: Plots the root at the top, and plot descendent links going downwards. (default). ‘bottom’: Plots the root at the bottom, and plot descendent links going upwards. ‘left’: Plots the root at the left, and plot descendent links going right. ‘right’: Plots the root at the right, and plot descendent links going left.

context(ax, info, id_column, context_columns, dotsize=200, markersize=20, orientation='left', color_palette='colorblind', margin=0.2, make_legend=True, **kwargs)#

Context plot is generated and returned. The context plot contains the categorical data used for describing the data.

Parameters:
axmaptlotlib axis

where to map subtype information to

infopandas df

Dataframe where context information is pulled from

id_column: str

Column used to map the subtype information to

context_columnslist

list of columns to pull context informaiton from

dotsizeint, optional

size of context dots

markersize: int, optional

size of legend markers

orientationstr, optional

orientation to plot context plots to - determines where legends are placed options : left, right, top, bottom

color_palettestr, optional

seaborn color palette to use

margin: float, optional

margin

make_legendbool, optional

whether to create legend for context colors

dotplot(ax=None, orientation='left', size_legend=True, color_legend=True, max_size=None, **kwargs)#

Generates the dotplot plot, where size is determined by values dataframe and color is determined by significant dataframe

Parameters:
axmatplotlib Axes instance, optional

axes dotplot will be plotted on. If None then new plot generated

orientationstr, optional

orientation to place legends, either ‘left’ or ‘right’

size_legendbool, optional

whether to include size legend (indicates meaning of dot size/activity)

color_legendbool, optional

whether to include color legend (indicates significance)

max_sizeint, optional

maximum size value to use when generating size legend. If None, automatic legend generated

Returns:
axmatplotlib Axes instance

Axes containing the dotplot

drop_kinases(kinase_list)#

Given a list of kinases, drop these from the dot.values dataframe in all future plotting of this object. Removal is in place

Parameters:
kinase_list: list

list of kinase names to remove

drop_kinases_with_no_significance()#

Drop kinases from the values dataframe (inplace) when plotting if they are never observed as significant

evidence_count(ax, binary_evidence, plot_type='bars', phospho_type=None, dot_size=1, include_recommendations=False, ideal_min=None, recommended_min=None, dot_colors=None, bar_line_colors=None)#

Add bars to dotplot indicating the total number of sites used as evidence in activity calculation

Parameters:
ax: axes object

where to plot the bars

binary_evidence: pandas dataframe

binarized dataframe produced during activity calculation (threshold applied to original experiment)

make_complete_dotplot(kinases_to_plot=None, cluster_samples=False, cluster_kinases=False, sort_kinases_by=None, sort_samples_by=None, binary_evidence=None, context=None, significant_kinases_only=True, show_xtick_labels=True, **kwargs)#

Master function for creating a comprehensive dotplot visualization, which automatically creates any necessary subplots

Parameters:
kinases_to_plotlist or None, optional

List of kinases to include in the plot. If None, all kinases are included.

cluster_samplesbool, optional

Whether to cluster samples in the plot.

cluster_kinasesbool, optional

Whether to cluster kinases in the plot.

significant_kinases_onlybool, optional

Whether to include only significant kinases in the plot.

sort_samples_bystr or None, optional

Kinase Column to sort samples by in the plot based on kinase activities. If cluster_sample=True, this will be ignored.

sort_kinases_bystr or None, optional

Sample Column to sort kinases by in the plot based on kinase activities. If cluster_kinases=True, this will be ignored.

binary_evidencepd.DataFrame or None

Binary evidence dataframe from KSTAR analysis. If provided, will calculate the number of sites used as evidence in each sample and plot this.

contextpd.DataFrame or None, optional

Context dataframe providing additional sample information for plotting. If provided, must include an ‘id_column’ for unique sample identifiers and list ‘context_columns’ for context information.

show_xtick_labelsbool, optional

Whether to show x-axis tick labels in the dotplot.

**kwargs

Additional keyword arguments passed to plotting functions, like matplotlib.pyplot.scatter, DotPlot.context, DotPlot.dotplot, DotPlot.cluster, and DotPlot.evidence_count

set_colors(labelmap=None)#

Set colors for the plot based on significance or false positive rate.

The “KSTAR_PDF” class#

class kstar.plot.KSTAR_PDF(activities, fpr, odir, name, binarized_experiment, param_dict)#

Class to generate a PDF report from KSTAR analysis results, built on fdpf2 module

Parameters:
activitiespandas DataFrame

DataFrame of mann whitney kinase activities

fprpandas DataFrame

DataFrame of false positive rates corresponding to activities

odirstr

Output directory for saving the PDF report

namestr

Name of the experiment/run, used for file naming

binarized_experimentpandas DataFrame

Binarized experiment indicating which sites were used as evidence in each column

param_dictdict

Dictionary of parameters used in the KSTAR run

Attributes:
MARKDOWN_LINK_COLOR
accept_page_break

Whenever a page break condition is met, this @property method is called, and the break is issued or not depending on the returned value.

char_spacing
char_vpos

Return vertical character position relative to line.

current_font
current_font_is_set_on_page
dash_pattern
default_page_dimensions

Return a pair (width, height) in the unit specified to FPDF constructor

denom_lift

Return lift factor for denominator text.

denom_scale

Return scale factor for denominator text.

draw_color
emphasis

The current text emphasis: bold, italics, underline and/or strikethrough.

eph

Effective page height: the page height minus its vertical margins.

epw

Effective page width: the page width minus its horizontal margins.

fill_color
font_family
font_size
font_size_pt
font_stretching
font_style
fonts
is_ttf_font
line_width
nom_lift

Return lift factor for nominator text.

nom_scale

Return scale factor for nominator text.

output_intents
page_layout
page_mode
pages_count

Returns the total pages of the document, at the time it is called.

strikethrough
sub_lift

Return lift factor for subscript text.

sub_scale

Return scale factor for subscript text.

sup_lift

Return lift factor for superscript text.

sup_scale

Return scale factor for superscript text.

text_color
text_mode
text_shaping
underline

Methods

HTML2FPDF_CLASS

alias of HTML2FPDF

add_action(action, x, y, w, h, **kwargs)

Puts an Action annotation on a rectangular area of the page.

add_font([family, style, fname, ...])

Imports a TrueType or OpenType font and makes it available for later calls to the FPDF.set_font() method.

add_link([y, x, page, zoom, name])

Creates a new internal link and returns its identifier.

add_output_intent(subtype[, ...])

Adds desired Output Intent to the Output Intents array:

add_page([orientation, format, same, ...])

Adds a new page to the document.

add_text_markup_annotation(type, text, ...)

Adds a text markup annotation on some quadrilateral areas of the page.

alias_nb_pages([alias])

Defines an alias for the total number of pages.

arc(x, y, a, start_angle, end_angle[, b, ...])

Outputs an arc.

bezier(point_list[, closed, style])

Outputs a quadratic or cubic Bézier curve, defined by three or four coordinates.

cell([w, h, text, border, ln, align, fill, ...])

Prints a cell (rectangular area) with optional borders, background color and character string.

circle(x, y, radius[, style])

Outputs a circle.

code39(text, x, y[, w, h])

Barcode 3of9

create_dotplot()

Generate a standard activity dotplot for use in the PDF report

dashed_line(x1, y1, x2, y2[, dash_length, ...])

Draw a dashed line between two points.

dotplot_page([regenerate_plots])

Create a PDF page that includes the KSTAR dotplot figure and information on where to find the figure and underlying data in the output directory

draw_path(path[, debug_stream])

Add a pre-constructed path to the document.

draw_vector_glyph(path, font)

Add a pre-constructed path to the document.

drawing_context([debug_stream])

Create a context for drawing paths on the current page.

ellipse(x, y, w, h[, style])

Outputs an ellipse.

elliptic_clip(x, y, w, h)

Context manager that defines an elliptic crop zone, useful to render only part of an image.

embed_file([file_path, bytes, basename, ...])

Embed a file into the PDF as an attachment (and, for PDF/A-3 or PDF/A-4f, as an Associated File).

evidence_count_plot(data_columns)

Creates a barplot showing the number of sites used as evidence in each column of the experiment

evidence_overlap_plot(data_columns)

Creates a heatmap showing the Jaccard index of evidence overlap between columns in the experiment

evidence_page([regenerate_plots])

Create a PDF page that includes the total number of sites used as evidence for each column and the jaccard similarity of evidence between columns

file_attachment_annotation(file_path, x, y)

Puts a file attachment annotation on a rectangular area of the page.

file_id()

This method can be overridden in inherited classes in order to define a custom file identifier.

font_face()

Return a fpdf.fonts.FontFace instance representing a subset of properties of this GraphicsState.

footer()

Override the footer method to add a page number at the bottom center of each page.

free_text_annotation(text[, x, y, w, h])

Puts a free text annotation on a rectangular area of the page.

generate([regenerate_plots])

Generates the PDF report by creating each page in sequence and saving the final PDF to the output directory

get_fallback_font(char[, style])

Returns which fallback font has the requested glyph.

get_named_destination(name)

Retrieves a named destination by its name and creates a link to it.

get_page_label()

Return the current page fpdf.output.PDFPageLabel.

get_string_width(s[, normalized, markdown])

Returns the length of a string in user unit.

get_x()

Returns the abscissa of the current position.

get_y()

Returns the ordinate of the current position.

glyph_drawing_context()

Create a context for drawing paths for type 3 font glyphs, without writing on the current page.

header()

Header to be implemented in your own inherited class

highlight(text[, type, color, modification_time])

Context manager that adds a single highlight annotation based on the text lines inserted inside its indented block.

image(name[, x, y, w, h, type, link, title, ...])

Put an image on the page.

ink_annotation(coords[, text, color, ...])

Adds add an ink annotation on the page.

insert_toc_placeholder(render_toc_function)

Configure Table Of Contents rendering at the end of the document generation, and reserve some vertical space right now in order to insert it.

interleaved2of5(text, x, y[, w, h])

Barcode I2of5 (numeric), adds a 0 if odd length

line(x1, y1, x2, y2)

Draw a line between two points.

link(x, y, w, h, link[, alt_text])

Puts a link annotation on a rectangular area of the page.

ln([h])

Line Feed.

local_context(**kwargs)

Creates a local graphics state, which won't affect the surrounding code.

mirror(origin, angle)

Method to perform a reflection transformation over a given mirror line.

multi_cell(w[, h, text, border, align, ...])

This method allows printing text with line breaks.

new_path([x, y, paint_rule, debug_stream])

Create a path for appending lines and curves to.

normalize_text(text)

Check that text input is in the correct format/encoding

offset_rendering()

All rendering performed in this context is made on a dummy FPDF object.

output([name, linearize, output_producer_class])

Output PDF to some destination.

page_no()

Get the current page number

polygon(point_list[, fill, style])

Outputs a polygon defined by three or more points.

polyline(point_list[, fill, polygon, style])

Draws lines between two or more points.

preload_image(name[, dims])

Read an image and load it into memory.

rect(x, y, w, h[, style, round_corners, ...])

Outputs a rectangle.

rect_clip(x, y, w, h)

Context manager that defines a rectangular crop zone, useful to render only part of an image.

regular_polygon(x, y, numSides, polyWidth[, ...])

Outputs a regular polygon with n sides It can be rotated Style can also be applied (fill, border...)

rotate(angle[, x, y])

rotation(angle[, x, y])

Method to perform a rotation around a given center.

round_clip(x, y, r)

Context manager that defines a circular crop zone, useful to render only part of an image.

set_author(author)

Defines the author of the document.

set_auto_page_break(auto[, margin])

Set auto page break mode, and optionally the bottom margin that triggers it.

set_char_spacing(spacing)

Sets horizontal character spacing.

set_compression(compress)

Activates or deactivates page compression.

set_creation_date([date])

Sets Creation of Date time, or current time if None given.

set_creator(creator)

Defines the creator of the document.

set_dash_pattern([dash, gap, phase])

Set the current dash pattern for lines and curves.

set_display_mode(zoom[, layout])

Defines the way the document is to be displayed by the viewer.

set_doc_option(opt, value)

Defines a document option.

set_draw_color(r[, g, b])

Defines the color used for all stroking operations (lines, rectangles and cell borders).

set_encryption(owner_password[, ...])

Activate encryption of the document content.

set_fallback_fonts(fallback_fonts[, exact_match])

Allows you to specify a list of fonts to be used if any character is not available on the font currently set.

set_fill_color(r[, g, b])

Defines the color used for all filling operations (filled rectangles and cell backgrounds).

set_font([family, style, size])

Sets the font used to print character strings.

set_font_size(size)

Configure the font size in points

set_image_filter(image_filter)

Args:

set_keywords(keywords)

Associate keywords with the document

set_lang(lang)

A language identifier specifying the natural language for all text in the document except where overridden by language specifications for structure elements or marked content.

set_left_margin(margin)

Sets the document left margin.

set_line_width(width)

Defines the line width of all stroking operations (lines, rectangles and cell borders).

set_link([link, y, x, page, zoom, name])

Defines the page and position a link points to.

set_margin(margin)

Sets the document right, left, top & bottom margins to the same value.

set_margins(left, top[, right])

Sets the document left, top & optionally right margins to the same value.

set_page_background(background)

Sets a background color or image to be drawn every time FPDF.add_page() is called, or removes a previously set background.

set_page_label([label_style, label_prefix, ...])

Enable fpdf.output.PDFPageLabel to be inserted on every page.

set_producer(producer)

Producer of document

set_right_margin(margin)

Sets the document right margin.

set_section_title_styles(level0[, level1, ...])

Defines a style for section titles.

set_stretching(stretching)

Sets horizontal font stretching.

set_subject(subject)

Defines the subject of the document.

set_text_color(r[, g, b])

Defines the color used for text.

set_text_shaping([use_shaping_engine, ...])

Enable or disable text shaping engine when rendering text.

set_title(title)

Defines the title of the document.

set_top_margin(margin)

Sets the document top margin.

set_x(x)

Defines the abscissa of the current position.

set_xy(x, y)

Defines the abscissa and ordinate of the current position.

set_y(y)

Moves the current abscissa back to the left margin and sets the ordinate.

sign(key, cert[, extra_certs, hashalgo, ...])

Args:

sign_pkcs12(pkcs_filepath[, password, ...])

Args:

skew([ax, ay, x, y])

Method to perform a skew transformation originating from a given center.

solid_arc(x, y, a, start_angle, end_angle[, ...])

Outputs a solid arc.

star(x, y, r_in, r_out, corners[, ...])

Outputs a regular star with n corners.

start_section(name[, level, strict])

Start a section in the document outline.

summary_page()

Create a PDF page that indicates the parameters used in the KSTAR run and the key kinases identified for each column

table(data[, header, column_widths, row_height])

Builds a table in the PDF

text(x, y[, text])

Prints a character string.

text_annotation(x, y, text[, w, h, name])

Puts a text annotation on a rectangular area of the page.

text_columns([text, img, img_fill_width, ...])

Establish a layout with multiple columns to fill with text. Args: text (str, optional): A first piece of text to insert. ncols (int, optional): the number of columns to create. (Default: 1). gutter (float, optional): The distance between the columns. (Default: 10). balance: (bool, optional): Specify whether multiple columns should end at approximately the same height, if they don't fill the page. (Default: False) text_align (Align or str, optional): The alignment of the text within the region. (Default: "LEFT") line_height (float, optional): A multiplier relative to the font size changing the vertical space occupied by a line of text. (Default: 1.0). l_margin (float, optional): Override the current left page margin. r_margin (float, optional): Override the current right page margin. print_sh (bool, optional): Treat a soft-hyphen (u00ad) as a printable character, instead of a line breaking opportunity. (Default: False) wrapmode (fpdf.enums.WrapMode, optional): "WORD" for word based line wrapping, "CHAR" for character based line wrapping. (Default: "WORD") skip_leading_spaces (bool, optional): On each line, any space characters at the beginning will be skipped if True. (Default: False).

top_kinases_table()

Constructs a table of the top 5 most active significant kinases per sample and adds it to the PDF page

unbreakable()

Ensures that all rendering performed in this context appear on a single page by performing page break beforehand if need be.

use_font_face(font_face)

Sets the provided fpdf.fonts.FontFace in a local context, then restore font settings back to they were initially.

use_pattern(shading)

Create a context for using a shading pattern on the current page.

will_page_break(height)

Let you know if adding an element will trigger a page break, based on its height and the current ordinate (y position).

write([h, text, link, print_sh, wrapmode])

Prints text from the current position.

write_html(text, *args, **kwargs)

Parse HTML and convert it to PDF.

add_highlight

clear_text_region

is_current_text_region

mapping_page

preload_glyph_image

register_text_region

set_xmp_metadata

use_text_style

x_by_align

create_dotplot()#

Generate a standard activity dotplot for use in the PDF report

dotplot_page(regenerate_plots=False)#

Create a PDF page that includes the KSTAR dotplot figure and information on where to find the figure and underlying data in the output directory

Parameters:
regenerate_plotsbool, optional

Whether to regenerate the dotplot figure even if it already exists in the output directory

evidence_count_plot(data_columns)#

Creates a barplot showing the number of sites used as evidence in each column of the experiment

Parameters:
data_columnslist

List of column names in the experiment to include in the plot

evidence_overlap_plot(data_columns)#

Creates a heatmap showing the Jaccard index of evidence overlap between columns in the experiment

Parameters:
data_columnslist

List of column names in the experiment to include in the plot

evidence_page(regenerate_plots=False)#

Create a PDF page that includes the total number of sites used as evidence for each column and the jaccard similarity of evidence between columns

Parameters:
regenerate_plotsbool, optional

Whether to regenerate the evidence plots even if they already exist in the output directory

footer()#

Override the footer method to add a page number at the bottom center of each page.

generate(regenerate_plots=False)#

Generates the PDF report by creating each page in sequence and saving the final PDF to the output directory

Parameters:
regenerate_plotsbool, optional

Whether to regenerate all plots even if they already exist in the output directory

summary_page()#

Create a PDF page that indicates the parameters used in the KSTAR run and the key kinases identified for each column

table(data, header=None, column_widths=40, row_height=5)#

Builds a table in the PDF

Parameters:
datapandas DataFrame

DataFrame containing the data to be included in the table

headerlist, optional

List of header names for the table columns. If None, uses DataFrame column names.

column_widthsint or list, optional

Width of each column in the table. If an integer is provided, all columns will have the same width. If a list is provided, it should contain the width for each column.

row_heightint, optional

Height of each row in the table.

top_kinases_table()#

Constructs a table of the top 5 most active significant kinases per sample and adds it to the PDF page

Downstream Analysis Modules#

kstar.analysis.interactions.getSubstrateInfluence(networks, kinase, substrate_subset=None)#

Given the pruned networks and kinase of interest, return the number of networks each substrate is connected to that kinase in (the ‘substrate influence’ on that kinase’s activity prediction). If subset of substrates is provided, will only do this for the given subset

Parameters:
networks: dict

dictionary storing all pruned networks used in activity calculation

kinase: str

name of the kinase of interest: should match the name found in provided networks

substrate_subset: list

subset of substrates to analyze, indicated by ‘{KSTAR_ACCESSION}_{KSTAR_SITE}’. If none, will return a series containing info on all substrates with at least one connection to the given kinase

Returns:
Pandas series indicating the number of networks each substrate is connected the indicated kinase, sorted from the most connections (highest influence) to the least (lowest influence). Sites with no connection will not be included.
kstar.analysis.interactions.getSubstrateInfluence_inExperiment(networks, binary_evidence, kinase, data_cols=None)#

Given the binary evidence used for activity prediction, identify which sites are found across the most networks for a given kinase and each sample.

Parameters:
networks: dictionary

dictionary containing all 50 pruned networks used for activity prediction

binary_evidence: pandas dataframe

binarized dataset (using the same threshold/criteria as the one used for activity prediction)

kinase: str

name of the kinase to probe

data_cols: list or None

name of the data columns in binary_evidence to probe. If None, will analyze all columns with ‘data:’ at the start of the column name.

kstar.analysis.coverage.averageUniqueSubstrates_KSTAR(networks=None)#

Calculate the average number of unique substrates covered by each KSTAR pruned network

Parameters:
mod_types: list

list containing which networks to calculate average for. Either [‘Y’], [‘ST’], or [‘Y’,’ST’]

Returns:
averageSub: dict

indicates the average number of substrates across all pruned networks for indicated modification types

kstar.analysis.coverage.experimentCoverage(experiment, networks, mod='Y', exp_cols=['KSTAR_ACCESSION', 'KSTAR_SITE'], net_cols=['KSTAR_ACCESSION', 'KSTAR_SITE'])#

Given an experiment, determine how many of the sites observed in the experiment can be captured by a kinase-substrate network (function was designed for KSTAR pruned networks, but should work with any ks-network that indicates UniProt ID and site number)

Parameters:
experiment: pandas dataframe

phosphoproteomic experiment, ideally that has been mapped to KinPred by KSTAR already

network: pandas dataframe

binarized kinase-substrate network (unweighted), ideally having been mapped to KinPred/KSTAR already

exp_cols: list

list indicating the columns in experiment dataframe that contain uniprot id and site number

net_cols: list

list indicating the columns in network dataframe that contain the uniprot id and site number

Returns:
fraction_of_sites_covered: dict

indicates the fraction of phosphorylation sites observed in experiment that are also found within the kinase-substrate network, for each modification type (tyrosine, serine/threonine).

kstar.analysis.coverage.getStudyBiasDistribution_InExperiment(binary_experiment, ax=None, figsize=(4, 3), return_dist=False)#

Plot the distribution of study bias within a single phosphoproteomic experiment

Parameters:
mapped_experiment: pandas dataframe

phosphoproteomic experiment that has been mapped by KSTAR (contains ‘KSTAR_SITE’,’KSTAR_ACCESSION’, and ‘KSTAR_NUM_COMPENDIA’ columns)

ax: matplotlib axes object

axis to plot the distribution on. If none, will create subplot

figsize: tuple

size of matplotlib figure. Default is (4,3)

return_dist: bool

whether you would like to also return the distribution values. Default is False.

Returns:
Histogram plotting the distribution of study bias found in the provided experiment, as defined by the number of compendia a phosphorylation site is recorded in. If return_dist = True, will also return a series object containing the same data as the histogram.
kstar.analysis.coverage.getStudyBiasDistribution_InPhosphoproteome(mod_type='Y', ax=None, figsize=(4, 3), return_dist=False)#

Plot the distribution of study bias across the reference phosphoproteome

Parameters:
mod_type: str

indicates which modification type, tyrosine (‘Y’) or serine/threonine (‘ST’), you would like to plot. Default is ‘Y’

ax: matplotlib axes object

axis to plot the distribution on. If none, will create subplot

figsize: tuple

size of matplotlib figure. Default is (4,3)

return_dist: bool

whether you would like to also return the distribution values. Default is False.

Returns:
Histogram plotting the distribution of study bias found in overall phosphoproteome, as defined by the number of compendia a phosphorylation site is recorded in. If return_dist = True, will also return a series object containing the same data as the histogram.
kstar.analysis.coverage.getStudyBiasDistribution_InSample(binary_experiment, data_column, ax=None, figsize=(4, 3), return_dist=False)#

Plot the distribution of study bias within a single phosphoproteomic experiment

Parameters:
mapped_experiment: pandas dataframe

phosphoproteomic experiment that has been mapped by KSTAR (contains ‘KSTAR_SITE’,’KSTAR_ACCESSION’, and ‘KSTAR_NUM_COMPENDIA’ columns)

ax: matplotlib axes object

axis to plot the distribution on. If none, will create subplot

figsize: tuple

size of matplotlib figure. Default is (4,3)

return_dist: bool

whether you would like to also return the distribution values. Default is False.

Returns:
Histogram plotting the distribution of study bias found in the provided experiment, as defined by the number of compendia a phosphorylation site is recorded in. If return_dist = True, will also return a series object containing the same data as the histogram.
kstar.analysis.coverage.numUniqueSubstrates(networks, acc_col='KSTAR_ACCESSION', site_col='KSTAR_SITE')#

Given a KSTAR network(s), return the number of unique substrates within the network (across all kinases). If a dictionary of multiple pruned networks is provided, will calculate the total number of unique substrates across ALL networks.

Parameters:
network: pandas dataframe or dict of pandas dataframes

pruned KSTAR network, or dictionary containing multiple pruned networks

acc_col: str

name of column in network dataframe which indicates UniProt ID of substrates

site_col: str

name of column in network dataframe which indicates residue and site number (i.e. Y1197)

Returns:
Number of unique substrates within network(s)
kstar.analysis.coverage.sampleCoverage(binary_experiment, data_col, networks, mod='Y', exp_cols=['KSTAR_ACCESSION', 'KSTAR_SITE'], net_cols=['KSTAR_ACCESSION', 'KSTAR_SITE'])#

Given a sample within an experiment, determine how many of the sites observed in the experiment can be captured by KSTAR pruned networks. Essentially the same as experimentCoverage(), but restricts experiment sites to those used as evidence for a given sample

Parameters:
binary_experiment: pandas dataframe

binarized phosphoproteomic experiment, with each 1 indicating that site was observed in sample. Ideally has been mapped to KinPred by KSTAR already

data_col: str

column name of the sample of interest

network: pandas dataframe

binarized kinase-substrate network (unweighted), ideally having been mapped to KinPred/KSTAR already

exp_cols: list

list indicating the columns in experiment dataframe that contain uniprot id and site number

net_cols: list

list indicating the columns in network dataframe that contain the uniprot id and site number

Returns:
fraction_of_sites_covered: dict

indicates the fraction of phosphorylation sites observed in sample that are also found within the kinase-substrate network, for each modification type (tyrosine, serine/threonine).

kstar.analysis.kinase_MI.kinase_mutual_information(network, kinase_column='KSTAR_KINASE', accession_column='KSTAR_ACCESSION', site_column='KSTAR_SITE', substrate_list=None)#

Finds mutual information shared between kinases based on the substrate phosphorylated Mutual Information is defined as the intersection substrates between two kinases A substrate is defined as the substrate accession and site, i.e. P54760_Y596. Normalization is performed by comparing intersection of kinases vs union of the two kinases This the the Jaccard Index. Jaccard Distance can be calcualted by taking 1 - JI

Parameters:
networkpandas dataframe or dictionary of pandas dataframe

The network to analyze for mutual kinase information. Can send a dictionary of multiple pandas dataframes and this will average the MI across all networks in dictionary

kinase_columnstr

Column in network that contiains kinase information

substrate_columnstr

Column in network that contains substrate information

substrate_listlist

Optional and default is no subset list to use. You can calculate the MI within network(s) for only the evidence given in a substrate_evidence_list (must matche substrate_column of network passed in)

Returns:
heatmappandas dataframe

Number of substrates that overlap between kinases

normalizedpandas dataframe

Normalized mutual information into Jaccard Index. size of intersection of two kinase networks / size of union of two kinase networks.

heatlist or heatdict: list or dictionary of lists

intersection of kinase networks. If a single network it is a list. If multiple networks it is a dict of lists with keys the same as the network name

kstar.analysis.kinase_MI.plot_kinase_heatmap(heatmap, use_mask=True, annotate=False)#

Plots Kinase network heatmap

Parameters:
heatmappandas dataframe

Network Heatmap to plot (must be square matrix)

info_type: str

Indicates what type of informatin is included in heatmap variable. Default is mutual information, equivalent to the normalized matrix obtained from kinase_mutual_information function

use_maskbool

If true a mask is applied to the heatmap

annotatebool

If true then numbers are annotated into each heatmap square

Dataset Processing Functions#

Other Helper Functions#

kstar.helpers.agg_jaccard(jaccard_matrix, agg='max')#

Given a jaccard similarity matrix between samples, calculate the aggregate jaccard similarity excluding self-comparisons

Parameters:
jaccard_matrix: pd.DataFrame

jaccard similarity matrix between samples, created using jaci_matrix_between_samples()

agg: str

aggregation method to use, either ‘max’ or ‘mean’

kstar.helpers.calculate_jaccard_by_binary(set1, set2)#

Compares two binary arrays and calculates the Jaccard index between them (based on number of matches)

kstar.helpers.calculate_jaccard_by_sets(set1, set2)#

Compares two sets and calculates the Jaccard index between them

kstar.helpers.convert_acc_to_uniprot(df, acc_col_name, acc_col_type, acc_uni_name)#

Given an experimental dataframe (df) with an accession column (acc_col_name) that is not uniprot, use uniprot to append an accession column of uniprot IDS

Parameters:
df: pandas.DataFrame

Dataframe with at least a column of accession of interest

acc_col_name: string

name of column to convert FROM

acc_col_type: string

Uniprot string designation of the accession type to convert FROM, see https://www.uniprot.org/help/api_idmapping

acc_uni_name:

name of new column

Returns:
appended_df: pandas.DataFrame

Input dataframe with an appended acc column of uniprot IDs

kstar.helpers.get_logger(name, filename)#

Finds and returns logger if it exists. Creates new logger if log file does not exist

Parameters:
namestr
log name
filenamestr
location to store log file
kstar.helpers.jaci_matrix_between_samples(evidence, samples=None)#

This function creates a looks at the similarity of evidence between samples based on Jaccard index of phosphopeptide identities

Parameters:
evidence: pd.DataFrame

evidence dataframe, preferably one that has been binarized

samples: a list of sample columns
Returns:
jaccard_matrix: pd.DataFrame

a dataframe showing the similarity of phosphopeptide identities between samples

kstar.helpers.parse_network_information(network_directory, file_type='txt')#

Parse the RUN_INFORMATION.txt file from network pruning run and extract its data.

Args:

file_path (str): Path to the RUN_INFORMATION.txt file.

Returns:

dict: A dictionary containing the parsed data.

kstar.helpers.process_fasta_file(fasta_file)#

For configuration, to convert the global fasta sequence file into a sequence dictionary that can be used in mapping

Parameters:
fasta_filestr

file location of fasta file

Returns:
sequencesdict

{acc : sequence} dictionary generated from fasta file

kstar.helpers.string_to_boolean(string)#

Converts string to boolean

Parameters:
string :str

input string

Returns:
resultbool

output boolean