KSTAR in Parallel

The base implementation of KSTAR operates with only a single processor. However, this can often be time consuming, particularly for datasets with many sites and/or many samples. It is possible to run KSTAR as a parallel process, either within your python environment or using a software package called nextflow. See the following sections for more details.

Option 1: Running KSTAR using Multiprocessing

Within python, several KSTAR functions can be run in parallel using the multiprocessing module. For these functions, switching to multiprocessing simply requires changing the ‘PROCESSES’ parameter from 1 (default) to the number of processes you would like to run in parallel.

# Activity Calculation
kinact_dict = calculate.run_kstar_analysis(experiment, activity_log, networks, PROCESSES = 4)

# Normalization
calculate.normalize_analysis(kinact_dict, activity_log, num_random_experiments, target_alpha, PROCESSES = 4)

# Mann Whitney Calculation
calculate.Mann_Whitney_analysis(kinact_dict, activity_log, number_sig_trials = 100, PROCESSES = 4)

While this strategy helps to improve the speed of analysis, it can be very memory intensive for large datasets. For large tyrosine datasets and most serine/threonine datasets, we recommend running KSTAR using nextflow, described in the following section.

Option 2: Running Large Datasets with Nextflow

While our standard implementation of KSTAR can be run on most phosphotyrosine datasets and some small phosphoserine/threonine datasets, the memory and time costs are often too high for many large datasets. For these cases, we have implemented a highly parallel version of KSTAR implemented with the nextflow software package. The remainder of this section will detail how to install and run KSTAR with nextflow.

Requirements

The nextflow pipeline takes advantage of either Docker or Singularity containers. These are compatible with POSIX operating systems (Linux, OS X, etc.). However, nextflow can also be run on a Windows machine with the use of WSL2. Windows has provided instructions for installing and running WSL2 here: WSL Documentation. For other details about nextflow, please see their documentation here: Nextflow Documentation

This implementation is best suited for high performance computing environments, with a minimum of 8 cpu cores and 16GB of available memory suggested.

Installation

Before implementing KSTAR with nextflow, Docker will need to be installed. The nextflow implementation utilizes a docker container of the KSTAR algorithm. For details on how to install Docker, visit Install Docker.

First, it is recommended that you download KSTAR from github into an easily accessible folder:

git clone https://github.com/NaegleLab/KSTAR.git

Next, set up a conda virtual environment and install nextflow:

conda create -n kstar
conda activate kstar
conda install -c bioconda nextflow

Setting up KSTAR for Nextflow

Download Resource Files

As with the standard implementation, the nextflow implementation of KSTAR requires a reference proteome and phosphoproteome. These can be downloaded using the config.install_resource_files() function within the python interpreter (from the main KSTAR directory). It will install the necessary files to the default location, which is the RESOURCE_FILES directory in the repository.

Downloading or Generating Networks used in KSTAR

As with the standard implementation of KSTAR, you will need to obtain KSTAR pruned networks that will be used in activity calculation. You can download KSTAR networks from the Network FigShare. Unlike the standard implementation, network pickles do not need to be generated, as nextflow will operate on the individual network files.

If choosing to generate your own networks, this should be done outside of the nextflow environment as described in the tutorial. You will need to make sure the directory structure where these networks are located is correct: ‘{resource_directory}/{network_directory}/{phospho_event}/INDIVIDUAL_NETWORKS’, where elements in brackets are given as parameters during a nextflow run. Each individual pruned network should be a .tsv file located in the above directory.

In either case, make sure to place the networks in the same directory as the resource files (RESOURCE_FILES folder creating when running config.install_resource_files()).

Mapping datasets

Datasets must be mapped prior to running KSTAR with nextflow. The nextflow implementation is strictly for activity prediction. The mapped data should be a .tsv file that contains at least data columns, KSTAR_ACCESSION, KSTAR_PEPTIDE, AND KSTAR_SITE.

Running nextflow

Once all the above requirements are satisfied, KSTAR can be run using a simple bash script within the nextflow directory that looks like the one below:

export KSTAR_DIR = /repo_loc/KSTAR/nextflow
cd $KSTAR_DIR

nextflow run main.nf -profile Y \
--name example \
--phospho_event Y \
--outdir ./results \
--experiment_file data/example_data_mapped.tsv \
--outdir ./results \
--resource_directory ../RESOURCE_FILES \
--network_directoyr /NETWORKS/NetworKIN \
--data_columns data:Column1,data:Column2,data:Column3,data:Column4 \
--num_random_experiments 150 \
--threshold 0.5 \
--activity_aggregate mean \
--fpr_alpha 0.05 \
--number_of_sig_trials 100

The first line of the bash script starts the nextflow pipeline and indicates which type of phospho_mod is of interest (either pY or pST). The remaining lines are parameters accepted by nextflow, which either indicate where to find/deposit files or are KSTAR parameters. See the table below for a description of the accepted parameters by nextflow:

Available Parameters
Parameter	Description	Required	Default
experiment_file	full filename of the experiment	Yes	None
name	the name of the experiment. This will be used to name folders/files outputted by KSTAR	No	kstar’
phospho_event	Type of phosphorylation network to analyze. Either Y or ST.	Yes	FALSE
num_random_experiments	The number of random experiments to generate for use in normalization and mann whitney tests	Yes	150
number_of_sig_trials	The number of significance trials to perform for the mann whitney tests	Yes	FALSE
resource_directory	directory where all necesary resources can be found. Should contain the network directory	Yes	FALSE
network_directory	directory where all networks to be used in the analysis can be found	No	NETWORKS/NetworKIN
outdir	directory where results should be deposited	No	./results
threshold	The cutoff value to use for determining whether a site is used as evidence for a particular sample	No	1
data_columns	list of pre-mapped data columns to analyze. Must follow the following rules: 1) cannot include parenthesis, spaces, or commas in name 2) different data columns are split by a comma with no space in between 3) if left blank or false, all columns that start with ‘data:’ are analyzed.	No	FALSE
add_data_before	boolean, where if true, ‘data:’ is added in front of each column name	No	FALSE
activity_aggregate	aggregate for binarizing the experiments. Choices include count, mean, median, max, min	No	count
fpr_alpha	Desired false positive rate	No	0.05
greater	Boolean value that indicates whether sites above or below the threshold are used as evidence. If True, sites greater than evidence will be used	No	TRUE
chunk_size	chunk size to use in hypergeometric calculation, adjust this to help with memory issues	No	10

Output of a nextflow run

Each run with nextflow will deposit all results from a run in the folder indicated by the outdir parameter. There will be seperate folders for tyrosine and serine/threonine results if both types of runs are performed. Within the out directory, there will be the following subfolders, most of which mirror what is saved in the pickle object in the base KSTAR algorithm:

Nextflow Run Outputs
Subfolder	Description	Files within folder
random_hypergeometric_activity	Contains the predicted activities for each of the random experiments	example_random_activities.tsv, example_aggregated_activities.tsv
pipeline_info	Contains all information about the nextflow run, including run time, memory used, and cpus used	pipeline_dag.svg, execution_trace.txt, execution_timeline.html, execution_report.html
normalized_activity	Contains activity and fpr predictions after original predictions have been normalized relative to the median p-values obtained from the random experiments	example_normalized_aggregate_activity.tsv, example_normalized_activities.tsv
mann_whitney	Contains activity and fpr predictions obtained via the Mann Whitney statistical test, comparing the experiment p-values to the random experiments. These are the activity/fpr values we recommend using for analysis	example_mann_whitney_fpr.tsv, example_mann_whitney_activities.tsv, mann_whitney_combined
individual_experiments	This contains all of the random datasets generated for use in getting both the normalized and Mann Whitney activities	folders containing results for each individual experiment/sample
hypergeometric_activity	Contains activity predictions prior to accounting for the expected p-values due to random chance	example_aggregated_activities.tsv, example_activities_list.tsv, example_activities.tsv
binary_experiment	Contains the mapped data after the evidence columns have been binarized based on the given threshold	example_binarized_experiment.tsv

Running nextflow on Singularity

It is also possible to implement the above using singularity containers instead of docker containers. If singularity is already installed, only a small addition at the bottom of the bash script is needed to use singularity instead of docker.

export KSTAR_DIR = /repo_loc/KSTAR/nextflow
cd $KSTAR_DIR

nextflow run main.nf -profile Y \
--name example \
--phospho_event Y \
--outdir ./results \
--experiment_file data/example_data_mapped.tsv \
--outdir ./results \
--resource_directory ../RESOURCE_FILES \
--network_directoyr /NETWORKS/NetworKIN \
--data_columns data:Column1,data:Column2,data:Column3,data:Column4 \
--num_random_experiments 150 \
--threshold 0.5 \
--activity_aggregate mean \
--fpr_alpha 0.05 \
--number_of_sig_trials 100
-with-singularity
-without-docker