KSTAR in Parallel

The base implementation of KSTAR operates with only a single processor. However, this can often be time consuming, particularly for datasets with many sites and/or many samples. It is possible to run KSTAR as a parallel process, either within your python environment or using a software package called nextflow. See the following sections for more details.

Option 1: Running KSTAR using Multiprocessing

Within python, several KSTAR functions can be run in parallel using the multiprocessing module. For these functions, switching to multiprocessing simply requires changing the ‘PROCESSES’ parameter from 1 (default) to the number of processes you would like to run in parallel.

# Activity Calculation
kinact_dict = calculate.run_kstar_analysis(experiment, activity_log, networks, PROCESSES = 4)

# Normalization
calculate.normalize_analysis(kinact_dict, activity_log, num_random_experiments, target_alpha, PROCESSES = 4)

# Mann Whitney Calculation
calculate.Mann_Whitney_analysis(kinact_dict, activity_log, number_sig_trials = 100, PROCESSES = 4)

While this strategy helps to improve the speed of analysis, it can be very memory intensive for large datasets. For large tyrosine datasets and most serine/threonine datasets, we recommend running KSTAR using nextflow, described in the following section.

Option 2: Running Large Datasets with Nextflow

While our standard implementation of KSTAR can be run on most phosphotyrosine datasets and some small phosphoserine/threonine datasets, the memory and time costs are often too high for many large datasets. For these cases, we have implemented a highly parallel version of KSTAR implemented with the nextflow software package. The remainder of this section will detail how to install and run KSTAR with nextflow.

Requirements

The nextflow pipeline takes advantage of either Docker or Singularity containers. These are compatible with POSIX operating systems (Linux, OS X, etc.). However, nextflow can also be run on a Windows machine with the use of WSL2. Windows has provided instructions for installing and running WSL2 here: WSL Documentation. For other details about nextflow, please see their documentation here: Nextflow Documentation

This implementation is best suited for high performance computing environments, with a minimum of 8 cpu cores and 16GB of available memory suggested.

Installation

Before implementing KSTAR with nextflow, Docker will need to be installed. The nextflow implementation utilizes a docker container of the KSTAR algorithm. For details on how to install Docker, visit Install Docker.

First, it is recommended that you download KSTAR from github into an easily accessible folder:

git clone https://github.com/NaegleLab/KSTAR.git

Next, set up a conda virtual environment and install nextflow:

conda create -n kstar
conda activate kstar
conda install -c bioconda nextflow

Setting up KSTAR for Nextflow

Download Resource Files

As with the standard implementation, the nextflow implementation of KSTAR requires a reference proteome and phosphoproteome. These can be downloaded using the config.install_resource_files() function within the python interpreter (from the main KSTAR directory). It will install the necessary files to the default location, which is the RESOURCE_FILES directory in the repository.

Downloading or Generating Networks used in KSTAR

As with the standard implementation of KSTAR, you will need to obtain KSTAR pruned networks that will be used in activity calculation. You can download KSTAR networks from the Network FigShare. Unlike the standard implementation, network pickles do not need to be generated, as nextflow will operate on the individual network files.

If choosing to generate your own networks, this should be done outside of the nextflow environment as described in the tutorial. You will need to make sure the directory structure where these networks are located is correct: ‘{resource_directory}/{network_directory}/{phospho_event}/INDIVIDUAL_NETWORKS’, where elements in brackets are given as parameters during a nextflow run. Each individual pruned network should be a .tsv file located in the above directory.

In either case, make sure to place the networks in the same directory as the resource files (RESOURCE_FILES folder creating when running config.install_resource_files()).

Mapping datasets

Datasets must be mapped prior to running KSTAR with nextflow. The nextflow implementation is strictly for activity prediction. The mapped data should be a .tsv file that contains at least data columns, KSTAR_ACCESSION, KSTAR_PEPTIDE, AND KSTAR_SITE.

Running nextflow

Once all the above requirements are satisfied, KSTAR can be run using a simple bash script within the nextflow directory that looks like the one below:

export KSTAR_DIR = /repo_loc/KSTAR/nextflow
cd $KSTAR_DIR

nextflow run main.nf -profile Y \
--name example \
--phospho_event Y \
--outdir ./results \
--experiment_file data/example_data_mapped.tsv \
--outdir ./results \
--resource_directory ../RESOURCE_FILES \
--network_directoyr /NETWORKS/NetworKIN \
--data_columns data:Column1,data:Column2,data:Column3,data:Column4 \
--num_random_experiments 150 \
--threshold 0.5 \
--activity_aggregate mean \
--fpr_alpha 0.05 \
--number_of_sig_trials 100

The first line of the bash script starts the nextflow pipeline and indicates which type of phospho_mod is of interest (either pY or pST). The remaining lines are parameters accepted by nextflow, which either indicate where to find/deposit files or are KSTAR parameters. See the table below for a description of the accepted parameters by nextflow:

Available Parameters

Parameter

Description

Required

Default

experiment_file

full filename of the experiment

Yes

None

name

the name of the experiment. This will be used to name folders/files outputted by KSTAR

No

kstar’

phospho_event

Type of phosphorylation network to analyze. Either Y or ST.

Yes

FALSE

num_random_experiments

The number of random experiments to generate for use in normalization and mann whitney tests

Yes

150

number_of_sig_trials

The number of significance trials to perform for the mann whitney tests

Yes

FALSE

resource_directory

directory where all necesary resources can be found. Should contain the network directory

Yes

FALSE

network_directory

directory where all networks to be used in the analysis can be found

No

NETWORKS/NetworKIN

outdir

directory where results should be deposited

No

./results

threshold

The cutoff value to use for determining whether a site is used as evidence for a particular sample

No

1

data_columns

list of pre-mapped data columns to analyze. Must follow the following rules: 1) cannot include parenthesis, spaces, or commas in name 2) different data columns are split by a comma with no space in between 3) if left blank or false, all columns that start with ‘data:’ are analyzed.

No

FALSE

add_data_before

boolean, where if true, ‘data:’ is added in front of each column name

No

FALSE

activity_aggregate

aggregate for binarizing the experiments. Choices include count, mean, median, max, min

No

count

fpr_alpha

Desired false positive rate

No

0.05

greater

Boolean value that indicates whether sites above or below the threshold are used as evidence. If True, sites greater than evidence will be used

No

TRUE

chunk_size

chunk size to use in hypergeometric calculation, adjust this to help with memory issues

No

10

Output of a nextflow run

Each run with nextflow will deposit all results from a run in the folder indicated by the outdir parameter. There will be seperate folders for tyrosine and serine/threonine results if both types of runs are performed. Within the out directory, there will be the following subfolders, most of which mirror what is saved in the pickle object in the base KSTAR algorithm:

Nextflow Run Outputs

Subfolder

Description

Files within folder

random_hypergeometric_activity

Contains the predicted activities for each of the random experiments

example_random_activities.tsv, example_aggregated_activities.tsv

pipeline_info

Contains all information about the nextflow run, including run time, memory used, and cpus used

pipeline_dag.svg, execution_trace.txt, execution_timeline.html, execution_report.html

normalized_activity

Contains activity and fpr predictions after original predictions have been normalized relative to the median p-values obtained from the random experiments

example_normalized_aggregate_activity.tsv, example_normalized_activities.tsv

mann_whitney

Contains activity and fpr predictions obtained via the Mann Whitney statistical test, comparing the experiment p-values to the random experiments. These are the activity/fpr values we recommend using for analysis

example_mann_whitney_fpr.tsv, example_mann_whitney_activities.tsv, mann_whitney_combined

individual_experiments

This contains all of the random datasets generated for use in getting both the normalized and Mann Whitney activities

folders containing results for each individual experiment/sample

hypergeometric_activity

Contains activity predictions prior to accounting for the expected p-values due to random chance

example_aggregated_activities.tsv, example_activities_list.tsv, example_activities.tsv

binary_experiment

Contains the mapped data after the evidence columns have been binarized based on the given threshold

example_binarized_experiment.tsv

Running nextflow on Singularity

It is also possible to implement the above using singularity containers instead of docker containers. If singularity is already installed, only a small addition at the bottom of the bash script is needed to use singularity instead of docker.

export KSTAR_DIR = /repo_loc/KSTAR/nextflow
cd $KSTAR_DIR

nextflow run main.nf -profile Y \
--name example \
--phospho_event Y \
--outdir ./results \
--experiment_file data/example_data_mapped.tsv \
--outdir ./results \
--resource_directory ../RESOURCE_FILES \
--network_directoyr /NETWORKS/NetworKIN \
--data_columns data:Column1,data:Column2,data:Column3,data:Column4 \
--num_random_experiments 150 \
--threshold 0.5 \
--activity_aggregate mean \
--fpr_alpha 0.05 \
--number_of_sig_trials 100
-with-singularity
-without-docker