KSTAR in Parallel ================= The base implementation of KSTAR operates with only a single processor. However, this can often be time consuming, particularly for datasets with many sites and/or many samples. It is possible to run KSTAR as a parallel process, either within your python environment or using a software package called nextflow. See the following sections for more details. Option 1: Running KSTAR using Multiprocessing --------------------------------------------- Within python, several KSTAR functions can be run in parallel using the multiprocessing module. For these functions, switching to multiprocessing simply requires changing the 'PROCESSES' parameter from 1 (default) to the number of processes you would like to run in parallel. .. code-block:: python # Activity Calculation kinact_dict = calculate.run_kstar_analysis(experiment, activity_log, networks, PROCESSES = 4) # Normalization calculate.normalize_analysis(kinact_dict, activity_log, num_random_experiments, target_alpha, PROCESSES = 4) # Mann Whitney Calculation calculate.Mann_Whitney_analysis(kinact_dict, activity_log, number_sig_trials = 100, PROCESSES = 4) While this strategy helps to improve the speed of analysis, it can be very memory intensive for large datasets. For large tyrosine datasets and most serine/threonine datasets, we recommend running KSTAR using nextflow, described in the following section. Option 2: Running Large Datasets with Nextflow ---------------------------------------------- While our standard implementation of KSTAR can be run on most phosphotyrosine datasets and some small phosphoserine/threonine datasets, the memory and time costs are often too high for many large datasets. For these cases, we have implemented a highly parallel version of KSTAR implemented with the nextflow software package. The remainder of this section will detail how to install and run KSTAR with nextflow. Requirements ^^^^^^^^^^^^ The nextflow pipeline takes advantage of either Docker or Singularity containers. These are compatible with POSIX operating systems (Linux, OS X, etc.). However, nextflow can also be run on a Windows machine with the use of WSL2. Windows has provided instructions for installing and running WSL2 here: `WSL Documentation `_. For other details about nextflow, please see their documentation here: `Nextflow Documentation `_ This implementation is best suited for high performance computing environments, with a minimum of 8 cpu cores and 16GB of available memory suggested. Installation ^^^^^^^^^^^^ Before implementing KSTAR with nextflow, Docker will need to be installed. The nextflow implementation utilizes a docker container of the KSTAR algorithm. For details on how to install Docker, visit `Install Docker `_. First, it is recommended that you download KSTAR from github into an easily accessible folder: .. code-block:: bash git clone https://github.com/NaegleLab/KSTAR.git Next, set up a conda virtual environment and install nextflow: .. code-block:: bash conda create -n kstar conda activate kstar conda install -c bioconda nextflow Setting up KSTAR for Nextflow ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Download Resource Files """"""""""""""""""""""" As with the standard implementation, the nextflow implementation of KSTAR requires a reference proteome and phosphoproteome. These can be downloaded using the config.install_resource_files() function within the python interpreter (from the main KSTAR directory). It will install the necessary files to the default location, which is the RESOURCE_FILES directory in the repository. Downloading or Generating Networks used in KSTAR """""""""""""""""""""""""""""""""""""""""""""""" As with the standard implementation of KSTAR, you will need to obtain KSTAR pruned networks that will be used in activity calculation. You can download KSTAR networks from the `Network FigShare `_. Unlike the standard implementation, network pickles do not need to be generated, as nextflow will operate on the individual network files. If choosing to generate your own networks, this should be done outside of the nextflow environment as described in the tutorial. You will need to make sure the directory structure where these networks are located is correct: '{resource_directory}/{network_directory}/{phospho_event}/INDIVIDUAL_NETWORKS', where elements in brackets are given as parameters during a nextflow run. Each individual pruned network should be a .tsv file located in the above directory. In either case, make sure to place the networks in the same directory as the resource files (RESOURCE_FILES folder creating when running config.install_resource_files()). Mapping datasets """""""""""""""" Datasets must be mapped prior to running KSTAR with nextflow. The nextflow implementation is strictly for activity prediction. The mapped data should be a .tsv file that contains at least data columns, KSTAR_ACCESSION, KSTAR_PEPTIDE, AND KSTAR_SITE. Running nextflow ^^^^^^^^^^^^^^^^ Once all the above requirements are satisfied, KSTAR can be run using a simple bash script within the nextflow directory that looks like the one below: .. code-block:: bash export KSTAR_DIR = /repo_loc/KSTAR/nextflow cd $KSTAR_DIR nextflow run main.nf -profile Y \ --name example \ --phospho_event Y \ --outdir ./results \ --experiment_file data/example_data_mapped.tsv \ --outdir ./results \ --resource_directory ../RESOURCE_FILES \ --network_directoyr /NETWORKS/NetworKIN \ --data_columns data:Column1,data:Column2,data:Column3,data:Column4 \ --num_random_experiments 150 \ --threshold 0.5 \ --activity_aggregate mean \ --fpr_alpha 0.05 \ --number_of_sig_trials 100 The first line of the bash script starts the nextflow pipeline and indicates which type of phospho_mod is of interest (either pY or pST). The remaining lines are parameters accepted by nextflow, which either indicate where to find/deposit files or are KSTAR parameters. See the table below for a description of the accepted parameters by nextflow: .. csv-table:: Available Parameters :file: ./parameter_table.csv :widths: 20, 50,10,20 :header-rows: 1 Output of a nextflow run """""""""""""""""""""""" Each run with nextflow will deposit all results from a run in the folder indicated by the outdir parameter. There will be seperate folders for tyrosine and serine/threonine results if both types of runs are performed. Within the out directory, there will be the following subfolders, most of which mirror what is saved in the pickle object in the base KSTAR algorithm: .. csv-table:: Nextflow Run Outputs :file: ./output_table.csv :widths: 20, 100, 10 :header-rows: 1 Running nextflow on Singularity """"""""""""""""""""""""""""""" It is also possible to implement the above using singularity containers instead of docker containers. If singularity is already installed, only a small addition at the bottom of the bash script is needed to use singularity instead of docker. .. code-block:: bash export KSTAR_DIR = /repo_loc/KSTAR/nextflow cd $KSTAR_DIR nextflow run main.nf -profile Y \ --name example \ --phospho_event Y \ --outdir ./results \ --experiment_file data/example_data_mapped.tsv \ --outdir ./results \ --resource_directory ../RESOURCE_FILES \ --network_directoyr /NETWORKS/NetworKIN \ --data_columns data:Column1,data:Column2,data:Column3,data:Column4 \ --num_random_experiments 150 \ --threshold 0.5 \ --activity_aggregate mean \ --fpr_alpha 0.05 \ --number_of_sig_trials 100 -with-singularity -without-docker