Chapter 2 How to run RIMA
“Please make sure you adhere to NIH and HIPAA security standards when uploading controlled access data to a cloud or high computing environment.”
2.1 Install RIMA and Set Up the Running Environment
To run RIMA, you will create a working directory. In the working directory, you will install the pipeline code and other required software.
Running machine requirements- Cores and memory – We usually run RIMA on a machine with at least 64 GB of memory to ensure a smooth read alignment run using STAR.
-
Storage – You will need to reserve enough storage for the following:
1. Reference files and pipeline code – 65 GB
2. Raw data – we recommend reserving five times the size of your data files. For example, if your fastq files are 50GB each, then you should allot 250GB for each fastq file. Zipped fastq files are recommended.
3. Required software – reserve ~20GB for installation of miniconda, mamba and snakemake.
Prepare a working directory
mkdir RIMA
Get the pipeline code
Download the RIMA_pipeline folder to your working directory.
git clone https://github.com/liulab-dfci/RIMA_pipeline.git
Install miniconda
Download and configure your conda environment. Follow the prompts on the installer screens.
#Note: You will need to install miniconda3 into the RIMA directory. This means you need to override the default directory when miniconda3 is installed.
#If you have conda installed elsewhere, you will still need a local copy in your RIMA directory in order for the environment installation shell program to work. (see below)
#go to your RIMA directory if you are not there already
cd RIMA
#obtain the Miniconda3 code and install in the RIMA directory
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
# !Don't forget to override the default directory for miniconda3!
# Install miniconda3 into your RIMA directory.
conda list #to see whether you have successfully installed conda
Set up the running environment
The RIMA environment can be set up by running a shell command. In the command, you need to specify the platform you are using – Amazon Web Services (AWS) or Google Cloud Platform (GCP). This process may take ten to fifteen minutes. The user will need to answer two questions during the process – one at the beginning of the program and one near the end – to give permissions to install miniconda and snakemake.
#NOTE: before running the environment shell command, make sure to set the path for CONDA_ROOT
export CONDA_ROOT=/{path_to_your_RIMA_directory}/RIMA/miniconda3
export PATH=/{path_to_your_RIMA_directory}/RIMA/miniconda3/bin:$PATH
Run the shell program:
#shell command for
bash ./RIMA_environment.sh -p {platform --AWS, GCP}
#example for AWS
bash ./RIMA_environment.sh -p AWS
Activate the RIMA environment
Based on the conda environments information, the main environment “RIMA” can be activated as below.
#set the path if you this is a new login to your shell
export CONDA_ROOT=/{path_to_your_RIMA_directory}/RIMA/miniconda3
export PATH=/{path_to_your_RIMA_directory}/RIMA/miniconda3/bin:$PATH
# activate the RIMA conda environment
source activate RIMA
2.2 Download pre-built references
A pre-prepared RIMA reference folder can be downloaded using the code below.
If you want to prepare a customized reference, you can follow this tutorial to build your own reference.
The following link contains the hg38 reference downloaded from Genomic Data Commons using version 22 index and annotation files.
cd RIMA
wget http://cistrome.org/~lyang/ref.tar.gz
# unzip the reference
tar -zxvf ref.tar.gz
# remove the reference zip file to save some space (optional)
rm ref.tar.gz
Optional step
If you download your RIMA reference folder into a directory other than RIMA, you have to change the symbolic link for the ref_files under the rnaseq_pipeline folder: For example: if you downloaded the RIMA reference folder into /mnt/tutorial/ref_files, use the ln command to change the symbolic link.
cd RIMA
# remove the current link of ref_files
rm ref_files
# create a new symoblic link to your reference folder
ln -s /mnt/tutorial/ref_files
2.2.1 Video tutorial
2.3 Prepare input files
Metasheet.csv
Metasheet.csv is a comma-delimited file that resides in the RIMA_pipeline folder. The Metasheet.csv should contain phenotypic information about your samples that can be used for downstream analysis. Your metasheet should contain Two Required Columns (SampleName, Group) in comma-delimited format.
For this tutorial, RIMA uses demo data from Zhao, J., Chen, A.X., Gartrell, R.D. et al. Immune and genomic correlates of response to anti-PD-1 immunotherapy in glioblastoma. Nat Med 25, 462–469 (2019). https://doi.org/10.1038/s41591-019-0349-y . We selected 12 samples from this trial and included 6 treatment naive samples (3 responder and 3 non-responders) for cohort level comparison.
- SampleName: Sequencing sample id.
- PatName: Patient id
- Group: Sample included for cohort level comparison. ‘NA’ indicates samples that should not be included in the comparison. In the example below, we compare responders(R) and non-responders(NR) using only treatment naive (pre) patients.
You can add columns to the metasheet in order to compare other phenotypes e.g. columns for Timing Age, Sex etc.
SampleName,PatName,Group,Age,Tissue,Gender,Timing,Responder,TumorLocation,OngoingTreatment,PFS,Survival,OS,SampleId,syn_batch
SRR8281218,P20,NR,63,Brain,male,Pre,NR,temporal,no,110,1,278,4790-NL-AS,1
SRR8281219,P20,NA,63,Brain,male,Post,NR,temporal,no,110,1,278,4975-NL-AS,1
SRR8281226,P53,NR,70,Brain,male,Pre,NR,frontal,Yes,83,1,337,3981-NL-AS,2
SRR8281236,P53,NA,70,Brain,male,Post,NR,frontal,Yes,83,1,337,4760-D1,2
SRR8281233,P56,NR,54,Brain,female,Pre,NR,Parietal,no,83,1,83,4341-D3,3
SRR8281230,P56,NA,54,Brain,female,Post,NR,Parietal,no,83,1,83,4680-A1,3
SRR8281244,P100,NA,31,Brain,male,Post,R,Frontal-crossesmidline,yes,151,0,615,4956-NL-AS,2
SRR8281245,P100,R,31,Brain,male,Pre,R,Frontal-crossesmidline,yes,151,0,615,4595-G1,2
SRR8281243,P101,R,32,Brain,male,Pre,R,frontal,no,0,1,414,3542-NL-AS,2
SRR8281238,P102,NA,64,Brain,female,Post,R,Temporal-anterior,yes,519,0,539,5094-NL-AS,1
SRR8281251,P101,NA,32,Brain,male,Post,R,frontal,no,0,1,414,4943-A1,2
SRR8281250,P102,R,64,Brain,female,Pre,R,Temporal-anterior,yes,519,0,539,4443-NL-AS,1
Config.yaml
In the RIMA_pipeline folder, we have prepared a config.yaml as a template to run the pipeline. The config file is divided into three sections:- Fixed and user-defined parameters.
- Cohort level analysis parameters.
- Samples list.
You should provide these parameters with column names that match the columns in metasheet.csv.
Below is an example of a config.yaml file. You can set the patient name as the sample name. Note: Please make sure that sample names match the metasheet. Currently, only fastq files are accepted as input (including fastq.gz).
#########Fixed and user-defined parameters################
metasheet: metasheet.csv # Meta info
ref: ref.yaml # Reference config
assembly: hg38
cancer_type: GBM #TCGA cancer type abbreviations
rseqc_ref: house_keeping #Option: 'house_keeping' or 'false'.
#By default, a subset of housekeeping genes is used by RSeQC to assess alignment quality.
#This reduces the amount of time needed to run RSeQC.
mate: [1,2] #paired-end fastq format, we recommend naming paired-end reads with _1.fq.gz and _2.fq.gz
#########Cohort level analysis parameters################
design: Group # Condition on which to do comparsion (as set up in metasheet.csv)
Treatment: R # Treatment use in DESeq2, corresponding to positive log fold change
Control: NR # Control use in DESeq2, corresponding to negative log fold change
batch: syb_batch # Options: 'false' or a column name from the metasheet.csv.
# If set to a column name in the metasheet.csv, the column name will be used for batch effect analysis (limma).
# It will also be used as a covariate for differential analysis (DESeq2) to account for batch effect.
pre_treated: false # Option: true or false.
# If set to false, patients are treatment naive.
# If set to true, patients have received some form of therapy prior to the current study.
#########Samples list##############
samples:
SRR8281218:
- /mnt/zhao_trial/data/SRR8281218_1.fastq.gz
- /mnt/zhao_trial/data/SRR8281218_2.fastq.gz
SRR8281219:
- /mnt/zhao_trial/data/SRR8281219_1.fastq.gz
- /mnt/zhao_trial/data/SRR8281219_2.fastq.gz
SRR8281226:
- /mnt/zhao_trial/data/SRR8281226_1.fastq.gz
- /mnt/zhao_trial/data/SRR8281226_2.fastq.gz
SRR8281236:
- /mnt/zhao_trial/data/SRR8281236_1.fastq.gz
- /mnt/zhao_trial/data/SRR8281236_2.fastq.gz
SRR8281230:
- /mnt/zhao_trial/data/SRR8281230_1.fastq.gz
- /mnt/zhao_trial/data/SRR8281230_2.fastq.gz
SRR8281233:
- /mnt/zhao_trial/data/SRR8281233_1.fastq.gz
- /mnt/zhao_trial/data/SRR8281233_2.fastq.gz
SRR8281244:
- /mnt/zhao_trial/data/SRR8281244_1.fastq.gz
- /mnt/zhao_trial/data/SRR8281244_2.fastq.gz
SRR8281245:
- /mnt/zhao_trial/data/SRR8281245_1.fastq.gz
- /mnt/zhao_trial/data/SRR8281245_2.fastq.gz
SRR8281243:
- /mnt/zhao_trial/data/SRR8281243_1.fastq.gz
- /mnt/zhao_trial/data/SRR8281243_2.fastq.gz
SRR8281251:
- /mnt/zhao_trial/data/SRR8281251_1.fastq.gz
- /mnt/zhao_trial/data/SRR8281251_2.fastq.gz
SRR8281238:
- /mnt/zhao_trial/data/SRR8281238_1.fastq.gz
- /mnt/zhao_trial/data/SRR8281238_2.fastq.gz
SRR8281250:
- /mnt/zhao_trial/data/SRR8281250_1.fastq.gz
- /mnt/zhao_trial/data/SRR8281250_2.fastq.gz
execution.yaml
Use execution.yaml to control which modules to run in RIMA. The preprocess module outputs are required for optional downstream analysis modules. Details of each module will be introduced in the next chapters.
##Note: The preprocess individual and cohort modules are necessary to obtain alignment and quality results.
##Run the remaining modules only after these two modules.
preprocess_individual: true
preprocess_cohort: true
##Optional modules
##Note: The below modules are specialized modules, each dealing with specific targets.
##Make sure to run both the individual and cohort parts of each module
##to get all results.
##Individual runs
mutation_individual: false
immune_response_individual: false
immune_repertoire_individual: false
microbiome_individual: false
neoantigen_individual: false
##Cohort runs
differential_expression_cohort: false
immune_infiltration_cohort: false
mutation_cohort: false
immune_response_cohort: false
immune_repertoire_cohort: false
microbiome_cohort: false
neoantigen_cohort: false
ref.yaml The ref.yaml file provides the paths for all reference files. If you downloaded the pre-built reference files, you should not need to change the ref.yaml file.
#REF.yaml - File to contain paths to static reference files that RNA-Seq requires
# for its analysis.
# NOTE: these are already pre-filled for hg38
# NOTE: organized by assemblies
#
# !VERY IMPORTANT: you can OVERRIDE any of these values in config.yaml!
hg38:
###annotation path
fasta_path: ./ref_files/GRCh38.d1.vd1.fa
gtf_path: ./ref_files/gencode.annotation.gtf
bed_path: ./ref_files/RSeQC_bed/refseqGenes.bed
housekeeping_bed_path: ./ref_files/RSeQC_bed/housekeeping_refseqGenes.bed
trust4_reper_path: ./ref_files/TRUST4/hg38_bcrtcr.fa
trust4_IMGT_path: ./ref_files/TRUST4/human_IMGT+C.fa
###index path
star_index: ./ref_files/star_index
star_fusion_index: ./ref_files/fusion_index/ctat_genome_lib_build_dir
salmon_index: ./ref_files/salmon_index/transcripts.fa
msisensor_index: ./ref_files/msisensor_index/hg38/microsatellite.list
centrifuge_index: ./ref_files/centrifuge_index/p_compressed+h+v
annotation_pyprada: ./ref_files/pyPRADA/gencode.canonical.gene.exons.tab.txt
###tool path
msisensor2_path: ./ref_files/msisensor2
trust4_path: ./ref_files/TRUST4
arcasHLA_path: ./ref_files/arcasHLA
prada_path: ./ref_files/pyPRADA/pyPRADA_1.2
2.3.1 Video tutorial
2.4 Running RIMA
Check the path of your conda env
conda env list
# conda environments:
#
base * /home/ubuntu/miniconda3
centrifuge_env /home/ubuntu/miniconda3/envs/centrifuge_env
fusion_env /home/ubuntu/miniconda3/envs/fusion_env
rnaseq /home/ubuntu/miniconda3/envs/rnaseq
rseqc_env /home/ubuntu/miniconda3/envs/rseqc_env
stat_perl_r /home/ubuntu/miniconda3/envs/stat_perl_r
Check the pipeline with a dry run to ensure correct script and data usage.
snakemake -s RIMA.snakefile -np
Submit the job.
Alignment and some of the other modules of RIMA will take several hours to run. It is recommended that you run RIMA in the background using a command such as nohup as below.
nohup time snakemake -p -s RIMA.snakefile -j 4 > RIMA.out &
note: Argument -j sets the cores for parallel runs (e.g. ‘-j 4’ can run 4 jobs in parallel at the same time.). Argument -p prints the command in each rule. Note: Here, output log records the run information. A user may run one module at a time to obtain a record of each module’s output log.
2.5 Output
Output folders are generated in a folder called “analysis”. The output structure of each module will be introduced in the next chapters.