Chapter 11 Customize your own reference
RIMA provides a pre-built set of references using GDC hg38 and v22 GENCODE annotation. This set of references can be downloaded as described in chapter 2.2.
We have also pre-built a set of reference files using v27 GENCODE annotation. This set of pre-built references can be downloaded from http://cistrome.org/~lyang/ref_v27.tar.gz using the same instructions provided in chapter 2.
If you wish to build a different set of references, please follow the instructions which follow.
11.1 Reference fasta
Download the human GDC hg38 fasta file from GDC website.
11.2 Gene annotation file (gtf)
Dowload the human gtf annotation file from GENCODE website.
11.3 build STAR index
Build the STAR index using the following code. Make sure to change the file names to indicate which gencode version you are using.
conda activate rna
## STAR Version: STAR_2.6.1d
STAR --runThreadN 16 --runMode genomeGenerate --genomeDir ./ref_files/v27_index --genomeFastaFiles GRCh38.d1.vd1.CIDC.fa --sjdbGTFfile gencode.v27.annotation.gtf
...
00:04:54 ..... started STAR run
00:04:54 ... starting to generate Genome files
00:05:57 ... starting to sort Suffix Array. This may take a long time...
00:06:11 ... sorting Suffix Array chunks and saving them to disk...
00:17:43 ... loading chunks from disk, packing SA...
00:19:31 ... finished generating suffix array
00:19:31 ... generating Suffix Array index
00:23:20 ... completed Suffix Array index
00:23:20 ..... processing annotations GTF
00:23:35 ..... inserting junctions into the genome indices
00:26:49 ... writing Genome to disk ...
00:27:06 ... writing Suffix Array to disk ...
00:28:53 ... writing SAindex to disk
00:29:05 ..... finished successfully
11.4 RSeQC reference files
We download the human annotation bed file including the whole genome bed file, and house keeping bed file from RSeQC page from sourcforge website.
11.5 build salmon index
conda activate rna
## salmon Version: salmon 1.1.0
salmon index -t GRCh38.d1.vd1.CIDC.fa -i salmon_index
...
index ["salmon_index"] did not previously exist . . . creating it
[jLog] [info] building index
[jointLog] [info] [Step 1 of 4] : counting k-mers
[jointLog] [info] Replaced 164,553,847 non-ATCG nucleotides
[jointLog] [info] Clipped poly-A tails from 0 transcripts
[jointLog] [info] Building rank-select dictionary and saving to disk
[jointLog] [info] done
Elapsed time: 0.191866s
[jointLog] [info] Writing sequence data to file . . .
[jointLog] [info] done
Elapsed time: 1.91244s
[jointLog] [info] Building 64-bit suffix array (length of generalized text is 3,088,286,426)
[jointLog] [info] Building suffix array . . .
success
saving to disk . . . done
Elapsed time: 18.3072s
done
Elapsed time: 703.843s
11.5.1 GMT file for gene set analysis
The GMT file is downloaded from BROAD release page. The current GMT file we used is “c2.cp.kegg.v6.1.symbols.gmt”
11.6 STAR-Fusion genome resource lib
The genome resource lib is downloaded from BROAD release page. The current lib we used is GRCh38_v22_CTAT_lib.
You can also prep it for use with STAR-fusion. More details, read:
11.7 Centrifuge index
The human Centrifuge index is downloaded from Centrifuge website. The current index we used is p_compressed+h+v that includes human genome, prokaryotic genomes, and viral genomes.
You can also build your own custom Centrifuge index. For more details, read:
11.8 TRUST4 reference files
TRUST4 reference files includes 1. a TCR, BCR genomic sequence fasta file; and 2. A reference database sequence containing annotation information.
hg38_bcrtcr.fa
human_IMGT+C.fa
These reference files can directly be downloaded from TRUST4 github.