Chapter 11 Customize your own reference

RIMA provides a pre-built set of references using GDC hg38 and v22 GENCODE annotation. This set of references can be downloaded as described in chapter 2.2.

We have also pre-built a set of reference files using v27 GENCODE annotation. This set of pre-built references can be downloaded from http://cistrome.org/~lyang/ref_v27.tar.gz using the same instructions provided in chapter 2.

If you wish to build a different set of references, please follow the instructions which follow.

11.1 Reference fasta

Download the human GDC hg38 fasta file from GDC website.

11.2 Gene annotation file (gtf)

Dowload the human gtf annotation file from GENCODE website.

11.3 build STAR index

Build the STAR index using the following code. Make sure to change the file names to indicate which gencode version you are using.

conda activate rna

## STAR Version: STAR_2.6.1d
STAR --runThreadN 16 --runMode genomeGenerate --genomeDir ./ref_files/v27_index --genomeFastaFiles GRCh38.d1.vd1.CIDC.fa --sjdbGTFfile gencode.v27.annotation.gtf
...
00:04:54 ..... started STAR run
00:04:54 ... starting to generate Genome files
00:05:57 ... starting to sort Suffix Array. This may take a long time...
00:06:11 ... sorting Suffix Array chunks and saving them to disk...
00:17:43 ... loading chunks from disk, packing SA...
00:19:31 ... finished generating suffix array
00:19:31 ... generating Suffix Array index
00:23:20 ... completed Suffix Array index
00:23:20 ..... processing annotations GTF
00:23:35 ..... inserting junctions into the genome indices
00:26:49 ... writing Genome to disk ...
00:27:06 ... writing Suffix Array to disk ...
00:28:53 ... writing SAindex to disk
00:29:05 ..... finished successfully

11.4 RSeQC reference files

We download the human annotation bed file including the whole genome bed file, and house keeping bed file from RSeQC page from sourcforge website.

./ref_files/refseqGenes.bed
./ref_files/housekeeping_refseqGenes.bed

11.5 build salmon index

conda activate rna

## salmon Version: salmon 1.1.0
salmon index -t GRCh38.d1.vd1.CIDC.fa -i salmon_index

...
index ["salmon_index"] did not previously exist  . . . creating it
[jLog] [info] building index 
[jointLog] [info] [Step 1 of 4] : counting k-mers
[jointLog] [info] Replaced 164,553,847 non-ATCG nucleotides
[jointLog] [info] Clipped poly-A tails from 0 transcripts
[jointLog] [info] Building rank-select dictionary and saving to disk
[jointLog] [info] done
Elapsed time: 0.191866s
[jointLog] [info] Writing sequence data to file . . .
[jointLog] [info] done
Elapsed time: 1.91244s
[jointLog] [info] Building 64-bit suffix array (length of generalized text is 3,088,286,426)
[jointLog] [info] Building suffix array . . .
success
saving to disk . . . done
Elapsed time: 18.3072s
done
Elapsed time: 703.843s

11.5.1 GMT file for gene set analysis

The GMT file is downloaded from BROAD release page. The current GMT file we used is “c2.cp.kegg.v6.1.symbols.gmt”

11.6 STAR-Fusion genome resource lib

The genome resource lib is downloaded from BROAD release page. The current lib we used is GRCh38_v22_CTAT_lib.

You can also prep it for use with STAR-fusion. More details, read:

11.7 Centrifuge index

The human Centrifuge index is downloaded from Centrifuge website. The current index we used is p_compressed+h+v that includes human genome, prokaryotic genomes, and viral genomes.

You can also build your own custom Centrifuge index. For more details, read:

11.8 TRUST4 reference files

TRUST4 reference files includes 1. a TCR, BCR genomic sequence fasta file; and 2. A reference database sequence containing annotation information.

hg38_bcrtcr.fa
human_IMGT+C.fa

These reference files can directly be downloaded from TRUST4 github.