Chapter 3 High throughput sequencing
3.1 Three generations of sequencing technologies
First generation sequencing is Sanger sequencing. It is the technology that was used to obtain the first human genome sequence.
Second generation sequencing is also called next generation sequencing (NGS) and is the start of high throughput sequencing. It is what scientists use most often nowadays, and Illumina is the market leader. Most of the rest of this course will cover data analysis using second generation sequencing.
Third generation sequencing is single-molecule sequencing. There are many new technologies still under active development, although none has reached market penetration.
3.2 FASTQ and FASTQC
NGS generates FASTQ files. FASTQC is an computational approach to evaluate the quality of your NGS data.
3.3 Early sequence alignment (1 with 1)
In the early days (1970s), scientists were not worried about having to align too many sequences. They wanted to find the best alignment between two sequences. Many bioinformatics courses start with learning these, although it is not the main focus of our course. We included two videos in case you are interested.
The Needlemen-Wunsch algorithm is the earliest algorithm to find the alignment between two sequences and score their similarity.
When two sequences are long, and only a portion of them can align well with each other, the Smith-Waterman algorithm can find the best local sequence alignment. It is still considered the best alignment approach, although it is slow.
3.4 Sequence search algorihtms (1 with many)
With more and more sequences available in the public in the 1980s, scientists were interested in finding whether their newly sequenced string has been sequenced before in the public database. Therefore, the fast search algorithm BLAST was developed, using one sequence as the query to find similar sequences from a database.
3.5 Burrows-Wheeler Aligner (many with many)
With NGS, scientists need much faster search (aka mapping) algorithms in order to align the millions of sequences to the reference genome. The current best algorithm is called Borrow-Wheeler Aligner or BWA.
In order to understand BWA, we first need to introduce Borrows-Wheeler transformation and LF mapping
The basic idea of Borrows-Wheeler alignment
3.6 Alignment output
NGS raw data is in FASTQ. Alignment gives you SAM (alignment) or BAM (binary version of SAM) files which contain the sequence information in FASTQ and the mapping locations. BED file is the simpliest, although there is information loss.