Secondary Analysis Flashcards
What is a read?
An inferred sequence of base pairs (or base pair probabilities) corresponding to all or part of a single DNA fragment, typically 150 bp in length.
What is de novo sequence assembly?
Assembling of short nucleotide sequences into longer ones without the use of a reference genome.
What are mapped reads?
Those reads from the sequenced sample that align directly to a single region (set of loci) on the reference genome.
What are unmapped reads?
Those reads that map nowhere on the reference genome.
What is BLAST?
Basic local alignment search tool. It is an algorithm and program for comparing primary biological sequence information. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence (called a query) with a library or database of sequences, and identify database sequences that resemble alphabet above a certain threshold.
What is a FASTA file?
A text file for representing nucleotide or amino acid sequences where nucleotides or amino acids are represented by single-letter codes.
What is sequence alignment?
A way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity.
What is a consensus sequence (canonical sequence)?
It’s the calculated sequence of most frequent residues (nucleotide or amino acid) found at each position in a sequence alignment.
What is a sequence motif?
A nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function.
What is a FASTQ file?
A text file for representing a biological sequence and its corresponding quality scores.
What is a SAM file?
Sequence Alignment Map, a text-based format, originally for storing biological sequences aligned to reference sequence. Now it’s extended to also represent unmapped sequences.
What is a BAM file?
It’s a binary compressed format equivalent to text-based SAM format.
What is a CRAM file?
Compressed Reference-oriented Alignment Map. A compressed columnar file for storing biological sequences aligned to a reference sequence.
What is a library?
It’s the DNA product extracted from biological samples and prepared for sequencing.
What is cDNA?
Copy DNA or complementary DNA. It is synthetic DNA that has been transcribed from a specific mRNA through a reaction using the enzyme reverse transcriptase. While DNA is composed of both coding and non-coding sequences, cDNA contains only coding sequences.
What is a read group?
A set of reads that are generated from a single run of a sequencing instrument.
What is a lane?
The basic independent run of a high-throughput sequencing machine.
What is multiplexing in next generation sequencing?
Sequencing data from multiple libraries on multiple lanes.
What are the steps in pre-processing of the raw sequence data?
- map raw unmapped reads to reference genome
- mark duplicates
- recalibrate base quality scores
What is paired-end reading?
In paired-end reading the sequencer starts at one read, finishes the direction at the specified length, and then starts another round of reading from the opposite end of the fragment.
What is a reference genome?
A synthetic single-stranded representation of common genome sequence that is intended to provide a common coordinate framework for all genomic analysis.
What is mapping reads to reference?
This is the first processing step, where each read pair is mapped to the reference genome.
What tools are involved in mapping reads to reference in GATK?
BWA, MergeBamAlignments (Picard)
What is done in the mark duplicates step?
For each sample, identification is made of read pairs that are likely to have originated from duplicates of the same original DNA fragments through some artifactual processes.
All but one pair are marked within each set of duplicates, and later, variant discovery ignores the marked pairs.
Then the reads are sorted into coordinate-order for the next step of the pre-processing.
What tools are used for the mark duplicates step?
MarkDuplicatesSpark
or
MarkDuplicates (Picard) + SortSam
What is done in base recalibration step?
This step detects and corrects for patterns of systematic errors in the base quality scores using machine learning.
New BAM files are produced in this step.
What are base quality scores?
Per-base estimates of error emitted by the sequencing machines; they express how confident the machine was that it called the correct base each time.
What are base quality scores used for?
They’re used in variant calling algorithms.
What tools are used in the base recalibration step in gatk?
BaseRecalibrator, ApplyBSQR, AnalyzeCovariates (optional)
What is a haplotype?
A physical grouping of genomic variants (or polymorphisms) that tend to be inherited together. A specific haplotype typically reflects a unique combination of variants that reside near each other on a chromosome.