Secondary Analysis Flashcards

1
Q

What is a read?

A

An inferred sequence of base pairs (or base pair probabilities) corresponding to all or part of a single DNA fragment, typically 150 bp in length.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is de novo sequence assembly?

A

Assembling of short nucleotide sequences into longer ones without the use of a reference genome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are mapped reads?

A

Those reads from the sequenced sample that align directly to a single region (set of loci) on the reference genome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are unmapped reads?

A

Those reads that map nowhere on the reference genome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is BLAST?

A

Basic local alignment search tool. It is an algorithm and program for comparing primary biological sequence information. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence (called a query) with a library or database of sequences, and identify database sequences that resemble alphabet above a certain threshold.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a FASTA file?

A

A text file for representing nucleotide or amino acid sequences where nucleotides or amino acids are represented by single-letter codes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is sequence alignment?

A

A way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a consensus sequence (canonical sequence)?

A

It’s the calculated sequence of most frequent residues (nucleotide or amino acid) found at each position in a sequence alignment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a sequence motif?

A

A nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a FASTQ file?

A

A text file for representing a biological sequence and its corresponding quality scores.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a SAM file?

A

Sequence Alignment Map, a text-based format, originally for storing biological sequences aligned to reference sequence. Now it’s extended to also represent unmapped sequences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a BAM file?

A

It’s a binary compressed format equivalent to text-based SAM format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a CRAM file?

A

Compressed Reference-oriented Alignment Map. A compressed columnar file for storing biological sequences aligned to a reference sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a library?

A

It’s the DNA product extracted from biological samples and prepared for sequencing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is cDNA?

A

Copy DNA or complementary DNA. It is synthetic DNA that has been transcribed from a specific mRNA through a reaction using the enzyme reverse transcriptase. While DNA is composed of both coding and non-coding sequences, cDNA contains only coding sequences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a read group?

A

A set of reads that are generated from a single run of a sequencing instrument.

17
Q

What is a lane?

A

The basic independent run of a high-throughput sequencing machine.

18
Q

What is multiplexing in next generation sequencing?

A

Sequencing data from multiple libraries on multiple lanes.

19
Q

What are the steps in pre-processing of the raw sequence data?

A
  1. map raw unmapped reads to reference genome
  2. mark duplicates
  3. recalibrate base quality scores
20
Q

What is paired-end reading?

A

In paired-end reading the sequencer starts at one read, finishes the direction at the specified length, and then starts another round of reading from the opposite end of the fragment.

21
Q

What is a reference genome?

A

A synthetic single-stranded representation of common genome sequence that is intended to provide a common coordinate framework for all genomic analysis.

22
Q

What is mapping reads to reference?

A

This is the first processing step, where each read pair is mapped to the reference genome.

23
Q

What tools are involved in mapping reads to reference in GATK?

A

BWA, MergeBamAlignments (Picard)

24
Q

What is done in the mark duplicates step?

A

For each sample, identification is made of read pairs that are likely to have originated from duplicates of the same original DNA fragments through some artifactual processes.
All but one pair are marked within each set of duplicates, and later, variant discovery ignores the marked pairs.
Then the reads are sorted into coordinate-order for the next step of the pre-processing.

25
Q

What tools are used for the mark duplicates step?

A

MarkDuplicatesSpark
or
MarkDuplicates (Picard) + SortSam

26
Q

What is done in base recalibration step?

A

This step detects and corrects for patterns of systematic errors in the base quality scores using machine learning.
New BAM files are produced in this step.

27
Q

What are base quality scores?

A

Per-base estimates of error emitted by the sequencing machines; they express how confident the machine was that it called the correct base each time.

28
Q

What are base quality scores used for?

A

They’re used in variant calling algorithms.

29
Q

What tools are used in the base recalibration step in gatk?

A

BaseRecalibrator, ApplyBSQR, AnalyzeCovariates (optional)

30
Q

What is a haplotype?

A

A physical grouping of genomic variants (or polymorphisms) that tend to be inherited together. A specific haplotype typically reflects a unique combination of variants that reside near each other on a chromosome.