Sequencing and Assembly Flashcards
Name some of the techniques for DNA sequencing
- generation sequencing (PCR with terminating, fluorescent nucleotides).
- generation sequencing (flowcell).
- generation sequencing (PacBio SMRT-sequencing, Oxford Nanopore), Sanger sequencing.
What is the difference between FASTA and FASTQ files?
A FASTA file is the raw data from sequencing, it is a text-based document containing only the sequence (and an identifier for the sequence(s)).
A FASTQ file is a text-based document containing an identifier, the sequence, a line separator, and a quality score for each nucleotide or amino acid in the sequence. This is metadata.
Explain the terms reference-based -and de novo assembly, sequencing coverage, and sequencing depth
Reference based assembly aligns the reads (query) sequences to the genome of a closely related species (‘fasit’ if you will).
De novo assembly aligns the reads of the sequence to each other (puzzle).
Sequencing coverage is how much of the genome is covered by the reads (if we are missing parts or not).
Sequencing depth is how many reads cover each nucleotide of the genome (overlap)
Explain the Overlap Layout Consensus (OLC) approach to assembling a genome
When assembling reads into contigs and scaffolds for the genome there can be many different combinations between reads. To find the ‘right one’, OLC represents the fragments of the sequence as nodes, and the overlap between them as edges, or lines. By finding a path that visits each node exactly once we find the right sequence.
Explain how de Bruijn graphs work
de Bruijn graphs look a lot like those of OLC. However, in de Bruijn, the reads are converted into k-mers, all the same size. These k-mers are represented as the edges, or lines, of the graphs, while the overlaps are represented by the nodes. The same goal applies here, visit each node once, but with the addition that you only cross each line/edge once.
What are some different approaching to determining genes in an assembly/sequence?
Ab initio (expectation), homology-based (reference sequences), Hidden Markov Model (probability based on training data)
Explain the process of ab inito, homology-based, and HMM approach to determining genes
ab initio is based on the expected features of the genome: regulatory elements, promoters, start and stop genes, in a chronological order.
Homology is based on a reference to a known sequence or sequences, and matches the regions to determine genes.
HMM is based on a probability model trained on known sequences to determine the probability of a sequence coding for genes or not.