alignment_assembly Flashcards
What is the alignment process in bioinformatics?
The alignment process compares sequences (DNA, RNA, or protein) to identify regions of similarity, helping to understand functional, structural, and evolutionary relationships.
What is a bit score?
A normalized score that reflects the significance of the alignment between a query sequence and a database sequence, independent of database size; higher scores indicate more significant alignments.
What is an E-value?
The Expect value describes the number of hits one can expect to see by chance when searching a database of a particular size; lower E-values indicate more significant matches.
How does the length of the query sequence affect the E-value?
Shorter sequences tend to have higher E-values because they are more likely to appear in the database by random chance.
How does database size influence the E-value?
A larger database increases the likelihood of finding matches with the same score, affecting the E-value.
What is BLASTN used for?
Comparing nucleotide sequences against nucleotide databases.
What is BLASTP used for?
Comparing protein sequences against protein databases.
What is BLASTX used for?
Comparing a nucleotide sequence translated into all six reading frames against a protein database.
What is TBLASTN used for?
Comparing protein sequences against a nucleotide database translated in all six reading frames.
What is TBLASTX used for?
Comparing nucleotide sequences translated into all six reading frames against another translated nucleotide database.
What factors should you consider when choosing alignment software?
Type of sequences/experiment, sequencing platform, planned further analysis, and computational infrastructure.
Why is it important to know your sequencing platform when choosing alignment software?
Different platforms (e.g., Illumina, Ion Torrent, PacBio) have unique read lengths and error types that affect compatibility with mapping tools.
How can downstream analysis influence your choice of alignment software?
Ensure that subsequent tools are compatible with the reported alignment types and formats required for further analysis.
Why is computational infrastructure important in selecting alignment software?
Some tools may require significant computational power or memory resources; knowing your available resources helps in making an appropriate choice.
Aspect
Short Read Alignment
Read Length and Characteristics
Typically range from 36 to 600 bp; generated by platforms like Illumina; cost-effective, high-quality data but can complicate assembly of complex genomes.
Alignment Speed and Efficiency
Generally faster to align due to smaller size; may require more computational resources for large datasets; lower alignment rate for short reads.
Error Rates
Generally lower error rates, suitable for applications requiring high accuracy, such as variant calling.
Applications
Commonly used in RNA-Seq, targeted resequencing, and SNP detection due to high throughput and accuracy.
Alignment Tools
Tools like BWA, Bowtie2, and STAR are designed specifically for short-read data.
Aspect
Description
Genome Assembly
The process of reconstructing a complete genome from short DNA sequences, known as reads.
Reads
The smallest unit of sequencing data obtained from next-generation sequencing (NGS); short sequences of DNA, typically ranging from 150 to 600 base pairs.
Contigs
Contiguous sequences of DNA assembled from overlapping reads; represent longer stretches of the genome without gaps, usually ranging from hundreds of bases to several kilobases.
Scaffolds
Longer sequences constructed by linking contigs together using additional information about their relative positions and orientations; may contain gaps where the exact sequence is unknown.
K-mer
A substring of length k from a longer DNA sequence; used in assembly algorithms to identify overlaps between reads and help construct contigs.
De Bruijn Graph
A data structure representing overlaps between k-mers; each k-mer is a vertex, and edges connect vertices that overlap by k-1 bases; allows efficient assembly of contigs.
Assembling Software
Tools available for genome assembly with different strengths.
Velvet
A popular de novo assembler using a de Bruijn graph approach to assemble short reads into contigs.
Edena
Designed for assembling short reads; can handle large datasets and produce high-quality assemblies.
SSAKE
Focuses on creating contigs from paired-end reads, improving accuracy by utilizing information from both ends of the read pairs.
K-mer Selection
Choosing the k in k-mer analysis is crucial; smaller k may lead to many small contigs, while larger k can produce fewer, longer contigs.
Good Assembly Definition
Characterized by longer and fewer contigs, high percentage of contigs ≥ 1 kb, and high total number of bases in contigs.
High Coverage and maintain the appropriate GC content.
Methods to Increase Assembly Quality
Strategies to enhance the quality of genome assembly.
High Coverage
Increasing coverage depth can reduce gaps between contigs and lead to longer contigs overall.
GC Content
Maintaining appropriate GC content can improve sequencing accuracy and assembly quality.
Mate pair
a pair of reads from two ends of the same insert fragment
N50 of the assembly
N50 (a weighted median length statistic such that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value).
factors may affecting the assembly
- Coverage
- % GC
- Genomic content (repetitive regions, transposons, etc.)
Effect of coverage
Coverage = N*L/G
N=Total Number of Reads
G=Genome Size
L is read length
The high coverage will lead to the better assembly quality
velveth and velvetg
h: convert reads to k-mers
g: assemble k-mers into contigs
Velveth helps you construct the dataset for the following program, velvetg, and indicate to the system what each sequence file represents.