alignment_assembly Flashcards by Yunfei Shang

What is the alignment process in bioinformatics?

The alignment process compares sequences (DNA, RNA, or protein) to identify regions of similarity, helping to understand functional, structural, and evolutionary relationships.

How well did you know this?

Not at all

Perfectly

What is a bit score?

A normalized score that reflects the significance of the alignment between a query sequence and a database sequence, independent of database size; higher scores indicate more significant alignments.

How well did you know this?

Not at all

Perfectly

What is an E-value?

The Expect value describes the number of hits one can expect to see by chance when searching a database of a particular size; lower E-values indicate more significant matches.

How well did you know this?

Not at all

Perfectly

How does the length of the query sequence affect the E-value?

Shorter sequences tend to have higher E-values because they are more likely to appear in the database by random chance.

How well did you know this?

Not at all

Perfectly

How does database size influence the E-value?

A larger database increases the likelihood of finding matches with the same score, affecting the E-value.

How well did you know this?

Not at all

Perfectly

What is BLASTN used for?

Comparing nucleotide sequences against nucleotide databases.

How well did you know this?

Not at all

Perfectly

What is BLASTP used for?

Comparing protein sequences against protein databases.

How well did you know this?

Not at all

Perfectly

What is BLASTX used for?

Comparing a nucleotide sequence translated into all six reading frames against a protein database.

How well did you know this?

Not at all

Perfectly

What is TBLASTN used for?

Comparing protein sequences against a nucleotide database translated in all six reading frames.

How well did you know this?

Not at all

Perfectly

What is TBLASTX used for?

Comparing nucleotide sequences translated into all six reading frames against another translated nucleotide database.

How well did you know this?

Not at all

Perfectly

What factors should you consider when choosing alignment software?

Type of sequences/experiment, sequencing platform, planned further analysis, and computational infrastructure.

How well did you know this?

Not at all

Perfectly

Why is it important to know your sequencing platform when choosing alignment software?

Different platforms (e.g., Illumina, Ion Torrent, PacBio) have unique read lengths and error types that affect compatibility with mapping tools.

How well did you know this?

Not at all

Perfectly

How can downstream analysis influence your choice of alignment software?

Ensure that subsequent tools are compatible with the reported alignment types and formats required for further analysis.

How well did you know this?

Not at all

Perfectly

Why is computational infrastructure important in selecting alignment software?

Some tools may require significant computational power or memory resources; knowing your available resources helps in making an appropriate choice.

How well did you know this?

Not at all

Perfectly

Aspect

Short Read Alignment

How well did you know this?

Not at all

Perfectly

Read Length and Characteristics

Typically range from 36 to 600 bp; generated by platforms like Illumina; cost-effective, high-quality data but can complicate assembly of complex genomes.

How well did you know this?

Not at all

Perfectly

Alignment Speed and Efficiency

Study These Flashcards

Generally faster to align due to smaller size; may require more computational resources for large datasets; lower alignment rate for short reads.

Error Rates

Study These Flashcards

Generally lower error rates, suitable for applications requiring high accuracy, such as variant calling.

Applications

Study These Flashcards

Commonly used in RNA-Seq, targeted resequencing, and SNP detection due to high throughput and accuracy.

Alignment Tools

Study These Flashcards

Tools like BWA, Bowtie2, and STAR are designed specifically for short-read data.

Aspect

Study These Flashcards

Description

Genome Assembly

Study These Flashcards

The process of reconstructing a complete genome from short DNA sequences, known as reads.

Reads

Study These Flashcards

The smallest unit of sequencing data obtained from next-generation sequencing (NGS); short sequences of DNA, typically ranging from 150 to 600 base pairs.

Contigs

Study These Flashcards

Contiguous sequences of DNA assembled from overlapping reads; represent longer stretches of the genome without gaps, usually ranging from hundreds of bases to several kilobases.

Scaffolds

Longer sequences constructed by linking contigs together using additional information about their relative positions and orientations; may contain gaps where the exact sequence is unknown.

K-mer

A substring of length k from a longer DNA sequence; used in assembly algorithms to identify overlaps between reads and help construct contigs.

De Bruijn Graph

A data structure representing overlaps between k-mers; each k-mer is a vertex, and edges connect vertices that overlap by k-1 bases; allows efficient assembly of contigs.

Assembling Software

Tools available for genome assembly with different strengths.

Velvet

A popular de novo assembler using a de Bruijn graph approach to assemble short reads into contigs.

Edena

Designed for assembling short reads; can handle large datasets and produce high-quality assemblies.

SSAKE

Focuses on creating contigs from paired-end reads, improving accuracy by utilizing information from both ends of the read pairs.

K-mer Selection

Choosing the k in k-mer analysis is crucial; smaller k may lead to many small contigs, while larger k can produce fewer, longer contigs.

Good Assembly Definition

Characterized by longer and fewer contigs, high percentage of contigs ≥ 1 kb, and high total number of bases in contigs. High Coverage and maintain the appropriate GC content.

Methods to Increase Assembly Quality

Strategies to enhance the quality of genome assembly.

High Coverage

Increasing coverage depth can reduce gaps between contigs and lead to longer contigs overall.

GC Content

Maintaining appropriate GC content can improve sequencing accuracy and assembly quality.

Mate pair

a pair of reads from two ends of the same insert fragment

N50 of the assembly

N50 (a weighted median length statistic such that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value).

factors may affecting the assembly

- Coverage - % GC - Genomic content (repetitive regions, transposons, etc.)

Effect of coverage

Coverage = N*L/G N=Total Number of Reads G=Genome Size L is read length The high coverage will lead to the better assembly quality

velveth and velvetg

h: convert reads to k-mers g: assemble k-mers into contigs Velveth helps you construct the dataset for the following program, velvetg, and indicate to the system what each sequence file represents.

alignment_assembly Flashcards

(41 cards)