alignment_assembly Flashcards

1
Q

What is the alignment process in bioinformatics?

A

The alignment process compares sequences (DNA, RNA, or protein) to identify regions of similarity, helping to understand functional, structural, and evolutionary relationships.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a bit score?

A

A normalized score that reflects the significance of the alignment between a query sequence and a database sequence, independent of database size; higher scores indicate more significant alignments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is an E-value?

A

The Expect value describes the number of hits one can expect to see by chance when searching a database of a particular size; lower E-values indicate more significant matches.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How does the length of the query sequence affect the E-value?

A

Shorter sequences tend to have higher E-values because they are more likely to appear in the database by random chance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does database size influence the E-value?

A

A larger database increases the likelihood of finding matches with the same score, affecting the E-value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is BLASTN used for?

A

Comparing nucleotide sequences against nucleotide databases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is BLASTP used for?

A

Comparing protein sequences against protein databases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is BLASTX used for?

A

Comparing a nucleotide sequence translated into all six reading frames against a protein database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is TBLASTN used for?

A

Comparing protein sequences against a nucleotide database translated in all six reading frames.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is TBLASTX used for?

A

Comparing nucleotide sequences translated into all six reading frames against another translated nucleotide database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What factors should you consider when choosing alignment software?

A

Type of sequences/experiment, sequencing platform, planned further analysis, and computational infrastructure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why is it important to know your sequencing platform when choosing alignment software?

A

Different platforms (e.g., Illumina, Ion Torrent, PacBio) have unique read lengths and error types that affect compatibility with mapping tools.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How can downstream analysis influence your choice of alignment software?

A

Ensure that subsequent tools are compatible with the reported alignment types and formats required for further analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why is computational infrastructure important in selecting alignment software?

A

Some tools may require significant computational power or memory resources; knowing your available resources helps in making an appropriate choice.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Aspect

A

Short Read Alignment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Read Length and Characteristics

A

Typically range from 36 to 600 bp; generated by platforms like Illumina; cost-effective, high-quality data but can complicate assembly of complex genomes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Alignment Speed and Efficiency

A

Generally faster to align due to smaller size; may require more computational resources for large datasets; lower alignment rate for short reads.

18
Q

Error Rates

A

Generally lower error rates, suitable for applications requiring high accuracy, such as variant calling.

19
Q

Applications

A

Commonly used in RNA-Seq, targeted resequencing, and SNP detection due to high throughput and accuracy.

20
Q

Alignment Tools

A

Tools like BWA, Bowtie2, and STAR are designed specifically for short-read data.

21
Q

Aspect

A

Description

22
Q

Genome Assembly

A

The process of reconstructing a complete genome from short DNA sequences, known as reads.

23
Q

Reads

A

The smallest unit of sequencing data obtained from next-generation sequencing (NGS); short sequences of DNA, typically ranging from 150 to 600 base pairs.

24
Q

Contigs

A

Contiguous sequences of DNA assembled from overlapping reads; represent longer stretches of the genome without gaps, usually ranging from hundreds of bases to several kilobases.

25
Q

Scaffolds

A

Longer sequences constructed by linking contigs together using additional information about their relative positions and orientations; may contain gaps where the exact sequence is unknown.

26
Q

K-mer

A

A substring of length k from a longer DNA sequence; used in assembly algorithms to identify overlaps between reads and help construct contigs.

27
Q

De Bruijn Graph

A

A data structure representing overlaps between k-mers; each k-mer is a vertex, and edges connect vertices that overlap by k-1 bases; allows efficient assembly of contigs.

28
Q

Assembling Software

A

Tools available for genome assembly with different strengths.

29
Q

Velvet

A

A popular de novo assembler using a de Bruijn graph approach to assemble short reads into contigs.

30
Q

Edena

A

Designed for assembling short reads; can handle large datasets and produce high-quality assemblies.

31
Q

SSAKE

A

Focuses on creating contigs from paired-end reads, improving accuracy by utilizing information from both ends of the read pairs.

32
Q

K-mer Selection

A

Choosing the k in k-mer analysis is crucial; smaller k may lead to many small contigs, while larger k can produce fewer, longer contigs.

33
Q

Good Assembly Definition

A

Characterized by longer and fewer contigs, high percentage of contigs ≥ 1 kb, and high total number of bases in contigs.
High Coverage and maintain the appropriate GC content.

34
Q

Methods to Increase Assembly Quality

A

Strategies to enhance the quality of genome assembly.

35
Q

High Coverage

A

Increasing coverage depth can reduce gaps between contigs and lead to longer contigs overall.

36
Q

GC Content

A

Maintaining appropriate GC content can improve sequencing accuracy and assembly quality.

37
Q

Mate pair

A

a pair of reads from two ends of the same insert fragment

38
Q

N50 of the assembly

A

N50 (a weighted median length statistic such that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value).

39
Q

factors may affecting the assembly

A
  • Coverage
  • % GC
  • Genomic content (repetitive regions, transposons, etc.)
40
Q

Effect of coverage

A

Coverage = N*L/G
N=Total Number of Reads
G=Genome Size
L is read length

The high coverage will lead to the better assembly quality

40
Q

velveth and velvetg

A

h: convert reads to k-mers
g: assemble k-mers into contigs

Velveth helps you construct the dataset for the following program, velvetg, and indicate to the system what each sequence file represents.