Genomics and bioinformatics. Flashcards

Question

Why would you want to resequenced a genome? (3 reasons)

Answer 1

1. Can identify sequence variants. 2. Identified transcribed regions 3. Quantify levels of transcription- RNAseq.

Answer 2

Not all amino acids are equivalent.

Answer 3

Equivalent amino acids (i.e. same size and charge) are given less of a negative score.

Answer 4

PAM70 anf BLOSUM62.

Answer 5

Margaret Dayhoff.

Answer 6

1571 mutations and 71 protein.

Answer 7

An attempt is made to align the sequence across the entire genome.

Answer 8

Equivalent.

Answer 9

Needleman and Wunsch.

Answer 10

Can search subsequent lengths of the global sequences, this results in less gaps.

Answer 11

Smith and Waterman.

Answer 12

Dot plots.

Answer 13

Diagonal line (TR to BL) intersected by a diagonal line going the opposite direction, followed by the original diagonal line.

Answer 14

Alignment of three or more sequences.

Answer 15

As the number if possible alignments increases exponentially for every sequence is added.

Answer 16

Heuristic.

Answer 17

An initial examination of a guide tree which gives an indication of the relationship between the species.

Answer 18

Most commonly related sequences.

Answer 19

As a single sequence.

Answer 20

Clutsal W, T-Coffee, MUSCLE and MAFFT.

Answer 21

Programmes that infer phylognetic trees.

Answer 22

Basic Local Alignment Sequence Tool.

Answer 23

Searching a sequence database to rapidly identify sequences similar to a query sequence.

Answer 24

Heuristic.

Answer 25

It is not guaranteed to identify the best sequence (although it usually does.)

Answer 26

Smith anf Waterman- although it is not an algorithm.

Answer 27

3 amino acids / 11 nucleotides by default.

Answer 28

Regions of low complexity (e.g. regions that are made from a single nucleotide/ same amino acid) as these are relatively uninteresting regions.

Answer 29

BLAST database searched for seed alignments which match the words in the query, these are then extended.

Answer 30

One or more HPS (High scoring pairs). These are local alignments with a score above a particular threshold.

Answer 31

Fastq files.

Answer 32

1. Sequence data. | 2. Measure of the sequence quality.

Answer 33

Begin with an @ sign and contain a unique read identifier.

Answer 34

10log10P, where P is the probability of an error.

Answer 35

1 in 10 chance that the base is wrong.

Answer 36

1 in 100 chance that the base is wrong.

Answer 37

1 in 1000 chance that the base is wrong.

Answer 38

In fast files in a series of letters, digits and punctuation. 33 is added to the score and they are converted to ASCII files.

Answer 39

Letters. Punctuation shows a bad quality score.

Answer 40

1. Whole sequence unknown. 2. Coverage bias. 3. Sequencing errors. 4. Repeats. 5. Multiple replicons. 6. Contamination. 7. Circular genomes.

Answer 41

As there is no obvious start and end and the whole thing overlaps.

Answer 42

Two samples can be loaded into one tube.

Answer 43

When they are longer than the inserts as you do not know the ends- could be solved by long read technologies.

Answer 44

Pac, Oxford nanopore.

Answer 45

Resolving bubbles in deBruijn graphs.

Answer 46

The longest route.

Answer 47

Coverage levels and paired reads.

Answer 48

There is a break in the assembly.

Answer 49

It gets simpler.

Answer 50

The whole genome is essentially a repeat.

Answer 51

As the genetic code is a triplet code.

Answer 52

Through the presence of long ORFs.

Answer 53

AUG, pribnow box, Shine Delgado sequence.

Answer 54

Due to bias in codon usage.

Answer 55

Glimmer- identifies patterns in the genome to identify the longest ORFS and uses these to identify shorter genes.

Answer 56

As they contain no ORFS and contain introns.

Answer 57

Kozak sequence.

Answer 58

Terminator consensus and the poly(A) signal.

Answer 59

Donor and acceptor sites.

Answer 60

Alternative splicing patterns.

Answer 61

To identify genes in eukaryotes despite them having alternative splicing patterns. Accuracy of this software is limited.

Answer 62

EST sequences or RNA seq data.

Answer 63

Homologous genes can be found using BLAST and the function of the gene can be inferred.

Answer 64

HMMer, Pfam.

Answer 65

tRNA, rRNA, ncRNA genes and common repeat elements.

Answer 66

Prokka, MAKER

Answer 67

That the kmers can be stored in memory so require less RAM.

Answer 68

Rare k-mers.

Answer 69

Pair end reads.

Answer 70

Investigate genetic variation within a population of a species.

Answer 71

1. Single gene disorders. 2. Complex gene disorders. 3. Cancer.

Answer 72

Personalised medicine and genome editing methods such as CRISPR/Cas9.

Answer 73

RNA-seq and chIP-seq.

Answer 74

Fastq files.

Answer 75

1. BWA (Burrows Wheeler Aligner). | 2. Bowtie 2.

Answer 76

Each read is compared with a short index on the reference genome to identify short identical matches called seed sequences. The alignments from the seed sequences are extended to include the rest of the read.

Answer 77

An alignment scoring system.

Answer 78

1. Number of matches 2. Number of mismatches 3. Base qualities

Answer 79

Lower quality bases are penalised less then higher quality mismatches.

Answer 80

Paired reads. The best mapping position also takes these into account though.

Answer 81

The confidence that the read is derived from that position in the genome.

Answer 82

Ambiguously mapped reads, e.g. reads from repeat regions

Answer 83

BAM files.

Answer 84

Details of which position on which chromosome the read is mapped to.

Answer 85

The number of reads that overlap a particular position.

Answer 86

A bar chart showing the variation of depth coverage.

Answer 87

Distinguish between errors and real differences between the sequences. A 50 times read coverage allows you to make the assumption that any errors are random.

Answer 88

GATK and SAM tools.

Answer 89

They use a probabilistic model to distinguish real homozygous or heterozygous variants from sequencing errors.

Answer 90

Predicts the effect of a SNP.

Answer 91

It does not change an encoded protein.

Answer 92

Removes commonly found SNPs from the dataset. This is because common variants are less likely to be clinically relevant.

Answer 93

Affected and non affected individuals. Ideally controls are related or are very genetically similar.

Answer 94

PCR and Sanger.

Answer 95

When the sequencer outputs the wrong base call.

Answer 96

A mapper placing the read in the incorrect position of the genome.

Answer 97

The sampler containing different DNA with a different sequence from a different source.

Answer 98

Reads from a sample being mislabelled.

Answer 99

Larger scale chromosome rearrangements.

Answer 100

1. Insertions. 2. Deletions. 3. Duplications/copy number variants. 4. Inversions. 5. Translocations.

Answer 101

Structural variants in resequencing.

Answer 102

Structural variants in resequencing.

Answer 103

Deletions in the reference.

Answer 104

Assemblies?

Answer 105

Regions associated with a particular phenotype by statistical association.

Answer 106

Variants that are over represented in some cases.

Answer 107

Manhattan plots.

Answer 108

Correlation doesn't always mean causation. Further studies are often needed to establish molecular mechanisms underlying the phenotype.

Genomics and bioinformatics. Flashcards

(154 cards)