Genomics and bioinformatics. Flashcards

1
Q

What is the definition of bioinformatics?

A

The science of collecting and analysing complex biological data such as the genetic code.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Most things decrease in cost following Moore’s law, this was not the case for genome sequence after the development of NGS. What was the cost decrease from 2001 to 2014 for one megabase?

A

2001 $10k

2014 $0.10

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the bioinformatics gap?

A

The problem of storing and analysing all the genetic date provided from NGS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What two sequencing techniques are often used in health care?

A

Illumina Hiseq and Miseq

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What sort of sequencing produces ‘big data’?

A

NGS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Does Illumina Hiseq have a better read run or read depth?

A

Read depth- produces 8 billion reads of 150 bp on high output mode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What method of sequencing now takes up roughly 80% of the market?

A

Illumina.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What format does a biological file have to be in?

A

Txt file. NOT IN A WORD PROCESSOR.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the most common file format for biological data?

A

FASTA.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is found on the title line of a FASTA file?

A

> , identifier, description.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How many characters can be on each line in a FASTA file?

A

60.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

When sequences are compared what can they be described to have?

A

A % identity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a padding character?

A

A - which is filled in to form a ‘gap’, this maximises the sequence identity. T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is sequence alignment used for?

A

Used to identify homology between sequences with a common ancestor.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

When using - to fill in gaps (padding character) what do you need to assume?

A

That each sequence is equivalent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What can sequence alignment be used to do other than work out a % identity (2 things)?

A
  1. Create short reads and contigs.

2. Used to map reads to a reference genome (resequencing).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What can resequencing allow you to identify?

A

Sequence variants and transcribed regions of the genome, allowing to you quantify transcripts and the level of transcription through RNA seq.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Would x– or -x- be more likely?

A

X– as caused by a single event. Both are still possible however.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Scoring systems are used in sequence alignments. Matches are given a (+) score. What two events give a (-) score?

A

Gaps and mismatches.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How do the penalties given to longer gaps slightly differ fro the penalties for shorter gaps?

A

In a longer gap the first (-) will get a minus score while the next can get a slightly lower penally as they only extend the gap.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are gaps also called in sequence alignment?

A

INDELs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does calculating identities for sequences allow you to do?

A

Determine evolutionary relationships.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What do sequence alignments allow you to create?

A

Contigs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Sequence alignments can be used to map reads to a reference genome. What is this process also called/

A

Resequencing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Why would you want to resequenced a genome? (3 reasons)

A
  1. Can identify sequence variants.
  2. Identified transcribed regions
  3. Quantify levels of transcription- RNAseq.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Alignment is more difficult with protein sequences. Why?

A

Not all amino acids are equivalent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Amino acids are not all equivalent when it comes to making alignments. How does this effect the scoring?

A

Equivalent amino acids (i.e. same size and charge) are given less of a negative score.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What are two common scoring matrices?

A

PAM70 anf BLOSUM62.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Who produced the PAM70 matrix and can be classed as the first bioinformatician?

A

Margaret Dayhoff.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What percentage identify can be used on the PAM70 matrix?

A

30%.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

How many mutants and proteins families did Margaret Dayhoff use to make the PAM70 matrix?

A

1571 mutations and 71 protein.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Alignments can be global. What does this mean?

A

An attempt is made to align the sequence across the entire genome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

For a global alignment to occur what do both sequences need to be?

A

Equivalent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Who first proposed the fist global alignment algorithm?

A

Needleman and Wunsch.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

How can local alignments be used to maximise alignment scores?

A

Can search subsequent lengths of the global sequences, this results in less gaps.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Who proposed the first local alignment algorithm based on the Needleman and Wunsch global alignment algorithm?

A

Smith and Waterman.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is a simple way to visualise sequence alignments?

A

Dot plots.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

How would inversions be visualised in dots plots?

A

Diagonal line (TR to BL) intersected by a diagonal line going the opposite direction, followed by the original diagonal line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is the definition of a multiple alignment?

A

Alignment of three or more sequences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Why are multiple alignments considerably harder than pair alignments?

A

As the number if possible alignments increases exponentially for every sequence is added.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What method is applied to reduce the number of sequences examined in a multiple alignment?

A

Heuristic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What does a multiple alignment involve originally?

A

An initial examination of a guide tree which gives an indication of the relationship between the species.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What sequences are aligned first in a multiple alignment?

A

Most commonly related sequences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Once the most commonly related sequences have been aligned in a multiple alignment how are they treated?

A

As a single sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What are four commonly used programmes used in multiple alignments?

A

Clutsal W, T-Coffee, MUSCLE and MAFFT.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What programmes use multiple alignments for their inputs?

A

Programmes that infer phylognetic trees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What does ‘BLAST’ stand for?

A

Basic Local Alignment Sequence Tool.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What is BLAST a widely used method for?

A

Searching a sequence database to rapidly identify sequences similar to a query sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Is BLAST heuristic or an algorithm?

A

Heuristic.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

BLAST is heuristic. What does this mean?

A

It is not guaranteed to identify the best sequence (although it usually does.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What is BLAST basically a fast version of?

A

Smith anf Waterman- although it is not an algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

What does BLAST break the sequence into?

A

3 amino acids / 11 nucleotides by default.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What regions are often excluded from BLAST searches and why?

A

Regions of low complexity (e.g. regions that are made from a single nucleotide/ same amino acid) as these are relatively uninteresting regions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

How do BLAST searches work?

A

BLAST database searched for seed alignments which match the words in the query, these are then extended.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

What does each BLAST hit have?

A

One or more HPS (High scoring pairs). These are local alignments with a score above a particular threshold.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What are the basic files produced by DNA sequences called?

A

Fastq files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

What two things do Fastq files include?

A
  1. Sequence data.

2. Measure of the sequence quality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

What two features are there of a Fastq file headliner?

A

Begin with an @ sign and contain a unique read identifier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

During DNA sequencing each base is given a quality score. What can this score range between?

A

1-99.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

During DNA sequencing each base is given a quality score ranging from 1-99. What does the score normally max out at?

A

60.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

How are the quality scores assigned to each base calculated?

A

10log10P, where P is the probability of an error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

What does a quality score of 10 mean?

A

1 in 10 chance that the base is wrong.

63
Q

What does a quality score of 20 mean?

A

1 in 100 chance that the base is wrong.

64
Q

What does a quality score of 30 mean?

A

1 in 1000 chance that the base is wrong.

65
Q

What do separator lines in Fastq files contain?

A

+.

66
Q

How are quality scores stored?

A

In fast files in a series of letters, digits and punctuation. 33 is added to the score and they are converted to ASCII files.

67
Q

Do letters or punctuation show a good quality score?

A

Letters. Punctuation shows a bad quality score.

68
Q

After illumina sequencing how many Fastq reads do you get?

A

2.

69
Q

What are 7 problems of whole genome shotgun sequencing?

A
  1. Whole sequence unknown.
  2. Coverage bias.
  3. Sequencing errors.
  4. Repeats.
  5. Multiple replicons.
  6. Contamination.
  7. Circular genomes.
70
Q

Why are circular genomes a problem in whole genome sequencing

A

As there is no obvious start and end and the whole thing overlaps.

71
Q

What is the biggest issue for shotgun sequencing?

A

Repeats.

72
Q

How can contamination be a problem in sequencing?

A

Two samples can be loaded into one tube.

73
Q

When can repeats not be sequenced and what could solve this?

A

When they are longer than the inserts as you do not know the ends- could be solved by long read technologies.

74
Q

Name two examples of long read technologies.

A

Pac, Oxford nanopore.

75
Q

What is a common approach to assemble short reads while taking into account sequencing errors and repeats?

A

deBruijn.

76
Q

What do the bubbles in a deBruijn graph represent?

A

Repeats.

77
Q

What is one of key functions of sequencing software?

A

Resolving bubbles in deBruijn graphs.

78
Q

What route is taken in a deBruijn graph?

A

The longest route.

79
Q

What are the fixed length chunks of varying size on a deBruijn graph called?

A

Kmers.

80
Q

What additional information does genome sequencing software use to resolve bubbles on a deBruijn graph?

A

Coverage levels and paired reads.

81
Q

What happens if a bubble on a deBruijn graph can not be solved?

A

There is a break in the assembly.

82
Q

As a read gets longer what happens to a deBruijn graph?

A

It gets simpler.

83
Q

What does a circular deBruijn graph mean?

A

The whole genome is essentially a repeat.

84
Q

How many ORFS are there in DNA?

A

6.

85
Q

Why are there 3 ORFS per DNA strand?

A

As the genetic code is a triplet code.

86
Q

Are stop codons AT or GC rich?

A

AT.

87
Q

What BLAST program is used to look up nucleic acids in the nucleic acid database?

A

blastn.

88
Q

What BLAST program is used to look up conceptional protein translocations in the conceptional protein translocation database?

A

tblastx.

89
Q

What BLAST program is used to look up conceptional protein translocations in the peptide protein database?

A

blastx.

90
Q

What BLAST program is used to look up peptide proteins in the conceptual protein translocations database?

A

tblastn.

91
Q

What BLAST program is used to look up peptide proteins in the peptide protein database?

A

blastp.

92
Q

How can large genes be identified in bacterial/archael species?

A

Through the presence of long ORFs.

93
Q

Is it easier to identify long genes in bacterial/archael species in GC or AT rich genomes?

A

AT.

94
Q

What three conserved sequences can indicate the start of a gene in bacterial/ archaeal species?

A

AUG, pribnow box, Shine Delgado sequence.

95
Q

Why do codon sequences within genes often have a characteristic base composition?

A

Due to bias in codon usage.

96
Q

Name a gene finding program and explain the basic approach that it uses.

A

Glimmer- identifies patterns in the genome to identify the longest ORFS and uses these to identify shorter genes.

97
Q

Why is it harder to find eukaryotic genes in the genome?

A

As they contain no ORFS and contain introns.

98
Q

What can identify the start of a sequence in eukaryotic genes?

A

Kozak sequence.

99
Q

What two consensus sequences can identify the end of a eukaryotic gene?

A

Terminator consensus and the poly(A) signal.

100
Q

What can indicate the presence of an intron in a eukaryote?

A

Donor and acceptor sites.

101
Q

Why is gene prediction in eukaryotes hit and miss?

A

Alternative splicing patterns.

102
Q

What is the role of the software GeneMark and is it a god piece of software?

A

To identify genes in eukaryotes despite them having alternative splicing patterns. Accuracy of this software is limited.

103
Q

What experimental data can be used to improve gene prediction in eukaryotes?

A

EST sequences or RNA seq data.

104
Q

Once a gene has been predicted how is its function often identified?

A

Homologous genes can be found using BLAST and the function of the gene can be inferred.

105
Q

Common domains in predicted protein sequences can be identified using software such as _______ against databases such as _______.

A

HMMer, Pfam.

106
Q

HMMer is a software package use to predict common domains with a protein sequences. What other functional elements an be identified with similar software packages?

A

tRNA, rRNA, ncRNA genes and common repeat elements.

107
Q

Pipelines such as ______ in bacteria/ archaea and _____ in eukaryotes combine gene finding and gene annotating approaches rapidly allowing for the prediction of genes and a functional annotation.

A

Prokka, MAKER

108
Q

Apart from resolving repeat regions, what is an advantage of using deBuijn graphs?

A

That the kmers can be stored in memory so require less RAM.

109
Q

Removal of what can correct sequencing errors?

A

Rare k-mers.

110
Q

What can be used to solve bubble son deBruijn graphs?

A

Pair end reads.

111
Q

What does resequencing a genome allow you to do?

A

Investigate genetic variation within a population of a species.

112
Q

What three things can be studied through resequencing the human genome?

A
  1. Single gene disorders.
  2. Complex gene disorders.
  3. Cancer.
113
Q

Resequencing can identify variants. What could this possibly be useful for?

A

Personalised medicine and genome editing methods such as CRISPR/Cas9.

114
Q

What two functional genome technologies also rely on sequencing?

A

RNA-seq and chIP-seq.

115
Q

How are Illumina sequencing reads stored?

A

Fastq files.

116
Q

Many programs are available which will map short reads to a reference genome. What are the two most common?

A
  1. BWA (Burrows Wheeler Aligner).

2. Bowtie 2.

117
Q

What genome reference software allows a reference genome to be stored in a computer memory and efficiently searched?

A

BWA.

118
Q

How much RAM does a computer need to search and store the reference genome?

A

2GB.

119
Q

How are short identical matches found when comparing a sample to a reference genome?

A

Each read is compared with a short index on the reference genome to identify short identical matches called seed sequences. The alignments from the seed sequences are extended to include the rest of the read.

120
Q

What is used to identify the best mapping position?

A

An alignment scoring system.

121
Q

What is included in the alignment scoring system?

A
  1. Number of matches
  2. Number of mismatches
  3. Base qualities
122
Q

Not all mismatches are equally penalised in the alignment scoring system. How is this the case?

A

Lower quality bases are penalised less then higher quality mismatches.

123
Q

What is often mapped independently?

A

Paired reads. The best mapping position also takes these into account though.

124
Q

Each mapped read is given a mapping score. What does this score show?

A

The confidence that the read is derived from that position in the genome.

125
Q

What reads have low mapping scores?

A

Ambiguously mapped reads, e.g. reads from repeat regions

126
Q

What files are mapped reads usually stored in?

A

BAM files.

127
Q

Mapped reads are stored in BAM files. What information is contained in these?

A

Details of which position on which chromosome the read is mapped to.

128
Q

What is the ‘depth of coverage’?

A

The number of reads that overlap a particular position.

129
Q

What is a ‘pile up plot’?

A

A bar chart showing the variation of depth coverage.

130
Q

Would would a 1 times coverage make it impossible to do and what times coverage should be carried out to overcome this?

A

Distinguish between errors and real differences between the sequences. A 50 times read coverage allows you to make the assumption that any errors are random.

131
Q

What programs can be used to identify common SNPs and small Indels?

A

GATK and SAM tools.

132
Q

GATK and SAM tools can be sued to identify common SNPS. How do these programs work?

A

They use a probabilistic model to distinguish real homozygous or heterozygous variants from sequencing errors.

133
Q

What is the role of the program SNPef?

A

Predicts the effect of a SNP.

134
Q

If a SNP is synonymous what does this mean?

A

It does not change an encoded protein.

135
Q

What does dbSNP do?

A

Removes commonly found SNPs from the dataset. This is because common variants are less likely to be clinically relevant.

136
Q

What controls need to be included when looking at SNPs?

A

Affected and non affected individuals. Ideally controls are related or are very genetically similar.

137
Q

How can SNPS be validated?

A

PCR and Sanger.

138
Q

What can be the cause of random or systemic error in sequencing?

A

When the sequencer outputs the wrong base call.

139
Q

What does a mapping error involve?

A

A mapper placing the read in the incorrect position of the genome.

140
Q

What does sample contamination in sequencing involve?

A

The sampler containing different DNA with a different sequence from a different source.

141
Q

What does sequence contamination in sequencing involve?

A

Reads from a sample being mislabelled.

142
Q

Errors in sequencing can come from both sequences. True or false?

A

True.

143
Q

Do all changes in nucleotide sequences result in a SNP?

A

No.

144
Q

What are structural variants a result of?

A

Larger scale chromosome rearrangements.

145
Q

Structural variants are a result of larger scale chromosome rearrangements. Name 5 examples of these?

A
  1. Insertions.
  2. Deletions.
  3. Duplications/copy number variants.
  4. Inversions.
  5. Translocations.
146
Q

What can coverage depth allow you to detect?

A

Structural variants in resequencing.

147
Q

What can read pairs be used to detect?

A

Structural variants in resequencing.

148
Q

What can split reads detect?

A

Deletions in the reference.

149
Q

What can show novel insertions?

A

Assemblies?

150
Q

What is perhaps the simplest form of functional genomics?

A

GWAS.

151
Q

What regions of the genome do GWAS identify?

A

Regions associated with a particular phenotype by statistical association.

152
Q

Through examine lots of cases in a GWAS what cab be identified?

A

Variants that are over represented in some cases.

153
Q

What shows the data set for GWAS?

A

Manhattan plots.

154
Q

Why can GWAS be misleading?

A

Correlation doesn’t always mean causation. Further studies are often needed to establish molecular mechanisms underlying the phenotype.