Genomics and bioinformatics. Flashcards
What is the definition of bioinformatics?
The science of collecting and analysing complex biological data such as the genetic code.
Most things decrease in cost following Moore’s law, this was not the case for genome sequence after the development of NGS. What was the cost decrease from 2001 to 2014 for one megabase?
2001 $10k
2014 $0.10
What is the bioinformatics gap?
The problem of storing and analysing all the genetic date provided from NGS.
What two sequencing techniques are often used in health care?
Illumina Hiseq and Miseq
What sort of sequencing produces ‘big data’?
NGS.
Does Illumina Hiseq have a better read run or read depth?
Read depth- produces 8 billion reads of 150 bp on high output mode.
What method of sequencing now takes up roughly 80% of the market?
Illumina.
What format does a biological file have to be in?
Txt file. NOT IN A WORD PROCESSOR.
What is the most common file format for biological data?
FASTA.
What is found on the title line of a FASTA file?
> , identifier, description.
How many characters can be on each line in a FASTA file?
60.
When sequences are compared what can they be described to have?
A % identity.
What is a padding character?
A - which is filled in to form a ‘gap’, this maximises the sequence identity. T
What is sequence alignment used for?
Used to identify homology between sequences with a common ancestor.
When using - to fill in gaps (padding character) what do you need to assume?
That each sequence is equivalent.
What can sequence alignment be used to do other than work out a % identity (2 things)?
- Create short reads and contigs.
2. Used to map reads to a reference genome (resequencing).
What can resequencing allow you to identify?
Sequence variants and transcribed regions of the genome, allowing to you quantify transcripts and the level of transcription through RNA seq.
Would x– or -x- be more likely?
X– as caused by a single event. Both are still possible however.
Scoring systems are used in sequence alignments. Matches are given a (+) score. What two events give a (-) score?
Gaps and mismatches.
How do the penalties given to longer gaps slightly differ fro the penalties for shorter gaps?
In a longer gap the first (-) will get a minus score while the next can get a slightly lower penally as they only extend the gap.
What are gaps also called in sequence alignment?
INDELs.
What does calculating identities for sequences allow you to do?
Determine evolutionary relationships.
What do sequence alignments allow you to create?
Contigs.
Sequence alignments can be used to map reads to a reference genome. What is this process also called/
Resequencing.
Why would you want to resequenced a genome? (3 reasons)
- Can identify sequence variants.
- Identified transcribed regions
- Quantify levels of transcription- RNAseq.
Alignment is more difficult with protein sequences. Why?
Not all amino acids are equivalent.
Amino acids are not all equivalent when it comes to making alignments. How does this effect the scoring?
Equivalent amino acids (i.e. same size and charge) are given less of a negative score.
What are two common scoring matrices?
PAM70 anf BLOSUM62.
Who produced the PAM70 matrix and can be classed as the first bioinformatician?
Margaret Dayhoff.
What percentage identify can be used on the PAM70 matrix?
30%.
How many mutants and proteins families did Margaret Dayhoff use to make the PAM70 matrix?
1571 mutations and 71 protein.
Alignments can be global. What does this mean?
An attempt is made to align the sequence across the entire genome.
For a global alignment to occur what do both sequences need to be?
Equivalent.
Who first proposed the fist global alignment algorithm?
Needleman and Wunsch.
How can local alignments be used to maximise alignment scores?
Can search subsequent lengths of the global sequences, this results in less gaps.
Who proposed the first local alignment algorithm based on the Needleman and Wunsch global alignment algorithm?
Smith and Waterman.
What is a simple way to visualise sequence alignments?
Dot plots.
How would inversions be visualised in dots plots?
Diagonal line (TR to BL) intersected by a diagonal line going the opposite direction, followed by the original diagonal line.
What is the definition of a multiple alignment?
Alignment of three or more sequences.
Why are multiple alignments considerably harder than pair alignments?
As the number if possible alignments increases exponentially for every sequence is added.
What method is applied to reduce the number of sequences examined in a multiple alignment?
Heuristic.
What does a multiple alignment involve originally?
An initial examination of a guide tree which gives an indication of the relationship between the species.
What sequences are aligned first in a multiple alignment?
Most commonly related sequences.
Once the most commonly related sequences have been aligned in a multiple alignment how are they treated?
As a single sequence.
What are four commonly used programmes used in multiple alignments?
Clutsal W, T-Coffee, MUSCLE and MAFFT.
What programmes use multiple alignments for their inputs?
Programmes that infer phylognetic trees.
What does ‘BLAST’ stand for?
Basic Local Alignment Sequence Tool.
What is BLAST a widely used method for?
Searching a sequence database to rapidly identify sequences similar to a query sequence.
Is BLAST heuristic or an algorithm?
Heuristic.
BLAST is heuristic. What does this mean?
It is not guaranteed to identify the best sequence (although it usually does.)
What is BLAST basically a fast version of?
Smith anf Waterman- although it is not an algorithm.
What does BLAST break the sequence into?
3 amino acids / 11 nucleotides by default.
What regions are often excluded from BLAST searches and why?
Regions of low complexity (e.g. regions that are made from a single nucleotide/ same amino acid) as these are relatively uninteresting regions.
How do BLAST searches work?
BLAST database searched for seed alignments which match the words in the query, these are then extended.
What does each BLAST hit have?
One or more HPS (High scoring pairs). These are local alignments with a score above a particular threshold.
What are the basic files produced by DNA sequences called?
Fastq files.
What two things do Fastq files include?
- Sequence data.
2. Measure of the sequence quality.
What two features are there of a Fastq file headliner?
Begin with an @ sign and contain a unique read identifier.
During DNA sequencing each base is given a quality score. What can this score range between?
1-99.
During DNA sequencing each base is given a quality score ranging from 1-99. What does the score normally max out at?
60.
How are the quality scores assigned to each base calculated?
10log10P, where P is the probability of an error.