Genomics and bioinformatics. Flashcards
What is the definition of bioinformatics?
The science of collecting and analysing complex biological data such as the genetic code.
Most things decrease in cost following Moore’s law, this was not the case for genome sequence after the development of NGS. What was the cost decrease from 2001 to 2014 for one megabase?
2001 $10k
2014 $0.10
What is the bioinformatics gap?
The problem of storing and analysing all the genetic date provided from NGS.
What two sequencing techniques are often used in health care?
Illumina Hiseq and Miseq
What sort of sequencing produces ‘big data’?
NGS.
Does Illumina Hiseq have a better read run or read depth?
Read depth- produces 8 billion reads of 150 bp on high output mode.
What method of sequencing now takes up roughly 80% of the market?
Illumina.
What format does a biological file have to be in?
Txt file. NOT IN A WORD PROCESSOR.
What is the most common file format for biological data?
FASTA.
What is found on the title line of a FASTA file?
> , identifier, description.
How many characters can be on each line in a FASTA file?
60.
When sequences are compared what can they be described to have?
A % identity.
What is a padding character?
A - which is filled in to form a ‘gap’, this maximises the sequence identity. T
What is sequence alignment used for?
Used to identify homology between sequences with a common ancestor.
When using - to fill in gaps (padding character) what do you need to assume?
That each sequence is equivalent.
What can sequence alignment be used to do other than work out a % identity (2 things)?
- Create short reads and contigs.
2. Used to map reads to a reference genome (resequencing).
What can resequencing allow you to identify?
Sequence variants and transcribed regions of the genome, allowing to you quantify transcripts and the level of transcription through RNA seq.
Would x– or -x- be more likely?
X– as caused by a single event. Both are still possible however.
Scoring systems are used in sequence alignments. Matches are given a (+) score. What two events give a (-) score?
Gaps and mismatches.
How do the penalties given to longer gaps slightly differ fro the penalties for shorter gaps?
In a longer gap the first (-) will get a minus score while the next can get a slightly lower penally as they only extend the gap.
What are gaps also called in sequence alignment?
INDELs.
What does calculating identities for sequences allow you to do?
Determine evolutionary relationships.
What do sequence alignments allow you to create?
Contigs.
Sequence alignments can be used to map reads to a reference genome. What is this process also called/
Resequencing.
Why would you want to resequenced a genome? (3 reasons)
- Can identify sequence variants.
- Identified transcribed regions
- Quantify levels of transcription- RNAseq.
Alignment is more difficult with protein sequences. Why?
Not all amino acids are equivalent.
Amino acids are not all equivalent when it comes to making alignments. How does this effect the scoring?
Equivalent amino acids (i.e. same size and charge) are given less of a negative score.
What are two common scoring matrices?
PAM70 anf BLOSUM62.
Who produced the PAM70 matrix and can be classed as the first bioinformatician?
Margaret Dayhoff.
What percentage identify can be used on the PAM70 matrix?
30%.
How many mutants and proteins families did Margaret Dayhoff use to make the PAM70 matrix?
1571 mutations and 71 protein.
Alignments can be global. What does this mean?
An attempt is made to align the sequence across the entire genome.
For a global alignment to occur what do both sequences need to be?
Equivalent.
Who first proposed the fist global alignment algorithm?
Needleman and Wunsch.
How can local alignments be used to maximise alignment scores?
Can search subsequent lengths of the global sequences, this results in less gaps.
Who proposed the first local alignment algorithm based on the Needleman and Wunsch global alignment algorithm?
Smith and Waterman.
What is a simple way to visualise sequence alignments?
Dot plots.
How would inversions be visualised in dots plots?
Diagonal line (TR to BL) intersected by a diagonal line going the opposite direction, followed by the original diagonal line.
What is the definition of a multiple alignment?
Alignment of three or more sequences.
Why are multiple alignments considerably harder than pair alignments?
As the number if possible alignments increases exponentially for every sequence is added.
What method is applied to reduce the number of sequences examined in a multiple alignment?
Heuristic.
What does a multiple alignment involve originally?
An initial examination of a guide tree which gives an indication of the relationship between the species.
What sequences are aligned first in a multiple alignment?
Most commonly related sequences.
Once the most commonly related sequences have been aligned in a multiple alignment how are they treated?
As a single sequence.
What are four commonly used programmes used in multiple alignments?
Clutsal W, T-Coffee, MUSCLE and MAFFT.
What programmes use multiple alignments for their inputs?
Programmes that infer phylognetic trees.
What does ‘BLAST’ stand for?
Basic Local Alignment Sequence Tool.
What is BLAST a widely used method for?
Searching a sequence database to rapidly identify sequences similar to a query sequence.
Is BLAST heuristic or an algorithm?
Heuristic.
BLAST is heuristic. What does this mean?
It is not guaranteed to identify the best sequence (although it usually does.)
What is BLAST basically a fast version of?
Smith anf Waterman- although it is not an algorithm.
What does BLAST break the sequence into?
3 amino acids / 11 nucleotides by default.
What regions are often excluded from BLAST searches and why?
Regions of low complexity (e.g. regions that are made from a single nucleotide/ same amino acid) as these are relatively uninteresting regions.
How do BLAST searches work?
BLAST database searched for seed alignments which match the words in the query, these are then extended.
What does each BLAST hit have?
One or more HPS (High scoring pairs). These are local alignments with a score above a particular threshold.
What are the basic files produced by DNA sequences called?
Fastq files.
What two things do Fastq files include?
- Sequence data.
2. Measure of the sequence quality.
What two features are there of a Fastq file headliner?
Begin with an @ sign and contain a unique read identifier.
During DNA sequencing each base is given a quality score. What can this score range between?
1-99.
During DNA sequencing each base is given a quality score ranging from 1-99. What does the score normally max out at?
60.
How are the quality scores assigned to each base calculated?
10log10P, where P is the probability of an error.
What does a quality score of 10 mean?
1 in 10 chance that the base is wrong.
What does a quality score of 20 mean?
1 in 100 chance that the base is wrong.
What does a quality score of 30 mean?
1 in 1000 chance that the base is wrong.
What do separator lines in Fastq files contain?
+.
How are quality scores stored?
In fast files in a series of letters, digits and punctuation. 33 is added to the score and they are converted to ASCII files.
Do letters or punctuation show a good quality score?
Letters. Punctuation shows a bad quality score.
After illumina sequencing how many Fastq reads do you get?
2.
What are 7 problems of whole genome shotgun sequencing?
- Whole sequence unknown.
- Coverage bias.
- Sequencing errors.
- Repeats.
- Multiple replicons.
- Contamination.
- Circular genomes.
Why are circular genomes a problem in whole genome sequencing
As there is no obvious start and end and the whole thing overlaps.
What is the biggest issue for shotgun sequencing?
Repeats.
How can contamination be a problem in sequencing?
Two samples can be loaded into one tube.
When can repeats not be sequenced and what could solve this?
When they are longer than the inserts as you do not know the ends- could be solved by long read technologies.
Name two examples of long read technologies.
Pac, Oxford nanopore.
What is a common approach to assemble short reads while taking into account sequencing errors and repeats?
deBruijn.
What do the bubbles in a deBruijn graph represent?
Repeats.
What is one of key functions of sequencing software?
Resolving bubbles in deBruijn graphs.
What route is taken in a deBruijn graph?
The longest route.
What are the fixed length chunks of varying size on a deBruijn graph called?
Kmers.
What additional information does genome sequencing software use to resolve bubbles on a deBruijn graph?
Coverage levels and paired reads.
What happens if a bubble on a deBruijn graph can not be solved?
There is a break in the assembly.
As a read gets longer what happens to a deBruijn graph?
It gets simpler.
What does a circular deBruijn graph mean?
The whole genome is essentially a repeat.
How many ORFS are there in DNA?
6.
Why are there 3 ORFS per DNA strand?
As the genetic code is a triplet code.
Are stop codons AT or GC rich?
AT.
What BLAST program is used to look up nucleic acids in the nucleic acid database?
blastn.
What BLAST program is used to look up conceptional protein translocations in the conceptional protein translocation database?
tblastx.
What BLAST program is used to look up conceptional protein translocations in the peptide protein database?
blastx.
What BLAST program is used to look up peptide proteins in the conceptual protein translocations database?
tblastn.
What BLAST program is used to look up peptide proteins in the peptide protein database?
blastp.
How can large genes be identified in bacterial/archael species?
Through the presence of long ORFs.
Is it easier to identify long genes in bacterial/archael species in GC or AT rich genomes?
AT.
What three conserved sequences can indicate the start of a gene in bacterial/ archaeal species?
AUG, pribnow box, Shine Delgado sequence.
Why do codon sequences within genes often have a characteristic base composition?
Due to bias in codon usage.
Name a gene finding program and explain the basic approach that it uses.
Glimmer- identifies patterns in the genome to identify the longest ORFS and uses these to identify shorter genes.
Why is it harder to find eukaryotic genes in the genome?
As they contain no ORFS and contain introns.
What can identify the start of a sequence in eukaryotic genes?
Kozak sequence.
What two consensus sequences can identify the end of a eukaryotic gene?
Terminator consensus and the poly(A) signal.
What can indicate the presence of an intron in a eukaryote?
Donor and acceptor sites.
Why is gene prediction in eukaryotes hit and miss?
Alternative splicing patterns.
What is the role of the software GeneMark and is it a god piece of software?
To identify genes in eukaryotes despite them having alternative splicing patterns. Accuracy of this software is limited.
What experimental data can be used to improve gene prediction in eukaryotes?
EST sequences or RNA seq data.
Once a gene has been predicted how is its function often identified?
Homologous genes can be found using BLAST and the function of the gene can be inferred.
Common domains in predicted protein sequences can be identified using software such as _______ against databases such as _______.
HMMer, Pfam.
HMMer is a software package use to predict common domains with a protein sequences. What other functional elements an be identified with similar software packages?
tRNA, rRNA, ncRNA genes and common repeat elements.
Pipelines such as ______ in bacteria/ archaea and _____ in eukaryotes combine gene finding and gene annotating approaches rapidly allowing for the prediction of genes and a functional annotation.
Prokka, MAKER
Apart from resolving repeat regions, what is an advantage of using deBuijn graphs?
That the kmers can be stored in memory so require less RAM.
Removal of what can correct sequencing errors?
Rare k-mers.
What can be used to solve bubble son deBruijn graphs?
Pair end reads.
What does resequencing a genome allow you to do?
Investigate genetic variation within a population of a species.
What three things can be studied through resequencing the human genome?
- Single gene disorders.
- Complex gene disorders.
- Cancer.
Resequencing can identify variants. What could this possibly be useful for?
Personalised medicine and genome editing methods such as CRISPR/Cas9.
What two functional genome technologies also rely on sequencing?
RNA-seq and chIP-seq.
How are Illumina sequencing reads stored?
Fastq files.
Many programs are available which will map short reads to a reference genome. What are the two most common?
- BWA (Burrows Wheeler Aligner).
2. Bowtie 2.
What genome reference software allows a reference genome to be stored in a computer memory and efficiently searched?
BWA.
How much RAM does a computer need to search and store the reference genome?
2GB.
How are short identical matches found when comparing a sample to a reference genome?
Each read is compared with a short index on the reference genome to identify short identical matches called seed sequences. The alignments from the seed sequences are extended to include the rest of the read.
What is used to identify the best mapping position?
An alignment scoring system.
What is included in the alignment scoring system?
- Number of matches
- Number of mismatches
- Base qualities
Not all mismatches are equally penalised in the alignment scoring system. How is this the case?
Lower quality bases are penalised less then higher quality mismatches.
What is often mapped independently?
Paired reads. The best mapping position also takes these into account though.
Each mapped read is given a mapping score. What does this score show?
The confidence that the read is derived from that position in the genome.
What reads have low mapping scores?
Ambiguously mapped reads, e.g. reads from repeat regions
What files are mapped reads usually stored in?
BAM files.
Mapped reads are stored in BAM files. What information is contained in these?
Details of which position on which chromosome the read is mapped to.
What is the ‘depth of coverage’?
The number of reads that overlap a particular position.
What is a ‘pile up plot’?
A bar chart showing the variation of depth coverage.
Would would a 1 times coverage make it impossible to do and what times coverage should be carried out to overcome this?
Distinguish between errors and real differences between the sequences. A 50 times read coverage allows you to make the assumption that any errors are random.
What programs can be used to identify common SNPs and small Indels?
GATK and SAM tools.
GATK and SAM tools can be sued to identify common SNPS. How do these programs work?
They use a probabilistic model to distinguish real homozygous or heterozygous variants from sequencing errors.
What is the role of the program SNPef?
Predicts the effect of a SNP.
If a SNP is synonymous what does this mean?
It does not change an encoded protein.
What does dbSNP do?
Removes commonly found SNPs from the dataset. This is because common variants are less likely to be clinically relevant.
What controls need to be included when looking at SNPs?
Affected and non affected individuals. Ideally controls are related or are very genetically similar.
How can SNPS be validated?
PCR and Sanger.
What can be the cause of random or systemic error in sequencing?
When the sequencer outputs the wrong base call.
What does a mapping error involve?
A mapper placing the read in the incorrect position of the genome.
What does sample contamination in sequencing involve?
The sampler containing different DNA with a different sequence from a different source.
What does sequence contamination in sequencing involve?
Reads from a sample being mislabelled.
Errors in sequencing can come from both sequences. True or false?
True.
Do all changes in nucleotide sequences result in a SNP?
No.
What are structural variants a result of?
Larger scale chromosome rearrangements.
Structural variants are a result of larger scale chromosome rearrangements. Name 5 examples of these?
- Insertions.
- Deletions.
- Duplications/copy number variants.
- Inversions.
- Translocations.
What can coverage depth allow you to detect?
Structural variants in resequencing.
What can read pairs be used to detect?
Structural variants in resequencing.
What can split reads detect?
Deletions in the reference.
What can show novel insertions?
Assemblies?
What is perhaps the simplest form of functional genomics?
GWAS.
What regions of the genome do GWAS identify?
Regions associated with a particular phenotype by statistical association.
Through examine lots of cases in a GWAS what cab be identified?
Variants that are over represented in some cases.
What shows the data set for GWAS?
Manhattan plots.
Why can GWAS be misleading?
Correlation doesn’t always mean causation. Further studies are often needed to establish molecular mechanisms underlying the phenotype.