Lecture 6- An Introduction to clinical bioinformatics Flashcards
What is bioinformatics?
-combination of statistics, biology and computer science
How is the genome sequences for bioinformatics?
- shred them, read each piece and reconstruct the story
- clinical bioinformatics=
- sequencing= shred the DNA, sequence and then put together= then analyse where the typos are
- enough repeats= some pieces really easy to get,
- some parts of the genome really easy to get
- som erepeptitive and hard to find where they go
- some pieces missing
What are some of the challenges in clinal bioinformatics?
-Bioinformatic (statistical and computational) challenges arise in :
– the sequencing
– the reference
– the analysis
– the interpretation
-Understanding these challenges and incorporating them into the analysis is the job of bioinformatics
-Making it fit for routine health care, supporting diagnostics and clinicians is the job of clinical bioinformatics
What is clinical bioinformatics about?
-using bioinformatics in a clinical setting and making it proven, predictable and standardised to use
What are some of the areas that input into bioinformatics?
-patients, diagnostics, ethics, pathology, IT, IP, hospitals, sequencing laboratories, programmers, clinicians, geneticists
What is sequenced in bioinformatics?
-the exomes
-Variants in the exome or coding regions are most interpretable
-we sequence the exome= the bits that we can understand, the coding part= the exome
-coding bits are about 1% of the genome
1.-sequencing= break it into bits (the exomes)
-sequence the reads (several)
-stored in fastq filed (stored in a file)
-tells you about the quality of the data
2. Raw data: The fastq file is the raw sequencing file – bases and qualities
3. Align the reads (bam file): then you align with the genome
-bam file= aligned reads
-reads mapped to the genome
4. Data cleaning and control: Quality control and reproducibility is imperative for clinical analysis and it must be done on a large scale, remove:
– Reads that align to multiple positions
– Reads with poor quality
– Duplicated reads
Are exons covered evenly?
-no
-Many reasons for varying coverage – limits clinical utility
Coverage of 20 reads is considered ‘usable’ or sufficient (20X) There are many poorly covered exons (<20X)
There are many gaps in exons (0X)
Coverage is often related to:
– Mappability & GC
– Repetitive regions & blackspots
– Capture “baits”
Are large deletions and amplifications are challenging with exomes?
SNP arrays are still useful for large deletions & amplifications
Coverage varies across the genome
Coverage varies between genomes by chance
Gaps between exons make detecting structural variants difficult
–this is what they find= deletions in the sequence= in intelectual disability and cancer
-where there is a coverage gap? is this special in this person? maybe evidence that there is a deletion in that perosn
= difficult to say as variation within and between people
What does local re-alignment do?
-Local re-alignment to detect small indels
-Errors can be removed by fixing up alignments
-Different conventions for ‘calling’ or defining indels make comparing them between genomes difficult
Commonly used variant callers (GATK) are poor at detecting indels >10-20bp
What are the quality scores (Phred)?
- In our pipeline, variant genotype quality must be Q>5 (~30%’accuracy)
- Sequencing base qualities
- Most sequencing errors are near the end of the read
- The scores in the plot above are summarised across all the reads (100bps) that are output from the sequencing machine for a single sample
- This profile should be consistent between sequencing experiments
- due to the machine= accumulates errors as it runs the bases so at the beginning= good quality and then deteriorates
- Mapping qualities
- Variant call qualities
- Calling variants (vcf file)
What is variant quality?
Is it a real variant, what to look out for, what standards should be set?
Quality score
– base qualities, coverage
– number of reads with the variant
Sequence context
– beginning or end of reads
– near indels or homopolymers
– Blacklisted, GC content, strand bias
– Seen before in the cohort (technical artifact)
Set permissive standards and risk getting false positives?
Or, set conservative standards and risk more false negatives?
What is variant annotation?
-What is the variant, has it been seen before and what might it’s effect be?
Chromosome, position, gene
Transcript, codon change, amino acid change
Variant quality, in-silico effect predictions, conservation Observed frequency in population and disease databases
The variant is interpreted by it’s effect on the transcripts’ protein associated with the gene, we can choose the one expected to be effected most, but…..
It also matters how confident you are that the transcript is relevant;
for this we rely on transcript annotations (RefSeq or Ensembl) and
typically choose the transcript commonly used in the literature and disease specific databases
What are the types of coding variants?
-SNV is a single nucleotide variant Can be synonymous or non-synonymous (missense) -Indel variant is an insertion or deletion of nucleotides Can be in-frame or out-of-frame (frameshift) -These variants can effect splice sites, and can create (or remove) a start codon, or a stop codon (nonsense) -Frameshifts usually introduce a downstream stop codon, are called protein truncating variants (PTV) and can cause loss of function (LOF) -up= the references -synonymous mutation= the codon change but the aminoa cid didn’t change -non-synoumous or missense cariant -frameshift= the frame= 3 bases out teh window if you delete 2 bases= big change! -stop codon introduced= nonsense
What is variant filtering and prioritisation?
-It is important to make the output manageable for clinical use
• Coding regions of genes related to condition (phenotype)
• Rare and novel SNV’s and in-frame indels
• Non-synonymous or missense (SNV’s)
• Nonsense, stop-gain (SNV’s)
• Frameshift indels
• Splice sites
• Conserved sites
Filter variants that are repeatedly observed across patients with very different conditions that were sequenced within a laboratory as these are likely to be artifacts (arise due to technical reasons)
Avoid incidental findings by filtering to pre-defined genes for different conditions; the ‘test’ for each condition
-lot of genetic code so you filter so you can analyse and remove: the common, etc…
-pick genes related to condition
-also filter out technical issues of the praticular lab
-Prioritisation makes it manageable
Look at the genes most likely to fit the phenotype
and look at the most promising variants first
What regions of DNA are important?
- the regions that do not vary between species are usually very important
- conservation is useful for reference if it is a variant