Lecture 6- An Introduction to clinical bioinformatics Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

What is bioinformatics?

A

-combination of statistics, biology and computer science

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How is the genome sequences for bioinformatics?

A
  • shred them, read each piece and reconstruct the story
  • clinical bioinformatics=
  • sequencing= shred the DNA, sequence and then put together= then analyse where the typos are
  • enough repeats= some pieces really easy to get,
  • some parts of the genome really easy to get
  • som erepeptitive and hard to find where they go
  • some pieces missing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some of the challenges in clinal bioinformatics?

A

-Bioinformatic (statistical and computational) challenges arise in :
– the sequencing
– the reference
– the analysis
– the interpretation
-Understanding these challenges and incorporating them into the analysis is the job of bioinformatics
-Making it fit for routine health care, supporting diagnostics and clinicians is the job of clinical bioinformatics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is clinical bioinformatics about?

A

-using bioinformatics in a clinical setting and making it proven, predictable and standardised to use

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are some of the areas that input into bioinformatics?

A

-patients, diagnostics, ethics, pathology, IT, IP, hospitals, sequencing laboratories, programmers, clinicians, geneticists

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is sequenced in bioinformatics?

A

-the exomes
-Variants in the exome or coding regions are most interpretable
-we sequence the exome= the bits that we can understand, the coding part= the exome
-coding bits are about 1% of the genome
1.-sequencing= break it into bits (the exomes)
-sequence the reads (several)
-stored in fastq filed (stored in a file)
-tells you about the quality of the data
2. Raw data: The fastq file is the raw sequencing file – bases and qualities
3. Align the reads (bam file): then you align with the genome
-bam file= aligned reads
-reads mapped to the genome
4. Data cleaning and control: Quality control and reproducibility is imperative for clinical analysis and it must be done on a large scale, remove:
– Reads that align to multiple positions
– Reads with poor quality
– Duplicated reads

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Are exons covered evenly?

A

-no
-Many reasons for varying coverage – limits clinical utility
Coverage of 20 reads is considered ‘usable’ or sufficient (20X) There are many poorly covered exons (<20X)
There are many gaps in exons (0X)
Coverage is often related to:
– Mappability & GC
– Repetitive regions & blackspots
– Capture “baits”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Are large deletions and amplifications are challenging with exomes?

A

SNP arrays are still useful for large deletions & amplifications
Coverage varies across the genome
Coverage varies between genomes by chance
Gaps between exons make detecting structural variants difficult
–this is what they find= deletions in the sequence= in intelectual disability and cancer
-where there is a coverage gap? is this special in this person? maybe evidence that there is a deletion in that perosn
= difficult to say as variation within and between people

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does local re-alignment do?

A

-Local re-alignment to detect small indels
-Errors can be removed by fixing up alignments
-Different conventions for ‘calling’ or defining indels make comparing them between genomes difficult
Commonly used variant callers (GATK) are poor at detecting indels >10-20bp

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the quality scores (Phred)?

A
  • In our pipeline, variant genotype quality must be Q>5 (~30%’accuracy)
  • Sequencing base qualities
  • Most sequencing errors are near the end of the read
  • The scores in the plot above are summarised across all the reads (100bps) that are output from the sequencing machine for a single sample
  • This profile should be consistent between sequencing experiments
  • due to the machine= accumulates errors as it runs the bases so at the beginning= good quality and then deteriorates
  • Mapping qualities
  • Variant call qualities
  • Calling variants (vcf file)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is variant quality?

A

Is it a real variant, what to look out for, what standards should be set?
Quality score
– base qualities, coverage
– number of reads with the variant
Sequence context
– beginning or end of reads
– near indels or homopolymers
– Blacklisted, GC content, strand bias
– Seen before in the cohort (technical artifact)
Set permissive standards and risk getting false positives?
Or, set conservative standards and risk more false negatives?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is variant annotation?

A

-What is the variant, has it been seen before and what might it’s effect be?
Chromosome, position, gene
Transcript, codon change, amino acid change
Variant quality, in-silico effect predictions, conservation Observed frequency in population and disease databases
The variant is interpreted by it’s effect on the transcripts’ protein associated with the gene, we can choose the one expected to be effected most, but…..
It also matters how confident you are that the transcript is relevant;
for this we rely on transcript annotations (RefSeq or Ensembl) and
typically choose the transcript commonly used in the literature and disease specific databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the types of coding variants?

A
-SNV is a single nucleotide variant Can be synonymous or
non-synonymous (missense)
-Indel variant is an insertion or deletion of nucleotides
Can be in-frame or out-of-frame (frameshift)
-These variants can effect splice sites, and
can create (or remove) a start codon, or a stop codon (nonsense)
-Frameshifts usually introduce a downstream stop codon, are called protein truncating variants (PTV) and can cause loss of function (LOF)
-up= the references
-synonymous mutation= the codon change but the aminoa cid didn’t change
-non-synoumous or missense cariant
-frameshift= the frame= 3 bases out teh window if you delete 2 bases= big change!
-stop codon introduced= nonsense
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is variant filtering and prioritisation?

A

-It is important to make the output manageable for clinical use
• Coding regions of genes related to condition (phenotype)
• Rare and novel SNV’s and in-frame indels
• Non-synonymous or missense (SNV’s)
• Nonsense, stop-gain (SNV’s)
• Frameshift indels
• Splice sites
• Conserved sites
Filter variants that are repeatedly observed across patients with very different conditions that were sequenced within a laboratory as these are likely to be artifacts (arise due to technical reasons)
Avoid incidental findings by filtering to pre-defined genes for different conditions; the ‘test’ for each condition
-lot of genetic code so you filter so you can analyse and remove: the common, etc…
-pick genes related to condition
-also filter out technical issues of the praticular lab
-Prioritisation makes it manageable
Look at the genes most likely to fit the phenotype
and look at the most promising variants first

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What regions of DNA are important?

A
  • the regions that do not vary between species are usually very important
  • conservation is useful for reference if it is a variant
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a bit problematic with the rare variants?

A
  • lot of us have a lot of very rare variants

- 3000 still predict damage but the person is fine! problematic

17
Q

What are the issue with variable coverage in population databases?

A
  • Exome Aggregation Consortium (ExAC) has >65,000 individuals and provides coverage information about each gene
  • Can’t accurately assess variant frequency in regions that are poorly covered
18
Q

What is the in-silico prediction accuracy?

A

-Many false positives and false negatives
-Reported mutations predicted to have no effect on protein function
-In-silico prediction methods not independent
Different in-silico prediction methods often use similar information as a basis for their approach, they aren’t independent; don’t rely on consensus
They rely on automatic alignments that are often difficult to check and/or manually adjust
Most approaches use some of the following aspects:
• Homology of related sequences
• Conservation of the variant site across homologous aligned
sequences (the alignment must be correct and informative)
• Amino acid biochemical properties
• Protein structure information (often only partial information, and
only for some genes)
Well known methods include: SIFT, PolyPhen2, GERP, Align-GVGD, PhyloP, Condel, CADD, Mutation Taster,….

19
Q

What are the homology based methods?

A
  • The functional significance of an Amino Acid substitution at different sites in a protein is reflected by the level of evolutionary conservation
  • Variants at highly conserved sites are more likely to be deleterious
  • Variants that disrupt areas of known function are likely to be deleterious
  • But there are limitations to relying on conservation to predict the effect of a variant
20
Q

What are the conservation limits?

A

-Poor information content
Very little information about conservation, need more species
- Poor quality
Appears not to be conserved at the variant site, but poorly aligned sequence shouldn’t be included
-Poor stability, methods can be sensitive to alignment
Method can give very different predictions if for example the length of aligned sequences changes or a sequence is omitted

21
Q

How do you classify the variants?

A

-Classification of variants into 5 classes
5. Pathogenic
4. Likely pathogenic
4a and 4b
3. Variant of unknown significance
3a, 3b, 3c 2. Likely benign
1. Benign
Primary analysis – sequencing related informatics
Secondary analysis – alignment, variant calling (pipeline) etc. Tertiary analysis – variant interpretation and classification