Lecture 6- An Introduction to clinical bioinformatics Flashcards

Question 1

Q

What is bioinformatics?

Answer

A

-combination of statistics, biology and computer science

Question 2

Q

How is the genome sequences for bioinformatics?

Answer

A

shred them, read each piece and reconstruct the story
clinical bioinformatics=
sequencing= shred the DNA, sequence and then put together= then analyse where the typos are
enough repeats= some pieces really easy to get,
some parts of the genome really easy to get
som erepeptitive and hard to find where they go
some pieces missing

Question 3

Q

What are some of the challenges in clinal bioinformatics?

Answer

A

-Bioinformatic (statistical and computational) challenges arise in :
– the sequencing
– the reference
– the analysis
– the interpretation
-Understanding these challenges and incorporating them into the analysis is the job of bioinformatics
-Making it fit for routine health care, supporting diagnostics and clinicians is the job of clinical bioinformatics

Question 4

Q

What is clinical bioinformatics about?

Answer

A

-using bioinformatics in a clinical setting and making it proven, predictable and standardised to use

Question 5

Q

What are some of the areas that input into bioinformatics?

Answer

A

-patients, diagnostics, ethics, pathology, IT, IP, hospitals, sequencing laboratories, programmers, clinicians, geneticists

Question 6

Q

What is sequenced in bioinformatics?

Answer

A

-the exomes
-Variants in the exome or coding regions are most interpretable
-we sequence the exome= the bits that we can understand, the coding part= the exome
-coding bits are about 1% of the genome
1.-sequencing= break it into bits (the exomes)
-sequence the reads (several)
-stored in fastq filed (stored in a file)
-tells you about the quality of the data
2. Raw data: The fastq file is the raw sequencing file – bases and qualities
3. Align the reads (bam file): then you align with the genome
-bam file= aligned reads
-reads mapped to the genome
4. Data cleaning and control: Quality control and reproducibility is imperative for clinical analysis and it must be done on a large scale, remove:
– Reads that align to multiple positions
– Reads with poor quality
– Duplicated reads

Question 7

Q

Are exons covered evenly?

Answer

A

-no
-Many reasons for varying coverage – limits clinical utility
Coverage of 20 reads is considered ‘usable’ or sufficient (20X) There are many poorly covered exons (<20X)
There are many gaps in exons (0X)
Coverage is often related to:
– Mappability & GC
– Repetitive regions & blackspots
– Capture “baits”

Question 8

Q

Are large deletions and amplifications are challenging with exomes?

Answer

A

SNP arrays are still useful for large deletions & amplifications
Coverage varies across the genome
Coverage varies between genomes by chance
Gaps between exons make detecting structural variants difficult
–this is what they find= deletions in the sequence= in intelectual disability and cancer
-where there is a coverage gap? is this special in this person? maybe evidence that there is a deletion in that perosn
= difficult to say as variation within and between people

Question 9

Q

What does local re-alignment do?

Answer

A

-Local re-alignment to detect small indels
-Errors can be removed by fixing up alignments
-Different conventions for ‘calling’ or defining indels make comparing them between genomes difficult
Commonly used variant callers (GATK) are poor at detecting indels >10-20bp

Question 10

Q

What are the quality scores (Phred)?

Answer

A

In our pipeline, variant genotype quality must be Q>5 (~30%’accuracy)
Sequencing base qualities
Most sequencing errors are near the end of the read
The scores in the plot above are summarised across all the reads (100bps) that are output from the sequencing machine for a single sample
This profile should be consistent between sequencing experiments
due to the machine= accumulates errors as it runs the bases so at the beginning= good quality and then deteriorates
Mapping qualities
Variant call qualities
Calling variants (vcf file)

Question 11

Q

What is variant quality?

Answer

A

Is it a real variant, what to look out for, what standards should be set?
Quality score
– base qualities, coverage
– number of reads with the variant
Sequence context
– beginning or end of reads
– near indels or homopolymers
– Blacklisted, GC content, strand bias
– Seen before in the cohort (technical artifact)
Set permissive standards and risk getting false positives?
Or, set conservative standards and risk more false negatives?

Question 12

Q

What is variant annotation?

Answer

A

-What is the variant, has it been seen before and what might it’s effect be?
Chromosome, position, gene
Transcript, codon change, amino acid change
Variant quality, in-silico effect predictions, conservation Observed frequency in population and disease databases
The variant is interpreted by it’s effect on the transcripts’ protein associated with the gene, we can choose the one expected to be effected most, but…..
It also matters how confident you are that the transcript is relevant;
for this we rely on transcript annotations (RefSeq or Ensembl) and
typically choose the transcript commonly used in the literature and disease specific databases

Question 13

Q

What are the types of coding variants?

Answer

A

-SNV is a single nucleotide variant Can be synonymous or
non-synonymous (missense)
-Indel variant is an insertion or deletion of nucleotides
Can be in-frame or out-of-frame (frameshift)
-These variants can effect splice sites, and
can create (or remove) a start codon, or a stop codon (nonsense)
-Frameshifts usually introduce a downstream stop codon, are called protein truncating variants (PTV) and can cause loss of function (LOF)
-up= the references
-synonymous mutation= the codon change but the aminoa cid didn’t change
-non-synoumous or missense cariant
-frameshift= the frame= 3 bases out teh window if you delete 2 bases= big change!
-stop codon introduced= nonsense

Question 14

Q

What is variant filtering and prioritisation?

Answer

A

-It is important to make the output manageable for clinical use
• Coding regions of genes related to condition (phenotype)
• Rare and novel SNV’s and in-frame indels
• Non-synonymous or missense (SNV’s)
• Nonsense, stop-gain (SNV’s)
• Frameshift indels
• Splice sites
• Conserved sites
Filter variants that are repeatedly observed across patients with very different conditions that were sequenced within a laboratory as these are likely to be artifacts (arise due to technical reasons)
Avoid incidental findings by filtering to pre-defined genes for different conditions; the ‘test’ for each condition
-lot of genetic code so you filter so you can analyse and remove: the common, etc…
-pick genes related to condition
-also filter out technical issues of the praticular lab
-Prioritisation makes it manageable
Look at the genes most likely to fit the phenotype
and look at the most promising variants first

Question 15

Q

What regions of DNA are important?

Answer

A

the regions that do not vary between species are usually very important
conservation is useful for reference if it is a variant

Question 16

Q

What is a bit problematic with the rare variants?

Answer

Study These Flashcards

A

lot of us have a lot of very rare variants

- 3000 still predict damage but the person is fine! problematic

Question 17

Q

What are the issue with variable coverage in population databases?

Answer

Study These Flashcards

A

Exome Aggregation Consortium (ExAC) has >65,000 individuals and provides coverage information about each gene
Can’t accurately assess variant frequency in regions that are poorly covered

Question 18

Q

What is the in-silico prediction accuracy?

Answer

Study These Flashcards

A

-Many false positives and false negatives
-Reported mutations predicted to have no effect on protein function
-In-silico prediction methods not independent
Different in-silico prediction methods often use similar information as a basis for their approach, they aren’t independent; don’t rely on consensus
They rely on automatic alignments that are often difficult to check and/or manually adjust
Most approaches use some of the following aspects:
• Homology of related sequences
• Conservation of the variant site across homologous aligned
sequences (the alignment must be correct and informative)
• Amino acid biochemical properties
• Protein structure information (often only partial information, and
only for some genes)
Well known methods include: SIFT, PolyPhen2, GERP, Align-GVGD, PhyloP, Condel, CADD, Mutation Taster,….

Question 19

Q

What are the homology based methods?

Answer

Study These Flashcards

A

The functional significance of an Amino Acid substitution at different sites in a protein is reflected by the level of evolutionary conservation
Variants at highly conserved sites are more likely to be deleterious
Variants that disrupt areas of known function are likely to be deleterious
But there are limitations to relying on conservation to predict the effect of a variant

Question 20

Q

What are the conservation limits?

Answer

Study These Flashcards

A

-Poor information content
Very little information about conservation, need more species
- Poor quality
Appears not to be conserved at the variant site, but poorly aligned sequence shouldn’t be included
-Poor stability, methods can be sensitive to alignment
Method can give very different predictions if for example the length of aligned sequences changes or a sequence is omitted

Question 21

Q

How do you classify the variants?

Answer

Study These Flashcards

A

-Classification of variants into 5 classes
5. Pathogenic
4. Likely pathogenic
4a and 4b
3. Variant of unknown significance
3a, 3b, 3c 2. Likely benign
1. Benign
Primary analysis – sequencing related informatics
Secondary analysis – alignment, variant calling (pipeline) etc. Tertiary analysis – variant interpretation and classification

Lecture 6- An Introduction to clinical bioinformatics Flashcards

(21 cards)