(Week 2) [T4-2] Precision Medicine Bioinformatics Flashcards
Sequence alignment/comparison is >90% of bioinformatics. To align sequences, we must contemplate 3 scenarios [1]. We can have a naïve algorithm for example [2]. Sequence alignment could be [3] or [4].
[1] Nucleotide/amino-acid match, mismatch (substitution), gap.
[2] Finding the minimum number of operations to transform 1 sequence into other. But this assumes that substitutions/inserts/deletions have all the same probability which is not true.
[3] Global. Attempting to maximize the alignment score while matching the complete sequences (Needleman-Wunsch).
[4] Local. Attempting to maximize the alignment score in local regions.
What is BLAST, and what does BLAST gives us as result?
BLAST is the google of molecular biology research. Is a heuristic local alignment algorithm designed for database search. Not guaranteed to produce an optimal alignment.
To interpret a BLAST result, we have: sequence identity (how many are the same), expectation score (how probable is) and the length of the aligned segments.
In genome assembly, explain the differences between ‘De novo assembly’ and ‘Mapping assembly’.
- De novo assembly: like putting a puzzle together without looking at the box.
Generic Feature Format (GFF) is the text file format for describing genome features, used for de novo genome assemblies. - Mapping assembly: like putting a puzzle together on top of the box. Orders of magnitude quicker and simpler than de novo assembly.
Variant Call Format (VCF) is the text format for storing gene sequence variations, used for the mapping assemblies.
What is ‘De novo annotation’?
De novo annotation: identify relevant DNA sequences in the genome (transcripts, protein-coding regions, etc).
Talk a little bit about genomic data access.
Genomic data cannot be anonymised, cannot be placed into central repositories and it cannot leave the country.
Genomic data is critical for research.
Data federation (multiple databases function as one).
Data is deposited at the national/institutional level. Queries for variants allowed.
Repeat the question, and answer what are the main types of genomic variation data?
- SNPs. Individual nucleotide variation caused by DNA polymerase errors or induced by mutations. Links to genetic disorders, cancer and infectious diseases.
- Copy Number Variants (CNV). Copy of an area more than 1 time. Link to malaria and AIDS.
- Structural variants. Like chromosome level rearrangements: deletions, inversions, etc.