Bioinformatics Lecture 1 Flashcards
DNA sequencing machine
DNA comes out in random order
meta genomics
data recovered directly form environmental sources
includes microbiome
approach / goals of bioinformatics
store, process, analyse, model , predict
biomarkers
measurable characteristics informative about a biological state
can be e.g. genes or metabolites
Kaplan Meier curve
for progressive diseases
for incomplete observations
estimates / predicts survival
sources of data in bioinformatics
clinical data
imaging
non-high throughput data
high throughput data
high throughput profiling
automated process
outputs many different types of biological data
Alzheimer’s progression
present one or two decades before symptom onset
treatment only effective if early
FTD
frontal temporal dementia
more rare
little known about etiology
TAU and TDP proteins play a role
mass spectronomy
identifies mass to charge ratio of ions
used to identify proteins
gene networks
co-regulated / co-expressed genes
entire pathway goes up or down
individual differences in protein expression
some are noise
some are natural variation unrelated to disease
some influenced by e.g. what people ate before
gene set enrichment analysis
method to analyse genes or proteins that are overrepresented in a large dataset
identifying genes that are regulated together
often related to disease phenotypes
finding out functions of proteins
either wikipedia
or a biobank
or BLAST
BLAST
extremely widely used
output: homologous protein sequnces
aligned to the query (input) sequence
homology search
are there proteins with similar sequences?
often evolutionary related
ancestral and children sequences
ancestral sequences
from evolutionary ancestor
usually unknown
phylogenetic tree
see where variants cluster and how far they are from the ancestor
sequence alignment in evolution
shows which sequences are conserved in evolution
blast parameters
e-value
substation matrix
gap penalties
word size
defining similarity
scored by alignment score
matches and mismatches
in the end gap penalty is subtracted
= bit score
e-value
converts bit score into statistical score
.01 means 1 in 100
PSI blast
to find very distantly related homologies
first does a normal blast
then iteratively searches
hits come in and can be dropped out
creates an evolutionary conversation profile
in form of a position specific scoring matrix
PSSM
position specific scoring matrix
made after every iteration in psi blast
used as the scoring function (instead of the substitution matrix)
master-slave alignment
used in both blast and psi blast
not the same as multiple sequence alignment
when does iteration stop
technically after no new sequences within e value threshold are found
in practice often capped to five to avoid spurious findings
number of genes
20.000
only 1.5% code
number of proteins
20.000
number of amino acids
20
cells in the body
37 trillion
number of base pairs
3 billion
number of chromosomes
23 pairs
typical patient cohort size
20 to 500