Intro Flashcards
Data
Sequence Data:
Text strings using a limited alphabet, DNA/RNA(4 letters), Amino Acids (20 letters).
Can look genetic variations
Can measure gene expression (RNA sequencing)
Measurement Data:
Many different variables
All genes, many proteins, many metabolites
DNA, genes, proteins, metabolites
DNA (genome: genomics): chromosome double helix, 46 chromosomes, 23 pairs, 1 of each parent (in humans !! plants/animals don’t all have the same). Each cell contains 2 copies of DNA but eggs and sperm.
- Transcription -> RNA
RNA (genes: transcriptomics)
-Translation -> proteins (amino acids will fold into a specific structure for the protein)
Proteins: proteomics
- Catalyze metabolic reactions -> metabolites
Metabolites: metabolomics
Sequence data: genomics and transcriptomics
Numerical data: transcriptomics, proteomics and metabolomics
Eukaryotic / Prokaryotic
Eukaryotic: nucleus, chromosomes
Prokaryotic: no nucleus, simple
Proteins
Cell signaling, catalyze metabolic reactions, transport metabolites, antibodies,…
-> they can bind on sites and chemically change. Their shape indicates the corresponding binding site
Metabolites
Chemical compounds in the body
endogeneous: produced by the organism
exogeneous: from outside the organism
Genome
set of all DNA contained in a cell Double stranded DNA: complementary, can restore the other strand from one another eukaryote: linear dna Prokaryote: circular dna viral: variable
Mitochondrial DNA is only inherited from the mother
DNA
4 bases/nucleotides
Complementary strands
Adenine, Guanine, Thymine, Cytosine
A-T and G-T
A-T is weaker because they only have 2 hydrogen bonds in comparison of the 3 bonds of G-T
Coding strand is the DNA strand that correspond to the mRNA which is used to translate into amino acids
The non coding strand is the one transcribed to get the mRNA
Transcription: complementary base and T replaced by U
Transcription in nucleus, translation outside nucleus
Sequence Statistics standard format
FASTA
DNA made of A,C,G,T
elements si is nucleotide i
s(3:6) nucleotide 3 to 6
Multinomial Sequence model
Nucleotides are independent
Sequence Generated randomly by probability distrubtion
P = (PA,PC,PG,PT)
Get proba of sequence: multiply the probability of each nucleotide to each other: position doesnt change anything!
The model doesnt fit because density plots show changes of frequency depending on the region (not a fixed proba): not independent! otherwise would be uniform.
Nucleotide/ A-T C-G density plot
Nulceotide: see the fraction of each nucleotides at each point
A-T, C-G: see the fraction of each pair. can discuss CG content (methylation)
Window size makes it more or less smooth/noisy
Allows to check if there are changes in distribution across the sequence
Markov Sequence Models
Different states our sequence can be in
Proba to change state and each state has a proba for each nucleotide occuring next
Transition matrix: how likely to move to another state from the current state
T= from A/C/G/T (current states: rows) to A/C/G/T (next state: columns)
proba of sequence => follow the sequence and the proba matrix, multiply all probas together
K-mer frequency
Dimer: nucleotide word of length 2
Trimer: nucleotide word of length 3
K-mer: nucleotide word of length k
Can study if some kmers are more common than others
frequency matrix (for dimer have rows being the starting nucleotide and rows the second nucleotide, corresponding index is having that dimer)
Odds ratio
Comparix observed and expected frequency
If have dimer xy
Observed P(xy) / [Expected P(x) * Expected P(y)]
Need frequency matrix of dimers and the frequency of each nucleotide (if you sum the row of of a nucleotide in a dimer frequency matrix you get the frequency of the individual nucleotide)
<1: less than expected
>1: more than expected
Nucleotide alphabet
ACGT
N: can be any base
R/Y/M: are other nucleotides
Sequence Alignment
Predict function: align unknow sequence function to a sequence with known function
Sequence divergence: mutations
Gene finding: compare genomes of different species to locate genes. Most genes are conserved with a high similarity, rest is mutations.