Introduction Flashcards
Phylogenetics
Evolutionary process over millions of years
Population Genetics
Evolution within a species focusing on the genetic variation among people
Evolution will be treated as a mathematical process (in this class)
mathematical process
Biological evolutionary processes operate on several scales
cellular (in the body) / somatic
Tens/hundreds of thousands of years
Millions of years
Addressing most read world problems in data science requires using a mix of toolkits
tools like matlab, Python
Central Dogma in Biology
DNA - RNA - Protein
These are the major classes of polymers
Proteins are not used to go back to creating DNA or RNA.
RNA reverses transcribes to DNA in special cases
Genetic sequence is ideal natural representation
Linking biology and data
DNA is represented by “alphabet”
A , C , G ,T
RNA is represented by “alphabet”
A, C , G , U
Proteins have an alphabet of 20 letters (sometimes more) representing Amino Acids that make up proteins
Proteins
Transcription
DNA to RNA (1 to 1 mapping)
Most of the DNA in most genomes (e.g., 75% in humans and 15% in bacteria)
are not part of “genes”. These regions are not transcribed.
Translation
Protein are synthesized using the informatio in m(RNA). Protein are the building blocks of living cells
Different cells “express” RNA/Protein of each gene
at various levels and in multiple forms (splicing)
How does the 4 letter DNA/RNA code for 20 letter (Amino Acid) proteins?
Three DNA/RNA letters ion arow, called codon, code for one amino acid
There are 4^3 = 64 codons, but only 20 amino acids. There is redundancy
LEss than 2% of genome is for coding
Exons: protein coding regions
Stop codons: gene boundaries
Introns: regions between exons. Introns are transcribes but not translated
DNA Replication / Mutation
Parts of the molecule occasionally change in the new copy. These events are called mutations
DNA Replication / Substitution
Most common form of mutation where one letter changes to another letter
Transition: substitution between A and G or between C and T
Transversion: other cases
Transitions are more liely than tranversions. This has to do with the chemistry of DNA, plus properties of the reduntant genetic code
Indels
insertions or deletions (common)
Complex types like inversion, gene duplication, gene transfer, segmental duplication, rearragnement, etc.
.complex types of changes
Sequence Evolution
Just as various organisms look similar they also have similar genetic material.
For example chimps and humans are 95% similar
Each mutation starts with one individual from a “species”
Through time, mutations may survive to future generations and may eventually get “fixated” so that all/most individuals in that population include the mutation
Reason for fixation is natural selection. Another mechanism is genetic drift: random chance leading to fixation of changes. Others include sexual selection
Time + many of mutations can eventually generate a new organism
Sequence evolution
Evolutionary Trees show relationships through evolutionary time
Phylogeny
Tree topology (branching structure of the tree)
Nodes can represent species, viruses, different genes in the genome of one or several species, or even languages
Internal nodes typically correspond to extinct species/genes/etc. Leaves correspond to extant species/etc.
Edge indicates the parent node evolved to the child node. Leaves below an internal node are its evolutionary decsendants
Branch Length
shows some notion of time or amount of change between nodes
How do we study biological data
Define clean optimization problems. Ex. Align sequences by optimizing similarity between matched positions
Build mathematical models. A generative model can create a sequence data “just like” what we see in reality. Typically statistical models
How do we build models
Seek to capture mechanisms behind actual processes that generated the data
Or forget the data, but build descriptive models that seek to emulate patterns seen in the data, regardless of mechanism.
Where do we get data and their patterns. REquires a reference data set to train from. This is associated with machine learning
Models are always wrong, but some are useful (George Box)
Whether mechanistic or not, models are simplfying representations of reality
See Geman and Geman 2016 SCIENCE, for debate between camps