Genome Diversity in Space - Before Midterm Flashcards

1
Q

Comparative Genomics

A
  • a field of biological research in which the genomes of different organisms are compared in order to understand phenotypic differences between them and infer evolutionary relationships and processes
  • it starts with the phenotype, and then tries to find the genetic source of those phenotypic differences
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Key features of the human genome

A
  • The human genome is only about 1.5% genes
  • We have about 20,000 protein coding genes, and 22,000 genes for non-coding RNA, like rRNA, tRNA, and short-non coding RNA
    • An important type of short, non-coding RNA is micro-RNA (miRNA)
  • 24% of the genome is introns and other non-coding DNA
  • 12% of the genome is pseudogenes; genes that used to code for something but don’t anymore because they are “broken;” they’re just relics
  • 43% of the genome is interspersed repeats
  • 20% of the genome is short tandem repeats and other things
  • Collectively , the interspersed repeats and tandem repeats are considered “junk” DNA
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

miRNA

A
  • micro RNA found in the human genome (and prob most eukaryotes?)
  • a type of short, non-coding mRNA
  • miRNA silences genes by binding to mRNAs that are complementary to it and recruiting the protein RISC which recruits the protein argonaute which cuts up that mRNA
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Interspersed repeats

A
  • One of the two types of repeats in the human genome
  • Make up 43% of the human genome
  • They are spread out throughout the chromosome
  • Are mobile genetic elements
  • Are copied at random throughout the genome
  • The types of interspersed repeats are LINEs, SINEs, LTRs, and DNA transposons
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Tandem repeats

A
  • One of the two types of repeats in the human genome
  • Are tandem; meaning right after one another in the genome
  • Typically are derived rom copying errors
  • They have different names depending on their size
    • satellite = 5-200 bp repeats
    • mini-satellite <= 25 bp repeats
    • micro-satellites <= 13 bp repeats
    • dinucleotide : “AT” repeats
    • trinucleotides: found in Huntington’s disease
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

LINEs

A
  • Long interspersed nuclear repeats
  • A type of interspersed repeat (as obvious by the name)
  • LINES, like LTRs, code for a protein that helps them jump around the genome
  • This protein they code for is a “copy and paste” enzyme, meaning the enzyme copies its own gene that made it and inserts it somewhere else in the genome
  • They have recognition elements on either side to help the enzyme they code for recognize themselves
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

LTRs

A
  • Long-teminal repeats
  • A type of interspersed repeat
  • Like LINEs, they code for a protein that helps them jump around the genome
  • This protein they code for is a “copy and paste” enzyme, meaning the enzyme copies its own gene that made it and inserts it somewhere else in the genome
  • They have recognition elements on either side to help the enzyme they code for recognize themselves
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

SINEs

A
  • Short-interspersed nuclear repeats
  • A type of interspersed repeat
  • They don’t code for their own protein to help them move about the genome, but instead use the copy and paste enzyme of LINEs to move about
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

DNA Transposons

A
  • A type of interspersed repeat
  • Code for a “cut and paste” enzyme which cuts the gene out of the DNA and inserts it in a different location in the genome, so no copies are made, but the gene is moved
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Eukaryotic DNA Structure Overview

A
  • DNA takes the form of linear DNA during replication, but it is normally in chromosomal structures when not replicating
  • To form a chromosome, linear DNA wraps around histones, which then coil around themselves to be more dense, and then coil even more to become denser and denser, forming chromosomes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Prokaryotic DNA Structure Overview

A
  • Prokaryotes have their DNA in a circular, double stranded structure
  • This circular structure condenses into a super-coil, and then condenses even more around a protein core to make a weird structure (see Notes on 10/14 for picture)
  • These structures are called nucleoids (I think; or maybe the structures come together to make nuceoids?)
  • This circular chromosome structure is a prokaryotes primary genome, but many prokaryotes also have plasmids on top of this
    • Plasmids are useful because they allow for horizontal gene transfer
  • Since prokaryotes have their main circular genomes and also plasmids, there are different names for referring to the different “genomes”
    • core genome: the set of essential genes ALL members of a particular prokaryotic species have
    • accessory genome: the set of ‘extra’ gens that individual members of a particular prokaryotic species may or may not have in their genome
    • pan genome: the core genome AND the ENTIRE accessory genome for a species
  • NOTE: the core genome isn’t always just the big chromosome; it can also contain plasmids that ALL individuals in that species have, and then the accessory genome will be extra plasmids some individuals in that species may have that others don’t
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Horizontal Gene Transfer

A
  • Passing DNA to contemporaries as opposed to offspring

- Plasmids allow bacteria to do horizontal gene transfer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Key features of Prokaryotic Genome

A
  • Over 90% of the genome is coding
  • Prokaryotes hardly ever have introns
  • Their genes sometimes overlap, because they sometimes use both strands in the same region
  • Genes with related functions are often organized into operons
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Diversity between Domains

A
  • Gene diversity: high in both bacteria and archaea, and low in eukaryotes
  • Introns: not present in bacteria or archaea; present in eukaryotes
  • Repeats: only about 1% in bacteria and archaea; around 60% in eukaryotes
  • Structure: Bacteria and archaea both have a chromosomal structure and plasmids, with the chromosomal structure making up a nucleoid; eukaryotes have chromosomes that exist in the nucleus
  • Organization: bacteria and archaea have operons, eukaryotes do not
  • Comparative genomics data actually suggests that eukaryotes and archaea are more similar to each other than either are to bacteria
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Similarities between Eukaryotes and Archaea

A
  • Their ribosomal proteins are more similar in sequence to each others than to those of bacteria
  • Their RNA polymerases are more similar in sequence tp each others than to those of bacteria
  • Their DNA replication enzymes are more similar in sequence to each others than to those of bacteria
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Homologs

A
  • Genes descended from a common ancestor
  • Looking at homologs is a way to qualitatively compare two genes
  • There are two types of homologs:
    • orthologs: homologs in different species that evolved from a common ancestor via speciation
    • paralogs: homologs related by gene duplication within a species
  • When deciding if a pair of genes are orthologs or paralogs, ask if the event that resulted in these different versions of the gene was duplication or speciation
    • Orthologs are genes separated by speciation
    • Paralogs are genes separated by duplication
  • I think (I may be wrong), that they’re orthologs if in different species and paralogs if in the same species
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Analogs

A
  • Genes that appear similar, but are the result of parallel (convergent) evolution in different lineages
  • An example is the pandas and red pandas
    • They both have pseudo-generation (inactivation) of TASIRI protein, the umami last receptor
    • This was achieved in different ways, however
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Phylogenetics

A
  • A way to quantitatively compare two genes
  • Involves looking at the similarities between genes, proteins, etc and then mapping species out based on the similarities
  • Molecular phylogenies are based on sequences of DNA or the proteins they encode
  • Morphological phylogenies are based on physical characteristics
  • We can make phylogenies quantitative by taking a gens that is a homolog in all the species we are looking at, and compare the sequences in an objective way
    • we can look at identity, which is the percentage of bases or amino acids that are identical (opposite is “distance”; distance = number of differences )
      • When looking at percent identity, we use Hamming distance, which is the number of positions where the bases are different
    • We can look at similarity, which is the percentage of amino acids that are similar (opposite is “dissimilarity”)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Hamming Distance and Distance Matrix

A
  • the number of positions where the bases are different when comparing gene homologs between species
  • is used when making phylogenies
  • The first step in measuring Hamming Distance is to line up all the sequences for the different organisms, then measure the hamming distance between pairs
  • A good way to keep track of hamming distances is to make a distance matrix, which contains all pairwise distances
  • Organisms with shorter distances tend to have shared a common ancestor not too long ago, whereas those with longer distances probably shared a common ancestor further back in time, since more time could explain why there are more differences (mutations)
  • When making a distance matrix, it is usually helpful to do so via hierarchical clustering
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Hierarchical clustering

A
  • A method for converting a distance matrix into a phylogenetic tree
  • Hierarchical clustering joins neighbors in sequencing space
  • First, find the pair with the smallest distance, then convert that pair into a cluster
  • Re-calculate hamming distances again, averaging each member of the cluster’s hamming distance to the other organisms
  • Find the clusters/organisms with the smallest hamming distance, and repeat (this may result in continuing to add to the original cluster, or it may involve creating new clusters
  • Continue until you are down to only two clusters or organisms left
  • To make the phylogenetic tree, make branches based off the step-wise clustering, where the ends of each branches are the two organisms/clusters in that step
  • See notes from 10/16
  • Do some hierarchical clustering problems
  • Hierarchical clustering is just the first step in quantitatively comparing organisms; you then want to minimize mutation steps by testing out different trees
  • There are also corrections that can be made to adjust the distances to biological reality
    • A DNA substitution correction can be made to adjust distance for variation in mutation rate
      - A protein substitution correction can be made to adjust distance for similarity in proteins
  • The gene(s) you choose for your hierarchical clustering matter, because different genes evolve at different rates depending on if that gene is selected for
  • Faster evolving genes will give lower percent identities and larger hamming distances, while slower evolving genes will give higher percent identities and smaller hamming distances
  • When scientists make these phylogenies, they typically select a number of genes and do alignments on all of them, making a super-gene alignment
21
Q

Synteny

A
  • The conservation of gene order and location between genomes
  • Over evolutionary time, there can be rearrangements/reshuffling
  • Some of the things that can affect synteny are big rearrangements (groups of genes moving together), small rearrangements (jumps, typically only one gene), gene gain due to duplication, gene loss, etc
22
Q

Co-linear blocks

A
  • Continuous regions in the genome without rearrangement
  • Can be used to help visualize synteny
  • Can mark these blocks and track them in Circos plots
23
Q

Circos plots

A
  • Plots used to visualize synteny
  • Show genomes as concentric rings, sometimes with co-linear blocks, and sometimes with genes by themselves
  • They can be drawn in various ways
  • One way is have one genome on the outer circle and the other on the inner circle, and color the different genes/colinear blocks to show how similar they are between the two genomes
  • Another way is to have just one ring, with one genome on one half and the other genome on the other half, “mirroring” the first half, with lines drawn connecting homologs to show how they may have moved about in the genome
  • The less rearrangements there are between organisms, the more closely related they probably are, since rearrangements take time, and indicate there is a longer distance between the species
24
Q

Phylogenomics

A
  • Deals with whole genome comparisons between species, whereas phylogenetics only deals with single gene comparisons
  • Feature frequency profiles are a strategy used in phylogenomics
25
Q

Feature Frequency Profile

A
  • A phylogenomic approach to compare sequences throughout the genome in spite of differences in structure
  • Feature Frequency profiles involve taking small chunks of bases about 200bp long and counting how many times they appear in both genomes
  • This allows us to look for stretches of similar sequences (not necessarily full genes) found anywhere in the genome
  • We can build phylogenies off of these profiles
26
Q

Standing genetic variation

A
  • Variations among individuals in a species

- Typically are structural variants or point mutations (SNPs)

27
Q

Structural variations

A
  • Structural variants are mutations
  • A few examples of structural variants are:
  • Rearrangements
  • Duplications
28
Q

SNP

A
  • Single nucleotide polymorphism
  • Also called “point mutation”
  • Are when a single nucleotide is changed, deleted or inserted
  • When someone says “Humans have a SNP at position X,” what they mean is that at this position, some people may have one base while some may have another
  • A SNP refers to the position of the mutation, not the particular mutation that occurred, like a G to an A
  • There are different types of SNPs depending on what they do and where they are located
  • SNPs in coding regions can be
    • synonymous: preserves amino acid
    • non-synonymous: doesn’t preserve amino acid, and can be broken down further into missense or nonsense
    • missnese: a non-synonymous mutation in which the amino acid is changed
    • nonsens: a non-synonymous mutation which creates a stop codon
  • SNPs in non coding regions can potentially affect regulation, or they can have no effect at all
  • SNPs non-coding regions of the genome and synonymous SNPs are often called “silent” because they usually don’t have an effect, but in some cases they can
  • For example, if a synonymous mutation correlates to a less prevalent tRNA, that can lead to slower translation time, or if there is a SNP in the binding region of an activator, that may lead to a decrease in protein being made
  • About 22% of our genome “contain” SNPs, but polymorphisms are rare for many of these positions, so for any two given individuals, there genomes are about 99.9% identical
  • A SNP can be a common variant, in which the minor allele is seen at a frequency of 1-50%, or they can be rare, in which the minor allele is seen less than 1% of the time ( at least I think it’s related to minor alleles)
29
Q

Re-sequencing

A
  • Can also be used to find SNPs, by matching an individual’s genome to a reference genome
30
Q

MAF

A
  • Minor allele frequency

- The frequency of the second most commonly allele for a SNP

31
Q

Genotype in Relation to SNPs

A
  • The combination of allele a person has at a SNP?
32
Q

Alleles

A
  • particular variants available at a particular SNP locus
33
Q

Human Migration and SNPs

A
  • Diversity was present in the beginning when we all still lived in Africa before migrating elsewhere
  • Random subsampling occurred as parts of the main population moved away, since the subpopulations didn’t take all the diversity with them
  • As early humans were introduced to new environments, random mutations that weren’t necessarily beneficial before arose and became more prevalent as people adapted to their environments
  • Early humans also received SNPs via inbreeding with other hominid groups, like neanderthals and denisovans
  • Today, about 85% of variation can be found anywhere in the world, in any person
  • Only about 15% of variation is regional
  • There is actually no evidence for biological race, and there can be more differences between people in an individual group (“race”) than in individuals in different groups.
34
Q

GWAS

A
  • Genome Wide Association Studies
  • GWAS is used to find the particular variants in SNPs that correspond to certain alleles are associated with a particular disease
  • Takes a control groups and a disease group to compare SNPs
  • One challenge of GWAS is that there are different types of traits, and not every trait is due to just one gene; it could be due to other genes and environmental factors as well
  • The goal of GWAS is to quantify how much a gene affects a phenotype
  • A GWAS study is carried out as follows:
    • compute the odds ratio for every SNP in the genome to ID the ones with the biggest affect on the disease
    • Perform statistics on every SNP to see if the result is significant or just due to noisy data; the stats you do will give you a p-value
    • The results are then often reported in Manhattan plots, which show every SNP in the genome along with information on whether a variant at that SNP is associated with the trait
  • GWAS can help us:
    • Determine molecular causes of a disease to identify a target for treatments
    • Determine what particular version of a disease a person has to assist personalized medicine
    • Identify susceptibility for a disease to motivate lifestyle change
    • Help see how sensitive a parson is to a drug by looking at their pharmacogenetics
    • Bringing us one step closer to solving the phenotype equation
35
Q

Types of Traits

A
  • Mandelian traits: traits caused by a single gene
  • Multifactorial traits: traits caused by a combination of genes and the environment/lifestyle factors
  • Polygenic traits: traits influenced by more than one gene
36
Q

Penetrance

A
  • The likelihood that a phenotype will appear when a particular genotype is present
  • Penetrance ranges from 0 to 1, with 1 being “complete” penetrance, meaning every time you have that genotype you have that phenotype, and less than 1 meaning the gene contributes to the phenotype in varying degrees
37
Q

odds-ratio

A
  • Used in GWAS studies; is a way to quantify penetrance
  • Basically looks at the odds of getting a disease if you have a particular SNP variant versus not getting a disease if you have a particular SNP variant and the odds of getting a disease if you have a different SNP variant versus the odds of not getting the disease if you have the different SNP variant
  • The question you’re trying to answer is: how much more likely are you to have this disease if you have this one SNP variant than if you have this other SNP variant
  • To calculate odds-ratio, you first have to calculate the odds
  • For each SNP variant, calculate the odds of getting the disease by dividing the number of individuals with that SNP who have the disease by the number of individuals with that SNP who do not have the disease
  • To calculate the actual odds ratio, divide the odds of getting the disease with the SNP variant you’re interested in over the odds of getting the disease with the other SNP variant
  • If the odds ratio is >1, the SNP variant of interest is a risk factor, meaning a person is more likely to get the disease with that SNP variant
  • If the odds ratio is <1, the SNP variant of interest is protective, meaning individuals with it are less likely to get the disease than if they didn’t have it
  • If the odds ratio is 1, that means the variant doesn’t have any affect on the disease
38
Q

Odds Ratio vs Increased risk

A
  • When given an odds ratio, it’s important to also consider the incidence rate to get the full picture
  • incidence rate: general rate of occurrence
  • For example, if the risk factor for SNP variant 1 is 2.2, but the average lifetime risk fo developing the disease is 0.1%, then someone with SNP variant 1 only has a 0.22% chance of developing the disease; still pretty small
39
Q

p-value

A
  • Used in GWAS studies to generate Manhattan plots
  • indicates significance (probability of occurring by chance); the lower the p-value, the more the data supports significance
40
Q

Manhattan Plots

A
  • GWAS results are often reported as these
  • They show every SNP in the genome along with information on whether a variant at that SNP is associated with the trait
  • If the y-axis of the Manhattan plot is -log10(p), the higher the dot is along the y-axis, the smaller the p-value, so the more significance
41
Q

SNP chips

A
  • To save money and resources, GWAS relies on genotyping, not sequencing, meaning we only see parts of the genome and not the full genome
  • Genotyping uses “SNP chips”
    • The genome is first fragmented, and then fluorescent labels are attached to the fragments and undigested genomic DNA is removed
      - The DNA fragments are then run over a SNP chip
    • The chip has little regions on it with “probe” DNAs that are complementary to every variant for every SNP the researchers are testing for
    • The DNA fragments will bind to all variants for its particular SNP, since there will only be a 1 nucleotide difference, but they will bind more often to the correct variant, meaning that region will shine brighter due to the fluorescence and researchers will be able to know that that’s the SNP you have
    • We are diploid, however, meaning there is the possibility of heterozygosity in which an individual can have two different variants of the same SNP. In this case, their regions will probably glow about the same intensity
42
Q

Haplotype blocks

A
  • A haplotype is a combination of genotypes across a region
  • A haplotype block is a combination of nearby genotypes that tend to get inherited together; DNA tends to get inherited in “chunks”
  • Sequencing companies don’t check for all SNPs, only some of them, so what they do for other SNPs is they “guess” by using haplotype blocks
  • The actual SNP the chip will pick up is called the TagSnP, but if we know of certain genotypes (inferred SNPs) that are near the TagSNP and tend to get inherited with the particular TagSNP variant the individual has, they can say an individual is likely to have these other inferred SNP variants, even though they weren’t tested for
  • Is basically looking at the haplotype block for that particular TagSNP variant, and saying that it’s likely you also have these other inferred SNP variants
  • This process of using tagSNPs to “infer” about inferred SNPs is called “imputation”
  • A problem with using imputation and haplotype blocks is that if the tagSnP is associated with inferred SNPs that var by ancestor, the inferred SNP variants that the individual is told they probably have might not be correct; for example, the individual may have haplotype z when the database is based off of haplotypes x and y, and because the individual has tagSNP y, they’re told they have inferred SNP variants in haplotype y when they have haplotype z variants
43
Q

Are GWAS biased?

A
  • GWAS aren’t inherently biased, but the disproportionate usage of tagSNPs from European ancestry haplotype blocks makes then biased, when doing imputation
44
Q

Missing Heritability Problem

A
  • Heritability is the fraction of variation in a trait that is explained by genetics; it is related to penetrance; complete penetrance means the trait completely heritable, whereas incomplete penetrance means the trait isn’t heritable(?)
  • Not all heritability is accounted for; researchers think the missing heritability is due to SNPs that are too rare to be seen in GWAS
45
Q

Limitations of GWAS

A
  • TagSNPs used on SNP chips only indicate region; we have to do additional work to figure out if a SNP variant is causative
  • TagSNPs may not have some inferred SNPs in different ancestry groups (haplotype block problem)
  • SNP chips still seem to be missing genetic contributions to phenotype (missing heritability problem)
46
Q

The different levels of Biodiversity

A
  • ecosystem diversity: how many different ecosystems there are
  • species diversity: how many different species there are
  • genetic diversity: diversity within a species
47
Q

Why low Genetic Diversity is Bad

A

1) Reduced adaptive potential
- There is less variety for natural selection to act on if there is an environmental change
- This leads to a downward spiral; genetic erosion makes a species more susceptible to environmental problems, which makes them more susceptible to genetic erosion
2) Inbreeding depression
- Reduced ability to survive and reproduce as a result of breeding by closely related individuals
- Can lead to a loss of heterozygosity, which can be especially bad if the trait that is left standing is deleterious

48
Q

How genomics can help diversity and conservation biology

A

1) We can measure diversity by going out into the field, taking DNA from various organisms in the same species and population
- This can help us provide surveillance on the diversity (and thus susceptibility to extinction) of a species
- It can tell us if/where we should place wildlife corridors or converts to allow species to cross over/under manmade features like freeways so the populations can interbreed with one another to increase diversity
- Can tell us if we should transport individuals from one population to another to increase diversity
- Can tell us if we should go as far as to doing captive breeding
2) Helps us understand diversity
- Can tell us what parts of the genome are most important for survival/adaptation