module 3: beyond genome sequencing Flashcards
why bother with NGS if the human genome has already been sequenced?
1) Clinical setting - gives info about potential disease causing mutations
2) Phylogenetic studies
3) Compare sequences between population to detect variability
4) Keep up with the changes of the genome
1) What is whole-exome and targeted sequencing?
2) How does it work?
1) Sequencing the exonic region (sequences retained in mature mRNA) of the genome! It is important for CLINICAL RESEARCH.
2) Same as whole genome sequencing but differs in library preparation - after fragmenting DNA, special beads bind to exonic sequences only.
Whole-exome and targeted sequencing _______ sequencing power by reducing the _______ area covered.
It lays the framework for causes of _________.
increase; genomic
autism (complex disease with many factors contributing to it)
*most disease causing mutations disrupt gene expression; affect coding regions.
_________ platforms can detect modified bases.
Nanopore
*has lots of potential for epigenetic studies
(T/F) NGS is not important after sequencing one genome as a single genome can represent the genetic diversity of our species.
False!
NGS help CATALOGUE human genome diversity: NO SINGLE GENOME CAN REPRESENT THE GENETIC DIVERSITY OF OUR SPECIES.
Which statement is true?
1) The reference genome for humans is mosaic (many different genomes).
2) 0.6% of NT differs between any 2 individuals.
All are true!
1) What was the first project to catalog human diversity?
2) What was the “All-of-US” project?
1) 1000 genome project which ran from 2008-2015. it sequenced 2,504 individual genomes from 26 different populations. POPULATION-level sequencing
2) aimed to sequence the genomes of 1 million American citizens to accelerate research and improve health
Why is cataloging human genetic diversity important?
1) helps us see where our genomes differ and how these can affect our phenotypes.
2) can help us learn how our genetics can influence our response to certain drugs and our susceptibility to different diseases.
Human Pangenome Reference Consortium looked at 47 phased diploid genomes.
1) What does phased diploid genome mean?
2) Why is this important?
1) Separated maternal chromosomes from paternal chromosomes.
2) Reference sequences provide a consensus sequence of two homologous that doesn’t take in the diversity between maternal and paternal.
This will better represent the diversity of the human genome. Each genome carries a certain number of DNA variants.
1) Define the term mutation.
2) Mutations are caused by:
1) Permanent change in the DNA sequence compared to what is predominant in the population
2) Endogenous (un-repaired DNA damage) and exogenous sources
Give examples of exogenous and endogenous sources of mutations.
Exogenous:
- ionizing radiation (DNA breaks)
- UV rays (thymine dimers, issues with DNA replication)
- chemicals (deamination, oxidation of bases)
Endogenous:
- oxidation of bases (G -> T)
- errors of DNA pol
- mis-repaired ds/ss DNA breaks
- loss of a purine/pyrimidine (abasic site)
- cytosine deamination (gives uracil)
*most endogenous sources are repaired normally but when they are not repaired, it can lead to mutations!
(T/F) Mutations can range from a single bp to millions of bp. They are also inevitable.
True!
We can not stop mutations. They can be good or bad depending on where they occur and their nature.
(T/F) We lose 500 purines per day and 10,000 pyrimidines per day.
False!
We lose 10,000 purines per day and 500 pyrimidines per day.
These are normally repaired but if not, errors arise during DNA replication.
1) Define genetic VARIATION.
2) What are the two types of variations? Briefly describe each.
Variation: mutations that result in ALTERNATIVE forms of DNA (established in a population).
Common variation: minor allele (least common allele) frequency of at least 1% in the population
Rare variation: minor allele frequency of <1% in the population
*not a strict rule
Define allele.
Allele refers to one of two or more versions of a DNA sequence at a given location.
For any genomic location, we have two alleles (maternal vs paternal).
You can be HOMOZYGOUS or HETEROZYGOUS for alleles.
(T/F) If you are homozygous for one loci, chances are you are homozygous for all.
False!
You can be homozygous for some loci but heterozygous for others.
What are the four types of genomic variants?
1) Single NT polymorphisms
2) Insertion-Deletions (INDELs or DIPs)
3) Simple sequence repeats (SSRs)
4) Copy number variants (CNVs)
Single nucleotide polymorphisms, also known as SNPs are the ____ common genomic variants (1 in every ____ NTs).
There are about _______ SNPs in the human genome.
What are the causes of SNPs?
Most; 300
10 million
Causes of SNPs are the same as the causes of mutations; errors, radiation, oxidation, endo vs exo, etc.
*SNPs are point mutations
Briefly describe where SNPs can occur within the genome.
Which location has the most visible impact?
SNPs can occur in the:
1) Coding
2) Non-coding (introns - could affect mRNA splicing, TFs binding, and stability of mRNA esp if it is in the 3’ UTR)
3) Intergenic (regulatory regions between genes - can affect transcription)
Most visible impacts are seen when SNPs are present in the coding and intergenic regions.
There are two types of coding SNPs.
Differentiate them.
1) Synonymous: NO CHANGE in the amino acid thus no impact of protein
2) Non-synonymous: CHANGES the amino acid.
There are two types of non-synonymous: MISSENSE or NONSENSE (intro of premature STOP codon).
*the mRNA usually gets degraded before it can be translated if it has a non-synonymous coding SNP.
What is the difference between a causative and a correlated SNP?
Causative SNP: SNP alters protein function, leading to disfunction in the organism. The SNP causes the observed phenotype.
Correlated SNP: SNP is not within a coding region but is inherited with a mutation that causes a disease. The SNP does not cause the observed phenotype.
(T/F) Most SNPs have a significant impact on the health and development of humans.
False!
Most SNPs are not observable unless they are affecting a coding/regulatory region.
What is the human germline mutation rate for SNPs?
Knowing the human germline mutation rate for SNPs and that we have a lot of SNPs in our genome, what does this tell us?
1 in 100 million NTs are substituted per generation (~1.2x10^-8 per site per generation). This means 30 NEW PT MUTATIONS per generation are arising in an egg/sperm.
We find a lot of SNPs in our genome and we know that each incidence of creating that SNP is a rare event. This tells us that SNPs are VERY OLD mutational events inherited by a COMMON ANCESTOR.
We can compare the SNPs across various genomes to trace the origins of the human species!
Most SNPs are __-allelic.
Bi-allelic
This means that most SNPs come in one or two varieties. For example, a locus can have either A or T but not G or C.
This is because the germline mutation rate is so LOW! That exact position has to be mutated more than once to be tri-allelic and more, which is very rare.
An example of an SNP is in the gene ABCC11 which encodes a membrane transporter in sweat glands.
The minor allele (__) encodes for dry earwax and no body odour found in _________.
The major allele (__) encodes for wet earwax and normal body odour found in _______ and _____.
TT; East Asians
CC; Europeans and African
*this is an example of A SINGLE SNP having a profound change.
1) What is the second most common form of genetic variation in the human genome; at what rate?
2) What is the size of these? Which ones are more common?
3) What are they caused by?
1) Insertion-deletions (INDELs or DIPs) are the second most common form. They occur 1 per 10kb of DNA.
2) INDELs or DIPs can be 1 to 10,000bp in length. The shorter ones (1, 2, 3 bp) are most common.
3) These are caused by errors in DNA replication, recombination or repair.
*depending on where they occur, they can have an effect.
Coding region INDELs can lead to catastrophic phenotypes.
Briefly describe the two types of coding region INDELs.
1) Frameshift: changes the reading frame
2) Non-frameshift: multiple of three NTs added or deleted, leading to no change in the reading frame.
Simple sequence repeats (SSRs) are also known as ________.
They account for __% of the total DNA.
___-___ base pairs repeated in tandem (up to 100 times).
Frequency of once in every ___ of DNA.
The germline mutation rate is ____ per locus per gamete.
microsatellites
3%
1-6bp
30kb
10^-3
List these statements as either true or false.
1) For SSRs 2 and 3 base pairs being repeated is the most common. It is the NUMBER OF REPEATS that is variable between people.
2) SSRs are not as common as SNPs or INDELs and they affect less DNA.
3) SSRs are more polymorphic than SNPs as they change more frequently. They also are not BI-allelic. However, the rate of new formation for SRRs is still low enough that it usually doesn’t change within a few generations of a family.
1) True!
2) False. Though SSRs are not as common as SNPs or INDELs, they can affect more DNA.
3) True!
Give an example of a disease caused by SSRs and answer the following questions regarding it.
1) What is the SSR involved?
2) What is the correlation between number of repeats and age of onset?
Huntington’s disease (HD) is an autosomal dominant, neurodegenerative disease with no cure is an example.
1) Polyglutamine disease (polyQ) due to the trinucleotide expansion of CAG (codes for glutamine) within the HTT gene.
2) Number of repeats proportional to the age of onset (more repeats = earlier onset).
*families with a history of HTT will show an earlier and earlier onset of the disease with each generation as the triNTs can expand.
1) What are copy number variants?
2) How important are they?
3) How can they be detected?
1) DNA segment > 1kb that is present in variable copy number compared to a reference genome.
2) They are as important if not more than SNPs. Though they are not as abundant as the others, they affect more DNA!
3) They can be detected using CGH.
(T/F) We have strictly two copies of each gene in our genome!
False!!
This is not accurate - we can have more than two.
For example, the copy number of each gene of the olfactory receptor genes is EXTREMELY VARIABLE! They can have from none to 6 copies.
Why do we need to sequence more genes when we already have the first human genome sequence from 2001?
The first human genome sequence was a PATCHWORK of DNA sequences from different individuals. It LACKS diversity.
For a more detailed view of human genetic variation, we need to compare DNA from many different individuals. Thus, we have to sequence as many genomes as possible.
(T/F) In Craig Venter’s genome, the majority of the differences from the reference sequence came from the large-copy-number variants.
True!
*there is no such thing as WT or reference genome! lots of variations found in different genomes.
What was the 1000 genome project?
It was a population-level sequencing project that wanted a deep catalogue of human genetic variation.
They wanted to find genetic variants with a frequency greater than 1% (common variants).
They sequenced over 2500 samples.
Which continent has the highest variant sites per genome? Why?
Africa has the highest variant sites per genome (most are SNPs). They are the oldest population and thus had more time to accumulate these variants in their genome.
Other continents have lower variant sites per genome due to the Founder’s effect.
This occurs when a small group of migrants leave a population and move elsewhere and fail to capture all of the diversity of the original population. This group also interbreeds and limits diversity.
Thus other populations such as Europe, East Asia, and South Asia have lower genetic diversity.
(T/F) In average, there are 4-5 million variant sites per genome.
True!
(T/F) The majority of the SNPs in our genome are not common and are restricted to a subpopulation.
False!
The majority of the SNPs in our genome (3/4th) are common - they are VERY OLD mutational events.
The less common SNPs are restricted to a sub-population.
Besides the 1000 genome project, there are two other population-level sequencing projects underway.
Briefly describe each.
1) Exome aggregation consortium (ExAC) catalogued genetic variation in the PROTEIN-CODING REGIONS of the genome - looked only for SNPs.
2) Genome aggregation database (gnomAD) that took information from ExAC and had additional whole-genome and exome sequencing. They were looking for SNPs and LARGER VARIANTS (deletions, duplications, inversions, etc).
The gnomAD project had an average coverage of ____.
It was representative of the general ____ population and did not include severe ______ diseases.
32x
adult; Mendelian
Briefly answer the following questions regarding the results of gnomAD:
1) What types of variants were most commonly found among the structural variants?
2) What is the approximate number of structural variants identified in each individual, as opposed to previous estimations?
3) What were the characteristic traits of the SVs identified?
1) The majority of structural variants found were DELETIONS (cnv) and INSERTIONS (non-cnv).
2) Each individual had approximately 7000 structural variants, DOUBLE what was previously identified.
3) Most of the Structural Variants are SMALL and RARE.
We have many COMPLEX variants (more than 1 type of change in the DNA).
Like with SNPs, there were more structural variants found in the ________ population.
Unlike SNPs, the majority of the structural variants are _______.
_________ are variants that occur only once in a population (or the genomes studied). Over ____% of the SVs were found only once.
African
Rare (90%)
Singletons; 50*
*shows how rare SVs are as there were many singletons