Lecture 8 - Non-linear plots Flashcards
change in expression is also known as
variation in expression
Based on this image, assume that the distance (variation) between the naive and transplant 2Hon the y-axis is 24% and the x-axis distance (variation) is 15%. What would the overall variation between the naive and transplant 2H be?
The Pythagorean theorem states it would be about 28.3%
Here, we see that PC1 and PC2 account for _____% of all of the variation. The 3rd, 4th, etc PCs would have to add up to be less than _____% so we would be safe to assume that the
differences we see in this plot would be _______ (likely to/unlikely to) reflect the differences between the tissue types
a) 20.3+68.1 = 88.4%
b) 100 - 88.4 = 11.6%
c) likely
Match the following with the colours shown based on the PCA results
1) 75%
2) 39%
3) 63%
yellow = 2
green = 1
pink = 3
*based this on which ones appear the most distinct from each other
If a PCA has 3 dimensions the plot will contain a y-axis, x-axis, and a ______. each axis represents a different _____ (gene/cell)
z-axis, cell
answer the following pertaining to PCA
a) PCA stands for?
b) what type of approach does it have
c) Reduces dimensions by _____ on the variance in each dimension (minimizes/maximizes)
d) identifies key ____ that influence tissue types (genes/cells)
e) What type of biological processes does it identify?
a) principe componenet analysis
b) linear + unsupervised
c) maximizes
d) genes
e) differentiation
PCA is an unsupervised linear approach.
a) unsupervised?
b) linear?
a) it means that you are comparing all the components with each other and not one to another individually
b) measure the distance/variance between genes expression using lines not curves
why does scRNA-seq not just use PCA?
it requires a comparison of multipe differnt cell types and genes expressions at different times which is too complex for the linearness of PCA
match the following to
a) linear
b) nonlinear
a) A
b) B + D
which of the following are similar
I. PC
II. tSNE
III. UMAP
a) I and III
b) II and III
c) I and II
d) I, II, and III
e) none, they are all distinct
b
match the following
a) PC
b) tSNE
a) A
b) B
Non-linear diffusion models
a) what does it emphasize in the data?
b) useful for ______ of continuous processes such as ________
c) “Each dimension highlights the heterogeneity of a different
cell population” –> what does this mean
d) used for ______ and _______
e) typically rely on the ____ of dimensions first (addition/reduction)
a) transitions –> seeing a big difference in the spaces between the clusters of cells
b) visualization, differentiation
c) each dimension shows the variation (heterogeneity) between the different subpopulations (clusters)
d) exploration and visualization
e) reduction
T or F - the number used on the axis of tSNE plots are arbitrary
T - This plot is just meant for visualization of difference inexpression
which plot uses percentage variation as its axis?
PC
T or F - Non-linear diffusion models such as tSNE and UMAP are used to help with exploration, visualization, and for determining events
F - not used for determinging events, it cannot state whether one population is derived from another population on the plot
non-linear diffusion models are not used for determining events - what does this mean?
it means that it cannot tell you whether one of the populations (clusters) shown is derived from another population or not
t-SNE
t-distributed stochastic neighbour embedding
T or F - while PCA is unsupervised and linear, t-SNE is unsupervised and nonlinear
T
T or F - while PCA is unsupervised and linear, t-SNE is supervised and nonlinear
F - tSNE is also unsupervised
t-SNE calculates a similarity measure between a pair of instances in the high dimensional space and in the low dimension space
a) high dimensional space?
b) low dimensional space?
a) gene by gene comparsion
b) PCA
T or F - genes that are found to be similar to each other have a higher cost
T
Which of the following would result in a negative cost
a) similar genes
b) distinct genes
b
What allows tSNE to exaggerate differences between cell population and overlook potential connections between pop?
the cost function
T or F - in tSNE you will never get the same image twice
T
Which diffusion plot can help you determine closer relationships between adjacent groups and distant groups?
a) PCA
b) tSNE
c) UMAP
d) tSNE and UMAP
e) PCA and tSNE
c
UMAP?
uniform approximation and projection
a) ________ (PCA/UMAP,/tSNE)Gives the best approximation of the underlying topology
b) what does underlying topology refer to here?
UMAP
a) the real biiological relationships between the cells based on their similarities and differences in expression
As we all know PCA and tSNE are unsupervised but what is UMAP
force-directed
Molecular specification is visualized with a force-directed layout, in which each cell is represented by a coloured point at each ___________
developmental stage
T or F - UMAP can be used to get information about the biology of the cells due to its force-directed layout algorithm
F - give underlying topology or compare different UMAPs but not the biology due to images changing everytime
T or F - in UMAPS, the distance btw clusters represents a closer relationship between those genes
F
Match the following with the colour
a) Differential splicing between pop
b) finding markers of cell types
c) identification of genes that drive a process
d) allelic expression patterns
e) frequency of cell type in the pop
a) pink
b) blue
c) yellow
d) orange
e) green
T or F - UMAPS and tSNE are great visualization of gene express in cell but cannot tell you too much about the biology
T
a) high dimensional space?
b) low dimensional space?
a) a space that contains a high number of dimensions (cell/genes), making it difficult to visualize
b) a space that has reduced the number of dimensions (cells/geness) to make it easier to visualize
FDR - name and define
false discovery rate: the proportion of results that were reported as significant but were actually false positives (not truly significant)
How can a false discovery rate occur?
It can happen due to a certain gene out of the thousands appearing different (significant) by random chance when normally it would not be different/significant
a) Which bar is the gene that is considered significant?
b) Which bar is the gene that is considered insignificant?
c) what dictates the significance of a gene
d) this is known as a ________ experiment because______
a) green
b) blue
c) difference in expression
d) Perfect, there are no false discovery rates occurring (no false positives)
a) What kind of experiment is this showing?
b) What kind of plot is this?
c) Are these results common? Why?
a) perfect
b) volcano
c) no, as there are no false positives (FDR) present, which tend to occur
a) Which cells have high expression of that gene?
b) Which cells have low expression of that gene?
c) Which cells are insignificant?
d) What is the difference btw no significance in expression and low expression of that gene?
a) red
b) blue
c) grey
d) No significance just means that the expression of that gene does not change btw the experiment nor the control groups. While the low expression does
Volcano plots: represent the _________ in gene expression and the ________ of that change.
changes, significance
T or F - This is a demonstration of a perfect experiment volcano plot
T
T or F - This is a demonstration of a perfect experiment volcano plot
F - less difference btw the significant and non-significant gene expressions
Define the following wrt volcano plots
a) log2 fold change
b) -log10 (p-value)
a) the magnitude of change in gene expression between conditions (control + exp)
b) the statistical significance (how consistent is that difference in gene expression)
Match the following
a) log2 fold change
b) -log10 (p-value)
a) yellow
b) green
T or F - larger p-value in a volcano plot means a higher significance
F - smaller p-value
T or F - A smaller p-value in a volcano plot means a higher significance
T
T or F - A smaller p-value in a PC plot means a higher significance
F - p-value is associated in a volcano plot not PCA
T or F - cells that have a high p-value also tend to have a high FDR
T
T or F - Cells that have a low p-value also tend to have a high FDR
F - high p-value ~ high FDR
Why do values in a volcano plot that have a high p-value also have a high FDR?
having a high p-value means that that chance in the expression btw the two conditions varies a lot which makes it more likely to result in a false positive (FDR measures the rates of false positives)
a) The ________ is a measure of how likely you are to get this genetic data if no real difference existed. (FDR/p-value)
b) A _______ p-value indicates that there is a small chance of getting this data if no real difference existed. (Small,large)
c) A ________ is when you get a significant difference where, in reality, none exists
d) The _____ adjusts p-values in a way that limits the number of false positives reported as significant
e) So, choosing a cut off of 0.05 means there is a _____% chance that we make the _______ decision (right/wrong)
f) _______ are the name given to the adjusted _______ found using an optimised ______ approach.
a) p-values
b) small
c) false positive
d) FDR or false discovery rate
e) 5, wrong
f) q-value, p-value, FDR
a) Is this plot adjusted or not? How do you know?
b) What does adjusted mean here?
a) No, because the p-value is shown, not the FDR
b) whether the plot pts have been corrected for potential false positives
a) is this graph showing a difference btw experiments or not
b) which axis represents the p-values?
a) no difference
b) x-axis
a) Is this graph showing a difference btw experiments or not
b) Which axis represents the p-values?
a) yes difference
b) x-axis
describe where the false positives are
they are all the bars/bins that are below the read line on the left (highlighted)
There are roughly 450 above and below the line but without the control how many would there be? Why?
900 (450 x 2) –> the control removes the values below the line which represent the false positives
what is the purpose of the Benjamini-Hochberg method?
to figure out where the line is when trying to separate false from true positives wrt variance
The Benjamini-Hochberg method
1. Order the p-values from _________ to _________ (smallest to largest/largest to smallest)
2. ______ the p-values
3. the _______ FDR adjusted p-value and the ______ p-value are the same (smallest/largest)
4. the next _______ (smallest/largest) adjusted p-value is the ________ (smaller/larger) of the two optins
- smallest to largest
- rank
- largest, largest
- largest, smaller
What is the formluat for the Benjamini-Hochberg method?
the current p-value x (total # of p-values/p-value rank)
Calcualate the adjusted p-value for the 8th ranked adjusted p-value
- adj p-value = (.71)(10/8) = .8875
- prev adj p-value = .9
- .8875<.9 therefore, the 8th ranked adjusted p-value = .8875
Calcualate the adjusted p-value for the 5th ranked adjusted p-value
- adj p-value = (.41)(10/5) = .82
- prev adj p-value = .85
- .82<.85 therefore, the 5th ranked adjusted p-value = .82
a) fill in the highlighted parts
- yellow (false positive/false negative)
- green (significant, not significant)
b) explain why this statment is true
a) false positive, not significant
b) the adjusted p-value that indicates the significance in the variation betwen the two expreiments is very large in comparison to the set alpha (not given) which means it is not likely to be significant
T or F - larger p-values mean a lower level of significance
T
a) this shows the _____ p-values (raw/adjusted)
b) if alpha = 0.05 which of the circled values would not be considered significant?
d) for the red values the rows represent ______ and the columns represent ______
a) raw
b) 0.14, .64, and 0.71 (all are above 0.05)
c) ranks, bins
What is the circle representing here?
the false positives (the values that were under 0.05 when the data was raw but are now over 0.05 after the adjustments have been made)
Order the following adjusted p-values from most to least significant
a) .69
b) .004
c) .059
d) .1
e) .5
.004 > .059 > .1 > .5 > .69 (the greater signs represnt the significance not the value itself)
a) green (a difference/no difference
b) yellow (true/false/either)
c) Where did they get the value in the circle from?
a) no difference
b) false
c) it represents the number of bins (bars)
a) green (a difference/no difference
b) yellow (true/false/either)
c) Why is only the green bin being indicated here?
a) a difference
b) either
c) it is the only one that containes p-values less than 0.05 which are signficant
T or F - The benjamini-Hochberg method is used to eleminate the number of false positives
F - only reduces does not eleminate them
T or F - The benjamini-Hochberg method is used to reduce the number of false positives
T
WHy should a plot show the FDR values rather than the p-values
p-values contain more false positives but the FDR values are adjusted p-values that reduce the number of false positives making them more accurate
a) What are 2 techniques used to generate genomic data?
b) What are the 4 plots used to organize data?
c) What is used to correct false positives
d) Where is the data deposited?
a) RNA-seq (bulk) and scRNA-seq (single cell)
b) PCA, valcano, UMAPs, tSNE
c) FDR - false discovery rate
d) GEO - gene expression omnibus
define GEO
gene expression omnibus - a public database used to store and share gene expression info and other genomic data
Define GSEA
Gene set Enrichment Analysis - a set of gene that have somethign in common
What are 4 commonality that gene can have with one another
1, chromosomal regions - the gene location relative to eachother on a chromosome
2. gene ontology - classifying the genes based on biological processes, moleulaar fxn, and cellualar componenets
3. pathways
4. gene sets - using pre-published info about the gene to learn more about it
define GO
gene ontology: the act of classifying genes to describe their roles in cellular activies
What are the three ways to classify genes (wrt GO)
- biologolical process
- molecular function
- cellular ccomponents`
Match the following term to its definition
a) Molecular function
b) cellular component
c) bioloigical process
- The big picture as a result of multiple molecular activities. Example, DNA repair, Wnt signal transduction, etc
- The activity performed by the gene product. The word “activity” is usually appended to avoid confusion with the gene name. Example:, adenylate cyclase activity
- Location relative to cellular structures. Example, Mitochondria, ribosome, cell wall, etc
a) 2
b) 3
c) 1
a) Molecular Function: The activity performed by the gene product. The word “________” is usually appended to avoid confusion with the gene name. Example:, adenylate cyclase activity
activites
__________: The activity performed by the gene product. The word “activity” is usually appended to avoid confusion with the gene name. Example:, adenylate cyclase activity
* ____________: Location relative to cellular structures. Example, Mitochondria, ribosome, cell wall, etc
* ______________: The big picture as a result of multiplemolecular activities. Example, DNA repair, Wnt signaltransduction, etc
Molecular fxn, cellular compoenent, Biological process
The GO vocabulary is designed to be species agnostic, and includes terms applicable to prokaryotes and eukaryotes, as well as single and multicellular organisms. What does species-agnostic mean here?
it means that the terms used for GO are used universilly for all species used to describe the fxns and process of genes across all species
In an example of GO annotation, human “_________” can be described as:
* molecular function oxidoreductase activity
* cellular component mitochondrial intermembrane space.
* biological process oxidative phosphorylation
cytochrome c
Fill in the highlights
blue: gene set enrichment analysis
yellow: a single gene
green: DAVID
in the zebra fish analasis, it was observed that the gene expression was _______ during the conventional experiment, _______ during the germ free, and ________ during the metabolite treatment (high/low)
high, low, high
KEGG is a database resource for understanding ______-level functions (low/high) and utilities of the _________ system, such as the cell, the organism, and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other ______- throughput experimental technologies.
high, biological, high
T or F - KEGG is produced through computer algorithms
F - it is manually drawn
T or F - KEGG is a manually drawn pathway map
T
a) What is this experiment?
b) is there a lot of change in gene expression shown
c) Which of the following would indicate the amount of time observed for this type of change in gene expression
1- changes between 0 and 1h
2- changes between 0 and 6h
a) changes in gene expression relate to dehydration resistance of a dehydrated pear
b) no
c) 1
a) What is this experiment?
b) is there a lot of change in gene expression shown
c) Which of the following would indicate the amount of time observed for this type of change in gene expression
1- changes between 0 and 1h
2- changes between 0 and 6h
a) changes in gene expression relate to dehydration resistance of a dehydrated pear
b) yes
c) 2