MBG - Midterm 2 Flashcards
study
TPM define
transcripts per million –>amount of genes within an RNA seq sample
what does an RNA-seq Heat map describe?
the changes in developmental gene expression over a 14 days using a cell line
what is a cell line?
a pop of cells grown in vitro (in a lab) that divide indefinitly
pluripotency cell are ________ expressed (highly, lowly)
highly
hierarchical clustering analysis?
the act of grouping genes or samles based on similarities
what is the significance/purpose of single cell RNA-Seq
to follow kthe progression of a single stem cells gene expression over time
single cell RNA-Seq allows you to follow a stem cell within an orgnaism as it develops - T or F
F - it only allows the oberservation of the gene expression within the cell not the development of the cell itself
what are the 3 genes needed (ultamate markers) to produce plurlpotent stem cells?
1) NANOG
2) SOX2
3) OCT4
why is overlap needed in single cell RNA-Seq?
it allows you to see the transition in the cells properly (easier to visualize progress over time wrt gene expression)
what are barcodes
a special sequence that attatches to each mRNA allowing us to idenfity the DNA fragment being observed
are barcodes only used for DNA fragments? IF not what else can use a barcode?
No - they can be used to idenitfy the cell as well
a) how many identifiers/barcodes are needed for single cell RNA-seq?
b) what are each of them identifying?
a) 2
b) the target cell and the target mRNA in that cell (fragment)
UMI define
Unique molecular identifier - unique sequence on DNA fragment (cDNA) that will target a unique target mRNA
distinguish the following
a) cell barcode
b) UMI
a) unique seq that identifies the target cell
b) unique sequence that identifies the target mRNA that cDNA will attack
Describe single cell capture, including general steps
the act of combining the cell barcode and the UMI. By mixing the cell (gives the cell barcode) and the barcoded bead (UMI) into oil. The cell lysis allows hybridization to occur
what is the main sigificant difference btw single-cell analysis and bulk analysis
single-cell can differentiate between the cells while bulk just looks at the overall average of all the cells
Fill in the blank for RNA-seq (bulk)
a) analysis ___ cells at once (each/all)
b) _____ levels of mRNA (high/low)
c) _____ sequence depth (high/low)
d) _____ transcripts (rare/common)
e) ______ detect rare alternative splice forms (can/cannot)
f) _____ detect subpopulations (can/cannot)
a) all
b) high
c) high
d) rare
e) can
f) cannot
Fill in the blank for scRNA-seq (single-cell)
a) analysis ___ cells at once (each/all)
b) _____ levels of mRNA (high/low)
c) _____ sequence depth (high/low)
d) _____ variability btw cells (high/low)
e) ______ to detect rare transcripts (easy/difficult)
f) _____ to detect rare splice variants (easy/difficult)
a) each
b) low
c) low
d) high
e) difficult
f) difficult
how many cells are needed for ScRNA-Seq?
1000-10000
ScRNA-Seq?
single cell RNA-Seq
RNA-Seq has a ______ (lower/higher) number of dimensions relative to ScRNA-seq`
lower
Define sequencing depth. Describe the traits of each
a) high
b) low
the number of times a certain part of a DNA is sequenced
a) can detect lower expressed genes better, making them able to detect rare transcripts
b) cheaper but may miss lower expressed genes and thus the rare transcripts
rare transcripts?
transcripts that exist but in very low concentrations (very low expression from gene)
ScRNA-seq differentiates between the different cell ____
types
how many genes in human genomes
~22 000
If there are about 22,000 genes in the human genome, why would the read count not go all the way up to 22000 wrt doing scRNA-Seq?
because It only detect when the genes are being expressed, and not all genes are being expressed at any given time
What is the purpose of principle component analysis?
used to reduce the number of dimensions for the RNA-seq so it is easier to interpret and thus understand
Why does scRNA-seq tend to have a lot of false negatives? define the following
a) doublets
b) empty
c) lysed
because there will be many droplets that don’t one barcode bead and one cell
a) 2 cells in one droplet
b) no cells nor beads in a droplet
c) cell was lysed before encapsulated
if doing a scRNA-seq, let’s say there was a high number of mitochondrial reads. What does this mean?
This means that there are a lot of droplets that contain lysed cells in which the mitochondria is now producing RNA instead of the nucleus (false negatives)
Fill in the blank using the following terms: lysed cells, doublets, empty droplets
a) The left tail is showing _______
b) the right tail is showing _______
a) lysed cells
b) doublets
Match the following for scRNA-seq quality control
1. Finding outlier peaks in number of genes
2. Finding the fraction of mitochondrial reads
3. finding the outlider peaks in the count depth
a) doublets
b) empty droplets
c)Lysed cells
1a, 2c, 3b
What do these stand for?
a) hfp
b) dfp
a) hours post fertilization
b) days post fertilization
What type of cells are epiblasts?
stem cells
Which of the following are associated with the formation of the gonads (3 of these)
a) epiblast
b) neural anteririo
c) neural mid
d) neural posterior
e) neural crest
f) epidermal
g) endoderm
h) Mesoderm
i) Germline
j) apoptotic-like
f + h + i
MYL2
a) What type of gene is it?
b) What type of cell does it come from?
a) cardiac myosin light chain gene
b) cardiac cell
define gene ontology
standardized system for describing gene functions, biological processes, and cellular locations across different species. It organizes genes into categories based on what they do, where they act, and what biological processes they’re involved in, making it easier to interpret large-scale gene expression data
What is a t-SNE plot?
a type of visualization that reduces high-dimensional data (like gene expression across thousands of genes) into 2D or 3D, grouping similar data points close together to reveal clusters or patterns
due to t-SNE plot being a graph that have an x and y axis - T or F
F - they are just plots of cell clusters that are either expressed or not depending on the time in which the plot is captured. THey are meant to compare to other stages or other samples no axis required
What is the purpose of batch correction?
to ensure that the differences between genomic/transcriptomic experiments are just biological differences (not technical artifacts causing variations) –> removes unwanted variables
What is batch effect
unwanted variations in data between genomic experiments that are due to technical artifacts instead of biological differences –> confounds the analysis making it less accurate
What is used to fix/minimize batch effect?
batch correction
What causes batch effects + Give 4 examples that would cause this
They occur when cells are handled in distinct groups
a) different chips –> loading samples on a physical chip
b) different seq lanes –> in flow cell
c) different harvesting times
d) different handlers/experimenters –> difference in tech used to harvest samples
Which image is the no batch vs the batch correction
left = no batch –> differences due to harvesting times
right = batch correction
Theoretically, how many expressed genes can a human cell contain? Why is this theoretical?
22000 - because you will never have all genes within the cell expressed at the same time
“A single cell could have over 15 000 dimensions.”
a) If a single human cell contains 22000 genes, why is it saying 15000?
b) What is it referring to by ‘dimensions’ in this context?
a) because while there are 22000 genes in a single human cell, not all of them will be expressed at the same time, nor will all of them be informative if expressed. Realistically only about 15000 will be expressed and informative
b) dimensions = expressed genes in this context
T or F - 15000 dimensions are used to show data instead of 22000 because this is easier to visualize the varying expression levels
F - 15000 dimensions is very difficult to visualize and instead it is reduced down to 2/3 dimensions. Plus 22000 dimensions is not likely to occur in one go
what is the purpose of reducing the dimensionality of data
to make it easier to visualize the informative genes
a) genes that are expressed with ____ variabillity are the most informative (high/low)
b) HVG stands for ____
c) _______ to _______ HVGs are usually selected for downstream analysis
a) high
b) highly variable genes
c) 1000 to 15000
define a dimensionality
the number of genes being measured (expression level) per cell
On this plot identify
a) black genes
b) the y-axis
c) the x-axis
a) highly variable genes
b) y-axis = expression value (p-value)
c) x-axis = fold change (difference in gene expression between 2 treatment grs)
Based on this image which genes are informative/ non informative? explain why.
the black dots (HVGs) are informative because they have enough variation btw to distingiuish them vs the gray dots which are non informative as they are too similar to one another to distinguish
What is the PCA and what does it do
principle component analysis: reduces the number of dimensions (genes measured per cell), but allows us to inforporate the variability
match the followng based on the image given
1. ouM
2. 10uM
3. 5uM
a) blue
b) red
c) yellow
1a
2c
3b
a) what is this PCA actually showing
b) comment on the middle yellow sample
a) the variabtion between the differnt gene samples
b) there is a huge variation in the yellow sample due to this large gap making the overal population less varied from other population types (red and blue)
What are the 3 types of dimensionality reducing maps?
PCA, t-SNE, and UMAP
PCA
a) linear or nonlinear relationship?
b) main goal?
c) can you differentiate populations of cells
a) linear
b) reduce dimensions and preserve variance
c) yes
what are the two most common maps when looking at data sets (scRNA seq)
t-SNE and UMAP
t-SNE
a) linear or nonlinear relationship?
b) main goal?
c) can you differentiate populations of cells
a) nonlinear
b) to preserve the local structure of the data (keep similar pts clustered)
c) no
UMAP
a) linear or nonlinear relationship?
b) main goal?
c) can you differentiate populations of cells
a) nonlinear
b) preserve local and gloobal structure of data - clusters similar data pt together
c) no
Which of the following gives an indication of variablility between data pts?
a) PC
b) t-SNE
c) UMAP
d) t-SNE and UMAP
e) none of the above
a
“UMAP has no directinoality” what does this mean?
there is no direction with regards to variance (how much certain data pts vary) thus the cluster just represent similarity through biological factors and there distance from each other does not indicate how much they vary from eachother (no directionalty of amount of variation)
What does the percentages on the axis of a PC plot refer to?
it is the percentage of variability between the data pts
fill in the blank
a) _____ is used to firgure out the variability between gene expression
b) ______ is used to differentiate the cell populations (local)
a) PC
b) t-SNE
define volcano plot
a plot that represent the changes in gene expresssion and the significance of that change
This a ______ plot explain the following compoenents shown
a) Yes vs No
b) UP vs down
valcano
a) yes means it is significant while no means it is occuring but not sigificanltly
b) up = upreagulation of that gene exrpession and down = downregualtion of that gene expression
what is another way to say that a genes expression level is significant?
that it is informative
a) what is this showing?
b) what does the branching off represent
a) UMAP
b) the differentiation of the cells
what does plurypotent mean?
it means that all the cells are the same (have yet to differentiate)
answer the following based on this image which shows a bulk RNA of zebra fish at different treatments
a) what does CV, GF, and ZM stand for
b) how many samples for each treatment gr
based on the expression level what can you conclude about the zebra fish
a) CV = conventional, GF = germ free, ZM = metobolites added back
b) 3
c) GF show significantly less expression of the gene but in CV while sample 1 and 3 are highly eaxpressed sample 2 is less so showing high variability in the gene expression normally
based on this image answer the following
a) what type of graph/plot it this?
b) is there directionality of variabilty present?
c) if b is yes than which sample has the most variabilty
a) PC
b) yes
c) red
Label the following plots
A - PCA
B - t-SNE
C - UMAP
What is this?
a Hierarchical cluster
based on this tabe match the following gene pairs to the corrlation values that make the most sense
a) 1
b) 0.8
a) green (gene H and N) and blue (genes G and L)
b) yellow (genes C and F)
define persons correlation coefficent
a measure of the linear relationhip btw two continou variables- how well the pattern of one variable can predict the pattern of another variable
What is used to create a heiarichal analysis?
persons correlation coefficient
based on the image
a) fill in the blanks
b) what does each colour represent in this context
a) yellow - lower, green - higher
b) different samlpes
a) intravariability
b) intervariability
a) variation observed within a single individual/group
b) variation observed between different groups
based on this image which is demonstrating the following
a) intravariability
b) intervariability
c) describe the relationship for both
d) which ones has the most variation
e) which one is better for identification of the cells why?
right: linear relationship
left: random (no corrlation)
c) left
d) left –> more informative becasue being able to differentiate the expression levels dep on teh cell tells you what kind of gene it likely is
a) which genes are the most informative
b) which genes are the least informative
c) why
a) gene 1 and gene 9
b) genes 4-7
c) genes 1 and 9 are more distinct in their expression and thus it is easier to determine what type of genes they are
No longer looking at _____. we are looking at _____. The cells’ positions result from the genes that are _______ between them.
genes, cells, variably expressed
define Eigenvectors
vectors that point in the direction of maximum variance in the data when performing PCA. Each eigenvector corresponds to a principal component (PC) and captures the largest possible variation in the data along that axis.
Define what a PCA does
Takes a dataset with a lot of dimensions (cells) and flattens it to 2 or 3 dimensions so we can look at it more clearly
Just by looking at the raw data, you can tell that cell 1 and cell 2 have _______ expression profiles. (similar/distinct)
That is, their gene expression has a _______ correlated. (low/high).
similar, high
Select all that apply wrt the following image
a) high correlation
b) low correlation
c) cells are similar
d) cells are distinct
a + c
Select all that apply wrt the following image
a) high correlation
b) low correlation
c) cells are similar
d) cells are distinct
b and d
Which group(s) are of high value, and which group(s) are of low value? Why?
high value: 1 + 4 –> shows the most variation between the two cells making them easier to analyze the gene significancej
low value: 2 + 3 –> show the least variation between the two cells making it harder to analyze the gene significance
Fill in the blanks
green: 2, 1
yellow: less, similar
blue: 1, 2
What if the formula for calcualting the PC score for a cell?
PC score for a cell = sum of (gene expression x gene loading) across all genes
what is gene loading also known as?
Eigenvectors
T or F - the gene loading/Eigenvectors are given to you before you calculate the PC score
T
Which of the following will have the most variation
a) PC1
b) PC2
c) UMAP
d) t-SNE
e) a and b
a
a) green = ? (most/2nd most)
b) yellow = ? (most/2nd most)
c) Note that each one of these points represents a ______. (cell/gene)
d) The position of that point is a reflection of the ________ of all of its ______ that vary relative to every other ______.
a) 2nd most
b) most
c) cell
d) expression, genes, cell
Fill in the blanks/answer the question
a) Genes with the _________ (smallest/largest) variation btw cells will have the ________ (least/most) influence on the principle components
b) what causes the most variation between cells?
c) PC__ (1/2) captures the most variation in the data
d) PC__ (1/2) captures the 2nd most variation in the data
e) Cells with _______ (similar/distinct) transcription patterns will cluster together
a) largest
b) genes that are highly expressed in cell 1 but not expressed in cell 2
c) 1
d) 2
e) similar
T or F - PC1 captures the most variation in the dataset, while PC2 does the least variation in the dataset
F - PC2 is the 2nd most not the least
T or F - based on this image, the variation between cluster x and y are equal to the variation between cluster z and y
F - the variation in PC1 is greater than in PC;2 thus the variation between Cluster y and z is higher than cluster x and y