Guest Lecture: Big Data Flashcards
What is big data in health?
Big data in health encompasses high volume, high diversity, biological, clinical, environmental, and lifestyle information collected from single individuals to large cohorts, in relation to their health and wellness status, at one or several time points.
EX: electronic health records (EHR)
mammography
gene data
How did the emerging of big data changes where and how we collect data?
- clinical trials
- EHR
- patient registries and databases
- multidimensional data form genomic, epigenomic, transcriptomic, proteomics, metabolomics, and microbiomics (OMICS)
- medical imaging
more recently
- social media
- socioeconomic or behavioural indicators
- mobile applications
- environmental monitoring
What are OMICS?
genomic/epigenomic/transcriptomic - large scale study of genes/epigenetic modifications/RNA
proteomics - large-scale study of proteins
metabolomics - large-scale study of metabolites
microbiomics - large scale study of genes of microbiota
Why we need to integrate big data in health science?
“Big data in health can be used to improve the efficiency and effectiveness of prediction and prevention strategies or of medical interventions, health services, and health policies.”
How can we make sense of and use big data?
Machine learning (“black box”) –> clinical trial
Biology based model –> use big data to understand the association/mechanism in the biological system –> clinical trial
What is machine learning?
“Machine learning is the science (and art) of programming computers so they can learn from data”
What is the hierarchy of evidence?
animal and lab studies case report or case series case control studies cohort studies randomized controlled trials systematic review meta analysis
Where are fabp2 and fabp6 located?
in the small intestine
What is the objective of Yiheng’s study?
Analyze of sex-specific gene expression programs using Fabp gene disrupted mice
What is an microarray?
A DNA microarray (also commonly known as DNA chip or biochip) has a collection of microscopic DNA spots attached to a solid surface, using to survey and anneal target cDNA sequence in the sample
What is the p value?
p (significance level):
o probability of rejecting the null hypothesis when the null is true
o (whether the difference exist because two groups are really different instead of due to chance)
What are multiple testing issues?
Omics is high-dimensional data –> 100s ~100000s of variables
• Lots of hypothesis tests
• Performing t-tests on the microarray data might result in performing more than 20000 separate hypothesis tests.
• If we use a standard p value cut-off of 0.05, we would see 1000 (20000*0.05) genes to be recognized as “significant” by chance
What are the ways of multiple testing correction?
Family Wise Error Rate (FWER) - e.g. Bonferroni
False Discovery Rate (FDR) - e.g. Benjamini-Hochberg
Explain the Family Wise Error Rate (FWER) - e.g. Bonferroni
Using “corrected p.value=p.value/n” (p.value =0.05; n=number of genes in the list)
For example, I test 20,000 genes at a time, the highest accepted individual p value is 0.0000025, make the correction very stringent
Explain the False Discovery Rate (FDR)
A FDR of 0.05 means that 5% among the significant genes are expected to be false positive
For example, 100 genes are identified as DE genes, 5 of them will be false positive.
By controlling the FDR, we can control the expected proportion of “discoveries” (rejected H0) that are false (incorrect
rejections)