Guest Lecture: Big Data Flashcards

Question 1

Q

What is big data in health?

Answer

A

Big data in health encompasses high volume, high diversity, biological, clinical, environmental, and lifestyle information collected from single individuals to large cohorts, in relation to their health and wellness status, at one or several time points.

EX: electronic health records (EHR)
mammography
gene data

Question 2

Q

How did the emerging of big data changes where and how we collect data?

Answer

A

clinical trials
EHR
patient registries and databases
multidimensional data form genomic, epigenomic, transcriptomic, proteomics, metabolomics, and microbiomics (OMICS)
medical imaging

more recently

social media
socioeconomic or behavioural indicators
mobile applications
environmental monitoring

Question 3

Q

What are OMICS?

Answer

A

genomic/epigenomic/transcriptomic - large scale study of genes/epigenetic modifications/RNA

proteomics - large-scale study of proteins

metabolomics - large-scale study of metabolites

microbiomics - large scale study of genes of microbiota

Question 4

Q

Why we need to integrate big data in health science?

Answer

A

“Big data in health can be used to improve the efficiency and effectiveness of prediction and prevention strategies or of medical interventions, health services, and health policies.”

Question 5

Q

How can we make sense of and use big data?

Answer

A

Machine learning (“black box”) –> clinical trial

Biology based model –> use big data to understand the association/mechanism in the biological system –> clinical trial

Question 6

Q

What is machine learning?

Answer

A

“Machine learning is the science (and art) of programming computers so they can learn from data”

Question 7

Q

What is the hierarchy of evidence?

Answer

A

animal and lab studies 
case report or case series 
case control studies
cohort studies 
randomized controlled trials 
systematic review 
meta analysis

Question 8

Q

Where are fabp2 and fabp6 located?

Answer

A

in the small intestine

Question 9

Q

What is the objective of Yiheng’s study?

Answer

A

Analyze of sex-specific gene expression programs using Fabp gene disrupted mice

Question 10

Q

What is an microarray?

Answer

A

A DNA microarray (also commonly known as DNA chip or biochip) has a collection of microscopic DNA spots attached to a solid surface, using to survey and anneal target cDNA sequence in the sample

Question 11

Q

What is the p value?

Answer

A

p (significance level):
o probability of rejecting the null hypothesis when the null is true
o (whether the difference exist because two groups are really different instead of due to chance)

Question 12

Q

What are multiple testing issues?

Answer

A

Omics is high-dimensional data –> 100s ~100000s of variables
• Lots of hypothesis tests
• Performing t-tests on the microarray data might result in performing more than 20000 separate hypothesis tests.
• If we use a standard p value cut-off of 0.05, we would see 1000 (20000*0.05) genes to be recognized as “significant” by chance

Question 13

Q

What are the ways of multiple testing correction?

Answer

A

Family Wise Error Rate (FWER) - e.g. Bonferroni

False Discovery Rate (FDR) - e.g. Benjamini-Hochberg

Question 14

Q

Explain the Family Wise Error Rate (FWER) - e.g. Bonferroni

Answer

A

Using “corrected p.value=p.value/n” (p.value =0.05; n=number of genes in the list)

For example, I test 20,000 genes at a time, the highest accepted individual p value is 0.0000025, make the correction very stringent

Question 15

Q

Explain the False Discovery Rate (FDR)

Answer

A

A FDR of 0.05 means that 5% among the significant genes are expected to be false positive

For example, 100 genes are identified as DE genes, 5 of them will be false positive.

By controlling the FDR, we can control the expected proportion of “discoveries” (rejected H0) that are false (incorrect
rejections)

Question 16

Q

What are functional annotations?

Answer

A

“the process of collecting information about and describing a gene’s biological identity—its various aliases, molecular function, biological role(s), subcellular location, and its expression domains”

Question 17

Q

What are KEGG pathways?

Answer

A

Kyoto Encyclopedia of Genes and Genomes

Question 18

Q

What is personalized nutrition?

Answer

A

Trying to integrate all the information that could influence nutritional response. Bacteria and how that can affect our nutrient metabolism, etc.
Used information from the microbiome, blood, and questionnaires on family history, lifestyle information, anthropometrics and food diary to generate functions and equations in the blackbox to predict how these individuals to respond to a certain diet.

Question 19

Q

What is GWAS?

Answer

A

Genome wise association studies

Question 20

Q

What gene is FGF21?

Answer

A

sweet preference test