Lecture 28- Big data Flashcards

1
Q

Outline the biological data science pyramid

A
  1. Big data
  2. Information
  3. Knowledge
  4. Insight
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What populations is big data gathered from?

A
  1. DNA, RNA, proteins
  2. Cells
  3. Tissue samples
  4. Organisms
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does big data involve?

A
  1. Gathering late volumes of data
  2. Substantial variation within the data
  3. Integrative analysis of different types of big data reveals interactions between variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the 4 main types of big data?

A
  1. Transcriptomic
  2. Genomic
  3. Proteomic
  4. Epigenomic
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What sort of knowledge is generated using big data in biology?

A
  1. Developmental
  2. Physiological
  3. Drug safety and efficacy
  4. Epidemiology
  5. Understanding past events and predicting future risks
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the aim of transcriptomic analyses of developmental processes, drug treatments and environmental factors?

A

To define the functional consequences of a specific mutation, drug treatment or other environmental change on expression of every gene

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How can transcriptomic data generate and compared?

A
  • Generated by sequencing complementary DNA copies of every mRNA
  • Compare the mRNA population in 2 or more biological samples in order to identify the genes whose expression differs and is likely to be caused by an actual biological different in the same samples
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What was the experimental strategy used for transcriptomic analysis (see notes)?

A
  1. mRNA is extracted
  2. mRNA is converted to cDNA and each cDNA molecule is sequences on an Illumina Next Generation Sequence
  3. Numbers of independent molecules corresponding to each specific mRNA are counted in silica
  4. cDNA counts = mRNA expression level
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What data analysis is performed for the transcriptomic analysis (see notes)?

A
  1. Gene exhibiting differentiation expression (DE) in the compared sample types are identified, ranked and presented in a data frame
  2. Volcano plot: gene expression levels are plotted as log2 fold changes vs p-values
  3. Gene ontology and biological pathway algorithms are used to identify and visualise biological pathways involving the genes on the DE list
  4. Functional biological consequences of the DE genes are inferred
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What was the aim, procedure and analysis of the RNAseq experiment of a differentially expressed gene list of human skin cells treated with and without glucocorticoid clobetasol propionate?

A

Aim: to identify genes that are regulated by the glucocorticoid in human skin cells

  • RNA count data was collected for each sample and a differential expression gene list was complied
  • A volcano plot identified genes exhibiting statistically significant changes in transcript abundance caused by exposure to glucocorticoid
  • Statistically significance rises as the fold change increases
  • The more robust the gene expression, the more statistically significant the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does a more robust gene expression mean to the significance of the data?

A

The more robust the gene expression is, the more the data becomes statistically significant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain the RNAseq experiment that investigating the effect of hypoxia

A
  1. Compare cells cultured in normoxic and hypoxic conditions
  2. RNAseq identified upregulating and downregulating genes responding to hypoxic conditions
  3. Hypoxia modulates the transcription of many hundreds of genes in human cells
  4. A volcano plot identified the most robust upregulated and downregulated genes that exhibited significant change in transcription abundance caused by hypoxia
  5. Information analysed using gene ontology analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What biological processes modulated by hypoxia-regulated genes did PANTHER gene ontology analysis identify?

A

Hypoxia induced genes: main roles in metabolism, development and transcription

Hypoxia repressed genes: main roles in metabolism, development, transcription and mRNA processing/splicing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What sort of analysis can identify biological processes modulated by hypoxia-regulated genes?

A

PANTHER gene ontology analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What gene expression is modulated by hypoxia and how was this identified?

A
  1. After RNAseq and gene ontology analysis, RNAi can further identify biological processes
  2. RNAi can inactivate the function of a TF which plays a role in responding to hypoxia
  3. REST is a transcriptional repressor that has been identified
  4. Hypoxia repressed genes require the function of REST
  5. REST repressed the expression of some hypoxia-responsive genes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which 3 tools can be used to identify the genes and biological processes affected by hypoxia/used in transcriptomic analysis?

A
  1. RNAseq
  2. Gene ontology analysis
  3. RNAi
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What types of analysis can be used to identify genetic causes of disease?

A

Genomic sequencing

18
Q

Why is it impossible to identify ‘the gene’ which causes diseases

A

Because there are often many genes involved

19
Q

What results and conclusions can be gained from a Manhattan plot?

A

Can summarise results of GWAS analysis for SNPs associated with a particular disease

Can conclude strong disease associated SNPs on particular chromosomes

20
Q

What is the use of SNPs?

A

SNPs have been utilised to map areas of the genome that are more frequently associated with specific chronic disease phenotypes and are distributed randomly across the genome

21
Q

What types of analysis can be combined to be understand disease mechanisms?

A

Genomic analysis of Disease-risk associated with SNPs with transcriptomic analysis

22
Q

What is the aim when combining demonic analysis with transcriptomic analysis to understand disease mechanisms?

A

Aim: to identify genetic variation that influences human disease risk via regulation of gene transcription in disease relevant cell types

23
Q

What is the experimental strategy when combining demonic analysis with transcriptomic analysis to understand disease mechanisms?

A
  1. Whole genome sequence data is obtained for large numbers of patients and controls. Disease-associated SNPs are identified in a GWAS analysis.
  2. mRNA samples from patient and control cells are converted to cDNA and sequenced on an Illumina NGS machine.
  3. Genes linked to SNP(s) with mRNA expression levels that are correlated with Disease-associated SNP allele(s) are identified and ranked.
24
Q

What is the data analysis when combining demonic analysis with transcriptomic analysis to understand disease mechanisms?

A

Gene Ontology and Biological Pathway algorithms are used to identify, categorise and visualize biological pathways involving genes whose expression is correlated with disease-associated SNPs.

A Manhattan plot maps DNA sequence variant alleles associated with coeliac disease at genome-scale

25
Q

What is the aim when combining human disease risk-associated genetic variant data with gene expression data at large scale?

A

Aim to find SNPs in regulatory sequences linked to those genes that show gene expression changes that might then identify specific transcription factor binding sites that SNPs can enhance or reduce transcriptional factor binding and therefore effect RNA transcription

26
Q

What is identified when combining human disease risk-associated genetic variant data with gene expression data at large scale?

A
  1. Identifies the linked genes whose expression is regulated
  2. Identifies the cell types in which the genetic variants are active
  3. Identifies SNPs that are most strongly associated with disease risk and linked to genes whose expression level varies according to the SNP allele
27
Q

What is one type of epigenomic data?

A

DNA methylation data

28
Q

What is DNA methylation data and how is it identified?

A

Genome wide DNA methylation of CpG dinucleotides within cis-regulatory regions of genes is a hallmark of transcriptional repression- a form of epigenetic regulation

Methyl groups lie on the inner surface of the major groove of the DNA double helix, available for interactions with enzymes and methyl-CpG binding proteins

High densities of CpGs (CpG islands) are found in gene promotors

CpG islands are unmethylated in transcriptionally active promoters and are methylated in transcriptionally silent promoters

29
Q

DNA methylation data allows for what when integrated into a big data analysis?

A

Allows further understanding into disease molecular mechanisms and developmental processes

30
Q

What does epigenomic data allow the identification of?

A

How the pattern of chromatin modification distribution across the genome differs between cell types

31
Q

How is proteomic data generated?

A
  1. Generated using mass spectrometry to determine the amino acid sequences of protein subunits within functional protein complexes
  2. Specific antibody is used to purify the protein complex and the components of the complex are sequenced using mass spectrometry
  3. The proteins within the complex can then be identified by comparing the peptide sequences from the mass spectrometry analysis with a data base of known sequences
32
Q

Outline how proteins are analysed using mass spectrometry

A
  1. Mass spectrometric analysis of proteins in the synaptosomal complex immunopurified from mouse hippocampus using a transgenic line expressing a TAP-tagged version the synaptic specific proteins PSD95
  2. Process could be repeated using TAP-tagged versions of other specific proteins to determine if interactions and robust and if proteins are strongly associated with each other
33
Q

What does proteomic data identify?

A

Reveals information about protein-protein interactions and the dynamic of these interactions across time and in space

34
Q

Give 2 examples of population-scale big data projects

A
  1. Genomic England 100,000 Genomes project

2. The UK Biobank

35
Q

What is the aim of Genomic England 100,000 Genomes project

A
  1. Whole genome sequencing to diagnose genetic causes of rare diseases and to identify mutational signatures of cancer: population-scale human genetic
  2. Project aims to provide a molecular genetic diagnoses of the genetic causes of rare diseases
36
Q

What is the process involved in the Genomic England 100,000 Genomes project?

A
  1. The process involved interviewing patients, securing content, taking blood to make genomic DNA preps and fully sequencing their genome
  2. Next, a comparison is made between the patients genome with the genomes of other cases to identify patient specific genetic variation to allow a molecular diagnosis to be made
37
Q

What has been identified through the Genomic England 100,000 Genomes project and why is this helpful?

A

Whole Genome Sequencing in the 100,000 Genomes Project provided a genetic diagnosis for 1138 of 7065 phenotyped patients, identifying 95 strong mendelian associated between genes and rare diseases

This is helpful in the study and treatments (including new drug targets) for rare diseases

38
Q

What is epidemiology?

A

Identifies exposures and predispositions that affect health, and aims to reduce disease burden by public health interventions that reduce exposures

39
Q

What is pathobiology?

A

Seeks to understand how interactions between exposures and predispositions govern health, and aims to reduce disease burden through more effective diagnosis and treatment

40
Q

What is the aim of the UK Biobank?

A

Aims to identify, measure, monitor, confirm and potentially predict health risks

41
Q

What is the UK Biobank and integrated database of?

A
  1. Demographic/socioeconomic data
  2. Electronic health records (NHS)
  3. Physical activity monitoring
  4. Anatomical data
  5. Physiological data
  6. Biochemical samples/data
  7. Genomic samples/data