Big Data Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

what is big data?

A

refers to data sets too large or complex to process using traditional data processing methods
- large volumes of data, often comprising multiple data types
- there is substantial variation within the data which is complex to analyse
- integrative analysis of different types of big data reveals interactions between variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

who analyses big data?

A
  • computational methods and advanced statistics are used by bioinformaticians to analyse data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

are big data experiments hypothesis-based or hypothesis-generating?

A

they are unbiased and hypothesis-generating
- they have huge power for discovery
- no need to choose and exclude markers in advance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

where can big data be generated from?

A
  • DNA, RNA, protein molecules
  • cells, tissues
  • organisms
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what are OMICs in big data?

A
  1. genomics (DNA) and transcriptomics (RNA) - rely on sequencing of nucleic acids
    - short read (Illumina) and long read (PacBio, Nanopore) sequencing
    - RNA-seq
  2. Proteomics and metabolomics
    - mass spectometry
  3. epigenomics
    - ChIP-Seq, chromatin conformation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

how can microscopy be used to generate big data?

A
  • high throughput imaging
  • fluorescent tagging in live cells
  • fixed cell staining
  • automated image analysis (machine learning/AI)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what big data can microscopy generate?

A
  • cell shape/cell type
  • subcellular protein localisation
  • cell differentiation
  • cell contractility and migration -> wound healing, sclerosis, metastasis
  • infection status
  • response to drugs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

how can big data on human physiology/health be generated?

A
  • activity tracking
  • questionnaires
  • blood samples
  • whole body imaging
  • electronic health records
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what knowledge does big data contribute to biology?

A
  • Development
  • Physiology
  • Drug safety and efficacy
  • Epidemiology – identifies relationships between environmental exposures / genetic predispositions and disease risk -> reduce exposures
  • Disease pathobiology – understand how interactions between exposures and predispositions affect health -> more effective diagnosis and treatment
  • Understanding of past events and prediction of future risks
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is transcriptomics?

A

studies gene expression and mRNA
- to determine the functional consequences of something on the expression of every gene in the tissue/organ/particular cell type of interest, or on a developmental stage

may be:
- wildtype vs mutant
- treated vs untreated
- untreated vs environmental change

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is an experimental strategy in transcriptomics? what steps does it involve

A
  1. Extract mRNA from whole tissue or cell population, convert to cDNA
  2. Prepare a sequencing ‘library’ containing all cDNA molecules in each biological sample
  3. Sequence on an Illumina Next Generation Sequencing (NGS) machine.
  4. Run series of computational steps (‘pipeline’ = quality control and
    normalising/standardising the data) and make statistical comparisons
  5. cDNA counts reflect mRNA expression level

identify genes exhibiting differential expression in the compared cell types

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what plot can be used to display big data on transcriptomics?

A

volcano plot
- each dot represents a gene
- fold-change on x-axis is how much gene expression is increases/decreases
- significance is the Y-axis showing statistical significance of the difference in gene expression
- red dots = downregulated genes
- green dots = upregulated genes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what methods can help to interpret the consequences of gene expression changes?

A

gene ontology and biological pathway algorithms:
- These algorithms can be ran on the data to interpret consequences of gene expression changes
- Differentially expressed genes are fed into algorithms which extract information from databases about the functions of those genes and summarise it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

how can the transcriptome of 100-10,000s of individual cells be collected?

A

single cell RNA-seq:
1. Dissect tissue, treat with enzymes
2. Single cell suspension – contains a mixture of cell types from tissue
3. Prepare libraries and sequence the transcriptome of every cell

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what plot can be used to display the transcriptome of thousands of individual cells? what do these plots give insights into?

A

UMAP plots:
- Each dot is a cell
- Close = similar, far away = more different
- Each colour marks ‘clusters’ of similar cells

Potential insights into:
- Which genes are expressed by particular cells
- Cell type-specific gene expression changes
- Cell lineage/differentiation trajectories
- Tissue composition changes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

how can genetic causes of disease/disease-associated genes be identified?

A

Genome-Wide Association Studies (GWAS) can identify genes affecting disease risk:
- humans have ~3x10^7 single nucleotide polymorphisms (SNPs) distributed randomly across the genome
- some people may have a different nucleotide in a certain position compared to others

GWAS studies identify SNP alleles that are found more frequently in patients (cases) compared to healthy individuals (controls)
- high scoring SNPs are thus associated with the disease and may play causative roles in the disease process

17
Q

how are GWAS results presented?

A

Manhattan plots:
- these map DNA sequence variants associated with a disease at genome-scale
- strong disease-associated SNPs are outliers

18
Q

why must we be careful when interpreting disease-associated SNPs?

A
  1. The SNP does not necessarily affect the closest gene, it may affect a regulatory gene instead
  2. The SNP that is disease-associated is not always responsible for causing increased/decreased disease risk. The SNP identified may infact be closeto the actual SNP that does cause the altered disease risk
    - known as linkage disequilibrium

further investigation is required to understand SNP disease-association

19
Q

what can combining GWAS results with gene expression data achieve?

A
  • Identify the gene(s) whose expression levels are linked to the SNP allele
  • Identify the cell type(s) in which the genetic variant(s) have functional consequences
  • Reveal how those variants might regulate gene expression

Big data integration reveals and refines insights into the biological process

20
Q

Give an example of a population-scale big data project?

A

The 100,000 genomes project:
- Whole genome sequencing (WGS) to improve diagnosis of rare diseases and cancer care in the NHS through personalised medicine
- Data available to researchers
- 100,000 Genomes: 16.1% of rare disease patients received a molecular diagnosis

20
Q

what is the UK biobank?

A

a prospective cohort study of 500,000 UK adults aged 40-69 at recruitment:
- Monitored over time: years/decades
- An integrated database for population-scale studies of health and disease, combining genetics, deep phenotyping, and electronic medical records:
- Demographic / socioeconomic
- Electronic health records (NHS)
- Physical activity monitoring
- Anatomical, Physiological, Biochemical, Genomic

Doctors can then use these past records to help with diagnosis – identify biomarkers of disease

20
Q

why is big data important in the social gradient of health?

A

There is a social gradient in health, affecting Total and Healthy Life Expectancy:
- In England, poor neighbourhoods have a greater burden of ill-health than wealthy ones
- COVID-19 has had a proportionally higher impact on the most deprived areas of England

Big data is essential to understand how genetic predispositions, environmental exposures and social factors lead to disease