Lecture 28 Big Data Flashcards

1
Q

Define big data

A

Big data is a very generic term to indicate datasets that are so large or complex that traditional data processing applications (e.g. desktop computer, small server or statistical tools normally used in a small scale) are inadequate for mining it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 3V of big data

A
  • High volume
  • High velocity
  • High variety
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Recently which 3 more V have been added?

A
  • Highly variable
  • High veracity (variation in quality)
  • High value (which comes with complexity)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is big data a combination of what sort of data

A
  • unstructured
  • semi-structured
  • structured
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why is data collected?

A

It is collected so that it can be mined and used to build predictive models and other advanced analytics applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is structured data?

A

Able to catalogue

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is unstructured data?

A

Behavioural or ambiguous data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Does big data associate to any specific volume of data?

A

No it can be deployed in terabytes (TB), petabytes (PB) and even exabytes (EB) of data, captured over time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Whys is big data important to companies?

A
  • use it to improve operations
  • provide better customer service
  • create personalized marketing campaigns
  • faster decisions
  • more-informed decisions
  • they can become more customer-centric
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Examples of big data?

A
  • Business transaction systems
  • Customer databases
  • Medical records
  • Internet clickstream logs
  • Mobile applications
  • Social networks
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Examples of big data in SCIENCE?

A
  • Scientific research repositories
  • Machine-generated data
  • Clinical records e.g. life-style, not just medical records
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How is data left?

A

The data may be left in its raw form in big servers or preprocessed using data mining tools or data preparation software to be analysed e.g. Google/Amazon

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is Big Data used in Life Science?

A
  • allows identification of risk factors in disease

- helps diagnose illnesses and conditions in individual patients

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Big Data derived from?

A

Big data is derived from genomics, transcriptomes and epigenomics (OMICS) data of many individuals. It is also derived from electronic health records, social media, the web and other sources provides healthcare organisations and government agencies with up-to-the-minute information on infectious disease threats or outbreaks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How is big data being used in the COVID-19 pandemic?

A

AI and big data and playing a key role in modelling as well as making predictions for the effect of the measures enforces as well as the science of the virus itself.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What has poor prediction power on its own?

A

Genomics

17
Q

What else is used to increase the predictive power?

A

Lifestyle and environmental predictions

18
Q

Describe the abdominal aortic aneurysm example of Big Data

A
  1. Genome is taken alongside lifestyle and physiology
  2. Their genes, how these genes are activated, mutations and health records build HEAL: a machine learning framework.
  3. This is carried out on many people, a prediction about the predisposition of individuals to the disease.
  4. Predisposition (genome) and lifestyle are balanced against each other before management of heath is directed
  5. Genes are identified and their specific pathways

Eventually, the data is arranged to show the factors associated to risk. Red – genome, blue – eco, yellow – mixture. The closer the number is to 1, the better at predicting the model is. It is evident genome has a low predictive power compared to lifestyle in this case.

19
Q

Describe the challenges of Big Data in life science

A
  • Data analysis
  • Data curation i.e. making it as high quality as possible
  • Searching engines i.e. so they’re powerful enough to search for the full/whole data points
  • Data sharing is needed from many different labs to build up big data
  • Data storage and transfer
  • Data visualisation
  • Information privacy
20
Q

Advantage of big data

A

High predictive power

21
Q

Why does big data lead to more confident decision making

A

High accuracy

22
Q

In biology what has occurred as a result of high-throughput genomics?

A

Life scientists are starting to grapple with massive datasets, encountering big data challenges

23
Q

Define data science

A

Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data

24
Q

Define machine learning

A

Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead.

25
Q

Analysing the large amount of big data with local infrastructure is impossible. The data is then moved to the …. for analysis and storage. Data …..is becoming crucial for …. data.

A

cloud
storage
biological

26
Q

Why is big data used in data science

A

Helps life science to achieve the paradigm of personalised medicine and contain outbreaks in a timely manner by defining new vaccines or using approved drugs to mitigate the impact on human health

27
Q

Why is machine learning used?

A
  • Current data is complex and we need models that are able to learn patterns and associations in the data autonomously e.g. via Artificial Intelligence and Machine earning applications.
  • Establish the concept of privacy for data, sets of data governance to ensure that individuals (data subjects) are specifically protected.
28
Q

Challenge of ML in life science at data acquisition/analysis level?

A

• How do we extrapolated the information from the data to train and then test models?

29
Q

Challenge of ML in life science at data modelling level?

A

• How do we use existing models in applications to life science?

30
Q

Challenge of ML in life science at deploy and maintain level?

A

• How do we interpret the data and maintain the ability of the models to earn from the data?

31
Q

Examples of challenges in ML in biomedical science

A
  • Combine gene-level analyses with pathway-based methods to generate a comprehensive profile of the functional modules that govern biological processes. (instead of looking at a single gene)
  • We want to use high-throughput data to build models of data integration, to predict at the systems level (Abdominal aorta aneurysm example)
  • Design therapeutic intervention and/or genomic predispositions to disease at individual level (personalised medicine example)
32
Q

Name the 6 principles in Data Science

A

1) data acquisition
2) data preparation
3) data analysis
4) data modelling
5) visualization
6) deploy and maintain

33
Q

What do we encounter in the principles of data science?

A

A number of bottlenecks
e.g. data acquisition (due to the need to share/transfer data), data analysis and data modelling (due to the need to call upon different types of systems that might not be accessible) and with deploy/maintain (due to the need to interpret the output within the context and system).

34
Q

Describe the life cycle of a Data Science Project

A
  1. Define your research questions
    - Multiple people discuss this in a mutli-disciplinary setting.
  2. Data acquisition
    - Experimental data
    - Online repositories
    - Other labs
    - Web servers
    - Meta data
    - Images
  3. Transformation: Data cleaning
    - Data that is unstructured or semi-structured needs to be converted into structured data for it to be analysed
    - Missing data points, duplications and wrong annotations are looked at
  4. Exploratory data analysis
    - Define and refines selection of feature for the model
  5. Data modelling
  6. Visualisation and communication
  7. Deploy and maintain