Lecture 28 Big Data Flashcards by Rosie Rudin

Define big data

Big data is a very generic term to indicate datasets that are so large or complex that traditional data processing applications (e.g. desktop computer, small server or statistical tools normally used in a small scale) are inadequate for mining it.

How well did you know this?

Not at all

Perfectly

What are the 3V of big data

High volume
High velocity
High variety

How well did you know this?

Not at all

Perfectly

Recently which 3 more V have been added?

Highly variable
High veracity (variation in quality)
High value (which comes with complexity)

How well did you know this?

Not at all

Perfectly

What is big data a combination of what sort of data

unstructured
semi-structured
structured

How well did you know this?

Not at all

Perfectly

Why is data collected?

It is collected so that it can be mined and used to build predictive models and other advanced analytics applications.

How well did you know this?

Not at all

Perfectly

What is structured data?

Able to catalogue

How well did you know this?

Not at all

Perfectly

What is unstructured data?

Behavioural or ambiguous data

How well did you know this?

Not at all

Perfectly

Does big data associate to any specific volume of data?

No it can be deployed in terabytes (TB), petabytes (PB) and even exabytes (EB) of data, captured over time.

How well did you know this?

Not at all

Perfectly

Whys is big data important to companies?

use it to improve operations
provide better customer service
create personalized marketing campaigns
faster decisions
more-informed decisions
they can become more customer-centric

How well did you know this?

Not at all

Perfectly

Examples of big data?

Business transaction systems
Customer databases
Medical records
Internet clickstream logs
Mobile applications
Social networks

How well did you know this?

Not at all

Perfectly

Examples of big data in SCIENCE?

Scientific research repositories
Machine-generated data
Clinical records e.g. life-style, not just medical records

How well did you know this?

Not at all

Perfectly

How is data left?

The data may be left in its raw form in big servers or preprocessed using data mining tools or data preparation software to be analysed e.g. Google/Amazon

How well did you know this?

Not at all

Perfectly

How is Big Data used in Life Science?

allows identification of risk factors in disease

- helps diagnose illnesses and conditions in individual patients

How well did you know this?

Not at all

Perfectly

What is Big Data derived from?

Big data is derived from genomics, transcriptomes and epigenomics (OMICS) data of many individuals. It is also derived from electronic health records, social media, the web and other sources provides healthcare organisations and government agencies with up-to-the-minute information on infectious disease threats or outbreaks.

How well did you know this?

Not at all

Perfectly

How is big data being used in the COVID-19 pandemic?

AI and big data and playing a key role in modelling as well as making predictions for the effect of the measures enforces as well as the science of the virus itself.

How well did you know this?

Not at all

Perfectly

What has poor prediction power on its own?

Study These Flashcards

Genomics

What else is used to increase the predictive power?

Study These Flashcards

Lifestyle and environmental predictions

Describe the abdominal aortic aneurysm example of Big Data

Study These Flashcards

Genome is taken alongside lifestyle and physiology
Their genes, how these genes are activated, mutations and health records build HEAL: a machine learning framework.
This is carried out on many people, a prediction about the predisposition of individuals to the disease.
Predisposition (genome) and lifestyle are balanced against each other before management of heath is directed
Genes are identified and their specific pathways

Eventually, the data is arranged to show the factors associated to risk. Red – genome, blue – eco, yellow – mixture. The closer the number is to 1, the better at predicting the model is. It is evident genome has a low predictive power compared to lifestyle in this case.

Describe the challenges of Big Data in life science

Study These Flashcards

Data analysis
Data curation i.e. making it as high quality as possible
Searching engines i.e. so they’re powerful enough to search for the full/whole data points
Data sharing is needed from many different labs to build up big data
Data storage and transfer
Data visualisation
Information privacy

Advantage of big data

Study These Flashcards

High predictive power

Why does big data lead to more confident decision making

Study These Flashcards

High accuracy

In biology what has occurred as a result of high-throughput genomics?

Study These Flashcards

Life scientists are starting to grapple with massive datasets, encountering big data challenges

Define data science

Study These Flashcards

Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data

Define machine learning

Study These Flashcards

Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead.

Analysing the large amount of big data with local infrastructure is impossible. The data is then moved to the …. for analysis and storage. Data .....is becoming crucial for …. data.

cloud storage biological

Why is big data used in data science

Helps life science to achieve the paradigm of personalised medicine and contain outbreaks in a timely manner by defining new vaccines or using approved drugs to mitigate the impact on human health

Why is machine learning used?

* Current data is complex and we need models that are able to learn patterns and associations in the data autonomously e.g. via Artificial Intelligence and Machine earning applications. * Establish the concept of privacy for data, sets of data governance to ensure that individuals (data subjects) are specifically protected.

Challenge of ML in life science at data acquisition/analysis level?

• How do we extrapolated the information from the data to train and then test models?

Challenge of ML in life science at data modelling level?

• How do we use existing models in applications to life science?

Challenge of ML in life science at deploy and maintain level?

• How do we interpret the data and maintain the ability of the models to earn from the data?

Examples of challenges in ML in biomedical science

* Combine gene-level analyses with pathway-based methods to generate a comprehensive profile of the functional modules that govern biological processes. (instead of looking at a single gene) * We want to use high-throughput data to build models of data integration, to predict at the systems level (Abdominal aorta aneurysm example) * Design therapeutic intervention and/or genomic predispositions to disease at individual level (personalised medicine example)

Name the 6 principles in Data Science

1) data acquisition 2) data preparation 3) data analysis 4) data modelling 5) visualization 6) deploy and maintain

What do we encounter in the principles of data science?

A number of bottlenecks e.g. data acquisition (due to the need to share/transfer data), data analysis and data modelling (due to the need to call upon different types of systems that might not be accessible) and with deploy/maintain (due to the need to interpret the output within the context and system).

Describe the life cycle of a Data Science Project

1. Define your research questions - Multiple people discuss this in a mutli-disciplinary setting. 2. Data acquisition - Experimental data - Online repositories - Other labs - Web servers - Meta data - Images 3. Transformation: Data cleaning - Data that is unstructured or semi-structured needs to be converted into structured data for it to be analysed - Missing data points, duplications and wrong annotations are looked at 4. Exploratory data analysis - Define and refines selection of feature for the model 5. Data modelling 6. Visualisation and communication 7. Deploy and maintain

Lecture 28 Big Data Flashcards

(34 cards)