Week 1: Big Data Introduction Flashcards

Question 1

Q

Where does machine generated data come from? (3 types)

Answer

A

Internet of things
smart devices
sensing

Question 2

Q

Where does people generated data come from ? (4 examples) (BOSE)

Answer

A

Blogging
Online photo/video sharing
Social networks
Emails

Question 3

Q

Describe big data generated by people (4 formats)

Answer

A

Typically unstructured and text heavy
Multiple data formats:

Web pages
images
PDF
XML

Question 4

Q

Describe big data generated by organisations (3 points)

Answer

A

Typically structured
stored in RDBMS
Often siloed by department
Limited infrastructure to share and integrate this data

Question 5

Q

Define: big data

Answer

A

Big data is high volume, high velocity and/or high variety information assets
require new forms of processing
to enable enhanced decision making
insight discovery
process optimisation.

Question 6

Q

Characteristics of big data volume

Answer

A

Refers to the vast amount of data that is generated every second, minutes, hours and days
the dimension of big data and its exponential growth

Question 7

Q

Characteristics of big data velocity

Answer

A

Refers to the speed at which data is being generated
increasing speed at which the data needs to be stored and analysed
Processing of data in real-time to match its production rate

Question 8

Q

Characteristics of big data variety

Answer

A

Refers to the ever increasing different forms that big data can come in such as
text
images
voice
spatial data

Question 9

Q

Characteristics of big data veracity

Answer

A

Limited value if data is not accurate

Refers to the quality of the data, free from

biases
noise
abnormalities

Data must be:

accuracate
trustworthy
reliable
and have context within analysis

Question 10

Q

5 steps in the data science process

Answer

A

Acquire
Prepare
Analyse
Report
Act

Question 11

Q

What 3 ways can we acquiring data

Answer

A

Finding the right data sources
Conventional data may exist in RDBMS
NoSQL storage system can be used to manage variety of data types in big data

Question 12

Q

Prepare data

Answer

A

Preliminary investigations
Correlations,
general trends,
outliers
Summary statistics
Mean
Median
Range
std dev
Visualisation
Histogram
scatter plots

Question 13

Q

Why do we pre-process data

Answer

A

to clean the data to address data quality issues

Question 14

Q

Why is domain knowledge important

Answer

A

To make informed decisions on how to handle incomplete or incorrect data

Question 15

Q

Methods of pre processing data

Answer

A

Scaling
Transformation
Feature selection
Dimensional reduction
Data manipulation
One hot encoding (categorical data)
Normalise

Question 16

Q

Analysing data techniques

Answer

Study These Flashcards

A

Classification
Regression
Clustering
Association analysis
Graph analysis

Question 17

Q

Report

Answer

Study These Flashcards

A

To report the insights we gained from our analysis

Question 18

Q

Reporting visualtion tools

Answer

Study These Flashcards

A

Python
R
D3
Leafleet
Tableau

Question 19

Q

Act

Answer

Study These Flashcards

A

To determine what action or actions should be taken based on the gained insights
Does additional analysis need to be performed?
What data should be revisted

Question 20

Q

Varieties of big data

Answer

Study These Flashcards

A

Structural variety - format and models
Media variety - medium of delivery
Semantic variety - how to interpret and operate on
Availability variations - real-time vs intermittent

Question 21

Q

Where does machine generated data come from? (3 examples)

Answer

Study These Flashcards

A

network logs,
equipment logs
call detail logs

Question 22

Q

Answer

Study These Flashcards

A

Week 1: Big Data Introduction Flashcards

(22 cards)