Week 1: Big Data Introduction Flashcards

1
Q

Where does machine generated data come from? (3 types)

A
  1. Internet of things
  2. smart devices
  3. sensing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Where does people generated data come from ? (4 examples) (BOSE)

A
  • Blogging
  • Online photo/video sharing
  • Social networks
  • Emails
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe big data generated by people (4 formats)

A
  • Typically unstructured and text heavy
  • Multiple data formats:
  1. Web pages
  2. images
  3. PDF
  4. XML
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe big data generated by organisations (3 points)

A
  • Typically structured
  • stored in RDBMS
  • Often siloed by department
  • Limited infrastructure to share and integrate this data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Define: big data

A
  • Big data is high volume, high velocity and/or high variety information assets
  • require new forms of processing
  • to enable enhanced decision making
  • insight discovery
  • process optimisation.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Characteristics of big data volume

A
  • Refers to the vast amount of data that is generated every second, minutes, hours and days
  • the dimension of big data and its exponential growth
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Characteristics of big data velocity

A
  • Refers to the speed at which data is being generated
  • increasing speed at which the data needs to be stored and analysed
  • Processing of data in real-time to match its production rate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Characteristics of big data variety

A
  • Refers to the ever increasing different forms that big data can come in such as
  • text
  • images
  • voice
  • spatial data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Characteristics of big data veracity

A

Limited value if data is not accurate

Refers to the quality of the data, free from

  • biases
  • noise
  • abnormalities

Data must be:

  • accuracate
  • trustworthy
  • reliable
  • and have context within analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

5 steps in the data science process

A
  • Acquire
  • Prepare
  • Analyse
  • Report
  • Act
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What 3 ways can we acquiring data

A
  1. Finding the right data sources
  2. Conventional data may exist in RDBMS
  3. NoSQL storage system can be used to manage variety of data types in big data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Prepare data

A
  • Preliminary investigations
  • Correlations,
  • general trends,
  • outliers
  • Summary statistics
  • Mean
  • Median
  • Range
  • std dev
  • Visualisation
  • Histogram
  • scatter plots
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why do we pre-process data

A
  • to clean the data to address data quality issues
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why is domain knowledge important

A

To make informed decisions on how to handle incomplete or incorrect data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Methods of pre processing data

A
  • Scaling
  • Transformation
  • Feature selection
  • Dimensional reduction
  • Data manipulation
  • One hot encoding (categorical data)
  • Normalise
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Analysing data techniques

A
  • Classification
  • Regression
  • Clustering
  • Association analysis
  • Graph analysis
17
Q

Report

A

To report the insights we gained from our analysis

18
Q

Reporting visualtion tools

A
  • Python
  • R
  • D3
  • Leafleet
  • Tableau
19
Q

Act

A
  • To determine what action or actions should be taken based on the gained insights
  • Does additional analysis need to be performed?
  • What data should be revisted
20
Q

Varieties of big data

A
  • Structural variety - format and models
  • Media variety - medium of delivery
  • Semantic variety - how to interpret and operate on
  • Availability variations - real-time vs intermittent
21
Q

Where does machine generated data come from? (3 examples)

A
  1. network logs,
  2. equipment logs
  3. call detail logs
22
Q
A