Big Data Management Flashcards

Question 1

Q

Characterization (3Vs)

Answer

A

Variety: Different forms of data
Volume: Petabytes of data
Velocity: Real-time data

Question 2

Q

Big Data Analysis Pipeline

Answer

A

Data acquisition: Select important data to be stored
Information extraction & cleaning: Pull out required information from underlying sources
Data integration & aggregation & representation: Full integration not always possible. Problem, that origin can not be tracked on derived data, selection of storage complex
Modeling & analysis: Big data discloses hidden patterns and knowledge. Big pictures shows simple models
Interpretation: Annotate base data and discuss interpretation of metadata

Question 3

Q

Data Lake requirements

Answer

A

Secure
Scalable
Reliable
Throughput
Low Latency
Store details
Store native Forman
All sources

Question 4

Q

Advantages Cloud

Answer

A

Cost
Extensibility
Reliability
Workload
Sharing

Question 5

Q

Disadvantages Cloud

Answer

A

Custom software
Networking
Maintenance
Security
Parallelization not always possible

Question 6

Q

Three-Tier Server

Answer

A

Presentation ➔ Logic ➔ Data

Question 7

Q

Design Cloud

Answer

A

Transparent
Flexible
Reliable
Performant
Scalable

Question 8

Q

Fallancies of cloud

Answer

A

Network is reliable
Latency is zero
Bandwidth is infinite
Network is secure
Topology doesn’t change
There is one administrator
Transport cost is zero
The network is homogeneous

Question 9

Q

Cloud characteristics

Answer

A

Dynamic
Massively scalable
Multi-tenant
Self-service
Per-usage based pricing model
IP-based architecture

Question 10

Q

Google File System

Answer

A

Store chunks across chunk servers, replicate chunks, access control by master node

Question 11

Q

Map Reduce

Answer

A

Extract data as key value
Group by key
Reduce groups
Split data and perform mapping parallel

Question 12

Q

ACID

Answer

A

Atomicity
Consistency
Isolation
Durability

Question 13

Q

CAP

Answer

A

Consistency
Availability
Partition-tolerance
➔ Not all three possible at the same time

Question 14

Q

BASE

Answer

A

Basically Available
Soft state
Eventual Consistency

Question 15

Q

Types of NoSQL storage

Answer

A

Key/Value
Wide-column
Document database
Graph database

Question 16

Q

Steps of machine learning

Answer

A

Data ➔ Preprocessing ➔ Featuring ➔ Learning ➔ Testing ➔ Analysis

Question 17

Q

Decision tree

Answer

A

Created using greedy top down

- Choose attribute with most information value for each node step

Question 18

Q

K-means clustering

Answer

A

Variance within clusters minimal

- Random start points, assign data based on least distance to start points, recalculate start points, iterate