Big Data Management Flashcards

1
Q

Characterization (3Vs)

A

Variety: Different forms of data
Volume: Petabytes of data
Velocity: Real-time data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Big Data Analysis Pipeline

A
  1. Data acquisition: Select important data to be stored
  2. Information extraction & cleaning: Pull out required information from underlying sources
  3. Data integration & aggregation & representation: Full integration not always possible. Problem, that origin can not be tracked on derived data, selection of storage complex
  4. Modeling & analysis: Big data discloses hidden patterns and knowledge. Big pictures shows simple models
  5. Interpretation: Annotate base data and discuss interpretation of metadata
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data Lake requirements

A
  1. Secure
  2. Scalable
  3. Reliable
  4. Throughput
  5. Low Latency
  6. Store details
  7. Store native Forman
  8. All sources
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Advantages Cloud

A
  • Cost
  • Extensibility
  • Reliability
  • Workload
  • Sharing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Disadvantages Cloud

A
  • Custom software
  • Networking
  • Maintenance
  • Security
  • Parallelization not always possible
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Three-Tier Server

A

Presentation ➔ Logic ➔ Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Design Cloud

A
  • Transparent
  • Flexible
  • Reliable
  • Performant
  • Scalable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Fallancies of cloud

A
  1. Network is reliable
  2. Latency is zero
  3. Bandwidth is infinite
  4. Network is secure
  5. Topology doesn’t change
  6. There is one administrator
  7. Transport cost is zero
  8. The network is homogeneous
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Cloud characteristics

A
  1. Dynamic
  2. Massively scalable
  3. Multi-tenant
  4. Self-service
  5. Per-usage based pricing model
  6. IP-based architecture
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Google File System

A

Store chunks across chunk servers, replicate chunks, access control by master node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Map Reduce

A
  • Extract data as key value
  • Group by key
  • Reduce groups
  • Split data and perform mapping parallel
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

ACID

A
  • Atomicity
  • Consistency
  • Isolation
  • Durability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

CAP

A
  • Consistency
  • Availability
  • Partition-tolerance
    ➔ Not all three possible at the same time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

BASE

A
  • Basically Available
  • Soft state
  • Eventual Consistency
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Types of NoSQL storage

A
  • Key/Value
  • Wide-column
  • Document database
  • Graph database
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Steps of machine learning

A

Data ➔ Preprocessing ➔ Featuring ➔ Learning ➔ Testing ➔ Analysis

17
Q

Decision tree

A
  • Created using greedy top down

- Choose attribute with most information value for each node step

18
Q

K-means clustering

A
  • Variance within clusters minimal

- Random start points, assign data based on least distance to start points, recalculate start points, iterate