IDT lecture 4 Flashcards

1
Q

Drawbacks for older file system type db (before 1970’s)

A
  • redundancy
  • inconsistencies
  • data isolation
  • integrity
  • atomicity of updates
  • concurrent access by multiple users
  • security problems

Solution for these problems was the creation of RDBMS… so RELATIONAL dbms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

BIG DATA

A

Information assets that require NEW forms to process it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The Vs of BIG DATA

A

Volume: amount of generated and stored data

Velocity: the speed/rate at which the data is generated, collected, processed

Variety: different types of data available (unstructured, semi structured)

Veracity: quality of captured data. Truthful/reliable data

Value: inherent wealth embedded in the data.

Visualization: display the data

Volatility: everything changes, data changes

Vulnerability: new security concerns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

BIG DATA analytics: compromise about big data collection

A

You need to compromise because we cannot process it like RDMS.

People look for patterns in the data, look for top answers, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Interactive Processing

A

Algorithms that just stop the process and wait for the user input and then continue.

System users are asked to help during the processing, and their answers are considered as part of the algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Approximate processing

A

use representative sample instead of whole population

  • gives approximate output and not exact asnwer
  • einstein photos
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Crowdsourcing processing

A

Difficult tasks or opinions are given to a group of people.

Humans are asked about the relation between profiles for a small compensation per reply. ex: amazon mechanical turk.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Progressive processing

A

You have limited time/ resources to give an answer.

Results are shown as soon as there are available. (as opposed to SQL when you have to wait for it to finish the query)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Incremental processing

A

Data updates are frequent, makes previous results obsolete.

Update existing processing info

This method improves the answer as it gets more information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Scalability in Data Management for traditional dbs

A

Traditional dbs:

  • sql only (constraint)
  • efficiency limited by server capacity

Scaling can be done by:

  • adding more hw
  • creating better algorithms
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Solution for scalability for relational data (distributed dbs):

A

Distributed dbs (diff location for servers):

  • add more dbms & partition the data
  • efficiency limited by servers, network
  • scaling: add more/better servers, faster network,
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Massively parralel processing platforms

A

Move everything in the same place (opposed to distributed DBS
- connect computers over LAN and make development, parallelization and robustness easy
- functionality:
generic data-intensive computing

Scaling: buy more or better computers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Cloud

A

Massively parallel processing platforms running over rented hardware.

Innovation: Elasticity, standardization

Based on elasticity of demand (fluctuations) adjust resources for cloud.

Elasticity can be automatically adjusted

Scaling: it’s magic!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

BIG DATA models

A

Store, Manage and Process by harnessing large clusters of commodity nodes

  • MapReduce familiy: simpler, more constraint
    ex: hadoop
  • 2nd gen: enables more complex processing and data, optimization opportunities
    ex pySpark
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Aspects of data intensive systems

A
  • data storage
  • needle in the haystack
  • scalability (most important)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Architectural chaoices to consider when working with big data

A
  • storage layer
  • programming model and execution engine
  • scheduling
  • optimizations
  • fault tolerance
  • load balancing
17
Q

The Hadoop Ecosystem

A

Hadoop is a family of systems.

Most important (for this course):

Object Storage: HDFS -> storing of the data. (bottom layer)

  • > Table storage Hcatalog, Hbase
  • > Computation: MapReduce
  • > Programming language: Pig(dataflow); Hive(SQL)
18
Q

HDFS: storage layer of hadoop requirements

A

Scalability: just add more data nodes
Efficiency: everything read from HD
Simplicity: no need to know where each block is stored
Fault tolerance: failures do not lead to loss of data

19
Q

HDFS how it works:

A

Files partitioned into blocks.

Blocks then are distributed and replicated across nodes.

20
Q

Types of nodes in HDFS with one functionality

A

Name nodes: keep the location of blocks

Secondary name nodes: backup nodes

Data nodes: keep the actual blocks

21
Q

default size (in mb) of blocks in hadoop

A

64MB

22
Q

Failed data nodes

A

Name node and data node communication using “heartbeat” (like a ping). Informs if the node is still available. —> Data nodes send Name nodes heartbeat at regular intervals to show that everything is fine.

On failure, the name node removes the failed data nodes from the index

Lost partitions are re-replicated to the remaining data nodes

23
Q

Big Data Analytics (IBM)

A

driven by AI, IOT, social media, mobile devices.

  • data sources are becoming more complex than those for traditional data

we want:

  • deliver deeper insights
  • predict future outcomes
  • better and faster decision making
  • power innovative apps
24
Q

Analytics: MapReduce

A
  • a programming paradigm (language) for the creation of code that supports the following:

Easy scale out:

Fault tolerance: 1/1000 off the shelf comp will fail

It is built in Hadoop using HDFS.

Code your analytics logic within:

  • MAP FUNCTION: local processing
  • REDUCE FUNCTION: aggregation
25
Q

Example MapReduce:

A

Huge file -> split into multiple parts -> Map -> Reduce -> -»» RESULT

26
Q

Data model big data/hadoop

A

Basic unit of info: KEY - VALUE pair

Get the data, then translate/convert it to key - value pairs.

This conversion makes it easy to work with various data types: relational, structured, unstructured, etc.’

After converting the data to key-value pairs, give the pairs through the MapReduce.

27
Q

Master-Slave architecture

A

The master controls the slaves.

Master: namenode, jobtracker, secondary namenode

Slave: task tracker > data nodes

(see slide for diagram)

28
Q

MapReduce weaknesses:

A
  • cannot define complex processes
  • batch modes, acyclic
  • difficult to optimize
  • no distributed memory (file based)
29
Q

What is the purpose of replication ?

How does replication of blocks help?

A

Replication = n means that each individual block of data will be replicated and stored on ‘n’ nodes.

The purpose of replication should be to help with fault-tolerance and load balancing.

30
Q

Load balancing = ?

A

Load balancing means that the blocks of data should be distributed in a balanced way across the nodes.

You should have similar loads on all nodes. (no heavy lifting on certain nodes and not enough load on others)