LEcture 4,5 and 6 Flashcards

1
Q

What is data normalization?

A

Validates and improves a logical design so that it satisfies certain constraints
- Decomposes relations with anomalies to produce smaller, well-structured relations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is goal van data normalization?

A
  • Goal is to avoid anomalies
    1. Insertion anomaly
    a. Adding new rows forces user to create duplicate data
    2. Deletion anomaly
    a. Deleting rows may cause a loss of data that would be needed for somewhere
    else
    3. Modification anomaly
    a. Changing data forces changes to other rows because of duplication
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What zijn well structered relatios

A
  • relations that contain minimal data redundancy and allow users to insert, delete, and
    update rows without causing data inconsistencies
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is de first normaol form?

A

No multivalued attributes

  • Steps:
  • Ensure that every attribute value is atomic
  • But, in the relational world one only works with relations in 1NF
  • So, no need to actually do something
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is 2nd normal form?

A
  • 1NF + remove partial functional dependencies
  • create a new relation for each primary key attribute found in the old relation
  • Move the nonkey attributes that are only dependent on this primary key attribute from
    the old relation to the new relation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is 3rd normal form?

A

2NF + remove transitive dependencies
- Steps:
- Create a new relation for each nonkey attribute that is a determinant in a
relation:
- make that attribute the key
- Move all dependent attributes to new relation
- Keep the determinant attribute in the old relation to serve as a foreign key

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

wat zijn Challenges arise from the application settings ?

A

o Data characteristics
o System and resources
o Time restrictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Wat zijn de challenges van data management?

A

• Veracity
o Structured data with known semantics and quality
o Dealing with high levels of profile noise
• Volume
o Very large number of profiles
• Variety
o Large volumes of semi-structured, unstructured or highly heterogeneous structured data
• Velocity
o Increasing volume of available data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Eigenschappen van traditionele databases

A
Constrained functionality: SQL only
Efficiency limited by server capacity
- Memory
- CPU
- HDD
- Network
Scaling can be done by
- Adding more hardware
- Creating better algorithms
- But there are still limits
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Eigenschappen distributed databases

A

Innovation
- Add more DBMS and partition the data
Constrained functionality
- Answer SQL queries
Efficiency limited by #servers, network
API offers location transparency
- User/application always sees a single machine
- User/application not caring about data location
Scaling: add more/better servers, faster network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Eigenschappen van Massively parallel processing platforms:

A

Innovation
- Connect computers (nodes) over LAN
- make development, parallelization and robustness easy
Functionality
- Generic data-intensive computing
Efficiency relies on network, #computers & algorithms
API offers location & parallelism transparency
- Developers don’t know where data is stored and how the code will be parallelized
Scaling: add more and better computers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Eigenschappen van cloud

A

Massively parallel processing platforms running on nted hardware
- Innovation
- Elasticity, standardization
- e.g. university requires little resources during holidays, amazon
requires a lot of resources → elasticity
Elasticity can be automatically adjusted
API offers location and parallelism transparency
Scaling: It’s magic!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Five characteristics of big data

A
Volume
- quantity of generated and stored data
Velocity
- speed at which the data is processed and stored
Variety
- Type and nature of the data

Variability
- inconsistency of the data set
Veracity
- quality of captured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Architectural choices to consider:

A
  • Storage layer
  • Programming model & execution engine
  • Scheduling
  • Optimizations
  • Fault tolerance
  • Load balancing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Requirements of storage layer

A
  • Scalability: handle the ever-increasing data sizes
  • Efficiency: fast accesses to data
  • Simplicity: hide complexity from the developers
  • Fault-tolerance: failures do not lead to loss of data

• Developers are NOT reading from or writing to the files explicitly
• Distributed File System handles IO transparently
o Several DFS already available
o Hadoop Distributed File System
o Google File System
o Cosmos File system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is HDFS

A

• Files partitioned into blocks
• Blocks distributed and replicated across nodes
• Three types of nodes in HDFS with one functionality:
o Name nodes: Keep the locations of blocks
o Secondary name nodes: backup nodes
o Data nodes: keep the actual blocks

17
Q

What happens with failed daata node?

A
  • Name and data node communicate using heartbeat
  • Heartbeat is the signal that is sent by the data node to the name node after a regular interval to indicate that it is still present and working
  • On failure, name node removes the failed data nodes from the index
  • Lost partitions are re-replicated to the remaining data nodes
18
Q

Proporties of HDFS

A

• Scalability: Handle the ever-increasing data sizes
o Just add more data nodes
• Efficiency: Fast accesses to data
o Everything read from hard disk (requires I/O)
• Simplicity: Hide complexity from the developers
o No need to know where each block is stored
• Fault-tolerance: Failures do not lead to loss of data
o Administrator can control replication
o If failures are not widespread, no data is lost

19
Q

What is big data analytics?

A

• Driven by artificial intelligence, mobile devices, social media and the Internet of Things (IoT)
• Data sources are becoming more complex than those for traditional data
o e.g., Web applications allow user generated data

In order to
•	Deliver deeper insights
•	Power innovative data applications 
•	Better and faster decision-making 
•	Predicting future outcomes 
•	Enhanced business intelligence
20
Q

Types of analytics:

A

Traditional computation
- exact and all answers over the whole data collection
Approximate
- Use a representative sample instead of the entire input data collection
- Give approximate output and not exact answers
- Answers given within quarantines
Progressive
- Efficiently process given limited time and/or computational resources that
currently are available
Incremental
- Data updates is often high, which quickly makes previous result obsolete
- Update existing processing information
- Allow leveraging new evidence from the updates to fix previous
inconsistencies or complete the information

21
Q

What is mapreduce?

A

A programming paradigm (~language) for the creation of code that supports the
following:
- Easy scale-out
- Parallelism & location transparency
- Simple to code and learn
- Fault tolerance
- In 1000’s off the shelf computers, one WILL fail
- Constrain the user to simple constructs!

22
Q

What is an data model of mapreduce?

A
  • Basic unit of information
  • key-value pair
  • Translate data to key-value pairs
  • Thus can work on various data-types (structured, unstructured etc)
  • Then, give the pairs through the MapReduce
23
Q

what is an programming model of mapreduce?

A

Model based on different functions
- Primary ones: Map function and Reduce function
- Map (key, value):
- Invoked for every split of the input data
- Value corresponds to the records (lines) in the split
- Reduce(key , list(values))
- Invoked for every unique key emitted by Map
- List(values) corresponds to all values emitted from ALL mappers for this key
- Combine (key,list(values))
locally merge the keys at each node to reduce the number of cross-node
messages
- No guarantees that it will actually be executed!
- Typically, invoked after a fixed-memory buffer is full

24
Q

downsides of Map reduce

A
MapReduce is not a panacea
MapReduce simple but weak for some requirements 
•	Cannot define complex processes
•	Batch mode, acyclic, not iterative
•	Everything file-based, no distributed memory 
•	Difficult to optimize
Not good in:
•	Iterative processes, e.g., clustering
•	Real-time answers, e.g., streams 
•	Graph queries, e.g., shortest path
25
Q

More problems with map reduce

A

Problems with MapReduce:
- MapReduce is a major step backwards (DeWitt 2008)
Performance: Extensive I/O (input/output)
- Everything is a file stored in the HDFS
- Data access too slow
- RAM is not used sufficiently
Programming model: limited expressiveness
- e.g. iterations, cyclic processes
- Code is difficult to optimize
- SQL: several optimization methodologies

26
Q

What is SParks dataFlow paradigm?

A
  • Models an algorithm as a directed graph with the data flowing between operations
  • Construction goals:
  • Improve expressiveness and extensibility
  • Make coding easier: strive for high-level code
  • Enable additional optimizations
  • Improve performance by utilizing the hardware better (RAM)
  • Representative examples:
  • Spark, apache, dryad, pregel
27
Q

What is sparks architectural choiches?

A
Architectural choices of Spark
•	Storage layer
o	Resilient Distributed Datasets (RDDs) 
o	Datasets and data sources
o	Input files still stored in HDFS
•	Programming model and execution engine
28
Q

RDD storage layer requirements

A
  • Scalability: handle the ever-increasing data sizes
  • Efficiency: fast accesses to data
  • Simplicity: hide complexity from the developers
  • Fault-tolerance: failures do not lead to loss of data
  • Fast RAM for hot data: recent data stored in RAM
29
Q

What is RDD

A

• R resilient
o Recover from failures
• D distributed
o Parts are placed on different computers
• D dataset
o Collection of data
o Array, table, data frame, etc.
Distributed, fault-tolerant collections of elements that can be processed in parallel

Resilient Distributed Datasets
• Created by
o Loading data from stable storage, e.g., from HDFS
o Manipulation of existing RDDs
• Core properties
o Immutable, i.e., read-only, cannot change
o Distributed
o Lazily evaluated
 Joining multiple orders together (restaurant example)
 Intuition > optimization
o Cacheable > by default, stored in memory!
o Replicated

30
Q

What do RDD contains?

A
•	Details about the data
o	I.e., data location or the actual data
•	Lineage information (equals to “history”)
o	Dependencies from other RDDs
o	Functions/transformations for recreating a lost split of an RDD from a previous RDD!
•	Examples:
o	RDD2 = RDD1.filter(...) 
o	RDD3 = RDD2.transform(...)
31
Q

Why programming model spark?

A

Why?

  • MapReduce is simple but weak
  • Cannot define complex processes
  • Batch mode, acyclic, not iterative
  • Everything is file-based, no distributed memory
  • Procedural → difficult to optimize
  • Spark
  • Processing expressed as a directed acyclic graph (DAG)
32
Q

Spark: Dataflow and RDDs

A

Spark development is RDD centric

  • In the future, dataset-centric
  • RDDs enable operations:
  • Transformations (lazy operations)
  • I.e. map, filter, flatmap, joins
  • Actions
  • I.e. count, collect
  • Chain of RDD transformations to implement the required functionality
33
Q

What zijn transofrmation in spark?

A

Most used cases:
- Map transformation
- Returns a new RDD formed by passing each element of the source
through the given function
- Filter transformation
- Returns a new RDD formed by keeping those elements of the source
on which function returns = TRUE
- FlatMap transformation
- Similar to map
- But each input can be mapped to 0 or more output elements
- e.g. name FlatMap: First name and last name can be different
output elements
- ReduceByKey transformations
- Processes the elements with each being an (K,V) pair (key, value)
- Creates another set of (K,V) pairs where the values for each key are
aggregated using the given function
- GroupByKey transformations
- Processes the elements with each being the (K,V) pair
- For each key K it creates an iterable containing all values for the
particular key

34
Q

Transformation in spark

A

Create a new RDD from an existing one

  • All transformations in Spark are lazy
  • Do not compute results right away
  • Computed only
35
Q

Actions and transformations on RDDs are fully parallelizable

A
  • Synchronization required only on shuffling
36
Q

Lazy evaluation in Spark;

A
  • Spark = static rule-based optimizations
  • Exploits lazy evaluation of transformations
  • The actual computation starts only when an action is called, in this case collect()