LEcture 4,5 and 6 Flashcards by bas noordhoorn

What is data normalization?

Validates and improves a logical design so that it satisfies certain constraints
- Decomposes relations with anomalies to produce smaller, well-structured relations

How well did you know this?

Not at all

Perfectly

What is goal van data normalization?

Goal is to avoid anomalies
1. Insertion anomaly
a. Adding new rows forces user to create duplicate data
2. Deletion anomaly
a. Deleting rows may cause a loss of data that would be needed for somewhere
else
3. Modification anomaly
a. Changing data forces changes to other rows because of duplication

How well did you know this?

Not at all

Perfectly

What zijn well structered relatios

relations that contain minimal data redundancy and allow users to insert, delete, and
update rows without causing data inconsistencies

How well did you know this?

Not at all

Perfectly

What is de first normaol form?

No multivalued attributes

Steps:
Ensure that every attribute value is atomic
But, in the relational world one only works with relations in 1NF
So, no need to actually do something

How well did you know this?

Not at all

Perfectly

What is 2nd normal form?

1NF + remove partial functional dependencies
create a new relation for each primary key attribute found in the old relation
Move the nonkey attributes that are only dependent on this primary key attribute from
the old relation to the new relation

How well did you know this?

Not at all

Perfectly

What is 3rd normal form?

2NF + remove transitive dependencies
- Steps:
- Create a new relation for each nonkey attribute that is a determinant in a
relation:
- make that attribute the key
- Move all dependent attributes to new relation
- Keep the determinant attribute in the old relation to serve as a foreign key

How well did you know this?

Not at all

Perfectly

wat zijn Challenges arise from the application settings ?

o Data characteristics
o System and resources
o Time restrictions

How well did you know this?

Not at all

Perfectly

Wat zijn de challenges van data management?

• Veracity
o Structured data with known semantics and quality
o Dealing with high levels of profile noise
• Volume
o Very large number of profiles
• Variety
o Large volumes of semi-structured, unstructured or highly heterogeneous structured data
• Velocity
o Increasing volume of available data

How well did you know this?

Not at all

Perfectly

Eigenschappen van traditionele databases

Constrained functionality: SQL only
Efficiency limited by server capacity
- Memory
- CPU
- HDD
- Network
Scaling can be done by
- Adding more hardware
- Creating better algorithms
- But there are still limits

How well did you know this?

Not at all

Perfectly

Eigenschappen distributed databases

Innovation
- Add more DBMS and partition the data
Constrained functionality
- Answer SQL queries
Efficiency limited by #servers, network
API offers location transparency
- User/application always sees a single machine
- User/application not caring about data location
Scaling: add more/better servers, faster network

How well did you know this?

Not at all

Perfectly

Eigenschappen van Massively parallel processing platforms:

Innovation
- Connect computers (nodes) over LAN
- make development, parallelization and robustness easy
Functionality
- Generic data-intensive computing
Efficiency relies on network, #computers & algorithms
API offers location & parallelism transparency
- Developers don’t know where data is stored and how the code will be parallelized
Scaling: add more and better computers

How well did you know this?

Not at all

Perfectly

Eigenschappen van cloud

Massively parallel processing platforms running on nted hardware
- Innovation
- Elasticity, standardization
- e.g. university requires little resources during holidays, amazon
requires a lot of resources → elasticity
Elasticity can be automatically adjusted
API offers location and parallelism transparency
Scaling: It’s magic!

How well did you know this?

Not at all

Perfectly

Five characteristics of big data

Volume
- quantity of generated and stored data
Velocity
- speed at which the data is processed and stored
Variety
- Type and nature of the data

Variability
- inconsistency of the data set
Veracity
- quality of captured data

How well did you know this?

Not at all

Perfectly

Architectural choices to consider:

Storage layer
Programming model & execution engine
Scheduling
Optimizations
Fault tolerance
Load balancing

How well did you know this?

Not at all

Perfectly

Requirements of storage layer

Scalability: handle the ever-increasing data sizes
Efficiency: fast accesses to data
Simplicity: hide complexity from the developers
Fault-tolerance: failures do not lead to loss of data

• Developers are NOT reading from or writing to the files explicitly
• Distributed File System handles IO transparently
o Several DFS already available
o Hadoop Distributed File System
o Google File System
o Cosmos File system

How well did you know this?

Not at all

Perfectly

What is HDFS

Study These Flashcards

• Files partitioned into blocks
• Blocks distributed and replicated across nodes
• Three types of nodes in HDFS with one functionality:
o Name nodes: Keep the locations of blocks
o Secondary name nodes: backup nodes
o Data nodes: keep the actual blocks

What happens with failed daata node?

Study These Flashcards

Name and data node communicate using heartbeat
Heartbeat is the signal that is sent by the data node to the name node after a regular interval to indicate that it is still present and working
On failure, name node removes the failed data nodes from the index
Lost partitions are re-replicated to the remaining data nodes

Proporties of HDFS

Study These Flashcards

• Scalability: Handle the ever-increasing data sizes
o Just add more data nodes
• Efficiency: Fast accesses to data
o Everything read from hard disk (requires I/O)
• Simplicity: Hide complexity from the developers
o No need to know where each block is stored
• Fault-tolerance: Failures do not lead to loss of data
o Administrator can control replication
o If failures are not widespread, no data is lost

What is big data analytics?

Study These Flashcards

• Driven by artificial intelligence, mobile devices, social media and the Internet of Things (IoT)
• Data sources are becoming more complex than those for traditional data
o e.g., Web applications allow user generated data

In order to
•	Deliver deeper insights
•	Power innovative data applications 
•	Better and faster decision-making 
•	Predicting future outcomes 
•	Enhanced business intelligence

Types of analytics:

Study These Flashcards

Traditional computation
- exact and all answers over the whole data collection
Approximate
- Use a representative sample instead of the entire input data collection
- Give approximate output and not exact answers
- Answers given within quarantines
Progressive
- Efficiently process given limited time and/or computational resources that
currently are available
Incremental
- Data updates is often high, which quickly makes previous result obsolete
- Update existing processing information
- Allow leveraging new evidence from the updates to fix previous
inconsistencies or complete the information

What is mapreduce?

Study These Flashcards

A programming paradigm (~language) for the creation of code that supports the
following:
- Easy scale-out
- Parallelism & location transparency
- Simple to code and learn
- Fault tolerance
- In 1000’s off the shelf computers, one WILL fail
- Constrain the user to simple constructs!

What is an data model of mapreduce?

Study These Flashcards

Basic unit of information
key-value pair
Translate data to key-value pairs
Thus can work on various data-types (structured, unstructured etc)
Then, give the pairs through the MapReduce

what is an programming model of mapreduce?

Study These Flashcards

Model based on different functions
- Primary ones: Map function and Reduce function
- Map (key, value):
- Invoked for every split of the input data
- Value corresponds to the records (lines) in the split
- Reduce(key , list(values))
- Invoked for every unique key emitted by Map
- List(values) corresponds to all values emitted from ALL mappers for this key
- Combine (key,list(values))
locally merge the keys at each node to reduce the number of cross-node
messages
- No guarantees that it will actually be executed!
- Typically, invoked after a fixed-memory buffer is full

downsides of Map reduce

Study These Flashcards

MapReduce is not a panacea
MapReduce simple but weak for some requirements 
•	Cannot define complex processes
•	Batch mode, acyclic, not iterative
•	Everything file-based, no distributed memory 
•	Difficult to optimize
Not good in:
•	Iterative processes, e.g., clustering
•	Real-time answers, e.g., streams 
•	Graph queries, e.g., shortest path

More problems with map reduce

Problems with MapReduce: - MapReduce is a major step backwards (DeWitt 2008) Performance: Extensive I/O (input/output) - Everything is a file stored in the HDFS - Data access too slow - RAM is not used sufficiently Programming model: limited expressiveness - e.g. iterations, cyclic processes - Code is difficult to optimize - SQL: several optimization methodologies

What is SParks dataFlow paradigm?

- Models an algorithm as a directed graph with the data flowing between operations - Construction goals: - Improve expressiveness and extensibility - Make coding easier: strive for high-level code - Enable additional optimizations - Improve performance by utilizing the hardware better (RAM) - Representative examples: - Spark, apache, dryad, pregel

What is sparks architectural choiches?

``` Architectural choices of Spark • Storage layer o Resilient Distributed Datasets (RDDs) o Datasets and data sources o Input files still stored in HDFS • Programming model and execution engine ```

RDD storage layer requirements

* Scalability: handle the ever-increasing data sizes * Efficiency: fast accesses to data * Simplicity: hide complexity from the developers * Fault-tolerance: failures do not lead to loss of data * Fast RAM for hot data: recent data stored in RAM

What is RDD

• R resilient o Recover from failures • D distributed o Parts are placed on different computers • D dataset o Collection of data o Array, table, data frame, etc. Distributed, fault-tolerant collections of elements that can be processed in parallel Resilient Distributed Datasets • Created by o Loading data from stable storage, e.g., from HDFS o Manipulation of existing RDDs • Core properties o Immutable, i.e., read-only, cannot change o Distributed o Lazily evaluated  Joining multiple orders together (restaurant example)  Intuition > optimization o Cacheable > by default, stored in memory! o Replicated

What do RDD contains?

``` • Details about the data o I.e., data location or the actual data • Lineage information (equals to “history”) o Dependencies from other RDDs o Functions/transformations for recreating a lost split of an RDD from a previous RDD! • Examples: o RDD2 = RDD1.filter(...) o RDD3 = RDD2.transform(...) ```

Why programming model spark?

Why? - MapReduce is simple but weak - Cannot define complex processes - Batch mode, acyclic, not iterative - Everything is file-based, no distributed memory - Procedural → difficult to optimize - Spark - Processing expressed as a directed acyclic graph (DAG)

Spark: Dataflow and RDDs

Spark development is RDD centric - In the future, dataset-centric - RDDs enable operations: - Transformations (lazy operations) - I.e. map, filter, flatmap, joins - Actions - I.e. count, collect - Chain of RDD transformations to implement the required functionality

What zijn transofrmation in spark?

Most used cases: - Map transformation - Returns a new RDD formed by passing each element of the source through the given function - Filter transformation - Returns a new RDD formed by keeping those elements of the source on which function returns = TRUE - FlatMap transformation - Similar to map - But each input can be mapped to 0 or more output elements - e.g. name FlatMap: First name and last name can be different output elements - ReduceByKey transformations - Processes the elements with each being an (K,V) pair (key, value) - Creates another set of (K,V) pairs where the values for each key are aggregated using the given function - GroupByKey transformations - Processes the elements with each being the (K,V) pair - For each key K it creates an iterable containing all values for the particular key

Transformation in spark

Create a new RDD from an existing one - All transformations in Spark are lazy - Do not compute results right away - Computed only

Actions and transformations on RDDs are fully parallelizable

- Synchronization required only on shuffling

Lazy evaluation in Spark;

- Spark = static rule-based optimizations - Exploits lazy evaluation of transformations - The actual computation starts only when an action is called, in this case collect()

LEcture 4,5 and 6 Flashcards

(36 cards)