LEcture 4,5 and 6 Flashcards
What is data normalization?
Validates and improves a logical design so that it satisfies certain constraints
- Decomposes relations with anomalies to produce smaller, well-structured relations
What is goal van data normalization?
- Goal is to avoid anomalies
1. Insertion anomaly
a. Adding new rows forces user to create duplicate data
2. Deletion anomaly
a. Deleting rows may cause a loss of data that would be needed for somewhere
else
3. Modification anomaly
a. Changing data forces changes to other rows because of duplication
What zijn well structered relatios
- relations that contain minimal data redundancy and allow users to insert, delete, and
update rows without causing data inconsistencies
What is de first normaol form?
No multivalued attributes
- Steps:
- Ensure that every attribute value is atomic
- But, in the relational world one only works with relations in 1NF
- So, no need to actually do something
What is 2nd normal form?
- 1NF + remove partial functional dependencies
- create a new relation for each primary key attribute found in the old relation
- Move the nonkey attributes that are only dependent on this primary key attribute from
the old relation to the new relation
What is 3rd normal form?
2NF + remove transitive dependencies
- Steps:
- Create a new relation for each nonkey attribute that is a determinant in a
relation:
- make that attribute the key
- Move all dependent attributes to new relation
- Keep the determinant attribute in the old relation to serve as a foreign key
wat zijn Challenges arise from the application settings ?
o Data characteristics
o System and resources
o Time restrictions
Wat zijn de challenges van data management?
• Veracity
o Structured data with known semantics and quality
o Dealing with high levels of profile noise
• Volume
o Very large number of profiles
• Variety
o Large volumes of semi-structured, unstructured or highly heterogeneous structured data
• Velocity
o Increasing volume of available data
Eigenschappen van traditionele databases
Constrained functionality: SQL only Efficiency limited by server capacity - Memory - CPU - HDD - Network Scaling can be done by - Adding more hardware - Creating better algorithms - But there are still limits
Eigenschappen distributed databases
Innovation
- Add more DBMS and partition the data
Constrained functionality
- Answer SQL queries
Efficiency limited by #servers, network
API offers location transparency
- User/application always sees a single machine
- User/application not caring about data location
Scaling: add more/better servers, faster network
Eigenschappen van Massively parallel processing platforms:
Innovation
- Connect computers (nodes) over LAN
- make development, parallelization and robustness easy
Functionality
- Generic data-intensive computing
Efficiency relies on network, #computers & algorithms
API offers location & parallelism transparency
- Developers don’t know where data is stored and how the code will be parallelized
Scaling: add more and better computers
Eigenschappen van cloud
Massively parallel processing platforms running on nted hardware
- Innovation
- Elasticity, standardization
- e.g. university requires little resources during holidays, amazon
requires a lot of resources → elasticity
Elasticity can be automatically adjusted
API offers location and parallelism transparency
Scaling: It’s magic!
Five characteristics of big data
Volume - quantity of generated and stored data Velocity - speed at which the data is processed and stored Variety - Type and nature of the data
Variability
- inconsistency of the data set
Veracity
- quality of captured data
Architectural choices to consider:
- Storage layer
- Programming model & execution engine
- Scheduling
- Optimizations
- Fault tolerance
- Load balancing
Requirements of storage layer
- Scalability: handle the ever-increasing data sizes
- Efficiency: fast accesses to data
- Simplicity: hide complexity from the developers
- Fault-tolerance: failures do not lead to loss of data
• Developers are NOT reading from or writing to the files explicitly
• Distributed File System handles IO transparently
o Several DFS already available
o Hadoop Distributed File System
o Google File System
o Cosmos File system