IDT lecture 4 Flashcards
Drawbacks for older file system type db (before 1970’s)
- redundancy
- inconsistencies
- data isolation
- integrity
- atomicity of updates
- concurrent access by multiple users
- security problems
Solution for these problems was the creation of RDBMS… so RELATIONAL dbms
BIG DATA
Information assets that require NEW forms to process it.
The Vs of BIG DATA
Volume: amount of generated and stored data
Velocity: the speed/rate at which the data is generated, collected, processed
Variety: different types of data available (unstructured, semi structured)
Veracity: quality of captured data. Truthful/reliable data
Value: inherent wealth embedded in the data.
Visualization: display the data
Volatility: everything changes, data changes
Vulnerability: new security concerns
BIG DATA analytics: compromise about big data collection
You need to compromise because we cannot process it like RDMS.
People look for patterns in the data, look for top answers, etc
Interactive Processing
Algorithms that just stop the process and wait for the user input and then continue.
System users are asked to help during the processing, and their answers are considered as part of the algorithm
Approximate processing
use representative sample instead of whole population
- gives approximate output and not exact asnwer
- einstein photos
Crowdsourcing processing
Difficult tasks or opinions are given to a group of people.
Humans are asked about the relation between profiles for a small compensation per reply. ex: amazon mechanical turk.
Progressive processing
You have limited time/ resources to give an answer.
Results are shown as soon as there are available. (as opposed to SQL when you have to wait for it to finish the query)
Incremental processing
Data updates are frequent, makes previous results obsolete.
Update existing processing info
This method improves the answer as it gets more information.
Scalability in Data Management for traditional dbs
Traditional dbs:
- sql only (constraint)
- efficiency limited by server capacity
Scaling can be done by:
- adding more hw
- creating better algorithms
Solution for scalability for relational data (distributed dbs):
Distributed dbs (diff location for servers):
- add more dbms & partition the data
- efficiency limited by servers, network
- scaling: add more/better servers, faster network,
Massively parralel processing platforms
Move everything in the same place (opposed to distributed DBS
- connect computers over LAN and make development, parallelization and robustness easy
- functionality:
generic data-intensive computing
Scaling: buy more or better computers
Cloud
Massively parallel processing platforms running over rented hardware.
Innovation: Elasticity, standardization
Based on elasticity of demand (fluctuations) adjust resources for cloud.
Elasticity can be automatically adjusted
Scaling: it’s magic!
BIG DATA models
Store, Manage and Process by harnessing large clusters of commodity nodes
- MapReduce familiy: simpler, more constraint
ex: hadoop - 2nd gen: enables more complex processing and data, optimization opportunities
ex pySpark
Aspects of data intensive systems
- data storage
- needle in the haystack
- scalability (most important)