Introduction to Big Data Flashcards
What is Big Data?
Extremely large volumes of data.
What are the three factors to consider when designating something as Big Data?
Its volume, variety and velocity (the rate of change of data growth).
What is variety?
Data that is considered unstructured, usually derived from unstructured digital content.
What are the problems with Big Data?
Storage, Computational Efficiency, Data loss, Cost.
What are the traditional solutions for Big Data?
Relational Database Management systems (RDBMS), Grid Computing, RAID Systems.
What is RDBMS and what issues does it have?
Relational Database Management Systems (MySQL, PL/SQL, etc.). It has scalability issues (as the data gets bigger, so does the computational time). They are also designed to handle structured data.
**RDBMS are not horizontally scalable (you cannot improve performance by adding more computing power)
What is grid computing and what are some of its drawbacks?
Putting computers in parallel and have a program run do computations on each data. While good for low volume, it is intensive on computational tasks.
** also requires experience in low level programming languages (not suitable for mainstream)
What is RAID and what are some of the drawbacks?
Redundant Array of Independent Disks (RAID) systems were not designed to scale. As volume increases, so does cost, and though they have tried to be sold as scalable systems, their efforts have largely failed.
What is Hadoop?
A framework for distributed computing.
What are the two main components of Hadoop?
Hadoop Distributed File System (HDFS) -> storage solution
MapReduce -> Computation Solution
What does the Hadoop Distributed File System (HDFS) do?
Takes care of all your distributed storage complexities.
- Splitting your data into blocks
- Replicating each block to more than one node
- Keep track of which block is stored in which node
What is MapReduce?
A programming model implemented by Hadoop that takes care of all the computational complexities.
What was Hadoop built to work on?
Commodity hardware.
**needs a machine that has a processor, hard disk and RAM.
What was Hadoop built to work on?
Commodity hardware.
**needs a machine that has a processor, hard disk and RAM.
What do you need to deal with Big Data?
Distributed computing platform.