Week 2: Foundation For Big Data Systems Flashcards
1
Q
What is a distributed file system
A
- A physically distributed implementation of the traditional file system
- Allowing users to manipulate organise and share data seamlessly
- regardless of its actual location on the network
2
Q
2 benefits of data replication
A
- Makes the system more fault tolerant
- Helps with scaling the access to this data by many users
3
Q
What 3 does distributed file systems provide?
A
- Data scalability
- Fault tolerance
- High concurrency through partitioning and replication of data on many nodes
4
Q
What is the frequency of updates in big data systems
A
- Written once
- updates maintained as additional data sets over time
5
Q
Parallel computing
A
- computation needing more than one node or parallel processing
6
Q
commodity clusters
A
- affordable parallel computers with an average number of computing nodes
7
Q
3 commodity cluster cons
A
- Not as powerful as traditional parallel computers
- often built out of less specialised nodes
- higher potential for partial failures
8
Q
What is Apache Hadoop
A
A framework that allows distributed processing of large data sets
9
Q
What are the 3 parts of the Hadoop eco system
A
- Hadoop Distributed File System
- Hadoop Yarn
- Hadoop Map Reduce
10
Q
What is Hadoop HDFS?
A
A distributed file system that provides high-throughput access to application data
11
Q
What is Hadoop YARN?
A
A framework for job scheduling and cluster resource mangement
12
Q
What is Hadoop MapReduce?
A
A YARN based system for parallel procssing of large data sets