Big Data- Lec 2 (Hadoop) Flashcards
what are the 2 components that Hadoop is built on?
HDFS and Mapreduce
t/f
Hadoop is a centralized file system. All data is stored in one location making it easier to gather information.
false - hadoop is a distributed file system. purpose is to promote sharing of dispersed files.
- t/f
A distributed file system is something where users are able to see everything at their end. ( Back end and front end are the same).
- Which of the following does a DFS not provide?
a. distributed storage
b. fault tolerance
c. low throughput data access
- false- users dont see the distribution at their level
- c - dfs provides HIGH throughput
t/f - hadoop infrastructure
- Hadoop is powerful but not scalable. it can only handle a few compute nodes and terabytes of information
- Hadoop is very cost-effective since it uses low cost commodity hardware
- hadoop is efficient since it distributes the data and processes it individually but never at the same time.
- false - it is scalable. it can handle thousands of compute nodes and PB of info
- t
- F - yes data is distributed, but its efficient bc the data is processed in parallel
t/f
- hadoop is an open source system
- hadoop provides analysis of only structured data
- compute nodes in hadoop can offer local computation and storage
4 . Apache spark is a software library that allows us to do distributing processing of small datasets across computer ‘clusters’
- T
- F - both struc and unstruc
- T
- F
- apache HADOOP is the library
-data sets are huge (big data )
which is true about hadoop?
- The Hadoop framework includes master and slave nodes for hdfs and MapReduce
- master node for hdfs is job-tracker and for map reduce its name node
- slave node for hdfs is task-tracker and for MapReduce its data node.
- hadoop is made up of only one component in its platform which is hive.
- T
- F- Master node:
- hdfs = name node
- map reduce= job tracker - F- slave node:
- hdfs = name node
- map reduce= job tracker
4.false, it has many components (hive, hbase, spark etc)
which of the following is not correct about hadoop stack?
- pig is a component of hadoop used for system management and maintenance. It helps to grab or kill applications
- Hive is a data warehouse-like tool that is built on top of mapreduce.
- Hadoops main framework revolves around HDFS and Mapreduce
- Zookeeper allows you to automate distributed data flow using a special language and do parallel computation.
- F - system management is done by zoo keeper
- F - its built on top of hdfs
- T
- F - Pig allows you to do this using language called Piglatin
T/F
- Hdfs is responsible for distributed processing across computer clusters
- To transfer data, you can use scoop
- To process streaming data ( that is very fast), you use hbase
- Hive is used to summarize data and to SQL-like queries
- To deal with large tables in a distributed database, use apache spark as it can handle large datasets very quickly.
- F - Mapreduce
- T
- F- use spark
- T
- F - for large tables- use hbase
Which of the following is not true about what Hadoop is used for:
a. helps us deliver mashup services (GPS data, clickstream data)
b. helps give context for data (friends networks, social graphs)
c. creates a central data warehouse for our data for easy access
d. keeps apps running through edit logs, query logs
e. aggregates data-exhaust (messages,posts, blogs etc).
c- not one of the uses
Describe hadoops server roles
hadoop has
- distributed data processing = map reduce
master= job tracker
slave= task tracker
- distributed data storage = hdfs
master= name node
slave= data node
masters - manage and track all interactions
slaves - actually do the job
which of the following is not an assumption / goal of HDFS?
- assumes there will always be a second copy and thus minimizes replications to avoid redundancy
- hard failure - always possible
- we are always gonna be dealing with large data sets and hdfs goal is to handle this
- hdfs wants to provide streaming data access which is best for batch processing
- follows a simple coherency model -write once- read many ( best for transaction data)
- portability across different hardware and software platforms.
- F - this is not an assumption or a goal of hdfs
________ refers to the overall perf of the system
a. optimization
b. through-put access
b
A distributed file system that provides high throughput access
a. hadoop
b. hdfs
b - hdfs
what is not true about hdfs?
a. it breaks data into small blocks ( 128 MB = default)
b. blocks are stored on the name node
c. blocks are replicated to other nodes to accomplish fault tolerance
d. Data node keeps track of where all the blocks are kept through metadata and indexing each block
b- false bc blocks are stored on data node
describe rack awareness script
rack awareness script is responsible for keeping track of which racks have which nodes and are they working
_______ responsible for holding status of the data nodes inside a rack. points something out if there are faulty datanodes and helps racks talk to each other.
switch