Midterm Flashcards

1
Q

What are the data types associated with Big Data?

A

Structured (tabular data), semi-structured (XML files), and unstructured (text, audio,
video, images) data are all associated with Big Data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which statement best describes small data?

A

Small data is available in limited quantities that humans can easily interpret with little or no digital processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which of the following capabilities are quantifiable advantages of distributed processing?

A
  • You can add and remove execution nodes as and when required, significantly reducing infrastructure costs.
  • Since problem instructions are executed on separate execution nodes, memory and processing requirements are low even while processing large volumes of data.
  • Parallel processing can process Big Data in a fraction of the time compared to linear processing.
  • Parallel processing foxes and executes errors logically without impacting other nodes.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which of these statements describes Big Data?

A
  • Data generates in huge volumes and can be structures, semi-structured, or unstructured.
  • Big data arrives continuously at enormous speed from multiple sources.
  • Big data is mostly located in storage within enterprises and data centers.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which of the following capabilities are quantifiable advantages of parallel processing?

A

Parallel processing can process Big Data in a fraction of the time compared to linear processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is vertical scaling and horizontal scaling?

A
  • Vertical scaling improves the current system. Bigger computer.
  • Horizontal scaling adds more systems. Having more computers.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Big data? Why does it matter?

A
  • Everything we do is increasingly leaving a digital trace ( or data) which we can use and analyze to become smarter. It’s the entire process, not just the data itself, it is a process.
  • It matters because it’s the future. Data is being collected all around us all the time and we can use it to improve our lives.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which of the following statements about Hadoop are true?

A
  • Collection of computers working together at the same time to perform tasks.
  • Hadoop allows for running applications on clusters.
  • Process massive amount of data in distributed files systems that are linked together.
  • Set of open-source programs and procedures which can be used as the framework for Big Data operation.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

MapReduce is a programming model used in Hadoop for processing Big Data. Its also a processing technique for what?

A

Distributed computing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which of the following key features of HDFS ensure against data loss?

A

Replication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the components of a Hadoop 1 Architecture(before 2014)?

A

HDFS and MapReduce

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

All of the following accurately describe Hadoop, Except?

A
  • Open source
  • Java based
  • Real Time
  • Distributed Computing Approach

Answer:
- Real Time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which of the following is a component of Hadoop?

A
  • YARN
  • HDFS
  • MapReduce
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Namenode keeps metadata in?

A

HDFS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which of the following is a data processing engine for Hadoop Framework?

A

MapReduce

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

In which language can you code in Hadoop?

A

Java

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Hadoop can be deployed on commodity servers, which provides low-cost processing as well as storage of unstructured, huge volume of data.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Which of the following manages the resources among all the applications running in a Hadoop cluster?

A

YARN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the main Hadoop components in Hadoop 2 and Hadoop 3? What functions do they perform?

A
  • YARN - Cluster Management
  • HDFS - Manages the storage of data
  • MapReduce - Framework to process data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are the modes that a Hadoop can run? What are the differences among them?

A
  • Standalone - Hadoop runs with a single node. HDFS and YARN do not run.
  • Pseudo-Distributed - Hadoop runs on a single machine with 2 JVM’s. All 3 components run in this mode.
  • Full-Distributed - a full-fledged setup with Hadoop running on a cluster of machines. The cluster can be made up of Linux servers or a cloud service like AWS/Azure.
21
Q

What happens if the block on Hadoop HDFS is corrupted?

A

Each block is replicated and the replicas are stored in different data nodes. The replica location are also stored in the name node.

22
Q

What is the difference between NameNode and DataNode in Hadoop?

A
  • NameNode is the master node. It stores directory structure, metadata for all the files.
  • DataNode is the other nodes, the data is physically stored on these nodes.
23
Q

Although the Hadoop Framework is implemented in Java, MapReduce applications need not be written in?

A
  • Python
  • Java
  • None of the Above
  • C++

Answer:
- None of the Above

24
Q

The number of maps is usually driven by the total size of ____?

A

Inputs (and tasks are correlated)

25
Q

The minimum amount of data that HDFS can read or write is called a ____?

A

Block

26
Q

HDFS works in a ____ fashion?

A

Master-worker

27
Q

_____ is a utility which allows users to create and run jobs with any executables as the mapper and/or the reducer.

A

Hadoop Streaming

28
Q

Point out the wrong statement:

A
  • The outputs of the map-tasks go directly to the local File System.
  • The MapReduce framework does not sort the map-outputs before sending them to the reduce tasks. (it does sort)
  • It is legal to set the number of reduce-tasks to zero if no reduction is desired.

Answer:
- The MapReduce framework does not sort the map-outputs before sending them to the reduce tasks.

29
Q

What is Hadoop streaming API?

A

Hadoop streaming API allows other languages to be used for Hadoop programs. The standard input and output are used to communicate with the program. The streamer requires us to call the Hadoop streaming .JAR file in the execution.

30
Q

Benefits of YARN:

A
  • Scalability - you can sort and manage terabytes of nodes and clusters.
  • Compatibility - YARN supports Map-reduce applications without disruptions. Its compatible with Hadoop 1.0
  • Cluster utilization
  • Multi-tenancy - it allows multiple engine access.
31
Q

What are disadvantages of MapReduce programming model?

A
  • It has more lines of code since its low level programming
    -More development effort is involved
  • Its not flexible, we have 1 or more mappers and 0 or more reducers, but a job is only done using the MapReduce framework.
32
Q

Shuffle/sort

A

Always happens

33
Q

Sort/merge

A

Happens always. Go from each node and combines them and puts them in alphabetical order. And the merge part tallies each word. The reducing puts all the tallies together.

34
Q

What happens if the block on Hadoop HDFS is corrupted?

A

If HDFS is corrupted, it will rely on the replicas which are stored in different data nodes. These replicas avoid the corrupted HDFS to fail the complete program. Furthermore, it is also possible to set a bigger replication factor to make it fail-proof.

35
Q

Hadoop has a default strategy for storing 3 replicas:

A
  • First replicas on one rack, second replica is on the second rack, third replica is on the same rack as the second replica but on a different node.
  • It gives a good balance between redundancy and bandwidth.
36
Q

The name node is….

A

the heart of HDFS. It stores all the metadata of files such as name, owner, permissions, etc. it knows which data nodes, the blocks, and their replicas of a file are stored on. If the name node fails, all the files are lost because there is no way to reconstruct them.

37
Q

Hadoop is normally deployed on a group of machines called clusters.

A
  • Each machine in the cluster is a node
  • One of those nodes are the master node - this node managers overall the file system.
  • The name node stores the directory structure and the metadata for all files.
  • The other nodes are called data nodes. The data is physically stored on these nodes.
  • Default size for data is 128 mb.
38
Q

There are 3 scheduling policies available:

A
  • FIFO scheduler
  • Capacity scheduler
  • Fair scheduler
39
Q

FIFO Scheduler

A

FIFO scheduler - resources are allocated on First In First Out basis
- Requests made first are satisfied first before resources are allocated to others.
- On a cluster shared by many applications, FIFO will cause huge wait times. For this reason, FIFO scheduler is rarely used.

40
Q

Capacity Scheduler

A

Capacity scheduler - capacity is distributed to different queues.
- Each queue is allocated a share of the cluster resources
- Jobs can be submitted to a specific queue. Within a queue, FIFO scheduling is followed
- Allows small jobs to complete without getting stuck due to larger ones. The cluster might be under-utilized since capacity is reserved for queues.

41
Q

Fair Scheduler

A

Fair scheduler - resources are always proportionally allocated to all jobs.
- There is no wait time.

42
Q

MapReduce YARN HDFS

A
  1. Users define map and reduce tasks or other tasks running on the cluster
  2. A job is triggered on the cluster
  3. Yarn figures out where and how to run the job and stores the result in HDFS.
43
Q

Yarn does these 3 steps using 2 services:

A
  • Resource manager - there is 1 resource manager per Hadoop cluster(which consists of hundreds or thousands of computing nodes).
  • Node manager - a node manager runs on each data node in the cluster (all the data nodes).
44
Q

Resource Manager

A
  • The resource manager service runs on a single node - usually the same nodes as HDFS name node
  • Launches tasks submitted to yarn, arbitrates the available resources on the cluster among computing applications.
  • It is optimized for cluster utilization based on constraints such as capacity guarantees, fairness, and SLA’s
  • It has a pluggable scheduler policy which determines which application can be run first.
45
Q

Node manager

A
  • There is one node manager per node.
  • Is responsible for launching and managing containers on a node.
  • It coordinates with the resource manager in order to perform its tasks.
  • It monitors resources, logs, tracks the health of the node, everything related to the one node that is in charge.
46
Q

A job is submitted to YARN

A
  • Yarn has a resource manager running on 1 node managers on each computing node. The job is first submitted to the Resource Manager.
  • The RM will reschedule the job based on the constraints specified and capacity available.
  • The RM will in turn find a Node Manager on one of the nodes to launch an Application Master Process.
47
Q

Application Master Process

A

Process running with the container. It negotiates resources from the RM. It works with the node managers to execute and monitor containers. Its moving responsibility for the monitoring and execution of tasks to the application master allowed scale in computing. It’s per application. It has to perform the task with the resources allocated to the container. Depends on how many jobs are in the cluster.

48
Q

The location constraint

A
  • Data locality
  • for RM, in order to conserve network bandwidth, YARN will try to perform the computation on the same node as the data resides. If there are no resources available on the data node, YARN will first wait for some time for resources to free up. If that fails the container is requested on the same rack as the concerned data node.