spark1 Flashcards

1
Q

Spark is best suited for_____data.
* Real time
* virtual
* structured
* All of the above

A

All of the above

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which of the following is a feature of Apache Spark?
* Speeds
* Supports multiple languages
* Advanced Analytics
* All of the above

A

All of the above

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does Spark Engine do?
* Scheduling
* Distributing data across cluster
* Monitoring data across cluster
* all of the above

A

all of the above

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

RDD can NOT be created from data stored on?
* LocalFS
* Oracle
* S3
* HDFS

A

Oracle

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

For resource management spark can use?
* Yam
* Mesos
* Standalone cluster manager
* All of the above

A

All of the above

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Fault Tolerance in RDD is achieved using?
* Immutable nature of RDD
* DAG(Directed Acyclic Graph) or Data Lineage
* Both A&B
* Neither A nor B

A

Both A&B

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is transformation in Spark RDD?
* Takes RDD as input and produces one or more RDD as output
* Return final results of RDD computations
* The way to sent results from executors to the driver
* None of the above

A

Takes RDD as input and produces one or more RDD as output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which of the following is a feature of Spark RDD?
* In-memory computation
* Lazy evaluations
* Fault Tolerance
* All of the mentioned

A

All of the mentioned

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Four main component built in top of spark core

A
  • Spark ML
  • Spark SQL
  • Spark streaming
  • Spark GraphX
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe Spark ML

A

Spark ML provides simple APIs for execute the functions (classifications , clustering , regression) and creating execution pipelines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Describe spark SQL

A

spark module for working with structured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Describe spark streaming

A

large-scale near-real-time stream processing framework

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Describe spark GraphX

A

spark API for graphs-parallel computation, include
-growing collection of graph algorithms
-builders to simplify graph analytics tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

features of HIVE

A

good abstraction
declarative language
less error prone
easier to learn & analyze
compile to java map-reduce code

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Four key component at Hive architecture

A

meta store
thrift server
driver
Hive QL
Hive CLI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Different mode of execution in Apache pig

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

4 Pig vs sql

A
18
Q

3 Key component of HBase

A

HBase RegionServer
HBase Master
ZooKeeper

19
Q

6 HBase vs DBMS

A
20
Q

Role of the zookeeper in HBase architecture

A

managing the state and configuration of the HBase cluster, providing distributed coordination, leader election, and synchronization and locking services. !!!!!!!!!!!!!!!!!!!!!!

21
Q

4 How Zookeeper achieves constantly, and how it achieves performance

A
22
Q

__ is a distributed graph processing framework on top of Spark.
* MLlib
* Spark streaming
* GrapghX
* None of the above

A

GrapghX

23
Q

Spark is 100x faster than MapReduce due to?
* In-memory computing
* Development in scala
* Stream processing
* Spark SQL

A

In-memory computing

24
Q

creating RDD

A

load from external RDD
create RDD from another RDD
parallelizing a centralized collection

25
Q

transformation oparation examples

A

map , filter . join

26
Q

action operations examples

A

count ,collect ,reduce . save

27
Q

lazy evaluation

A

not computed right away

28
Q

rdd fault tolerance

A

no replication in memory , lineage

29
Q

lineage graph

A

maintain dependencies between rdd
go back to the closest disk based rdd

30
Q

representation of RDD

A

data part:—–
metadata information:——

31
Q

data part:—–
metadata information:——

A

data part: multiple partiotions
metadata information: dependencies on parent rdd

32
Q

narrow dependency ex

A

filter map

33
Q

wide dependency ex

A

join grouping

34
Q

DAG

A

directed acycling graph

35
Q

spark is faster than hadoop especially for

A

iterative algorithms

36
Q

spark need more ——– than hadoop

A

memory

37
Q

map() operate —— record

A

entire record

38
Q

mapValues() operate —— record

A

second component of the record

39
Q

cash()=persist(———)

A

persist(MEMORY_ONLY)

40
Q

benefits of spark SQL

A

can executes SQL operators
optimizations like RDBMS
build on top and extends RDD
easy integration with other spark libraries

41
Q

spark streaming requirements

A

scalable to large cluster
second-scale latencies
simple programming model
integrated with batch & interactive processing
efficient fault tolerance

42
Q

spark streaming motivation

A

many important applications process large stream live data

require large cluster to handle workload

require latencies of few seconds