spark1 Flashcards by Nehal Alnono

Spark is best suited for_____data.
* Real time
* virtual
* structured
* All of the above

All of the above

How well did you know this?

Not at all

Perfectly

Which of the following is a feature of Apache Spark?
* Speeds
* Supports multiple languages
* Advanced Analytics
* All of the above

All of the above

How well did you know this?

Not at all

Perfectly

What does Spark Engine do?
* Scheduling
* Distributing data across cluster
* Monitoring data across cluster
* all of the above

all of the above

How well did you know this?

Not at all

Perfectly

RDD can NOT be created from data stored on?
* LocalFS
* Oracle
* S3
* HDFS

Oracle

How well did you know this?

Not at all

Perfectly

For resource management spark can use?
* Yam
* Mesos
* Standalone cluster manager
* All of the above

All of the above

How well did you know this?

Not at all

Perfectly

Fault Tolerance in RDD is achieved using?
* Immutable nature of RDD
* DAG(Directed Acyclic Graph) or Data Lineage
* Both A&B
* Neither A nor B

Both A&B

How well did you know this?

Not at all

Perfectly

What is transformation in Spark RDD?
* Takes RDD as input and produces one or more RDD as output
* Return final results of RDD computations
* The way to sent results from executors to the driver
* None of the above

Takes RDD as input and produces one or more RDD as output

How well did you know this?

Not at all

Perfectly

Which of the following is a feature of Spark RDD?
* In-memory computation
* Lazy evaluations
* Fault Tolerance
* All of the mentioned

All of the mentioned

How well did you know this?

Not at all

Perfectly

Four main component built in top of spark core

Spark ML
Spark SQL
Spark streaming
Spark GraphX

How well did you know this?

Not at all

Perfectly

Describe Spark ML

Spark ML provides simple APIs for execute the functions (classifications , clustering , regression) and creating execution pipelines

How well did you know this?

Not at all

Perfectly

Describe spark SQL

spark module for working with structured data

How well did you know this?

Not at all

Perfectly

Describe spark streaming

large-scale near-real-time stream processing framework

How well did you know this?

Not at all

Perfectly

Describe spark GraphX

spark API for graphs-parallel computation, include
-growing collection of graph algorithms
-builders to simplify graph analytics tasks

How well did you know this?

Not at all

Perfectly

features of HIVE

good abstraction
declarative language
less error prone
easier to learn & analyze
compile to java map-reduce code

How well did you know this?

Not at all

Perfectly

Four key component at Hive architecture

meta store
thrift server
driver
Hive QL
Hive CLI

How well did you know this?

Not at all

Perfectly

Different mode of execution in Apache pig

How well did you know this?

Not at all

Perfectly

4 Pig vs sql

Study These Flashcards

3 Key component of HBase

Study These Flashcards

HBase RegionServer
HBase Master
ZooKeeper

6 HBase vs DBMS

Study These Flashcards

Role of the zookeeper in HBase architecture

Study These Flashcards

managing the state and configuration of the HBase cluster, providing distributed coordination, leader election, and synchronization and locking services. !!!!!!!!!!!!!!!!!!!!!!

4 How Zookeeper achieves constantly, and how it achieves performance

Study These Flashcards

__ is a distributed graph processing framework on top of Spark.
* MLlib
* Spark streaming
* GrapghX
* None of the above

Study These Flashcards

GrapghX

Spark is 100x faster than MapReduce due to?
* In-memory computing
* Development in scala
* Stream processing
* Spark SQL

Study These Flashcards

In-memory computing

creating RDD

Study These Flashcards

load from external RDD
create RDD from another RDD
parallelizing a centralized collection

transformation oparation examples

map , filter . join

action operations examples

count ,collect ,reduce . save

lazy evaluation

not computed right away

rdd fault tolerance

no replication in memory , lineage

lineage graph

maintain dependencies between rdd go back to the closest disk based rdd

representation of RDD

data part:----- metadata information:------

data part: multiple partiotions metadata information: dependencies on parent rdd

narrow dependency ex

filter map

wide dependency ex

join grouping

DAG

directed acycling graph

spark is faster than hadoop especially for

iterative algorithms

spark need more -------- than hadoop

memory

map() operate ------ record

entire record

mapValues() operate ------ record

second component of the record

cash()=persist(---------)

persist(MEMORY_ONLY)

benefits of spark SQL

can executes SQL operators optimizations like RDBMS build on top and extends RDD easy integration with other spark libraries

spark streaming requirements

scalable to large cluster second-scale latencies simple programming model integrated with batch & interactive processing efficient fault tolerance

spark streaming motivation

many important applications process large stream live data require large cluster to handle workload require latencies of few seconds

spark1 Flashcards

(42 cards)