3.Big Data Technologies Flashcards

1
Q

What is the Hadoop ecosystem, and how does it work?

A

The Hadoop ecosystem is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does Apache Spark differ from MapReduce?

A

Apache Spark is faster than MapReduce due to its in-memory processing capabilities, while MapReduce writes intermediate results to disk.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain the working of Spark’s DAG (Directed Acyclic Graph).

A

Spark’s DAG represents the sequence of computations to be performed on data, allowing for optimization and fault tolerance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is HDFS, and how does it achieve fault tolerance?

A

HDFS (Hadoop Distributed File System) is designed to store large files across multiple machines, achieving fault tolerance through data replication.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you optimize Spark jobs for performance?

A

Optimizing Spark jobs can be done by using efficient data formats, caching data, and tuning configurations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain the difference between RDD, DataFrame, and Dataset in Spark.

A

RDD (Resilient Distributed Dataset) is a low-level abstraction, DataFrame is a higher-level abstraction for structured data, and Dataset combines the benefits of both with type safety.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does partitioning work in Spark?

A

Partitioning in Spark divides data into smaller chunks, allowing for parallel processing and improved performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the role of Apache Kafka in a data pipeline?

A

Apache Kafka serves as a distributed messaging system that allows for the real-time processing and streaming of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does data shuffling impact Spark performance?

A

Data shuffling can significantly slow down Spark jobs as it involves redistributing data across partitions, leading to increased I/O operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the advantages of using Parquet or ORC formats for big data?

A

Parquet and ORC formats provide efficient data compression and encoding schemes, leading to reduced storage costs and improved query performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly