Overview 2 Flashcards

Question

How do we use Kafka in our projects?

Answer 1

In my projects, we use Kafka as a messaging system for real-time data streaming. It helps with ingesting data into the system and enables communication between different services.

Answer 2

In my most recent project, we built a real-time data pipeline using Kafka, Spark, and HDFS. We ingested data from various sources, processed it with Spark, and stored the results in HDFS for further analysis.

Answer 3

My tech stack includes Spark, Hadoop, Kafka, Hive, and Python. I’ve worked on projects involving real-time data processing, ETL pipelines, and data warehousing.

Answer 4

Transformations in Spark are lazy operations, such as `map()` and `filter()`, which create a new RDD. Actions are operations like `collect()` and `count()` that trigger execution.

Answer 5

Managed tables in Hive store both data and metadata within Hive, while external tables store only metadata in Hive, and the data can reside outside of Hive in an external system.

Answer 6

You can load data from a CSV file in HDFS to a Hive table using the following command: `LOAD DATA INPATH '/path/to/csv' INTO TABLE tablename`.

Answer 7

You can reverse a string in Python with slicing: `reversed_string = string[::-1]`.

Answer 8

A right join in SQL returns all rows from the right table and matched rows from the left table. If there’s no match, NULL values are returned for the left table columns.

Answer 9

You can load data into a Hive table with: `LOAD DATA INPATH '/path/to/data' INTO TABLE table_name`. To copy data to HDFS: `hadoop fs -copyFromLocal /local/path /hdfs/path`.

Answer 10

Transformations are operations that return a new RDD and are lazily evaluated. Actions trigger the execution of the pipeline and return a result.

Answer 11

The `map()` function applies a given function to each element in an RDD and returns a new RDD with the transformed elements.

Answer 12

Apache Spark is an open-source, distributed computing system that provides high-speed processing of large datasets. It supports in-memory processing, which makes it faster than Hadoop MapReduce.

Answer 13

A producer in Kafka is responsible for sending data to topics in a Kafka cluster. It pushes data streams into Kafka for consumers to read.

Answer 14

Joins in SQL combine rows from two or more tables based on a related column. Types of joins include INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN.

Answer 15

Hadoop is an open-source framework for processing and storing large datasets in a distributed computing environment. It consists of the HDFS (Hadoop Distributed File System) and MapReduce.

Answer 16

HDFS (Hadoop Distributed File System) is the storage system of Hadoop that enables distributed storage of large datasets across multiple nodes.

Answer 17

MapReduce is a programming model for processing large datasets. It involves two main steps: the map function processes input data, and the reduce function aggregates the results.

Answer 18

You can count characters in a string with `len(string)`.

Answer 19

You can rename a column using the `.withColumnRenamed()` method: `df = df.withColumnRenamed('old_name', 'new_name')`.

Answer 20

You can read a text file in Spark with: `df = spark.read.text('file.txt')`. For CSV with headers: `df = spark.read.option('header', 'true').csv('file.csv')`.

Answer 21

You can set the names of columns by passing a list of column names: `df = df.toDF('col1', 'col2', 'col3')`.

Answer 22

You can drop a column using the `.drop()` method: `df = df.drop('column_name')`.

Answer 23

Hive is a data warehouse built on top of Hadoop, allowing SQL-like queries on large datasets stored in HDFS. SQL, on the other hand, is a language for managing relational databases.

Answer 24

Partitioning in Hive is the process of dividing a table into parts based on a column, while bucketing is used to divide data into fixed-sized parts based on a hash of a column.

Answer 25

Spark is a distributed processing engine that can process data stored in HDFS, while Hive is a data warehouse system built on top of Hadoop for querying large datasets using a SQL-like interface.

Answer 26

You can add a column to a DataFrame using `.withColumn()`: `df = df.withColumn('new_column', expr)`.

Answer 27

RDD (Resilient Distributed Dataset) is the low-level data structure in Spark. DataFrame (DF) is a higher-level abstraction with a schema. Dataset (DS) is similar to a DataFrame but with strong typing.

Answer 28

You can check if a column exists using `df.columns.contains('column_name')`.

Answer 29

Hive is structured around tables and partitions, which organize data. It uses a schema-on-read approach, meaning the schema is applied when the data is read.

Answer 30

You submit a job using `spark-submit`. The key parameters include the application’s main class, the master URL, input data paths, and configurations for Spark.

Answer 31

Lazy evaluation in Spark means transformations are not executed until an action is called. This allows Spark to optimize the execution plan.

Answer 32

Spark SQL translates SQL queries into RDD operations using Catalyst optimizer and Tungsten execution engine, enabling optimization and efficient execution on distributed systems.

Answer 33

`DROP` removes a table and its data from the database, while `TRUNCATE` removes all rows but leaves the table structure intact.

Answer 34

`GROUP BY` aggregates data based on a column, while `SORT BY` sorts the data based on a specified column.

Answer 35

A Map stores key-value pairs, while a Set stores unique values without duplicates.

Answer 36

I am familiar with text files, CSV, JSON, Parquet, Avro, and ORC files.

Answer 37

You can write a JSON file with: `json.dump(data, file)`, and read it with: `data = json.load(file)`.

Answer 38

You can count occurrences with: `string.count('substring')`.

Answer 39

DAG (Directed Acyclic Graph) in Spark represents the sequence of computations that need to be performed, where each vertex represents an RDD transformation and edges represent dependencies.

Answer 40

Kafka consists of producers, brokers, consumers, topics, and zookeeper for managing cluster metadata.

Answer 41

To start Kafka, run `bin/kafka-server-start.sh config/server.properties`. To create a topic: `bin/kafka-topics.sh --create --topic topic_name --bootstrap-server localhost:9092`. Start a producer: `bin/kafka-console-producer.sh --topic topic_name --bootstrap-server localhost:9092`. Start a consumer: `bin/kafka-console-consumer.sh --topic topic_name --bootstrap-server localhost:9092`.

Overview 2 Flashcards

(65 cards)