Overview 2 Flashcards

1
Q

Tell me about yourself (VERY common - you should have a 2-3 sentence answer ready for this. (Where did you graduate? What did you study? What work have you done till now?))

A

I graduated with a degree in Computer Science from [University Name]. Over the past few years, I’ve worked on projects involving data analysis, machine learning, and big data processing. I have hands-on experience with technologies like Spark, Kafka, and Hadoop, and I’m excited to continue building my skills in this field.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do you handle team conflict?

A

I believe in open communication and understanding all perspectives. I listen carefully to the concerns of team members, seek common ground, and facilitate a solution that works for everyone. It’s important to remain calm and professional, even when disagreements arise.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Do you see yourself as a leader of a team? (there’s no wrong answer to this besides something like ‘yes I’m better than everyone’ or ‘no I hate people’)

A

I believe I can step up as a leader when needed. I value collaboration and support my teammates in achieving goals. I see leadership as guiding and motivating others, rather than being in control.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Tell me something interesting about your schooling.

A

During my schooling, I worked on a research project where we analyzed large datasets using Spark. It was interesting because we learned how to optimize data workflows, which helped me understand the importance of performance tuning in distributed systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What conflicts/challenges happened in any of your 3 projects?

A

In one of my projects, we faced challenges related to data inconsistencies and integration. To overcome this, we developed a robust data validation pipeline to ensure accuracy across the system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Are you willing to learn different programming languages?

A

Absolutely! I’m always eager to expand my skill set. I believe that learning new languages can open up different perspectives on solving problems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe what part of your most recent project you worked on and how you went about designing it.

A

In my most recent project, I worked on designing a scalable data pipeline using Apache Kafka and Spark. I focused on optimizing data ingestion and processing by using partitioning strategies in Kafka and Spark’s built-in transformations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe a moment where you failed?

A

In a previous project, I underestimated the complexity of data transformation. This caused delays, but I learned the importance of thorough planning, especially when working with large datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What makes you excited or passionate about Big Data?

A

I’m passionate about Big Data because it has the power to provide insights that drive decisions. The ability to work with vast datasets and derive meaningful results is what excites me most.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How to count the number of times a word appears in a text file? (was required to write code to solve this)

A

You can count the number of times a word appears in a text file by reading the file and using a word count algorithm, such as: with open('file.txt', 'r') as file: text = file.read().split() count = text.count('word').

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is an RDD?

A

An RDD (Resilient Distributed Dataset) is a fundamental data structure in Apache Spark. It is an immutable distributed collection of objects that can be processed in parallel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How to create a dataframe in spark?

A

You can create a DataFrame in Spark using the following code: df = spark.read.csv('file.csv', header=True, inferSchema=True).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How to create an RDD?

A

You can create an RDD by parallelizing a collection: rdd = sc.parallelize([1, 2, 3, 4]).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How to create the schema of a dataframe in spark?

A

You can define a schema using StructType and StructField: schema = StructType([StructField('name', StringType(), True), StructField('age', IntegerType(), True)]) df = spark.read.schema(schema).csv('file.csv').

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the different types of joins in Spark?

A

The different types of joins in Spark are: inner join, outer join, left outer join, right outer join, and cross join.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to do a join?

A

To perform a join, you can use the .join() function: df1.join(df2, df1.id == df2.id).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How did you use Kafka in your producer program?

A

In my producer program, I used Kafka’s KafkaProducer to publish messages to a topic. I configured it with the Kafka broker details and serialized messages to be sent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How did you use Kafka in your consumer program?

A

In the consumer program, I used Kafka’s KafkaConsumer to subscribe to a topic and consume messages in real-time. I also set up appropriate deserialization for message formats.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

If your Kafka consumer stops running, what would you do?

A

If the Kafka consumer stops, I would first check for issues like connection problems, message deserialization errors, or resource constraints. I’d review logs and restart the consumer as needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

SQL Query For Top Sales Employee’s by Department

A

SELECT department, employee_name, SUM(sales) AS total_sales FROM sales_table GROUP BY department, employee_name ORDER BY total_sales DESC LIMIT 5.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How did you use Hive in your projects?

A

I used Hive to store and query large datasets in a distributed environment. It allowed me to perform SQL-like queries on data stored in HDFS, which was very useful for data analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How did you use HDFS?

A

In my projects, I used HDFS to store large volumes of data, which allowed for scalable and distributed processing. I interacted with it using Spark and Hadoop tools for data reading and writing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What do we use spark-submit for?

A

We use spark-submit to submit Spark applications to a cluster for execution. It handles resource allocation and job execution across multiple nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How to read a CSV file in Spark.

A

You can read a CSV file in Spark with spark.read.csv('file.csv', header=True, inferSchema=True).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How do we use Kafka in our projects?

A

In my projects, we use Kafka as a messaging system for real-time data streaming. It helps with ingesting data into the system and enables communication between different services.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Describe the most recent project.

A

In my most recent project, we built a real-time data pipeline using Kafka, Spark, and HDFS. We ingested data from various sources, processed it with Spark, and stored the results in HDFS for further analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is your tech stack/ what projects have you worked on?

A

My tech stack includes Spark, Hadoop, Kafka, Hive, and Python. I’ve worked on projects involving real-time data processing, ETL pipelines, and data warehousing.

28
Q

Transformations and Actions in Spark

A

Transformations in Spark are lazy operations, such as map() and filter(), which create a new RDD. Actions are operations like collect() and count() that trigger execution.

29
Q

Different Hive table types? (Managed vs External)

A

Managed tables in Hive store both data and metadata within Hive, while external tables store only metadata in Hive, and the data can reside outside of Hive in an external system.

30
Q

How to load a CSV from HDFS to Hive Table?

A

You can load data from a CSV file in HDFS to a Hive table using the following command: LOAD DATA INPATH '/path/to/csv' INTO TABLE tablename.

31
Q

Reversing a string in Python.

A

You can reverse a string in Python with slicing: reversed_string = string[::-1].

32
Q

Right Join in SQL

A

A right join in SQL returns all rows from the right table and matched rows from the left table. If there’s no match, NULL values are returned for the left table columns.

33
Q

HDFS commands to load into Hive table and copy to HDFS.*

A

You can load data into a Hive table with: LOAD DATA INPATH '/path/to/data' INTO TABLE table_name. To copy data to HDFS: hadoop fs -copyFromLocal /local/path /hdfs/path.

34
Q

Transformation vs Actions?

A

Transformations are operations that return a new RDD and are lazily evaluated. Actions trigger the execution of the pipeline and return a result.

35
Q

What is the map function?

A

The map() function applies a given function to each element in an RDD and returns a new RDD with the transformed elements.

36
Q

What do you know about Spark?*

A

Apache Spark is an open-source, distributed computing system that provides high-speed processing of large datasets. It supports in-memory processing, which makes it faster than Hadoop MapReduce.

37
Q

what is a producer?

A

A producer in Kafka is responsible for sending data to topics in a Kafka cluster. It pushes data streams into Kafka for consumers to read.

38
Q

what are joins in SQL?

A

Joins in SQL combine rows from two or more tables based on a related column. Types of joins include INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN.

39
Q

what is Hadoop?

A

Hadoop is an open-source framework for processing and storing large datasets in a distributed computing environment. It consists of the HDFS (Hadoop Distributed File System) and MapReduce.

40
Q

what is HDFS?

A

HDFS (Hadoop Distributed File System) is the storage system of Hadoop that enables distributed storage of large datasets across multiple nodes.

41
Q

explain MapReduce.*

A

MapReduce is a programming model for processing large datasets. It involves two main steps: the map function processes input data, and the reduce function aggregates the results.

42
Q

Count the characters in a string in Python.

A

You can count characters in a string with len(string).

43
Q

How do you rename a column in a Spark dataframe?

A

You can rename a column using the .withColumnRenamed() method: df = df.withColumnRenamed('old_name', 'new_name').

44
Q

How do you read from a text file in Spark? How to include headers from a csv file?

A

You can read a text file in Spark with: df = spark.read.text('file.txt'). For CSV with headers: df = spark.read.option('header', 'true').csv('file.csv').

45
Q

How would you set the names of columns?*

A

You can set the names of columns by passing a list of column names: df = df.toDF('col1', 'col2', 'col3').

46
Q

How would you drop a specific column if it exists?

A

You can drop a column using the .drop() method: df = df.drop('column_name').

47
Q

Difference between Hive and SQL?

A

Hive is a data warehouse built on top of Hadoop, allowing SQL-like queries on large datasets stored in HDFS. SQL, on the other hand, is a language for managing relational databases.

48
Q

What is partitioning and bucketing in Hive?

A

Partitioning in Hive is the process of dividing a table into parts based on a column, while bucketing is used to divide data into fixed-sized parts based on a hash of a column.

49
Q

Spark vs Hive?

A

Spark is a distributed processing engine that can process data stored in HDFS, while Hive is a data warehouse system built on top of Hadoop for querying large datasets using a SQL-like interface.

50
Q

How do you add a column to a DF?

A

You can add a column to a DataFrame using .withColumn(): df = df.withColumn('new_column', expr).

51
Q

Difference between an RDD/DS/DF?

A

RDD (Resilient Distributed Dataset) is the low-level data structure in Spark. DataFrame (DF) is a higher-level abstraction with a schema. Dataset (DS) is similar to a DataFrame but with strong typing.

52
Q

How to check if a column already exists in a DF?*

A

You can check if a column exists using df.columns.contains('column_name').

53
Q

Explain Hive Structure?*

A

Hive is structured around tables and partitions, which organize data. It uses a schema-on-read approach, meaning the schema is applied when the data is read.

54
Q

How to submit a job to Spark and what parameters are needed?*

A

You submit a job using spark-submit. The key parameters include the application’s main class, the master URL, input data paths, and configurations for Spark.

55
Q

What is lazy evaluation?

A

Lazy evaluation in Spark means transformations are not executed until an action is called. This allows Spark to optimize the execution plan.

56
Q

How Spark SQL works in the background?*

A

Spark SQL translates SQL queries into RDD operations using Catalyst optimizer and Tungsten execution engine, enabling optimization and efficient execution on distributed systems.

57
Q

Difference between Drop/Truncate?

A

DROP removes a table and its data from the database, while TRUNCATE removes all rows but leaves the table structure intact.

58
Q

Difference between Group By/Sort By?*

A

GROUP BY aggregates data based on a column, while SORT BY sorts the data based on a specified column.

59
Q

What is the difference between a Map and a Set

A

A Map stores key-value pairs, while a Set stores unique values without duplicates.

60
Q

What different file types are you familiar with

A

I am familiar with text files, CSV, JSON, Parquet, Avro, and ORC files.

61
Q

How would you write and read a JSON in Python?

A

You can write a JSON file with: json.dump(data, file), and read it with: data = json.load(file).

62
Q

How would you count the number of occurrences in a string in Python?

A

You can count occurrences with: string.count('substring').

63
Q

What is DAG in Spark?

A

DAG (Directed Acyclic Graph) in Spark represents the sequence of computations that need to be performed, where each vertex represents an RDD transformation and edges represent dependencies.

64
Q

What are the components of KAFKA?

A

Kafka consists of producers, brokers, consumers, topics, and zookeeper for managing cluster metadata.

65
Q

How would you start KAFKA, create a topic, consumer, and producer?

A

To start Kafka, run bin/kafka-server-start.sh config/server.properties. To create a topic: bin/kafka-topics.sh --create --topic topic_name --bootstrap-server localhost:9092. Start a producer: bin/kafka-console-producer.sh --topic topic_name --bootstrap-server localhost:9092. Start a consumer: bin/kafka-console-consumer.sh --topic topic_name --bootstrap-server localhost:9092.