Questions Flashcards

Question 1

Q

What are the differences between star schema and snowflake schema in data warehousing?

Answer

A

Star schema and snowflake schema are two types of dimensional modeling techniques used in data warehousing. Star schema has a central fact table that connects to multiple dimension tables, each representing a single attribute or entity. Snowflake schema is a variation of star schema that normalizes the dimension tables into multiple levels of hierarchy. Star schema is simpler and faster to query, but snowflake schema is more normalized and reduces data redundancy.

Question 2

Q

What are the pros and cons of Hadoop for big data processing?

Answer

A

Hadoop is an open-source framework for distributed storage and processing of large-scale data sets. Pros: It can handle various data types, scale easily, and be cost-effective. Cons: It has a high learning curve, high latency, high maintenence costs for administration and security.

Question 3

Q

What are some common data quality issues?

Answer

A

Missing, incomplete, or incorrect data
Duplicate or redundant data
Inconsistent or incompatible data formats or standards
Outdated or irrelevant data
Data that does not comply with business rules or regulations

Question 4

Q

How to handle Data Quality issues?

Answer

A

Define the data quality criteria and metrics
Perform data profiling and auditing to identify the issues
Implement data cleansing and validation techniques
Monitor and report the data quality status and improvement
Establish data governance and security policies

Question 5

Q

What are some of the benefits working with cloud-based data platforms?

Answer

A

Scalability: Cloud platforms can easily adjust to the changing data volume and demand.
Availability: Cloud platforms can provide high availability and reliability by replicating the data across multiple locations and servers.
Cost-effectiveness: Cloud platforms can reduce the upfront and maintenance costs of owning and operating physical infrastructure and software.

Question 6

Q

What are some of the challenges working with cloud-based data platforms?

Answer

A

Integration: Cloud platforms can have integration challenges with existing on-premise systems or other cloud providers.
Hidden costs
Flexibility: locked into cloud infrastructure

Question 7

Q

What is Spark Driver?

Answer

A

The process that runs the main() method of the Spark application and creates the SparkSession object. It is responsible for coordinating the execution of tasks across the Spark cluster.

Question 8

Q

What is SparkSession?

Answer

A

The entry point to the Spark application. It provides access to the Spark functionality, such as creating and manipulating RDDs, DataFrames, Datasets, and Spark SQL.

Question 9

Q

What is Spark Cluster Manager?

Answer

A

The component that manages the allocation and release of resources across the Spark cluster. It can be one of the following: Standalone, YARN, Mesos, or Kubernetes.

Question 10

Q

What is Spark Executor?

Answer

A

The process that runs on each worker node in the cluster and executes the tasks assigned by the driver. It also stores the data in memory or disk.

Question 11

Q

What is Spark Task?

Answer

A

The unit of work that is sent to the executor by the driver. It is a computation on a partition of data.

Question 12

Q

What is Spark Job?

Answer

A

A parallel computation that consists of multiple tasks that are triggered by an action on an RDD, DataFrame, or Dataset.

Question 13

Q

What is Spark Stage?

Answer

A

A set of tasks within a job that can be executed in parallel. A stage is divided by shuffle boundaries, which are operations that require data movement across executors.

Question 14

Q

What is Spark RDD?

Answer

A

Resilient Distributed Dataset: The original and low-level abstraction in Spark. It is an immutable collection of objects that can be partitioned across the cluster and operated on in parallel. It supports two types of operations: transformations and actions. It provides fault-tolerance by maintaining lineage information. It does not have any schema or optimization information.

Question 15

Q

What is Spark DataFrame?

Answer

A

A higher-level abstraction in Spark that is similar to a table in a relational database. It is a distributed collection of rows organized into named columns. It supports both SQL and domain-specific language (DSL) queries. It provides fault-tolerance by maintaining lineage information. It has a schema and optimization information that can be used by the Catalyst optimizer to improve performance.

Question 16

Q

What is Spark Data-Set?

Answer

Study These Flashcards

A

A higher-level abstraction in Spark that combines the benefits of RDDs and DataFrames. It is a distributed collection of objects that can be manipulated using both functional and relational operations. It supports both SQL and domain-specific language (DSL) queries. It provides fault-tolerance by maintaining lineage information. It has a schema and optimization information that can be used by the Catalyst optimizer to improve performance.

Question 17

Q

What are some of the benefits of using Spark for big data processing?

Answer

Study These Flashcards

A

Speed: Spark can run workloads 100 times faster than Hadoop MapReduce by using in-memory computation.
Ease of use: Spark provides over 80 high-level operators that make it easy to build parallel apps using various languages.
Generality: Spark supports multiple types of data processing, such as batch, streaming, interactive, graph, or machine learning.
Compatibility: Spark can run on various platforms and access data from multiple sources.

Question 18

Q

What are some of the challanges of using Spark for big data processing?

Answer

Study These Flashcards

A

Memory management: Spark requires sufficient memory to store and process large amounts of data in memory. If the memory is insufficient or poorly configured, it can cause performance issues or errors.
Debugging: Spark applications can be difficult to debug due to their distributed nature and lack of visibility into the framework.
Tuning: Spark applications can require careful tuning to achieve optimal performance and resource utilization. Some of the parameters that need to be tuned are parallelism level, partition size, serialization format, memory fraction, garbage collection strategy, etc.

Question 19

Q

Data Warehouse

Answer

Study These Flashcards

A

A database for structured data with predefined schema and fast SQL queries for reporting and analysis.

Question 20

Q

Data Lake

Answer

Study These Flashcards

A

A repository for raw data in any format with flexible analytics for dashboards, big data, real-time, and machine learning.

Question 21

Q

Data Lakehouse

Answer

Study These Flashcards

A

A new architecture for data in open formats with data management and ACID transactions for BI and ML on all data.

Question 22

Q

Data Vault

Answer

Study These Flashcards

A

A design pattern for data warehouse with hubs, links, and satellites for agile and scalable analytics.

Question 23

Q

Main advantage for Data Lakehouse

Answer

Study These Flashcards

A

Combines the performance and optimization of a data warehouse with the flexibility of a data lake, enabling structure and schema to be applied to unstructured data.

Question 24

Q

Main disadvantage for Data Lakehouse?

Answer

Study These Flashcards

A

Difficult to implement, maintain or migrate. Involves extra steps so there might be higher data latency.

Questions Flashcards

(24 cards)