Questions Flashcards
What are the differences between star schema and snowflake schema in data warehousing?
Star schema and snowflake schema are two types of dimensional modeling techniques used in data warehousing. Star schema has a central fact table that connects to multiple dimension tables, each representing a single attribute or entity. Snowflake schema is a variation of star schema that normalizes the dimension tables into multiple levels of hierarchy. Star schema is simpler and faster to query, but snowflake schema is more normalized and reduces data redundancy.
What are the pros and cons of Hadoop for big data processing?
Hadoop is an open-source framework for distributed storage and processing of large-scale data sets. Pros: It can handle various data types, scale easily, and be cost-effective. Cons: It has a high learning curve, high latency, high maintenence costs for administration and security.
What are some common data quality issues?
- Missing, incomplete, or incorrect data
- Duplicate or redundant data
Inconsistent or incompatible data formats or standards - Outdated or irrelevant data
- Data that does not comply with business rules or regulations
How to handle Data Quality issues?
- Define the data quality criteria and metrics
- Perform data profiling and auditing to identify the issues
- Implement data cleansing and validation techniques
- Monitor and report the data quality status and improvement
- Establish data governance and security policies
What are some of the benefits working with cloud-based data platforms?
- Scalability: Cloud platforms can easily adjust to the changing data volume and demand.
- Availability: Cloud platforms can provide high availability and reliability by replicating the data across multiple locations and servers.
- Cost-effectiveness: Cloud platforms can reduce the upfront and maintenance costs of owning and operating physical infrastructure and software.
What are some of the challenges working with cloud-based data platforms?
- Integration: Cloud platforms can have integration challenges with existing on-premise systems or other cloud providers.
- Hidden costs
- Flexibility: locked into cloud infrastructure
What is Spark Driver?
The process that runs the main() method of the Spark application and creates the SparkSession object. It is responsible for coordinating the execution of tasks across the Spark cluster.
What is SparkSession?
The entry point to the Spark application. It provides access to the Spark functionality, such as creating and manipulating RDDs, DataFrames, Datasets, and Spark SQL.
What is Spark Cluster Manager?
The component that manages the allocation and release of resources across the Spark cluster. It can be one of the following: Standalone, YARN, Mesos, or Kubernetes.
What is Spark Executor?
The process that runs on each worker node in the cluster and executes the tasks assigned by the driver. It also stores the data in memory or disk.
What is Spark Task?
The unit of work that is sent to the executor by the driver. It is a computation on a partition of data.
What is Spark Job?
A parallel computation that consists of multiple tasks that are triggered by an action on an RDD, DataFrame, or Dataset.
What is Spark Stage?
A set of tasks within a job that can be executed in parallel. A stage is divided by shuffle boundaries, which are operations that require data movement across executors.
What is Spark RDD?
Resilient Distributed Dataset: The original and low-level abstraction in Spark. It is an immutable collection of objects that can be partitioned across the cluster and operated on in parallel. It supports two types of operations: transformations and actions. It provides fault-tolerance by maintaining lineage information. It does not have any schema or optimization information.
What is Spark DataFrame?
A higher-level abstraction in Spark that is similar to a table in a relational database. It is a distributed collection of rows organized into named columns. It supports both SQL and domain-specific language (DSL) queries. It provides fault-tolerance by maintaining lineage information. It has a schema and optimization information that can be used by the Catalyst optimizer to improve performance.