Overview 1 Flashcards

Question

ls -al

Answer 1

`ls -al` lists files and directories in a detailed (long) format, including hidden files.

Answer 2

`mkdir -p` creates parent directories as needed, without error if the directory exists.

Answer 3

`rm -r` removes directories and files recursively; `-f` forces the removal, and `-rf` combines both options.

Answer 4

`cat` is used to display file contents, concatenate files, or create files.

Answer 5

Permissions in Linux specify which users or groups can read, write, or execute files using `r`, `w`, `x` flags.

Answer 6

`chmod 400` sets the file permissions to read-only for the owner, and no permissions for others.

Answer 7

The 'Hadoop Explosion' refers to the rapid adoption and growth of the Hadoop ecosystem for big data processing and storage.

Answer 8

Cloudera is a company that provides enterprise-level Hadoop distribution, tools, and support.

Answer 9

A daemon is a background process that runs continuously in a system to perform a specific task. For example when i utilize hadoop I have the namenode and data node running in the background

Answer 10

A 200MB file would be stored in approximately 2 blocks (assuming the default block size of 128MB).

Answer 11

DataNodes are fault-tolerant through data replication across multiple nodes.

Answer 12

The default number of replications in HDFS is 3.

Answer 13

The NameNode manages the HDFS filesystem namespace and metadata, including file-to-block mapping.

Answer 14

A Secondary NameNode assists with checkpointing but doesn't take over in case of failure. A Standby NameNode acts as a backup in case of a failure.

Answer 15

Data locality refers to the practice of storing data near the computation to reduce network traffic and improve performance.

Answer 16

Rack awareness refers to ensuring data is stored in different racks to improve fault tolerance and minimize data loss in case of rack failure.

Answer 17

Heartbeats are periodic signals from DataNodes to the NameNode to indicate they are functioning properly.

Answer 18

YARN (Yet Another Resource Negotiator) is the resource management layer of Hadoop responsible for managing compute resources in clusters.

Answer 19

Hive is a data warehouse infrastructure built on top of Hadoop for querying and managing large datasets using a SQL-like interface.

Answer 20

Partitioning divides data into different directories, while bucketing splits data into fixed-size files within partitions allowing for faster joins

Answer 21

Low cardinality means few unique values. When partitioning by a low-cardinality column, it could result in an excessive number of partitions.

Answer 22

A Hive partition is a way of dividing a table into smaller, manageable parts based on the values of one or more columns.

Answer 23

The Hive metastore stores metadata information about the tables, partitions, and other objects in Hive.

Answer 24

Specify the file format using `ROW FORMAT DELIMITED FIELDS TERMINATED BY ','` for CSV.

Answer 25

They are data processing systems that are used to manage and analyze data. OLAP is used to analyze data, while OLTP is used to process transactions

Answer 26

A data warehouse is a centralized repository for storing historical and current data for analysis and reporting.

Answer 27

Facts are quantitative data (e.g., sales). Dimensions are descriptive attributes (e.g., time, region). They are key components of data ware house schemas design

Answer 28

In a star schema, a central fact table is surrounded by dimension tables. In a snowflake schema, dimension tables are normalized into multiple related tables.

Answer 29

ETL stands for Extract, Transform, Load—processes for moving data from source systems to data warehouses.

Answer 30

Batch load loads data in bulk at scheduled times, while history load specifically adds or updates historical data.

Answer 31

Different paradigms provide various approaches for solving problems. OOP focuses on objects and inheritance, while functional programming focuses on immutability and pure functions.

Answer 32

The four pillars are Encapsulation, Abstraction, Inheritance, and Polymorphism. Overloading takes multiple parameters overriding defines the class method in the subclass

Answer 33

A higher-order function is a function that takes other functions as arguments or returns a function as its result.

Answer 34

A Lambda is an anonymous function, usually used for short-term, one-time operations.

Answer 35

Yes, in Python, you can assign a function to a variable (e.g., `f = my_function`).

Answer 36

It means functions can be passed around as arguments, returned from other functions, and assigned to variables.

Answer 37

Common collections in Python include lists, sets, tuples, and dictionaries.

Answer 38

Methods include `append()`, `remove()`, `sort()`, for lists; `add()`, `remove()` for sets; `keys()`, `values()` for dictionaries.

Answer 39

A thread is a lightweight process that runs independently in a program, allowing concurrent execution.

Answer 40

RDDs are distributed collections of data, while DataFrames are a higher-level abstraction built on top of RDDs for working with structured data.

Answer 41

RDDs and DataFrames are immutable. However, DataFrames provide more optimization and easier handling for structured data.

Answer 42

RDD stands for Resilient Distributed Dataset, a fundamental data structure in Spark, representing an immutable distributed collection of objects.

Answer 43

A DataFrame is a distributed collection of data organized into named columns, offering higher-level abstractions than RDDs.

Answer 44

A DataSet is a strongly-typed, distributed collection of data that combines the features of RDDs and DataFrames.

Answer 45

RDDs are lower-level and provide fine-grained control. DataFrames are higher-level and provide optimizations. DataSets are similar to DataFrames but with compile-time type safety.

Answer 46

DataFrames and DataSets are preferred for structured data because of optimizations. RDDs are useful for unstructured data or when fine-grained control is required.

Answer 47

RDD is a distributed collection of data, and you can sort it using `sortBy()` method.

Answer 48

Example: `rdd = sc.parallelize([(1, 'a'), (3, 'b'), (2, 'c')])`; `rdd.sortByKey().collect()`.

Answer 49

Actions trigger computation and return results, such as `collect()`, `reduce()`, `count()`, etc.

Answer 50

Spark can be deployed in local mode, client mode, or cluster mode (on YARN, Mesos, or Kubernetes).

Answer 51

Use `spark.read.json()` to read a JSON file into a DataFrame.

Answer 52

You can flatten a JSON file by using `explode()` and other transformation functions.

Answer 53

RDD stands for Resilient Distributed Dataset. It is used for distributed data processing and can be transformed using map, filter, etc.

Answer 54

To transform an RDD means to apply operations like map, filter, or flatMap to produce a new RDD.

Answer 55

Wide transformations require a shuffle of data across partitions, while narrow transformations do not. Examples: `map` (narrow), `groupByKey` (wide).

Answer 56

Caching an RDD stores it in memory, making subsequent computations faster.

Answer 57

A broadcast variable allows you to efficiently share large read-only data across all nodes.

Answer 58

A shuffle is a process of redistributing data across partitions, which may be necessary for wide transformations.

Answer 59

In cluster mode, both the driver and executor run on cluster nodes. In client mode, the driver runs on the client machine.

Answer 60

A Spark Application consists of a SparkContext and may run one or more jobs. A job is divided into stages, and each stage is divided into tasks.

Answer 61

Spark SQL is a module for working with structured data using SQL queries, DataFrames, and Datasets.

Answer 62

In a broadcast join, the smaller dataset is broadcast to all nodes, reducing data shuffling and improving performance.

Answer 63

Because broadcast joins avoid the cost of shuffling large datasets across nodes.

Answer 64

The Catalyst Optimizer is a query optimizer in Spark SQL that applies various optimization rules to SQL queries.

Answer 65

DataFrames are untyped and easier to use. Datasets provide type safety with compile-time checking.

Answer 66

Yes, DataFrames are lazily evaluated, meaning computations are triggered only when an action is performed.

Answer 67

Use the `withColumnRenamed()` method to rename a column in a DataFrame.

Answer 68

Functions include `select()`, `filter()`, `groupBy()`, `agg()`, `join()`, etc.

Answer 69

Popular file formats include Parquet, ORC, CSV, JSON, Avro. Parquet and ORC are typically the best for performance in Spark due to columnar storage.

Answer 70

Parquet is a columnar storage file format optimized for analytics and big data processing.

Answer 71

ORC (Optimized Row Columnar) is a file format for efficient storage and query processing in Hive and Spark.

Answer 72

File types include Parquet, ORC, Avro, JSON, CSV, and Text.

Answer 73

By default, `.saveAsTable` creates tables in the Parquet file format.

Answer 74

Audit tracks data changes, balance ensures data consistency, and control monitors and enforces security and data integrity.

Answer 75

AWS (Amazon Web Services) is a cloud computing platform offering scalable computing, storage, and networking solutions.

Answer 76

I have used services like EC2, S3, RDS, Lambda, EMR, DynamoDB, and more.

Answer 77

Yes, S3 is used for storing and retrieving any amount of data at any time.

Answer 78

EC2 is a virtual server that provides computing power, whereas EMR (Elastic MapReduce) is a managed cluster platform for processing big data.

Answer 79

Yes, AWS Lambda is a serverless compute service for running code without provisioning servers.

Answer 80

The Python library used to access AWS services is Boto3.

Overview 1 Flashcards

(104 cards)