Overview 1 Flashcards

1
Q

Difference between Tuple & List

A

A tuple is immutable, whereas a list is mutable. Tuples are generally used for fixed data, while lists are used for dynamic data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Remove duplicates from a sorted list

A

Use set() or a list comprehension to remove duplicates. Example: list(set(sorted_list)).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

In the given string, print each literal followed by number of occurrences in the string.
*

A

Example code: from collections import Counter; s = 'yaaba daaba do'; counts = Counter(s); for char, count in counts.items(): print(f'{char}: {count}').

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Python Unit test?

A

Python Unit Test is a framework for testing individual units of code, usually functions, ensuring they work as expected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does String interpolation work in Python?

A

String interpolation in Python can be done using f-strings, format(), or % formatting. Example: f'Hello {name}'.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Create a List in Python and print the smallest and largest items, reverse the list max, min? also could you sort list

A

Example code: lst = [5, 3, 8, 1]; print(min(lst), max(lst)); lst.sort(); print(lst[::-1]).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is SQL?

A

SQL (Structured Query Language) is a domain-specific language used for managing and querying data in relational databases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a RDBMS, relational database management system?

A

RDBMS is a type of database management system that stores data in tables with relationships between them. Examples include MySQL, PostgreSQL.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the sublanguages of SQL?

A

The main sublanguages are Data Definition Language (DDL), Data Manipulation Language (DML), Data Control Language (DCL), and Transaction Control Language (TCL).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is and How implement (multiplicity)?*

A

Multiplicity refers to the number of instances of one entity that can be associated with another entity. Implemented through foreign keys and relationships in database schemas.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

how many primary keys in table, how many foreign

A

A table can have only one primary key but multiple foreign keys.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the differences between WHERE vs HAVING?* ex

A

WHERE filters rows before aggregation, while HAVING filters rows after aggregation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

List the 4 major joins (inner, outer, left, right)

A

INNER JOIN: Returns rows that match in both tables. OUTER JOIN: Includes unmatched rows. LEFT JOIN: Includes unmatched rows from the left table. RIGHT JOIN: Includes unmatched rows from the right table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Write database script that creates joins between tables and queries them. Write examples here.

A

Example: SELECT * FROM employees INNER JOIN departments ON employees.dept_id = departments.dept_id;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the difference between an aggregate function and a scalar function?, examples?

A

Aggregate functions perform operations on groups of data, e.g., SUM(), AVG(). Scalar functions operate on single values, e.g., UPPER(), LEN().

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a transaction?*

A

A transaction is a logical unit of work, which can be committed or rolled back. It ensures data integrity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the properties of a transaction?

A

The properties of a transaction are Atomicity, Consistency, Isolation, and Durability (ACID).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does ACID stand for?

A

ACID stands for Atomicity, Consistency, Isolation, and Durability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the transaction isolation levels and what do they prevent?*

A

The isolation levels are Read Uncommitted, Read Committed, Repeatable Read, and Serializable. They control the visibility of uncommitted data to other transactions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

what is serializable isolation level?*

A

Serializable isolation level ensures that transactions are executed in such a way that it’s as if they were run one after another, without interference.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is normalization?

A

Normalization is the process of organizing data in a database to reduce redundancy and improve integrity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the CAP Theorem?*

A

The CAP Theorem states that in a distributed database, you can only achieve two out of the following three properties:
Consistency: All reads received are the most updated data.
Availability: Reads and write is
Partition Tolerance.: in the case of failed nodes, the system will still rubn

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

List down characteristics of Bigdata (Volume/Variety/Velocity& Variability)

A

The 3 Vs are Volume, Velocity, and Variety. Variability refers to the inconsistency of the data flow.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What are the 5 Vs of big data?

A

Volume, Velocity, Value, Variety, Veracity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
ls -al
`ls -al` lists files and directories in a detailed (long) format, including hidden files.
26
mkdir -p
`mkdir -p` creates parent directories as needed, without error if the directory exists.
27
rm -r? -f -rf
`rm -r` removes directories and files recursively; `-f` forces the removal, and `-rf` combines both options.
28
cat
`cat` is used to display file contents, concatenate files, or create files.
29
how do permissions work?
Permissions in Linux specify which users or groups can read, write, or execute files using `r`, `w`, `x` flags.
30
what does chmod 400 do?
`chmod 400` sets the file permissions to read-only for the owner, and no permissions for others.
31
What was the 'Hadoop Explosion'?
The 'Hadoop Explosion' refers to the rapid adoption and growth of the Hadoop ecosystem for big data processing and storage.
32
What is Cloudera in relation to Hadoop?
Cloudera is a company that provides enterprise-level Hadoop distribution, tools, and support.
33
What is a daemon?
A daemon is a background process that runs continuously in a system to perform a specific task. For example when i utilize hadoop I have the namenode and data node running in the background
34
How many blocks will a 200MB file be stored in in HDFS
A 200MB file would be stored in approximately 2 blocks (assuming the default block size of 128MB).
35
How are DataNodes fault tolerant?
DataNodes are fault-tolerant through data replication across multiple nodes.
36
default number of replications?
The default number of replications in HDFS is 3.
37
What does namenode do?
The NameNode manages the HDFS filesystem namespace and metadata, including file-to-block mapping.
38
secondary and standby namenodes?
A Secondary NameNode assists with checkpointing but doesn't take over in case of failure. A Standby NameNode acts as a backup in case of a failure.
39
what is data locality*
Data locality refers to the practice of storing data near the computation to reduce network traffic and improve performance.
40
what is rack awareness
Rack awareness refers to ensuring data is stored in different racks to improve fault tolerance and minimize data loss in case of rack failure.
41
What are heartbeats?
Heartbeats are periodic signals from DataNodes to the NameNode to indicate they are functioning properly.
42
What is Yarn?
YARN (Yet Another Resource Negotiator) is the resource management layer of Hadoop responsible for managing compute resources in clusters.
43
What is Hive?
Hive is a data warehouse infrastructure built on top of Hadoop for querying and managing large datasets using a SQL-like interface.
44
describe partitioning and bucketing
Partitioning divides data into different directories, while bucketing splits data into fixed-size files within partitions allowing for faster joins
45
low cardinality for partitioning
Low cardinality means few unique values. When partitioning by a low-cardinality column, it could result in an excessive number of partitions.
46
What is a Hive partition?
A Hive partition is a way of dividing a table into smaller, manageable parts based on the values of one or more columns.
47
What is the Hive metastore?
The Hive metastore stores metadata information about the tables, partitions, and other objects in Hive.
48
How specify we're reading from a csv file?
Specify the file format using `ROW FORMAT DELIMITED FIELDS TERMINATED BY ','` for CSV.
49
What is OLAP vs OLTP?
They are data processing systems that are used to manage and analyze data. OLAP is used to analyze data, while OLTP is used to process transactions
50
What is a data warehouse?
A data warehouse is a centralized repository for storing historical and current data for analysis and reporting.
51
What are facts and dimensions?*
Facts are quantitative data (e.g., sales). Dimensions are descriptive attributes (e.g., time, region). They are key components of data ware house schemas design
52
data warehouse, star snowflake schema of tables in data warehouse?
In a star schema, a central fact table is surrounded by dimension tables. In a snowflake schema, dimension tables are normalized into multiple related tables.
53
What is ETL?
ETL stands for Extract, Transform, Load—processes for moving data from source systems to data warehouses.
54
batch load vs. history load?
Batch load loads data in bulk at scheduled times, while history load specifically adds or updates historical data.
55
Why do we make use of programming paradigms like Functional Programming and Object Oriented Programming?
Different paradigms provide various approaches for solving problems. OOP focuses on objects and inheritance, while functional programming focuses on immutability and pure functions.
56
What are the 4 pillars of OOP
The four pillars are Encapsulation, Abstraction, Inheritance, and Polymorphism. Overloading takes multiple parameters overriding defines the class method in the subclass
57
What is a higher order function?
A higher-order function is a function that takes other functions as arguments or returns a function as its result.
58
What is a Lambda?
A Lambda is an anonymous function, usually used for short-term, one-time operations.
59
could I set function equal to value?
Yes, in Python, you can assign a function to a variable (e.g., `f = my_function`).
60
what does it mean functions are first class citizens?
It means functions can be passed around as arguments, returned from other functions, and assigned to variables.
61
What collections do you use?
Common collections in Python include lists, sets, tuples, and dictionaries.
62
methods used in these collections?
Methods include `append()`, `remove()`, `sort()`, for lists; `add()`, `remove()` for sets; `keys()`, `values()` for dictionaries.
63
What is a Thread?
A thread is a lightweight process that runs independently in a program, allowing concurrent execution.
64
What is the difference between an RDD and a dataframe?
RDDs are distributed collections of data, while DataFrames are a higher-level abstraction built on top of RDDs for working with structured data.
65
Are RDDs mutable or immutable? Dataframes?
RDDs and DataFrames are immutable. However, DataFrames provide more optimization and easier handling for structured data.
66
What is an RDD
RDD stands for Resilient Distributed Dataset, a fundamental data structure in Spark, representing an immutable distributed collection of objects.
67
What is a Dataframe
A DataFrame is a distributed collection of data organized into named columns, offering higher-level abstractions than RDDs.
68
What is a DataSet
A DataSet is a strongly-typed, distributed collection of data that combines the features of RDDs and DataFrames.
69
What is the difference between an RDD, DataSet and DataFrame
RDDs are lower-level and provide fine-grained control. DataFrames are higher-level and provide optimizations. DataSets are similar to DataFrames but with compile-time type safety.
70
Wich one is better? When would you use one or another?
DataFrames and DataSets are preferred for structured data because of optimizations. RDDs are useful for unstructured data or when fine-grained control is required.
71
Define RDD and how do you sort elements within RDD.
RDD is a distributed collection of data, and you can sort it using `sortBy()` method.
72
create key value pairs with map then sortByKey
Example: `rdd = sc.parallelize([(1, 'a'), (3, 'b'), (2, 'c')])`; `rdd.sortByKey().collect()`.
73
Define Actions in spark
Actions trigger computation and return results, such as `collect()`, `reduce()`, `count()`, etc.
74
What are the deploy models in Apache Spark
Spark can be deployed in local mode, client mode, or cluster mode (on YARN, Mesos, or Kubernetes).
75
how to map a json file.
Use `spark.read.json()` to read a JSON file into a DataFrame.
76
how to flatten a JSON file
You can flatten a JSON file by using `explode()` and other transformation functions.
77
What does RDD stand for? what is Rdd? how used Rdds?
RDD stands for Resilient Distributed Dataset. It is used for distributed data processing and can be transformed using map, filter, etc.
78
What does it mean to transform an RDD?
To transform an RDD means to apply operations like map, filter, or flatMap to produce a new RDD.
79
wide vs narrow? shuffle?
Wide transformations require a shuffle of data across partitions, while narrow transformations do not. Examples: `map` (narrow), `groupByKey` (wide).
80
What does it mean to cache an RDD?
Caching an RDD stores it in memory, making subsequent computations faster.
81
What is a broadcast variable?
A broadcast variable allows you to efficiently share large read-only data across all nodes.
82
What is a shuffle in Spark?
A shuffle is a process of redistributing data across partitions, which may be necessary for wide transformations.
83
What’s the difference between cluster mode and client mode on YARN?
In cluster mode, both the driver and executor run on cluster nodes. In client mode, the driver runs on the client machine.
84
What is a Spark Application? Job? Stage? Task?
A Spark Application consists of a SparkContext and may run one or more jobs. A job is divided into stages, and each stage is divided into tasks.
85
What is Spark SQL?
Spark SQL is a module for working with structured data using SQL queries, DataFrames, and Datasets.
86
How does a broadcast join work in Spark?
In a broadcast join, the smaller dataset is broadcast to all nodes, reducing data shuffling and improving performance.
87
Why are broadcast joins significantly faster than shuffle joins?
Because broadcast joins avoid the cost of shuffling large datasets across nodes.
88
What is the catalyst optimizer?
The Catalyst Optimizer is a query optimizer in Spark SQL that applies various optimization rules to SQL queries.
89
Difference between dataframes and datasets?
DataFrames are untyped and easier to use. Datasets provide type safety with compile-time checking.
90
Are Dataframes lazily evaluated, like RDDs?
Yes, DataFrames are lazily evaluated, meaning computations are triggered only when an action is performed.
91
how do you rename column in dataframe?
Use the `withColumnRenamed()` method to rename a column in a DataFrame.
92
List functions available to us when using DataFrames?
Functions include `select()`, `filter()`, `groupBy()`, `agg()`, `join()`, etc.
93
different file formats in Spark, which is best?
Popular file formats include Parquet, ORC, CSV, JSON, Avro. Parquet and ORC are typically the best for performance in Spark due to columnar storage.
94
What is Parquet?
Parquet is a columnar storage file format optimized for analytics and big data processing.
95
What is ORC?
ORC (Optimized Row Columnar) is a file format for efficient storage and query processing in Hive and Spark.
96
File Types use in Spark?
File types include Parquet, ORC, Avro, JSON, CSV, and Text.
97
By default which file type does .saveAsTable create?
By default, `.saveAsTable` creates tables in the Parquet file format.
98
What is audit, balance, and control?
Audit tracks data changes, balance ensures data consistency, and control monitors and enforces security and data integrity.
99
Give a high-level AWS description
AWS (Amazon Web Services) is a cloud computing platform offering scalable computing, storage, and networking solutions.
100
What kind of services did you use with AWS?
I have used services like EC2, S3, RDS, Lambda, EMR, DynamoDB, and more.
101
Have you worked with S3 buckets?
Yes, S3 is used for storing and retrieving any amount of data at any time.
102
What is the difference between EC2 and EMR?
EC2 is a virtual server that provides computing power, whereas EMR (Elastic MapReduce) is a managed cluster platform for processing big data.
103
Have you worked with Amazon Lambda?
Yes, AWS Lambda is a serverless compute service for running code without provisioning servers.
104
List the python library used to access AWS services(like S3) - Boto3?
The Python library used to access AWS services is Boto3.