EndGame Flashcards

1
Q

What is a UNIQUE constraint?

A

The UNIQUE constraint ensures that all values in a column are different. You can have more than one unique constraint per table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a View? Materialized View?

A

A view is a virtual table based on the result set from a SQL statement. It is never stored and only displayed. A materialized view is a table that is stored on the disk. This allows for faster processing compared to a view because the results are already precomputed. Views are slower because they are called every time, they are used so you are always getting an updated view.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do you create a View?

A

Create View view_name AS Select column FROM table Where column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are primary and foreign keys?

A

Primary keys are keys that are used to uniquely identify a record in a table. Each table can only have one primary key. Primary keys are unique and not null. Foreign keys are keys used to establish relations between two tables. Foreign keys in one table refer to primary keys in another table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the order of operations in a SQL statement?

A

From, Where, Group By, Having, Select, Order By, Limit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the sub languages of SQL?

A

DDL - Data definition language (create, alter, drop)
DML - Data manipulation language (insert, update, delete)
DCL- Data control language
DQL - Data query language (select)
TCL - Transaction control language
SCL - Session control language

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is cardinality?

A

Cardinality refers to the uniqueness of values in a column. When a column has high cardinality, it means that the column is very unique and there are no duplicates. When there is low cardinality, it means that there are a lot of duplicates in the column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a candidate key?

A

A candidate key is a key that uniquely identifies a record in a table. It is similar to a primary key with the only difference being that you can have more than one candidate key. A candidate key is created by using the UNIQUE constraint.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a composite key?

A

A composite key is made of two or more columns in table that can be used to uniquely identify each row in a table. To declare a composite key, you would do Primary Key (col 1, col2) in your table creation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the different constraints on a column?

A

Not Null, Unique, Primary Key, Foreign Key, Create Index, Default, Check

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is an ERD?

A

ERD stands for entity relation diagram, and it is a diagram that shows the relationships between entities. Entities are represented as tables in a database and the ERD shows relations based on certain attributes (data) of these entities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the difference between the Where clause and the Having clause?

A

Where clause is used to filter records before a grouping is made while the Having clause is used to filters values out of group. Where clause can be said to be a pre filter and the Having clause as post filter.
Also where clause cannot be used with aggregate but the Having clause can.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the difference between Group By and Order By?

A

Group by is used to group rows that have the same values while the order by clause is used to order a column in ascending or descending order.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the LIKE clause?

A

The like clause is used for pattern matching. It is a logical operator that checks whether a string matches a particular pattern.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a Sub Query?

A

A subquery is a query that appears inside another query statement. Sub queries are often used the the select and where clause.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the difference between aggregate and scalar functions?

A

Aggregate functions are functions that operate on a collection of values and return a single value while scalar functions return a single value based on input. Some examples of aggregate functions are Avg(), Sum(), Count(), Min(), Max(). Some examples of scalar functions are Ucase(), Lcase() and round().

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Explain the different types of joins?

A

Inner Join - joins rows from two tables based on a matching column and returns only the matches

Left Join - joins rows from two tables based on a matching column and also returns rows from the left table.
Right Join - same thing as left join but instead returns rows from the right table.
Cross Join - The CROSS JOIN is used to generate a paired combination of each row of the first table with each row of the second table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Explain the different types of set operations?

A

Union - Combines result set of two or more select statements but does not include duplicates.
Union All - Same thing as Union but includes the duplicates
Intersect - Returns only the common results of the select statements.
Minus - The SQL MINUS operator is used to return all rows in the first SELECT statement that are not returned by the second SELECT

Intersect and minus are not supported in mysql so for intersect you can use inner join and for minus you can use left join

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the Merge statement in SQL? How would you use it?

A

The MERGE statement in SQL is a very popular clause that can handle inserts, updates, and deletes all in a single transaction without having to write separate logic for each of these.

MERGE Products AS TARGET
USING UpdatedProducts AS SOURCE
ON (TARGET.ProductID = SOURCE.ProductID)
–When records are matched, update the records if there is any change
WHEN MATCHED AND TARGET.ProductName <> SOURCE.ProductName OR TARGET.Rate <> SOURCE.Rate
THEN UPDATE SET TARGET.ProductName = SOURCE.ProductName, TARGET.Rate = SOURCE.Rate
–When no records are matched, insert the incoming records from source table to target table
WHEN NOT MATCHED BY TARGET
THEN INSERT (ProductID, ProductName, Rate) VALUES (SOURCE.ProductID, SOURCE.ProductName, SOURCE.Rate)
–When there is a row that exists in target and same record does not exist in source then delete this record target
WHEN NOT MATCHED BY SOURCE
THEN DELETE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is referential integrity?

A

It is a database concept that is used to build and maintain logical relationships between tables to avoid logical corruption of data. Usually, referential integrity is made up of the combination of a primary key and a foreign key.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the difference between Truncate, Drop, and Delete?

A

Delete - It is a command that is used to delete records from a table. It is not meant to delete a table from a database. You can either delete all records (Delete FRom tablename) or you can delete a group of records (Delete from tablename where col = something)
Truncate - Similar to delete in that it deletes all records from a table but does not use the where clause. It is faster than delete because it doesn’t need to scan every record.
Drop - drop is used to delete the entire table from the database. The table structure is removed along with all of the data in the table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How do you find duplicates in a table?

A

To find duplicates in a table you can use the group by clause in conjunction with the count aggregation function and having clause to determine if the count of particular column is more than 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How would you find the nth highest salary in a table?

A

solution 1: dense_rank() over(order by col) from table where somthing
solution 2 select from order by column limit n-1,1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is main difference between SQL and MySQL?

A

SQL is a language for querying databases and MySQL is an open source database product. SQL is used for accessing, updating and maintaining data in a database and MySQL is an RDBMS that allows users to keep the data that exists in a database organized.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are ranking functions?

A

dense_rank():
This function will assign rank to each row within a partition without gaps. Basically, the ranks are assigned in a consecutive manner i.e if there is a tie between values then they will be assigned the same rank, and next rank value will be one greater then the previous rank assigned.
rank():
This function will assign rank to each row within a partition with gaps. Here, ranks are assigned in a non-consecutive manner i.e if there is a tie between values then they will be assigned same rank, and next rank value will be previous rank + no of peers(duplicates).
percent_rank():

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How do you add a column to an existing table?

A

You can use the Alter clause. The alter clause lets you add, drop, and alter columns in a table.
Alter table table_name
ADD column_name

27
Q

What is a COMMIT?

A

COMMIT is a SQL transaction statement that is used to save the changes made by the SQL statements in the previous step permanently in the database.
You have Begin Transaction;
Sequence of operations;
Commit Transaction;

28
Q

Does a view contain any data?

A

Views do not contain and store any data they only return a snapshot of the table based on a criteria.

29
Q

What is a stored procedure?

A

A stored procedure is a prepared SQL code that you can save, so the code can be reused over and over again. So, if you have an SQL query that you write over and over again, save it as a stored procedure, and then just call it to execute it.

30
Q

How can you find distinct rows without using the distinct keyword?

A

You can use group by or you can use union

31
Q

What is Hadoop?

A

Hadoop is an open source framework that utilizes a cluster of networked computers to store and process large datasets. By using a cluster of multiple computers Hadoop is able to analyze large datasets in parallel more quickly.

32
Q

What are the benefits of Apache Spark?

A

Speed - Spark can be 100x faster than Hadoop for large scale data processing by exploiting in memory computing and other optimizations. Spark is also fast when data is stored on disk
Ease of Use - Spark has easy-to-use APIs for operating on large datasets. This includes a collection of over 100 operators for transforming data and familiar data frame APIs for manipulating semi-structured data.
Unified Engine - Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning and graph processing. These standard libraries increase developer productivity and can be seamlessly combined to create complex workflows.

33
Q

What is Spark?

A

Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size.

34
Q

What is HDFS?

A

HDFS stands for Hadoop distributed file system and it is the storage component of the Hadoop framework. HDFS is highly fault tolerant and designed to run on commodity hardware.

35
Q

What are the Namenode and datanodes in HDFS?

A

HDFS has what is known as a master/slave architecture. An HDFS cluster consists of a single Namenode also known as the Master daemon. The namenode manages the metadata of the actual data in the cluster and regulates access to the files by clients. The datanodes are known as the slave daemons and these are where the data are stored in HDFS. Datanodes are where read, write, and processing are handled and upon instructions from the master node data nodes perform creation, replication, and deletion of data blocks.

36
Q

What are the characteristics of HDFS?

A

Fault tolerant - Hadoop framework divides data into blocks. After that creates multiple copies of blocks on different machines in the cluster.
Scalable - whenever requirements increase you can scale the cluster. Two scalability mechanisms are available in HDFS: Vertical and Horizontal Scalability.
High Availability - At the time of unfavorable situations like a failure of a node, a user can easily access their data from the other nodes. Because duplicate copies of blocks are present on the other nodes in the HDFS cluster.

37
Q

What is YARN?

A

YARN stands for yet another resource negotiator and it is the resource management component of hadoop. Yarn consists of a resource manager and node manager in master/slave architecture. The resource manager is the master daemon and is in charge of the resource allocation to all the applications in the system while the node manager is the slave daemon and manages the bundles of resources called containers running on their machine and reports back to the resource manager.

38
Q

What is MapReduce?

A

MapReduce is the processing component of Hadoop and offers massive scalability across hundreds or thousands of servers in a Hadoop cluster. The term “MapReduce” refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. The parallel processing on multiple machines greatly increases the speed of handling even petabytes of data

39
Q

What are the differences between Spark and Hadoop?

A

Performance - In terms of performance Spark is a lot faster than Hadoop because it utilizes RAM versus reading and writing to disk.
Cost - Hadoop runs at a lower cost since it relies on any disk storage type for data processing. Spark runs at a higher cost because it relies on in-memory computations for real-time data processing, which requires it to use high quantities of RAM to spin up node.
Processing: Though both platforms process data in a distributed environment, Hadoop is ideal for batch processing and linear data processing. Spark is ideal for real-time processing and processing live unstructured data streams.

40
Q

What are managed tables vs external tables?

A

An external table is a table that is stored outside of the hive warehouse. Hive does not manage its storage. Managed tables are Hive owned tables where the entire lifecycle of the tables data is managed and controlled by hive.

41
Q

Partitions vs Buckets?

A

With partitioning you can organize large tables into smaller tables based on values of a column. This helps organize the data in a logical fashion and when we query the partitioned table it allows hive to skip all but relevant sub-directories.
With Bucketing, Hive Tables or partition are subdivided into buckets based on the hash function of a column in the table to give extra structure to the data that may be used for more efficient queries

42
Q

Pros and cons of partitioning?

A

Pros:

  • It distributes execution load horizontally.
  • In partition faster execution of queries with the low volume of data takes place. For example, search population from Vatican City returns very fast instead of searching entire world population.

Cons:

  • There is the possibility of too many small partition creations- too many directories.
  • Partition is effective for low volume data. But there some queries like group by on high volume of data take a long time to execute.
43
Q

Pros and cons of bucketing?

A

Pros:
It provides faster query response like portioning.
In bucketing due to equal volumes of data in each partition, joins at Map side will be quicker.

Cons:
We can define a number of buckets during table creation. But loading of an equal volume of data has to be done manually by programmers.

44
Q

What is order by and sort by in hive?

A

Hive sort by and order by commands are used to fetch data in sorted order. SORT BY clause sorts the data using multiple reducers. ORDER BY sorts all of the data together using a single reducer.

Hence, using ORDER BY will take a lot of time to execute a large number of inputs.

45
Q

What are some Hive file formats?

A

Text File - Hive Text file format is a default storage format. You can use the text format to interchange the data with other client application. The text file format is very common most of the applications. Data is stored in lines, with each line being a record. Each lines are terminated by a newline character (\n).
ORC File - The ORC file stands for Optimized Row Columnar file format. The ORC file format provides a highly efficient way to store data in Hive table. This file system was actually designed to overcome limitations of the other Hive file formats. The Use of ORC files improves performance when Hive is reading, writing, and processing data from large tables.
Parquet - Parquet is a column-oriented binary file format. The parquet is highly efficient for the types of large-scale queries. Parquet is especially good for queries scanning particular columns within a particular table. The Parquet table uses compression Snappy, gzip; currently Snappy by default.

46
Q

What is Parquet?

A

Columnar storage like Apache Parquet is designed to bring efficiency compared to row-based files like CSV. When querying, columnar storage you can skip over the non-relevant data very quickly. As a result, aggregation queries are less time consuming compared to row-oriented databases. This way of storage has translated into hardware savings and minimized latency for accessing data.

47
Q

What is a RDD?

A

RDD stands for resilient distributed dataset and it is an immutable collection of elements of your data partitoned across multiple nodes in the cluster and can be operated in parallel.

48
Q

What is a Dataframe?

A

Dataframes are like RDD’s in that they are an immutable collection of data but are instead organized into named columns similar to a table in a RDBMS.

49
Q

What is a DataSet?

A

A dataset is similar to a dataframe with the difference being that a dataset can also be strongly typed dictated by a case class that you can define in scala.

50
Q

What are transformations? Examples?

A

Transformations take an RDD as an input and produce one or multiple RDDs as output. Map, flatMap, filter, distinct

51
Q

What are actions? Examples?

A

Actions take an RDD as an input and produce a performed operation as an output. Count,Collect, take, foreach

52
Q

What is a case class?

A

A Scala Case Class is like a regular class, except it is good for modeling immutable data. It also serves useful in pattern matching, such a class has a default apply() method which handles object construction. A scala case class also has all vals, which means they are immutable.

53
Q

What is a broadcast variable? Benefits?

A

Broadcast variables are read only shared variables that are cached and available to all nodes in the cluster. Using broadcast variables can improve performance by reducing the amount of network traffic and data serialization required to execute your Spark application because the variables are cached on all the nodes, we do not see to send the data to each node every single time it is being called.

54
Q

What are accumulators?

A

Spark Accumulators are shared variables which are only “added” through an associative and commutative operation and are used to perform counter or sum operations.
Long
Double
Collection

val accum = sc.longAccumulator(“SumAccumulator”)
sc.parallelize(Array(1, 2, 3)).foreach(x => accum.add(x))

55
Q

What are client and cluster modes?

A

In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application.

56
Q

When to use cluster and client mode?

A

If we submit an application from a machine that is far from the worker machines, for instance, submitting locally from our laptop, then it is common to use cluster mode to minimize network latency between the drivers and the executors.
We want to use client mode mainly for testing and debugging purposes only.

57
Q

What is repartition and coalesce?

A

repartition() is used to increase or decrease the RDD, DataFrame, Dataset partitions whereas the coalesce() is used to only decrease the number of partitions in an efficient way. Spark repartition() and coalesce() are very expensive operations as they shuffle the data across many partitions hence try to minimize repartition as much as possible. Coalesce avoids full shuffle, instead of creating new partitions, it shuffles the data using Hash Partitioner (Default), and adjusts into existing partitions

58
Q

What is a DAG?

A

Dag stands for directed acyclic graph and is a set of Vertices and Edges, where vertices represent the RDDs and the edges represent the Operation to be applied on RDD. Therefore in spark, it automatically forms DAG logical flow of operations. That helps in minimize the data shuffling all around. This reduces the duration of computations with less data volume. It also increases the efficiency of the process with time.

59
Q

What is Spark submit? What are some common options when using spark-submit?

A

The spark-submit command is a utility to run or submit a Spark or PySpark application program (or job) to the cluster by specifying options and configurations along with a jar file.
Some common options are:
-deploy-mode with cluster or client mode
- master specifying what cluster manager you want to run yarn, standalone, local
-cores and memory

60
Q

Load file in RDD, DataFrame, Dataset?

A

RDD - val rdd = spark.sparkcontext.textFile()

other: spark.read .csv .text .textFile .parquet . orc

61
Q

caching vs persisting

A

cache to memory only useful for resusing rdd and reducing cost of recovery when executor fails. persisting you get to choose storage level.

62
Q

How to create dataframes?

A

rdd.toDF
spark.createDataFrame(rdd).toDF(columns:_*)
spark.read
hive or spark.sql

63
Q

how to create rdd?

A

.parallelize(collection,partition)
spark.read.toRdd
from existing rdd

64
Q

how to create dataset?

A

.toDS

.as[case class]