Advanced SQL Flashcards

1
Q

Generally, numeric column types in various SQL databases do not support commas or currency symbols. True/False?

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How can we change data type of a column?

A

Use CAST or CONVERT to change the data type

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the syntax for casting data types?

A

CAST(column_name AS integer)
column_name::integer

SELECT
CAST(funding_total_usd AS varchar) AS funding_total_usd_string,
founded_at_clean::varchar AS founded_at_string
FROM tutorial.crunchbase_companies_clean_date

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

When you perform arithmetic on dates (such as subtracting one date from another), what is the data type of the result?

A

The results are often stored as the interval data type—a series of integers that represent a period of time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does the below code do?
SELECT companies.permalink,
companies.founded_at_clean,
companies.founded_at_clean::timestamp +
INTERVAL ‘1 week’ AS plus_one_week
FROM tutorial.crunchbase_companies_clean_date companies
WHERE founded_at_clean IS NOT NULL

A

The interval is defined using plain-English terms like ‘10 seconds’ or ‘5 months’. using the INTERVAL function.
Also note that adding or subtracting a date column and an interval column results in another date column as in the above query.
The code adds a week to each date in the column “companies.founded_at_clean”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does below code do?
SELECT companies.permalink,
companies.founded_at_clean,
NOW() - companies.founded_at_clean::timestamp AS founded_time_ago
FROM tutorial.crunchbase_companies_clean_date companies
WHERE founded_at_clean IS NOT NULL

A

You can add the current time (at the time you run the query) into your code using the NOW()function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Does the below code work?

COUNT ( CASE WHEN acquisitions.acquired_at_cleaned-INTERVAL ‘3 years’ > companies.founded_at_clean::timestamp>= acquisitions.acquired_at_cleaned- INTERVAL ‘5 years’ THEN 1 ELSE NULL END ) AS between_3_to_5

(Using two interval data types for comparison)

A

No, it returns a syntax error. Using interval data types, work as below (one-sided)
COUNT (CASE WHEN acquisitions.acquired_at_cleaned <= companies.founded_at_clean::timestamp + INTERVAL ‘3 years’
THEN 1 ELSE NULL END) AS acquired_3_yrs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is data wrangling?

A

data wrangling (or munging) is the process of programmatically transforming data into a format that makes it easier to work with.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Does using a particular function, in many cases, change the data to the appropriate type in SQL? Explain your answer

A

Yes. LEFT, RIGHT, and TRIM are all used to select only certain elements of strings, but using them to select elements of a number or date will treat them as strings for the purpose of the function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does the left function do in SQL? how is the syntax?

A

You can use LEFT to pull a certain number of characters from the left side of a string and present them as a separate string. The syntax is LEFT (string, number of characters).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does the below code do?

SELECT incidnt_num,
date,
LEFT(date, 10) AS cleaned_date
FROM tutorial.sf_crime_incidents_2014_01

A

The code, extracts only the first 10 characters on the left of the entries in date column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does the below code do?
SELECT incidnt_num,
date,
LEFT(date, 10) AS cleaned_date,
RIGHT(date, 17) AS cleaned_time
FROM tutorial.sf_crime_incidents_2014_01

A

It extracts 10 characters from the left side and creates a new column
and then extracts 17 characters from the right side and creates another column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does the length function do?

A

The LENGTH function returns the length of a string.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does the below code do?
SELECT incidnt_num,
date,
LEFT(date, 10) AS cleaned_date,
RIGHT(date, LENGTH(date) - 11) AS cleaned_time
FROM tutorial.sf_crime_incidents_2014_01

A

it’s the same as:
SELECT incidnt_num,
date,
LEFT(date, 10) AS cleaned_date,
RIGHT(date, 17) AS cleaned_time
FROM tutorial.sf_crime_incidents_2014_01

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

When using functions within other functions, it’s important to remember that the outermost functions will be evaluated first, followed by the functions encapsulated by them. True/False?

A

False. When using functions within other functions, it’s important to remember that the 🎆 innermost 🎆 functions will be evaluated first, followed by the functions that encapsulate them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does the TRIM function do? How is its syntax?

A

The TRIM function is used to remove characters from the beginning and end of a string.

SELECT location,
TRIM(both ‘()’ FROM location)
FROM tutorial.sf_crime_incidents_2014_01

The TRIM function takes 3 arguments.
First, you have to specify whether you want to remove characters from the beginning (‘leading’), the end (‘trailing’), or both (‘both’, as used above).

Next you must specify all characters to be trimmed. Any characters included in the single quotes will be removed from both beginning, end, or both sides of the string.

Finally, you must specify the text you want to trim using FROM.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does the function POSITION do? How is its syntax?

A

POSITION allows you to specify a substring, then returns a numerical value equal to the character number (counting from left) where that substring first appears in the target string.
For example, the following query will return the position of the character ‘A’ (case-sensitive) where it first appears in the descript field

SELECT incidnt_num,
descript,
POSITION(‘A’ IN descript) AS a_position
FROM tutorial.sf_crime_incidents_2014_01

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Is POSITION function case-sensitive?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is another function that does the same thing as POSITION? how is its syntax?

A

You can also use the STRPOS function to achieve the same results as POSITION—just replace IN with a comma and switch the order of the string and substring:

SELECT incidnt_num,
descript,
STRPOS(descript, ‘A’) AS a_position
FROM tutorial.sf_crime_incidents_2014_01

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Is STRPOS case-sensitive?

A

yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What do you do If you want to look for a character regardless of its case?

A

You can make your entire string a single case string by using the UPPER or LOWER functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does SUBSTR do? What is its syntax?

A

LEFT and RIGHT both create substrings of a specified length, but they only do so starting from the sides of an existing string. If you want to start in the middle of a string, you can use SUBSTR.
The syntax is SUBSTR(string, starting character position, # of characters)

SELECT incidnt_num,
date,
SUBSTR(date, 4, 2) AS day
FROM tutorial.sf_crime_incidents_2014_01

This returns 01/31/2014 08:00:00 AM +0000 as 31 (note that positions start from 1 not 0, and the starting position is inclusive)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Do indices in SQL start from 1 or 0?

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What does CONCAT do? How is its syntax?

A

You can combine strings from several columns together (and with hard-coded values) using CONCAT. Simply order the values you want to concatenate and separate them with commas. If you want to hard-code values, enclose them in single quotes. Here’s an example

SELECT incidnt_num,
day_of_week,
LEFT(date, 10) AS cleaned_date,
CONCAT(day_of_week, ‘, ‘, LEFT(date, 10)) AS day_and_date
FROM tutorial.sf_crime_incidents_2014_01

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How can we use || instead of CONCAT?

A

Alternatively, you can use two pipe characters (||) to perform the same concatenation:

SELECT incidnt_num,
day_of_week,
LEFT(date, 10) AS cleaned_date,
day_of_week || ‘, ‘ || LEFT(date, 10) AS day_and_date
FROM tutorial.sf_crime_incidents_2014_01

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is the use of LOWER, UPPER functions? How are their syntaxes?

A

They change the strings to upper or lower case.

SELECT incidnt_num,
address,
UPPER(address) AS address_upper,
LOWER(address) AS address_lower
FROM tutorial.sf_crime_incidents_2014_01

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are 3 reasons for scrambled dates?

A

📌The data was manipulated in Excel at some point, and the dates were changed to MM/DD/YYYY format or another format that is not compliant with SQL’s strict standards.

📌The data was manually entered by someone who use whatever formatting convention he/she was most familiar with.

📌The date uses text (Jan, Feb, etc.) instead of numbers to record months.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is the difference between timestamp and date data types in SQL?

A

timestamp includes additional precision (hours, minutes, seconds)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Assume we have a date or timestamp field, how can we extract different parts of it? Demonstrate with code, for a date column named cleaned_date

A

Use EXTRACT to pull the pieces apart one-by-one:

SELECT cleaned_date,
EXTRACT(‘year’ FROM cleaned_date) AS year,
EXTRACT(‘month’ FROM cleaned_date) AS month,
EXTRACT(‘day’ FROM cleaned_date) AS day,
EXTRACT(‘hour’ FROM cleaned_date) AS hour,
EXTRACT(‘minute’ FROM cleaned_date) AS minute,
EXTRACT(‘second’ FROM cleaned_date) AS second,
EXTRACT(‘decade’ FROM cleaned_date) AS decade,
EXTRACT(‘dow’ FROM cleaned_date) AS day_of_week
FROM tutorial.sf_crime_incidents_cleandate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

How can we round dates to the beginning of a certain unit of measurement? For example, we want to round dates based on year, so that every date in the table for each year becomes January 1st of that year?

A

You can also round dates to the nearest unit of measurement. This is particularly useful if you don’t care about an individual date, but do care about the week (or month, or quarter) that it occurred in. The DATE_TRUNC function rounds a date to whatever precision you specify. The value displayed is the first value in that period. So when you DATE_TRUNC by year, any value in that year will be listed as January 1st of that year:

SELECT cleaned_date,
DATE_TRUNC(‘year’ , cleaned_date) AS year,
DATE_TRUNC(‘month’ , cleaned_date) AS month,
DATE_TRUNC(‘week’ , cleaned_date) AS week,
DATE_TRUNC(‘day’ , cleaned_date) AS day,
DATE_TRUNC(‘hour’ , cleaned_date) AS hour,
DATE_TRUNC(‘minute’ , cleaned_date) AS minute,
DATE_TRUNC(‘second’ , cleaned_date) AS second,
DATE_TRUNC(‘decade’ , cleaned_date) AS decade
FROM tutorial.sf_crime_incidents_cleandate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

How can we get today’s date or time using SQL? Does it require a FROM clause?

A

You can instruct your query to pull the local date and time at the time the query is run using any number of functions. Interestingly, you can run them 📍 without 📍a FROM clause:

SELECT CURRENT_DATE AS date,
CURRENT_TIME AS time,
CURRENT_TIMESTAMP AS timestamp,
LOCALTIME AS localtime,
LOCALTIMESTAMP AS localtimestamp,
NOW() AS now

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

How can we make current time appear in a different time zone in SQL?

A

You can make a time appear in a different time zone using AT TIME ZONE:

SELECT CURRENT_TIME AS time,
CURRENT_TIME AT TIME ZONE ‘PST’ AS time_pst

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

How can we fill null values in SQL?

A

Occasionally, you will end up with a dataset that has some nulls that you’d prefer to contain actual values. This happens frequently in numerical data (displaying nulls as 0 is often preferable), and when performing outer joins that result in some unmatched rows. In cases like this, you can use COALESCE to replace the null values:

SELECT incidnt_num,
descript,
COALESCE(descript, ‘No Description’)
FROM tutorial.sf_crime_incidents_cleandate
ORDER BY descript DESC

34
Q

What are sub-queries?

A

Subqueries (also known as inner queries or nested queries) are a tool for performing operations in multiple steps. For example, if you wanted to take the sums of several columns, then average all of those values, you’d need to do each aggregation in a distinct step.

Subqueries can be used in several places within a query, but it’s easiest to start with the FROM statement. Here’s an example of a basic subquery:

SELECT sub.*
FROM (
SELECT *
FROM tutorial.sf_crime_incidents_2014_01
WHERE day_of_week = ‘Friday’
) sub
WHERE sub.resolution = ‘NONE’

35
Q

Once the inner query runs, the outer query will run using the results from the inner query as its underlying table. True/False

A

True

36
Q

Subqueries are not required to have names, which are added after parentheses the same way you would add an alias to a normal table. True/False?

A

False. they do need an alias

37
Q

Can we use sub-queries in conditional logic ( for example … where date=(sub-query)? If yes, how?

A

You can use subqueries in conditional logic (in conjunction with WHERE, JOIN/ON, or CASE). The following query returns all the entries from the earliest date in the dataset (theoretically—the poor formatting of the date column actually makes it return the value that sorts first alphabetically):

SELECT *
FROM tutorial.sf_crime_incidents_2014_01
WHERE Date = (SELECT MIN(date)
FROM tutorial.sf_crime_incidents_2014_01
)

38
Q

What does this code do?

SELECT *
FROM tutorial.sf_crime_incidents_2014_01
WHERE Date IN (SELECT date
FROM tutorial.sf_crime_incidents_2014_01
ORDER BY date
LIMIT 5
)

A

It chooses 5 first dates for the conditional logic
Note🧨 that you should not include an alias when you write a subquery in a conditional statement. This is because the subquery is treated as an individual value (or set of values in the IN case) rather than as a table.

39
Q

What is the difference between the below codes?

SELECT *
FROM tutorial.sf_crime_incidents_2014_01
WHERE Date IN (SELECT date
FROM tutorial.sf_crime_incidents_2014_01
ORDER BY date
LIMIT 5
)

SELECT *
FROM tutorial.sf_crime_incidents_2014_01 incidents
JOIN ( SELECT date
FROM tutorial.sf_crime_incidents_2014_01
ORDER BY date
LIMIT 5
) sub
ON incidents.date = sub.date

A

Nothing, they both do the same thing.

40
Q

What is the advantage of using sub-query joins over using them in conditional logic like where clause?

A

Sub-query joins can be particularly useful when combined with aggregations. When you join, the requirements for your sub-query output aren’t as stringent as when you use the WHERE clause. For example, your inner query can output multiple results. (multiple columns)

41
Q

What does below code do?

SELECT incidents.,
sub.count AS total_incidents_in_category
FROM tutorial.sf_crime_incidents_2014_01 incidents
JOIN (
SELECT category,
COUNT(
) AS count
FROM tutorial.sf_crime_incidents_2014_01
GROUP BY 1
ORDER BY 2
LIMIT 3
) sub
ON sub.category = incidents.category

A

First we start by analyzing the inner query, it returns the categories with the least count. When joined by the whole table, we can see the rows belonging to those categories. It prevents the need for writing multiple queries.

42
Q

Why is this code:

SELECT COALESCE(acquisitions.month, investments.month) AS month,
acquisitions.companies_acquired,
investments.companies_rec_investment
FROM (
SELECT acquired_month AS month,
COUNT(DISTINCT company_permalink) AS companies_acquired
FROM tutorial.crunchbase_acquisitions
GROUP BY 1
) acquisitions

FULL JOIN (
SELECT funded_month AS month,
COUNT(DISTINCT company_permalink) AS companies_rec_investment
FROM tutorial.crunchbase_investments
GROUP BY 1
)investments

ON acquisitions.month = investments.month  ORDER BY 1 DESC

Better than this code:🎭

SELECT COALESCE(acquisitions.acquired_month, investments.funded_month) AS month,
COUNT(DISTINCT acquisitions.company_permalink) AS companies_acquired,
COUNT(DISTINCT investments.company_permalink) AS investments
FROM tutorial.crunchbase_acquisitions acquisitions
FULL JOIN tutorial.crunchbase_investments investments
ON acquisitions.acquired_month = investments.funded_month
GROUP BY 1

A

Full joining two already big datasets and then following it with count(distinct) terribly slows down the process. In cases like this, it’s better to create two sub-queries and performing count(distinct) there an then joining them.

43
Q

What does the below code do?

SELECT COUNT(*) AS total_rows
FROM (
SELECT *
FROM tutorial.crunchbase_investments_part1

     UNION ALL

    SELECT *
      FROM tutorial.crunchbase_investments_part2
   ) sub
A

It’s certainly not uncommon for a dataset to come split into several parts, especially if the data passed through Excel at any point (Excel can only handle ~1M rows per spreadsheet). The two tables used above can be thought of as different parts of the same dataset—what you’d almost certainly like to do is perform operations on the entire combined dataset rather than on the individual parts. You can do this by using a sub-query

44
Q

COUNT(DISTINCT) can take a very long time to process if the dataset is big. True/False?

A

True

45
Q

It’s certainly not uncommon for a dataset to come split into several parts, especially if the data passed through Excel at any point, what you’d almost certainly like to do is perform operations on the entire combined dataset rather than on the individual parts. You can do this by using a …

A

subquery

SELECT COUNT(*) AS total_rows
FROM (
SELECT *
FROM tutorial.crunchbase_investments_part1

     UNION ALL

    SELECT *
      FROM tutorial.crunchbase_investments_part2
   ) sub
46
Q

It’s certainly not uncommon for a dataset to come split into several parts, especially if the data passed through Excel at any point, what you’d almost certainly like to do is perform operations on the entire combined dataset rather than on the individual parts. You can do this by using a …

A

subquery

SELECT COUNT(*) AS total_rows
FROM (
SELECT *
FROM tutorial.crunchbase_investments_part1

     UNION ALL

    SELECT *
      FROM tutorial.crunchbase_investments_part2
   ) sub
47
Q

What does a window function do?

A

A window function performs a calculation across a set of table rows that are somehow related to the current row.

48
Q

What is the difference between window functions and aggregation functions?

A

Unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities.

49
Q

What does the below code do? and can we change the places of partition by and order by?
SELECT start_terminal,
duration_seconds,
SUM(duration_seconds) OVER
(PARTITION BY start_terminal ORDER BY start_time)
AS running_total
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < ‘2012-01-08’

A

First question: The above query uses window functions and groups and orders the query by start_terminal. Within each value of start_terminal, it is ordered by start_time, and the running total sums across the current row and all previous rows of duration_seconds.
Second question: No

50
Q

What does the below code do?

SELECT start_terminal,
duration_seconds,
SUM(duration_seconds) OVER (PARTITION BY start_terminal) AS start_terminal_sum,
(duration_seconds/SUM(duration_seconds) OVER (PARTITION BY start_terminal))*100 AS pct_of_total_time
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < ‘2012-01-08’
ORDER BY 1, 4 DESC

A

shows the duration of each ride as a percentage of the total time accrued by riders from each start_terminal

51
Q

When using window functions, you can apply the same aggregates that you would under normal circumstances, what are they?

A

SUM, COUNT, and AVG.

52
Q

What does ORDER BY do in window functions?

A

It simply orders by the designated column(s) the same way the ORDER BY clause would, except that it treats every partition as separate.

53
Q

Can we say we use window function when we want to expand each group and not have just one outcome?

A

Yes. Unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities.

54
Q

What does the below code do?

SELECT end_terminal,
duration_seconds,
SUM(duration_seconds) OVER
(PARTITION BY end_terminal ORDER BY duration_seconds DESC)
AS running_total
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < ‘2012-01-08’

A

Shows a running total of the duration of bike rides grouped by end_terminal, and with ride duration sorted in descending order.

55
Q

What does ROW_NUMBER() do?

A

ROW_NUMBER() does just what it sounds like—displays the number of a given row. It starts are 1 and numbers the rows according to the ORDER BY part of the window statement. ROW_NUMBER() does not require you to specify a variable within the parentheses

56
Q

What is the difference between rank() and row_number()?

A

RANK() is slightly different from ROW_NUMBER(). If you order by start_time, for example, it might be the case that some terminals have rides with two identical start times. In this case, they are given the same rank, whereas ROW_NUMBER() gives them different numbers.

57
Q

What’s the difference between rank() and rank_dense()?

A

RANK() would give the identical rows a rank of 2, then skip ranks 3 and 4, so the next result would be 5
DENSE_RANK() would still give all the identical rows a rank of 2, but the following row would be 3—no ranks would be skipped.

58
Q

What does the below code do?

SELECT *
FROM (
SELECT start_terminal,
start_time,
duration_seconds AS trip_time,
RANK() OVER (PARTITION BY start_terminal ORDER BY duration_seconds DESC) AS rank
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < ‘2012-01-08’
) sub
WHERE sub.rank <= 5

A

Shows the 5 longest rides from each starting terminal, ordered by terminal, and longest to shortest rides within each terminal. Limit to rides that occurred before Jan. 8, 2012.

59
Q

What does the below code do?

SELECT start_terminal,
duration_seconds,
NTILE(4) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds)
AS quartile,
NTILE(5) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds)
AS quintile,
NTILE(100) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds)
AS percentile
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < ‘2012-01-08’
ORDER BY start_terminal, duration_seconds

A

Computes in which quantile each duration second is. NTILE(n) is for calculating quantiles

60
Q

What happens if we use NTILE(100) on a group with 20 members?

A

It won’t work properly, it just counts each number from 1 to 20, the n in NTILE(n) should be smaller than the number of members in a group.
If you’re working with very small windows, keep this in mind and consider using quartiles or similarly small bands.

61
Q

What does the below code do?
SELECT
duration_seconds,
NTILE(100) OVER
(ORDER BY duration_seconds)
AS percentile
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < ‘2012-01-08’

A

Shows only the duration of the trip and the percentile into which that duration falls (across the entire dataset—not partitioned by terminal).

62
Q

Considering the below code:
SELECT duration_seconds,
SUM(duration_seconds) OVER (ORDER BY duration_seconds) AS running_total
FROM tutorial.dc_bikeshare_q1_2012
Is the output going to show the same number for all the rows with the same duration_seconds? If yes, what is this number showing?

A

Yes, it shows the sum of the rows with the same number for duration second.
It’ll be something like:

0 0
0 0
1 2
1 2
2 126
2 126
2 126
2 126
2 126
2 126
2 126

63
Q

What do LAG and LEAD window functions do?

A

LAG pulls from previous rows and LEAD pulls from following rows.
LAG shifts one down
LEAD shifts one up

64
Q

What does the below code do?
SELECT start_terminal,
duration_seconds,
LAG(duration_seconds, 1) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds) AS lag,
LEAD(duration_seconds, 1) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds) AS lead
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < ‘2012-01-08’
ORDER BY start_terminal, duration_seconds

A

LAG shifts the duration_seconds in each start_terminal group one cell down
LEAD shifts the duration_seconds in each start_terminal group one cell up
returns something like this:

31000 74 277
31000 277 74 291

65
Q

After using LAG or LEAD window functions, we end up with null values in some rows, how can we get rid of them?

A

If you’d like to make the results a bit cleaner, you can wrap it in an outer query to remove nulls.

SELECT *
FROM (
SELECT start_terminal,
duration_seconds,
duration_seconds -LAG(duration_seconds, 1) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds)
AS difference
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < ‘2012-01-08’
ORDER BY start_terminal, duration_seconds
) sub
WHERE sub.difference IS NOT NULL

66
Q

What is window alias?

A

If you’re planning to write several window functions in to the same query, using the same window, you can create an alias.

67
Q

How can we rewrite the below code using window alias?

SELECT start_terminal,
duration_seconds,
NTILE(4) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds)
AS quartile,
NTILE(5) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds)
AS quintile,
NTILE(100) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds)
AS percentile
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < ‘2012-01-08’
ORDER BY start_terminal, duration_seconds

A

SELECT start_terminal,
duration_seconds,
NTILE(4) OVER ntile_window AS quartile,
NTILE(5) OVER ntile_window AS quintile,
NTILE(100) OVER ntile_window AS percentile
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < ‘2012-01-08’
WINDOW ntile_window AS
(PARTITION BY start_terminal ORDER BY duration_seconds)
ORDER BY start_terminal, duration_seconds

68
Q

The WINDOW clause, if included, should always come after the WHERE clause. True/False?

A

True

69
Q

What are the window functions introduced in Mode?

A

SUM, COUNT, and AVG
ROW_NUMBER()
RANK() and DENSE_RANK()
NTILE
LAG and LEAD
Defining a window alias

70
Q

What are 3 of the controllable and 2 of the uncontrollable high-level things that will affect the number of calculations you need to make, and therefore your queries’ runtime?

A

Table size ✨: If your query hits one or more tables with millions of rows or more, it could affect performance.

Joins ✨: If your query joins two tables in a way that substantially increases the row count of the result set, your query is likely to be slow.

Aggregations ✨: Combining multiple rows to produce a result requires more computation than simply retrieving those rows.

Query runtime is also dependent on some things that you can’t really control related to the database itself:

Other users running queries ✨: The more queries running concurrently on a database, the more the database must process at a given time and the slower everything will run. It can be especially bad if others are running particularly resource-intensive queries that fulfill some of the above criteria.

Database software and optimization ✨: This is something you probably can’t control, but if you know the system you’re using, you can work within its bounds to make your queries more efficient.

71
Q

Filtering the data to include only the observations you need can dramatically improve query speed. True/False

A

True

72
Q

Why is it better to do exploratory analysis on a subset of data?

A

Keep in mind that you can always perform exploratory analysis on a subset of data, refine your work into a final query, then remove the limitation and run your work across the entire dataset. Mahsa: This helps with saving time and resources. This is why Mode enforces a LIMIT clause by default

73
Q

What should we do if we want to limit the dataset before performing the count?
(Since limiting happens after aggregation functions normally)

A

If you want to limit the dataset before performing the count (to speed things up), try doing it in a subquery:
Note 🧨: Using LIMIT this will dramatically alter your results, so you should use it to TEST query logic, but not to get actual results.

SELECT COUNT(*)
FROM (
SELECT *
FROM benn.sample_event_table
LIMIT 100
) sub

74
Q

What should you try, when your query joins two tables in a way that substantially increases the row count of the result set, hence it runs slowly?

A

It’s better to reduce table sizes before joining them. Meaning we can use sub-queries to filter tables, then join them, like below:

SELECT teams.conference,
sub.*
FROM (
SELECT players.school_name,
COUNT(*) AS players
FROM benn.college_football_players players
GROUP BY 1
) sub
JOIN benn.college_football_teams teams
ON teams.school_name = sub.school_name

Instead of

SELECT teams.conference AS conference,
players.school_name,
COUNT(1) AS players
FROM benn.college_football_players players
JOIN benn.college_football_teams teams
ON teams.school_name = players.school_name
GROUP BY 1,2

75
Q

What does EXPLAIN clause do?

A

You can add EXPLAIN at the beginning of any (working) query to get a sense of how long it will take. It’s not perfectly accurate, but it’s a useful tool. Try running this:

EXPLAIN
SELECT *
FROM benn.sample_event_table
WHERE event_date >= ‘2014-03-01’
AND event_date < ‘2014-04-01’
LIMIT 100

To clarify, this is most useful if you run EXPLAIN on a query, modify the steps that are expensive, then run EXPLAIN again to see if the cost is reduced. Finally, the LIMIT clause is executed last and is really cheap to run (24.65 vs 147.87 for the WHERE clause).

76
Q

Why do we pivot data?

A

presentation or charting

77
Q

How do we start making pivot tables in SQL?

A

It can be helpful to create the sub-query and select all columns from it before starting to make transformations, then use CASE statement to create new columns

78
Q

What does the below code do?

SELECT years.*,
earthquakes.magnitude,
CASE year
WHEN 2000 THEN year_2000
WHEN 2001 THEN year_2001
WHEN 2002 THEN year_2002
WHEN 2003 THEN year_2003
WHEN 2004 THEN year_2004
WHEN 2005 THEN year_2005
WHEN 2006 THEN year_2006
WHEN 2007 THEN year_2007
WHEN 2008 THEN year_2008
WHEN 2009 THEN year_2009
WHEN 2010 THEN year_2010
WHEN 2011 THEN year_2011
WHEN 2012 THEN year_2012
ELSE NULL END
AS number_of_earthquakes
FROM tutorial.worldwide_earthquakes earthquakes
CROSS JOIN (
SELECT year
FROM (VALUES (2000),(2001),(2002),(2003),(2004),(2005),(2006),
(2007),(2008),(2009),(2010),(2011),(2012)) v(year)
) years

A

The first thing to do here is to create a table that lists all the columns from the original table as rows in a new table.

SELECT year
FROM (VALUES (2000),(2001),(2002),(2003),(2004),(2005),(2006),
(2007),(2008),(2009),(2010),(2011),(2012)) v(year)
) years

Once you’ve got this, you can cross join it with the worldwide_earthquakes table to create an expanded view:

SELECT years.,
earthquakes.

FROM tutorial.worldwide_earthquakes earthquakes
CROSS JOIN (
SELECT year
FROM (VALUES (2000),(2001),(2002),(2003),(2004),(2005),(2006),
(2007),(2008),(2009),(2010),(2011),(2012)) v(year)
) years

Notice that each row in the worldwide_earthquakes is replicated 13 times. The last thing to do is to fix this using a CASE statement that pulls data from the correct column in the worldwide_earthquakes table given the value in the year column.

SELECT years.*,
earthquakes.magnitude,
CASE year
WHEN 2000 THEN year_2000
WHEN 2001 THEN year_2001
WHEN 2002 THEN year_2002
WHEN 2003 THEN year_2003
WHEN 2004 THEN year_2004
WHEN 2005 THEN year_2005
WHEN 2006 THEN year_2006
WHEN 2007 THEN year_2007
WHEN 2008 THEN year_2008
WHEN 2009 THEN year_2009
WHEN 2010 THEN year_2010
WHEN 2011 THEN year_2011
WHEN 2012 THEN year_2012
ELSE NULL END
AS number_of_earthquakes
FROM tutorial.worldwide_earthquakes earthquakes
CROSS JOIN (
SELECT year
FROM (VALUES (2000),(2001),(2002),(2003),(2004),(2005),(2006),
(2007),(2008),(2009),(2010),(2011),(2012)) v(year)
) years

79
Q

What does the below code do?

SELECT years.*,
earthquakes.magnitude,
CASE year
WHEN 2000 THEN year_2000
WHEN 2001 THEN year_2001
WHEN 2002 THEN year_2002
WHEN 2003 THEN year_2003
WHEN 2004 THEN year_2004
WHEN 2005 THEN year_2005
WHEN 2006 THEN year_2006
WHEN 2007 THEN year_2007
WHEN 2008 THEN year_2008
WHEN 2009 THEN year_2009
WHEN 2010 THEN year_2010
WHEN 2011 THEN year_2011
WHEN 2012 THEN year_2012
ELSE NULL END
AS number_of_earthquakes

A

“CASE year” goes down the column “year” and where it has value of 2000, it’s replaced with the corresponding value of that row in “year_2000” column (which is the number of earthquakes in year_2000) and so on till the end, then it’s given a column name and a new column is created that for each year and each magnitude shows the number of earthquakes. It’s like below and it’s for Pivoting columns to rows in the dataset:
year magnitude # of earthquakes
2000 8.0 to 9.9 1
2001 8.0 to 9.9 1
2002 8.0 to 9.9 0
2003 8.0 to 9.9 1
2004 8.0 to 9.9 2

80
Q

What does below code do? what does EXISTS clause do?
SELECT SupplierName
FROM Suppliers
WHERE EXISTS (SELECT ProductName FROM Products WHERE Products.SupplierID = Suppliers.supplierID AND Price < 20);

A

The SQL statement returns TRUE and lists the suppliers with a product price less than 20.

The EXISTS operator is used to test for the existence of any record in a subquery.

The EXISTS operator returns TRUE if the subquery returns one or more records.