Advanced SQL Flashcards

Question

How can we use || instead of CONCAT?

Answer 1

Alternatively, you can use two pipe characters (||) to perform the same concatenation: SELECT incidnt_num, day_of_week, LEFT(date, 10) AS cleaned_date, day_of_week || ', ' || LEFT(date, 10) AS day_and_date FROM tutorial.sf_crime_incidents_2014_01

Answer 2

They change the strings to upper or lower case. SELECT incidnt_num, address, UPPER(address) AS address_upper, LOWER(address) AS address_lower FROM tutorial.sf_crime_incidents_2014_01

Answer 3

📌The data was manipulated in Excel at some point, and the dates were changed to MM/DD/YYYY format or another format that is not compliant with SQL's strict standards. 📌The data was manually entered by someone who use whatever formatting convention he/she was most familiar with. 📌The date uses text (Jan, Feb, etc.) instead of numbers to record months.

Answer 4

timestamp includes additional precision (hours, minutes, seconds)

Answer 5

Use EXTRACT to pull the pieces apart one-by-one: SELECT cleaned_date, EXTRACT('year' FROM cleaned_date) AS year, EXTRACT('month' FROM cleaned_date) AS month, EXTRACT('day' FROM cleaned_date) AS day, EXTRACT('hour' FROM cleaned_date) AS hour, EXTRACT('minute' FROM cleaned_date) AS minute, EXTRACT('second' FROM cleaned_date) AS second, EXTRACT('decade' FROM cleaned_date) AS decade, EXTRACT('dow' FROM cleaned_date) AS day_of_week FROM tutorial.sf_crime_incidents_cleandate

Answer 6

You can also round dates to the nearest unit of measurement. This is particularly useful if you don't care about an individual date, but do care about the week (or month, or quarter) that it occurred in. The DATE_TRUNC function rounds a date to whatever precision you specify. The value displayed is the first value in that period. So when you DATE_TRUNC by year, any value in that year will be listed as January 1st of that year: SELECT cleaned_date, DATE_TRUNC('year' , cleaned_date) AS year, DATE_TRUNC('month' , cleaned_date) AS month, DATE_TRUNC('week' , cleaned_date) AS week, DATE_TRUNC('day' , cleaned_date) AS day, DATE_TRUNC('hour' , cleaned_date) AS hour, DATE_TRUNC('minute' , cleaned_date) AS minute, DATE_TRUNC('second' , cleaned_date) AS second, DATE_TRUNC('decade' , cleaned_date) AS decade FROM tutorial.sf_crime_incidents_cleandate

Answer 7

You can instruct your query to pull the local date and time at the time the query is run using any number of functions. Interestingly, you can run them 📍 without 📍a FROM clause: SELECT CURRENT_DATE AS date, CURRENT_TIME AS time, CURRENT_TIMESTAMP AS timestamp, LOCALTIME AS localtime, LOCALTIMESTAMP AS localtimestamp, NOW() AS now

Answer 8

You can make a time appear in a different time zone using AT TIME ZONE: SELECT CURRENT_TIME AS time, CURRENT_TIME AT TIME ZONE 'PST' AS time_pst

Answer 9

Occasionally, you will end up with a dataset that has some nulls that you'd prefer to contain actual values. This happens frequently in numerical data (displaying nulls as 0 is often preferable), and when performing outer joins that result in some unmatched rows. In cases like this, you can use COALESCE to replace the null values: SELECT incidnt_num, descript, COALESCE(descript, 'No Description') FROM tutorial.sf_crime_incidents_cleandate ORDER BY descript DESC

Answer 10

Subqueries (also known as inner queries or nested queries) are a tool for performing operations in multiple steps. For example, if you wanted to take the sums of several columns, then average all of those values, you'd need to do each aggregation in a distinct step. Subqueries can be used in several places within a query, but it's easiest to start with the FROM statement. Here's an example of a basic subquery: SELECT sub.* FROM ( SELECT * FROM tutorial.sf_crime_incidents_2014_01 WHERE day_of_week = 'Friday' ) sub WHERE sub.resolution = 'NONE'

Answer 11

False. they do need an alias

Answer 12

You can use subqueries in conditional logic (in conjunction with WHERE, JOIN/ON, or CASE). The following query returns all the entries from the earliest date in the dataset (theoretically—the poor formatting of the date column actually makes it return the value that sorts first alphabetically): SELECT * FROM tutorial.sf_crime_incidents_2014_01 WHERE Date = (SELECT MIN(date) FROM tutorial.sf_crime_incidents_2014_01 )

Answer 13

It chooses 5 first dates for the conditional logic Note🧨 that you should not include an alias when you write a subquery in a conditional statement. This is because the subquery is treated as an individual value (or set of values in the IN case) rather than as a table.

Answer 14

Nothing, they both do the same thing.

Answer 15

Sub-query joins can be particularly useful when combined with aggregations. When you join, the requirements for your sub-query output aren't as stringent as when you use the WHERE clause. For example, your inner query can output multiple results. (multiple columns)

Answer 16

First we start by analyzing the inner query, it returns the categories with the least count. When joined by the whole table, we can see the rows belonging to those categories. It prevents the need for writing multiple queries.

Answer 17

Full joining two already big datasets and then following it with count(distinct) terribly slows down the process. In cases like this, it's better to create two sub-queries and performing count(distinct) there an then joining them.

Answer 18

It's certainly not uncommon for a dataset to come split into several parts, especially if the data passed through Excel at any point (Excel can only handle ~1M rows per spreadsheet). The two tables used above can be thought of as different parts of the same dataset—what you'd almost certainly like to do is perform operations on the entire combined dataset rather than on the individual parts. You can do this by using a sub-query

Answer 19

subquery SELECT COUNT(*) AS total_rows FROM ( SELECT * FROM tutorial.crunchbase_investments_part1 UNION ALL SELECT * FROM tutorial.crunchbase_investments_part2 ) sub

Answer 20

subquery SELECT COUNT(*) AS total_rows FROM ( SELECT * FROM tutorial.crunchbase_investments_part1 UNION ALL SELECT * FROM tutorial.crunchbase_investments_part2 ) sub

Answer 21

A window function performs a calculation across a set of table rows that are somehow related to the current row.

Answer 22

Unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities.

Answer 23

First question: The above query uses window functions and groups and orders the query by start_terminal. Within each value of start_terminal, it is ordered by start_time, and the running total sums across the current row and all previous rows of duration_seconds. Second question: No

Answer 24

shows the duration of each ride as a percentage of the total time accrued by riders from each start_terminal

Answer 25

SUM, COUNT, and AVG.

Answer 26

It simply orders by the designated column(s) the same way the ORDER BY clause would, except that it treats every partition as separate.

Answer 27

Yes. Unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities.

Answer 28

Shows a running total of the duration of bike rides grouped by end_terminal, and with ride duration sorted in descending order.

Answer 29

ROW_NUMBER() does just what it sounds like—displays the number of a given row. It starts are 1 and numbers the rows according to the ORDER BY part of the window statement. ROW_NUMBER() does not require you to specify a variable within the parentheses

Answer 30

RANK() is slightly different from ROW_NUMBER(). If you order by start_time, for example, it might be the case that some terminals have rides with two identical start times. In this case, they are given the same rank, whereas ROW_NUMBER() gives them different numbers.

Answer 31

RANK() would give the identical rows a rank of 2, then skip ranks 3 and 4, so the next result would be 5 DENSE_RANK() would still give all the identical rows a rank of 2, but the following row would be 3—no ranks would be skipped.

Answer 32

Shows the 5 longest rides from each starting terminal, ordered by terminal, and longest to shortest rides within each terminal. Limit to rides that occurred before Jan. 8, 2012.

Answer 33

Computes in which quantile each duration second is. NTILE(n) is for calculating quantiles

Answer 34

It won't work properly, it just counts each number from 1 to 20, the n in NTILE(n) should be smaller than the number of members in a group. If you're working with very small windows, keep this in mind and consider using quartiles or similarly small bands.

Answer 35

Shows only the duration of the trip and the percentile into which that duration falls (across the entire dataset—not partitioned by terminal).

Answer 36

Yes, it shows the sum of the rows with the same number for duration second. It'll be something like: ... 0 0 0 0 1 2 1 2 2 126 2 126 2 126 2 126 2 126 2 126 2 126 ...

Answer 37

LAG pulls from previous rows and LEAD pulls from following rows. LAG shifts one down LEAD shifts one up

Answer 38

LAG shifts the duration_seconds in each start_terminal group one cell down LEAD shifts the duration_seconds in each start_terminal group one cell up returns something like this: 31000 74 277 31000 277 74 291

Answer 39

If you'd like to make the results a bit cleaner, you can wrap it in an outer query to remove nulls. SELECT * FROM ( SELECT start_terminal, duration_seconds, duration_seconds -LAG(duration_seconds, 1) OVER (PARTITION BY start_terminal ORDER BY duration_seconds) AS difference FROM tutorial.dc_bikeshare_q1_2012 WHERE start_time < '2012-01-08' ORDER BY start_terminal, duration_seconds ) sub WHERE sub.difference IS NOT NULL

Answer 40

If you're planning to write several window functions in to the same query, using the same window, you can create an alias.

Answer 41

SELECT start_terminal, duration_seconds, NTILE(4) OVER ntile_window AS quartile, NTILE(5) OVER ntile_window AS quintile, NTILE(100) OVER ntile_window AS percentile FROM tutorial.dc_bikeshare_q1_2012 WHERE start_time < '2012-01-08' WINDOW ntile_window AS (PARTITION BY start_terminal ORDER BY duration_seconds) ORDER BY start_terminal, duration_seconds

Answer 42

SUM, COUNT, and AVG ROW_NUMBER() RANK() and DENSE_RANK() NTILE LAG and LEAD Defining a window alias

Answer 43

Table size ✨: If your query hits one or more tables with millions of rows or more, it could affect performance. Joins ✨: If your query joins two tables in a way that substantially increases the row count of the result set, your query is likely to be slow. Aggregations ✨: Combining multiple rows to produce a result requires more computation than simply retrieving those rows. Query runtime is also dependent on some things that you can't really control related to the database itself: Other users running queries ✨: The more queries running concurrently on a database, the more the database must process at a given time and the slower everything will run. It can be especially bad if others are running particularly resource-intensive queries that fulfill some of the above criteria. Database software and optimization ✨: This is something you probably can't control, but if you know the system you're using, you can work within its bounds to make your queries more efficient.

Answer 44

Keep in mind that you can always perform exploratory analysis on a subset of data, refine your work into a final query, then remove the limitation and run your work across the entire dataset. Mahsa: This helps with saving time and resources. This is why Mode enforces a LIMIT clause by default

Answer 45

If you want to limit the dataset before performing the count (to speed things up), try doing it in a subquery: Note 🧨: Using LIMIT this will dramatically alter your results, so you should use it to TEST query logic, but not to get actual results. SELECT COUNT(*) FROM ( SELECT * FROM benn.sample_event_table LIMIT 100 ) sub

Answer 46

It's better to reduce table sizes before joining them. Meaning we can use sub-queries to filter tables, then join them, like below: SELECT teams.conference, sub.* FROM ( SELECT players.school_name, COUNT(*) AS players FROM benn.college_football_players players GROUP BY 1 ) sub JOIN benn.college_football_teams teams ON teams.school_name = sub.school_name Instead of SELECT teams.conference AS conference, players.school_name, COUNT(1) AS players FROM benn.college_football_players players JOIN benn.college_football_teams teams ON teams.school_name = players.school_name GROUP BY 1,2

Answer 47

You can add EXPLAIN at the beginning of any (working) query to get a sense of how long it will take. It's not perfectly accurate, but it's a useful tool. Try running this: EXPLAIN SELECT * FROM benn.sample_event_table WHERE event_date >= '2014-03-01' AND event_date < '2014-04-01' LIMIT 100 To clarify, this is most useful if you run EXPLAIN on a query, modify the steps that are expensive, then run EXPLAIN again to see if the cost is reduced. Finally, the LIMIT clause is executed last and is really cheap to run (24.65 vs 147.87 for the WHERE clause).

Answer 48

presentation or charting

Answer 49

It can be helpful to create the sub-query and select all columns from it before starting to make transformations, then use CASE statement to create new columns

Answer 50

The first thing to do here is to create a table that lists all the columns from the original table as rows in a new table. SELECT year FROM (VALUES (2000),(2001),(2002),(2003),(2004),(2005),(2006), (2007),(2008),(2009),(2010),(2011),(2012)) v(year) ) years Once you've got this, you can cross join it with the worldwide_earthquakes table to create an expanded view: SELECT years.*, earthquakes.* FROM tutorial.worldwide_earthquakes earthquakes CROSS JOIN ( SELECT year FROM (VALUES (2000),(2001),(2002),(2003),(2004),(2005),(2006), (2007),(2008),(2009),(2010),(2011),(2012)) v(year) ) years Notice that each row in the worldwide_earthquakes is replicated 13 times. The last thing to do is to fix this using a CASE statement that pulls data from the correct column in the worldwide_earthquakes table given the value in the year column. SELECT years.*, earthquakes.magnitude, CASE year WHEN 2000 THEN year_2000 WHEN 2001 THEN year_2001 WHEN 2002 THEN year_2002 WHEN 2003 THEN year_2003 WHEN 2004 THEN year_2004 WHEN 2005 THEN year_2005 WHEN 2006 THEN year_2006 WHEN 2007 THEN year_2007 WHEN 2008 THEN year_2008 WHEN 2009 THEN year_2009 WHEN 2010 THEN year_2010 WHEN 2011 THEN year_2011 WHEN 2012 THEN year_2012 ELSE NULL END AS number_of_earthquakes FROM tutorial.worldwide_earthquakes earthquakes CROSS JOIN ( SELECT year FROM (VALUES (2000),(2001),(2002),(2003),(2004),(2005),(2006), (2007),(2008),(2009),(2010),(2011),(2012)) v(year) ) years

Answer 51

"CASE year" goes down the column "year" and where it has value of 2000, it's replaced with the corresponding value of that row in "year_2000" column (which is the number of earthquakes in year_2000) and so on till the end, then it's given a column name and a new column is created that for each year and each magnitude shows the number of earthquakes. It's like below and it's for Pivoting columns to rows in the dataset: year magnitude # of earthquakes 2000 8.0 to 9.9 1 2001 8.0 to 9.9 1 2002 8.0 to 9.9 0 2003 8.0 to 9.9 1 2004 8.0 to 9.9 2

Answer 52

The SQL statement returns TRUE and lists the suppliers with a product price less than 20. The EXISTS operator is used to test for the existence of any record in a subquery. The EXISTS operator returns TRUE if the subquery returns one or more records.

Advanced SQL Flashcards

(80 cards)