SQL Flashcards
Calculate median in SQL
with t1 as (
select
row_number() as rn,
count() over () as cnt,
round(count() over () / 2) as mid
from table
)
select * from t1
where rn = mid
or (cnt % 2 = 0 and rn = mid + 1)
– Pull where rn = md, so the middle number. This works if dataset is odd.
– If dataset is even, pull where the rn is mid AND another row mid + 1
Tips for Optimizing a Query
- Minimize full table scans
- Filter data
- Use CTE’s
- Temp tables
- Avoid joins
- User inner (limits result set)
- No select *
- Avoid distinct / union
- Materialize frequently used queries
Get difference between two times
timestampsidff(
second,
time1,
time2
)
Get min/max between two or more columns
- least()
- greatest()
What type of language is SQL
Declarative (as opposed to procedural)
You describe the results you want - NOT how to do it
Group strings
group_contact()
list_agg()
Scalar function to add strings together
concat()
Result of FULL OUTER JOIN where join keys are null
All from A and all from B but NOT where A and B intersect
Result of FULL OUTER JOIN
All from A and B and intersection
count(*) vs count(1) vs count(col)
- count(*): All rows including nulls
- count(1): All rows including nulls
- count(col): All rows EXCEPT nulls
Find start and end of continuous ranges
1
2
3
7
8
with t1 as (
select
log_id,
log_id - row_number() over (order by log_id asc) as grp
from logs
)
select
min(log_id) as start_id,
max(log_id) as end_id
from t1
group by grp
Recursive function in SQL
with recursive t1 as (
select 1 as col1 union all select col1 + 1 from t1 where col1 < 10
)
select * from t1
Partial Dependencies
When non-key is dependent on part of a candidate key
studentid | projectno | studentname | projectname
studentname and projectname are both dependent on the ids in the table.
Problem with this is if we don’t have any students assigned to a project, then we can’t get the project name from anywhere
Pros of 3NF
- Fast writes
- Better organization
- Reduction of redundant data
2nd Normal Form Rules
- Be in 1NF
- Have no partial dependencies
Fan-out Trap
Two 1-to-many joins
Where to use an Index
- Typically on columns in WHERE and JOIN clauses
- Column that appears in max number of queries
When would Indexing fail
- Have too many indexes
- If index column is hardly queried
- Too many DML actions
Nested Joins
For every value in OUTER, look for it in INNER
Can be extremely slow
Cache
Temporarily storing an expressive operation to speed up results
Repeatable Read Isolation Level
Can read within same transaction without anything changing
Read Uncommitted Isolation Level
Can read uncommitted data (dirty reads)
Read Committed Isolation Level
Each read gets latest data committed
Purpose of database normalization
- Eliminating redundant data
- Ensure data dependencies make logical sense
- Prevent insert, update and removal anomalies
Chasm Trap
When 2 many-to-one joins converge on a single table. Pathway between is broke
Merge Join
Sort table on join columns
For each value in OUTER, iterate over INNER to find, but, stop when reach a value greater than what we’re searching for (since it’s in order)
Deletion Anomaly
Deletion of unwanted data causes loss of desired data
If there are partial dependencies (i.e. teacher phone number in classes table), having no teachers with classes means we no longer have their phone number
1st Normal Form Rules
- Each cell has a single value
- Column should be of all same type
- Unique column names
Update Anomaly
Miss updating all records in a table
Hash Join
Apply a hash function to both INNER and OUTER tables to create hash tables. Can then lookup each value in O(1)
Takes more time as the tables need to be hashed first
Cardinality
- Uniqueness of data
- Low: More duplicates
- High: More unique
Dirty reads
When a transaction reads data that is written to a concurrently uncommitted transaction
3NF Cons
- Highly normalized
- Many joins
- Setup high
Boyce-Codd Normal Form
- Non-prime attributes must depend on all candidate keys (super keys)
i.e. If a key is NOT the primary key, but it could have been
3rd Normal From Rules
- Must be in 2NF
- No transative functional dependencies (when a non-primary key depends on another non-primary key)
4th Normal Form
- Should not contain any multi-valued dependencies
- Looks like a cross join
A multi-valued dependency occurs when a single attribute determines more than one other attribute, and these attributes are independent of each other. For example, if a table stores the hobbies and skills of employees, and each employee can have multiple hobbies and skills, then there is a multi-valued dependency between the employee ID and the hobbies and skills. To achieve 4NF, we need to split the table into two separate tables, one for hobbies and one for skills, and link them with a foreign key.
5th Normal Form
- Cannot look like a join of tables
- Basically, a table is in 5th normal form if it cannot be made into any smaller tables with different keys (most tables can obviously be made into smaller tables with the same key!).
Fifth normal form (5NF) is a level of database normalization that ensures that a table has no join dependencies. A join dependency occurs when a table can be decomposed into two or more smaller tables, and then reconstructed by joining them on their primary keys, without losing any information. For example, if a table stores the products, colors, and sizes that a company sells, and each product can have multiple colors and sizes, then there is a join dependency between the product ID, the color, and the size. To achieve 5NF, we need to split the table into three separate tables, one for products, one for colors, and one for sizes, and link them with a composite key. The benefit of 5NF is that it ensures that the database is fully normalized and has no redundancy or anomalies. The drawback of 5NF is that it may create too many tables and relationships, and make the database design more difficult to understand and maintain.
Window functions RANGE vs ROW
Use ROWS when you want to define your window in terms of a specific number of rows, regardless of their values.
Use RANGE when you want to define your window based on the values in the ORDER BY column. But be cautious about non-uniform value distributions or duplicates!