Module 12: Databases and SQL basics Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

What are relational databases

A

The difference between SQL and noSQL databases are defined by the kinds of workloads they can handle. As workloads change, companies select the right tool for the right job.

In a relational database management system, data is stored in tables made up columns which are variables and rows which are observations. A table consists of a dataset which is both the rows and columns.

Relational databases are organized using schemas – this is essentially the blueprint of the tables, columns, and relationships between tables. A common schema is called the star schema which is a fact table in the middle that lists central facts usually quantitively like sales, and they are surrounded by dimension tables which further contextualize the fact table by providing things like dates in one or product type sold in another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Keys

A

The attributes used to identify rows are known as keys, which shouldn’t be confused with indexes as indexes are not part of a table but are used for faster referencing. Keys are otherwise the logical links between tables.

A primary key is a value in a column that usually just increments by one in the table. This is to show the unique record.

A foreign key is a unique record that identifies a row of another table. The purpose of this is to reconstruct records split across tables via operations called joins.

Database relationships are either 1-to-1 this implies a unique link such that two matching records do not match with any other records, one-to-many in that a record can match many records in another table, and many-to-many in that a record can match any number of records in another table and vice-versa.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Relational vs non-relational

A

Here are the specific differences:
Relational Databases Non-Relational Databases
Data organized in tables and columns “NoSQL” databases like MongoDB, don’t require designing data relations (no need for a data model)
Have a concept of primary and foreign keys Used as a “data store” for data than can change daily
Mature technology (ORACLE, SQL Server, MySQL, etc.) Used in big data and real-time web applications
Work well for data with predefined structure Many varieties such as key-value, document, and graph

Great for analytics (e.g. select customers who regularly spend in a certain category)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is SQL

A

Standard query language (SQL) is essential for accessing, manipulating, and managing databases, used for updating, creating, deleting databases. This is similar to the process of using access permissions to manage files.

Much of the data scientists work involves data retrieval and data preparation for analysis – as such, SQL is required knowledge for manipulation and retrieval of data. What can SQL do:
- Retrieve data which can be used for further analysis
- Joining tables and views to create new tables and views
- Creating new data records/tables/views
- Deleting data/tables/views
- Updating existing data/records/tables/views
- Setting permissions
- Creating stored procedures for automation and workflows

Its functions can be broken up into three sub-languages:
- A language for querying (immutable operations)
- A language for specifying a schema (declarations)
- A language for making changes (mutable operations)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Most common SQL query operation - SELECT, FROM, WHERE

A
  • SELECT [grab the columns needed]
  • FROM [table/views]
  • WHERE [apply relevant conditions/filters]

So, how do we explore the database? How do we know what tables and attributes we have available to us? Normally, this information can be found using a command like SHOW db_name or SELECT * FROM schema in the database management system (DBMS).

”””
Querying the schema table, sqlite_master. We do this if we know nothing about the database.

NOTE: LIMIT 5 only returns 5 rows of the query output.
NOTE: SELECT * selects all columns. We use this because we have no idea what the attributes
are called for sqlite_master.
“””
c.execute(“SELECT * FROM sqlite_master LIMIT 5;”)
output = c.fetchall()
output

Output:

[(‘table’,
‘Album’,
‘Album’,
2,
‘CREATE TABLE [Album]\n(\n [AlbumId] INTEGER NOT NULL,\n [Title] NVARCHAR(160) NOT NULL,\n [ArtistId] INTEGER NOT NULL,\n CONSTRAINT [PK_Album] PRIMARY KEY ([AlbumId]),\n FOREIGN KEY ([ArtistId]) REFERENCES [Artist] ([ArtistId]) \n\t\tON DELETE NO ACTION ON UPDATE NO ACTION\n)’),
(‘table’,
‘Artist’,
‘Artist’,
3,
‘CREATE TABLE [Artist]\n(\n [ArtistId] INTEGER NOT NULL,\n [Name] NVARCHAR(120),\n CONSTRAINT [PK_Artist] PRIMARY KEY ([ArtistId])\n)’),
(‘table’,
‘Customer’,
‘Customer’,
4,
‘CREATE TABLE [Customer]\n(\n [CustomerId] INTEGER NOT NULL,\n [FirstName] NVARCHAR(40) NOT NULL,\n [LastName] NVARCHAR(20) NOT NULL,\n [Company] NVARCHAR(80),\n [Address] NVARCHAR(70),\n [City] NVARCHAR(40),\n [State] NVARCHAR(40),\n [Country] NVARCHAR(40),\n [PostalCode] NVARCHAR(10),\n [Phone] NVARCHAR(24),\n [Fax] NVARCHAR(24),\n [Email] NVARCHAR(60) NOT NULL,\n [SupportRepId] INTEGER,\n CONSTRAINT [PK_Customer] PRIMARY KEY ([CustomerId]),\n FOREIGN KEY ([SupportRepId]) REFERENCES [Employee] ([EmployeeId]) \n\t\tON DELETE NO ACTION ON UPDATE NO ACTION\n)’),
(‘table’,
‘Employee’,
‘Employee’,
7,
‘CREATE TABLE [Employee]\n(\n [EmployeeId] INTEGER NOT NULL,\n [LastName] NVARCHAR(20) NOT NULL,\n [FirstName] NVARCHAR(20) NOT NULL,\n [Title] NVARCHAR(30),\n [ReportsTo] INTEGER,\n [BirthDate] DATETIME,\n [HireDate] DATETIME,\n [Address] NVARCHAR(70),\n [City] NVARCHAR(40),\n [State] NVARCHAR(40),\n [Country] NVARCHAR(40),\n [PostalCode] NVARCHAR(10),\n [Phone] NVARCHAR(24),\n [Fax] NVARCHAR(24),\n [Email] NVARCHAR(60),\n CONSTRAINT [PK_Employee] PRIMARY KEY ([EmployeeId]),\n FOREIGN KEY ([ReportsTo]) REFERENCES [Employee] ([EmployeeId]) \n\t\tON DELETE NO ACTION ON UPDATE NO ACTION\n)’),
(‘table’,
‘Genre’,
‘Genre’,
9,
‘CREATE TABLE [Genre]\n(\n [GenreId] INTEGER NOT NULL,\n [Name] NVARCHAR(120),\n CONSTRAINT [PK_Genre] PRIMARY KEY ([GenreId])\n)’)]

All data is returned as a list of tuples / rows / records. The rows contain the following information in the following tuple.

(‘table’,
‘Album’,
‘Album’,
2,
‘CREATE TABLE [Album]\n(\n [AlbumId] INTEGER NOT NULL,\n [Title] NVARCHAR(160) NOT NULL,\n [ArtistId] INTEGER NOT NULL,\n CONSTRAINT [PK_Album] PRIMARY KEY ([AlbumId]),\n FOREIGN KEY ([ArtistId]) REFERENCES [Artist] ([ArtistId]) \n\t\tON DELETE NO ACTION ON UPDATE NO ACTION\n)’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Example of SELECT FROM WHERE

A

This was a demonstration of join-select-project – using SELECT FROM WHERE

We can do more than this though – suppose that a company wanted to report on top performers using albums released per artists – we already have the two relevant tables then we can:

  1. Calculate performance metrics by tallying the number of albums by artist
  2. Rank by performance
  3. Retrieve the top-k performers

This can all be done with one query:

c.execute(“ “ +
# NOTE: SELECT is executed last. So COUNT is in the context of a grouping, not the table
“SELECT Artist.Name, COUNT(Album.AlbumID) AS performance “ +
“FROM Album, Artist “ +
“WHERE Album.ArtistID = Artist.ArtistID “ +
# Grouping by artists means we partition the table into containers for each artist
“GROUP BY Artist.ArtistID “ +
# Each of those containers has a count aggregate computed.
# Note: this is actually unnecessary for the query, we’re just being thorough.
“HAVING COUNT(Album.AlbumID)” +
# Each container is sorted (not the row contents of the containers, but the actual containers/groupings)
“ORDER BY COUNT(Album.AlbumID) DESC “ +
# We then limit by groupings / containers
“LIMIT 10 “+
# Remember: SELECT is executed last. So it is in the context of containers / groupings.
# Technically, Groupings always exist. By default, all records will be in one grouping unless
# otherwise specified.
“;”)

c.fetchall()

output:

[(‘Iron Maiden’, 21),
(‘Led Zeppelin’, 14),
(‘Deep Purple’, 11),
(‘Metallica’, 10),
(‘U2’, 10),
(‘Ozzy Osbourne’, 6),
(‘Pearl Jam’, 5),
(‘Various Artists’, 4),
(‘Faith No More’, 4),
(‘Foo Fighters’, 4)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly