Week 2: Data Extraction Flashcards
How do you get the unique values in a list?
list.unique()
What is the degree of a relation?
The number of columns
How do you get column sums in python?
df.sum()
What is the requirement of attribute values in a relation?
attribute values are normally required to be atomic: indivisible (not containing more than one value)
What are the 2 ways to create dataframes in python pandas?
data = {‘State’: [‘Ohio, ‘Ohio’, ‘Ohio’, ‘Nevada’, ‘Nevada’, ’Nevada'], ‘Year’: [2000, 2001, 2002, 2001, 2002, 2003], ‘Population’: [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]} df = pd.DataFrame(data)
data = [[‘Ohio’, 2000, 1.5], [‘Ohio’, 2001, 1.7], [‘Ohio’, 2002, 3.6], [Nevada, 2001, 2.4], [‘Nevada, 2002, 2.9], [‘Nevada’, 2003, 3.2]] cols = [‘State’, ‘Year’, ‘Population’] df = pd.DataFrame(data, columns = cols)
How do you get a vector containing the column names in python?
df.columns
How do you add a new column to a dataframe in python?
df[‘newcolumn’] = …
How do you do a left outer join in python?
pd.merge(df1, df2, on=‘attribute’, how = ‘left’)
How do you find mismatches of strings?
WHERE … NOT LIKE …
How do you match an exact substring and the beginning of a string?
‘Intro%’ #matches any string beginning with “Intro”
‘%Comp%’ #matches any string containing “Comp” as a substring
What are the distributive laws?
A ^ (BvC) = (A^B) v (A^C) A v (B^C) = (AvB) ^ (AvC)
What is the basic form of a select from where statement?
SELECT desired attributes
FROM one or more tables
WHERE condition about tuples of the tables
How do you concatenate dataframes row-wise in python?
pd.concat([df1, df2])
What does 5> unknown return?
Unknown
What are 2 ways to extract a single column by name in python?
df.director_name
df[“director_name”]
How do you extract the first 5 rows and column 2 in python?
df.iloc[0:5,1]
How do you force the elimination of duplicates in an SQL query?
SELECT DISTINCT
What are the comparison signs in SQL?
=, <> (not equal), , <=, >=
How do you get the first and last rows of a dataframe in python?
df. head(n) #get first n rows
df. tail(n) #get last n rows
What is a tuple comparison?
WHERE (instructor.ID, dept_name) = (teaches.ID, ‘Biology’)
How do you display the number of columns in python?
print(len(df.columns))
How do you extract multiple columns by name in python?
df[[‘actor’, ‘facebook_likes’, ‘content_rating’]]
What is a superkey of R?
K is a superkey of R if values for K are sufficient to identify a unique tuple of each possible relation
What are the absorptive laws?
A v (A^B) = A A ^ (AvB) = A
What is a self join?
Self join is where a table joins with itself on two columns with different names but storing data from the same domain. distinguish copies by renaming the relations
How do you display the number of rows in python?
print(len(df))
What is the cardinality of a relation?
The number of rows
How do you select all attributes in an SQL query?
SELECT *
How do you rename an attribute in SQL? (2 ways)
using “as”
e.g. SELECT ID, name, salary/12.0 as monthly_salary
FROM instructor
or simply write salary/12.0 monthly_salary
do not need the “as”
What happens to unknown values in the where clause?
treated as false if it evaluates to unknown
What is the relational algebra symbol for renaming?
rho (p)
What are the advantages of composing relational algebra operators?
- the output of a relational algebra (RA) is a relation that can be input for another RA operator
- don’t need heaps of functions for each line of code
How do you order the results of an SQL query?
What order does it put them in?
SELECT *
FROM instructor
WHERE dept_name = ‘Physics’
ORDER BY name
-puts results in ascending order
What aggregate functions can be used in SQL? (5)
- avg: average value
- min: minimum value
- max: maximum value
- sum: sum of values
- count: number of values
What is the double negation law?
~(~A) = A
What are the complement laws?
Av ~A = T
A^ ~A = F
What are the 4 boolean laws including true and false
A^F = F AvF = A A^T = A AvT = T
How do you get a true/false vector for whether the value is null in a column in python?
df[‘column’].isnull()
What is the result of SELECT ‘A’ FROM instructor
returns a relation with one column and N rows(the number of tuples in instructor) where each row will contain the value ‘A’
How do you delete a row in python?
df.drop([‘row1’])
What is a left outer join and its code in SQL?
All attributes from left table are listed. Matching attributes from the right table are added where they are available, null when not available
SELECT *
FROM course
LEFT OUTER JOIN prereq
ON course.course_id = prereq.course_id
How do you add values to the end of a list?
list.append(c)
How can you do a full outer join in python?
pd.merge(df1, df2, on=‘attribute’, how = ‘outer’)
When do you use WHERE vs HAVING?
Having must be used after a GROUP BY clasue
What is a full outer join and its code in SQL?
takes all attributes from left and right tables, fills in all matching attributes and puts null where not available
SELECT *
FROM course
FULL OUTER JOIN prereq
ON course.course_id = prereq.course_id
How do you find the average salary of instructors in each department
SELECT dept_name, AVG(salary) AS avg_salary
FROM instructor
GROUP BY dept_name
What does a cartesian product do?
Gets all the tuples from relation R and pair them with all the tuples from relation S, end up with huge table
What do you use in the where clause to meet more than one criteria?
SELECT ID, name
FROM student
WHERE dept_name = ‘Math’ AND tot_cred >= 24
can also use OR
What are the three boolean operators?
- and (^): retrieves records that satisfy both conditions
- or (v): retrieves records that satisfy one of the conditions
- not (~): retrieves records that don’t satisfy the condition
How do you get the summary statistics of dataframe columns in python?
df.describe()
What is an SQL query for getting the name of instructors who have the same salary, avoids pairs like (miller, miller) and produces pairs in alphabetical order. From the relation instructor
SELECT T.name AS N1, S.name AS N2
FROM instructor AS T, instructor AS S
WHERE T.salary = S.salary AND T.name < S.name
What are 3 ways to extract the rows with the condition that content rating = PG and the columns with names facebook likes and budget or index 3 and 5?
df.loc[df[‘content_rating’]==‘PG_13’, [‘facebook_likes’, ‘budget’]]
#or
df.iloc[(df[‘content_rating]==‘PG_13’).values,[3,5]]
#or
df[df.content_rating == ‘PG_13’][[‘facebook_likes’, ‘budget’]]
How do you match a string in the where clause?
WHERE name LIKE ‘%Van der%’
How do you use union, intersection and set difference in relational algebra?
Can do the union, intersection or set difference (-) of full relational algebra statements - but not include them within the statements
What does a natural join do?
set of all combinations of tuples in R and S that are equal on their common attribute names. and includes all other attributes from both relations filled in
What is a right outer join and its code in SQL?
All attributes from the right table are listed. Matching attributes from the left table are added where they are available, null when not available
SELECT *
FROM course
RIGHT OUTER JOIN prereq
ON course.course_id = prereq.course_id
What is data manipulation language?
Language for accessing and manipulating the data in a model
DML is used for adding, deleting and modifying the records in a database
What is data definition language?
- used for defining the database schema
- DDL is used to create and modify database objects such as tables, indexes and users
How do you check for null values using SQL?
WHERE salary IS NULL
What is the arity of a relation?
The number of columns
How do you match a string of 3 characters and a string of at least 3 characters?
‘___’ #matches any string of exactly three characters
‘___%’ #matches any string of at least three characters
What are the 5 basic types in SQL?
- char(n) or character(n): character string with fixed length n
- varchar(n): variable length character string
- int or integer: integer of any length
- numeric(p,d): fixed point number with p digits and d of the digits are to the right of the decimal
- float(n): floating point number with precision at least n digits
What is another way to write
Where dept_name = ‘Comp Sci’ or dept_name = ‘Biology’
WHERE dept_name IN (‘Comp sci’, ‘Biology’)
How do you concatenate dataframes column-wise in python?
pd.concat([df1, df2], axis=1)
What is the form of relational algebra statement for the general sql query:
select A1, sum (A3)
from r1, r2
where P
group by A1, A2
Pi(A1, sum(A3)) (A1, A2 G sum(A3) (sigma P (r1, r2)))
How can you extract row 5 in python by index (2) and rowname (1)?
df. iloc[5]
df. iloc[5, :]
df. loc[‘rowname’]
What is a candidate key of R?
Superkey K is a candidate key if K is minimal
How do you do a natural join in SQL?
SELECT *
FROM course
NATURAL JOIN prereq
ON course.course_id = prereq.course_id
What is the primary key of R?
A selected candidate key used to represent the relation
What are the unknowns combined with the and operator?
- true AND unknown = unknown
- false AND unknown = false
- unknown AND unknown = unknown
How do you transpose a dataframe in python?
df.T
How do you get the index and data types of a dataframe in python?
df.info()
What are de morgans laws?
~(A^B) = ~A v ~B ~(AvB) = ~A ^ ~B
How do you find the difference of two SQL queries?
(SELECT course_id FROM section WHERE sem = ‘Fall’ AND year = 2009)
EXCEPT
(SELECT course_id FROM section WHERE sem = ‘Spring’ AND year = 2010)
How would you match the string “100%”?
LIKE ‘100\%’ ESCAPE ‘'
How do you get the minimum of column 1 in python?
df.columns[1].min()
What type of subqueries can be nested in each of the select, from and where clauses?
SELECT A1, A2, …, An
FROM r1, r2, …, rn
WHERE P
as follows:
- Ai can be replaced by a sub query that generates a single value
- ri can be replaced by any valid subquery
- P can be replaced with an expression of the form: B (subquery)
How can you extract multiple columns by index in python?
df[df.columns[[5,7,9]]][:] #extracts columns 5,7,9
What are the 3 main components of data models?
-structures: e.g. rows and columns, nodes and edges, key-value pairs
-constraints: e.g. all rows must have same number of columns, age must be positive, c
* operations: e.g. return the values of record x, find rows where the column “name” is “Jordan”
What is a foreign key of R?
column or set of columns in table whose values correspond to the primary key in another table
How can you find the names of all instructors with salary between $90,000 and $100,000 (this includes the values equal to)
SELECT name
FROM instructor
WHERE salary BETWEEN 90000 and 100000
What are the 3 database levels of abstraction?
- Physical level: describes how a record is stored
- Logical level: describes what data stored in the database and the relationships among the data e.g. name: string, salary: real
- View level: application programs which hide details of data types. Views can also hide information for security/privacy purposes
What is the relational algebra symbol for grouping and aggregation?
G or gamma
Code to list all departments along with the number of instructors in each department in SQL? where there is an instructor relation and department relation
SELECT dept_name,
(SELECT COUNT(*) FROM instructor WHERE department.dept_name = instructor.dept_name) AS num_instructors
FROM department
How do you group by a particular attribute and count how many in each group in python?
df.groupby([‘column’]).count()
What are the unknowns combined with the or operator?
- unknown OR true = true
- unknown OR false = unknown
- unknown OR unknown = unknown
What is a subquery in SQL?
A subquery is a SELECT-FROM-WHERE expression that is nested within another query
What is A^A and AvA?
A^A = A AvA = A
What is a natural join and an inner join?
takes only the attributes that are common in both tables
Inner join keeps both the columns in the output even though they will be the same, while natural join will only keep one common column
How do you get the count for each column in python?
df.count()
What is the result in python of adding two dataframes df1+df2
Where columns match, it will add the data in that column from each df, e.g. “BIOBIO” because strings will concatenate. Where columns don’t match, will include all columns from both dataframes but entries will all be NaN
How do you find the average salary of instructors in each department which has average salary greater than 80000
SELECT dept_name, AVG(salary) AS avg_salary
FROM instructor
GROUP BY dept_name
HAVING avg_salary > 80000
What is the result of
SELECT *
FROM instructor, teaches
generates every possible instructor - teaches pair, with all attributes from both relations/ common attributes in the resulting table are renamed using the relation name
cartesian product
What is a query?
a statement requesting the retrieval of information
What are the 2 parts that a relation is made up of?
- Schema: specifies name of relation, plus name and type of each column
- Instance: a table with rows and columns. rows = records = tuples, column = attributes = fields = features (instance is a snapshot of the schema at a point in time)
How do you do an inner join in SQL?
SELECT *
FROM course
JOIN prereq
ON course.course_id = prereq.course_id
INNER JOIN is also the same as JOIN
What are the 6 main relational algebra operators?
Union (U): total set of records in both relations, only records each record once even if it is in both
intersection: common records
difference (-): records in one table but not the other
selection (sigma): a specific set of rows
projection (pi): a specific set of columns. Duplicate tuples are removed
join: the natural join of two tables
How do you delete a column in python? (2 ways)
del df[‘column1’]
df.drop(‘column1’, axis=1)
What is the result of SELECT ‘542’
results in a relation with one column and one row with value ‘542’