Week 3: Advanced SQL and Data Integration Flashcards

1
Q

What is the check clause?

A

defines constraints on the values of a particular attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is jaccard bag similarity?

A

Jaccard similarity but counts repetition of elements
Max is 0.5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are personalisation and customisation in terms of data?

A
  • personalisation: content adapted to user
  • Customisation: structure adapted to user
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the official definition of a lossless decomposition?

A

let R a relation schema (with constraints). A decomposition R1, R2, … Rn of R is called lossless iff. for each valid relation (instance) r(R): r = piR1(r) join piR2(r) join …

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What functional dependencies can be derived from armstrongs axoims? (4)

A

if X -> Y and X -> W, then X -> YW
if X -> YW, then X -> Y and X -> W
if X -> Y and WY -> Z, then XW -> Z
If X is a candidate key, then X -> Y for all Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do you create a view in SQL?

A

CREATE VIEW YoungActiveStudents (name, grade)
AS SELECT S.name, E.grade
FROM Students S, Enrolled E
Where S.Sid = E.Sid and S.age <21

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why is entity resolution useful? (3)

A

improves data quality and integrity, fosters re-use of existing data sources, optimises space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the standard blocking algorithm?

A
  1. Select the most appropriate attribute name(s) with respect to noise and distinctiveness
  2. Transform the corresponding value(s) into a blocking key (BK)
  3. For each BK, create one block that contains all entities having this BK in their transformation
    * works as a hash function - blocks on the equality of BKs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the similarity of two signatures?

A

the fraction of the hash functions in which they agree (for which signatures have the same value)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is entity integrity?

A

Means the primary key cannot be null

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you delete an attribute from an SQL table?

A

ALTER TABLE Students
DROP firstYear

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a legal instance of a relation?

A

one that satisfies all specified ICs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a view in SQL and what is it used for?

A

A view is just a relation, but we store a definition, rather than a set of tuples
Views can be used to present necessary information (or a summary) while hiding details in underlying relations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is locality sensitive hashing?

A

generate from the collection of all elements (signatures) a small list of candidate pairs: pairs of elements whose similarity must be evaluated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a similarity metric for documents?

A

Represent document as a set of its k-shingles
Use jaccard set similarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the three ways of enforcing referential integrity in SQL?

A
  • NO ACTION - delete/update is rejected
  • CASCADE - make the same changes to all tuples that refer to the updated/deleted tuple
  • SET NULL / SET DEFAULT - sets foreign key value of referencing tuple to NULL or a default value
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the purpose of integrity constraints?

A

Integrity constraints guard against accidental damage to the database, by ensuring that authorised changes do not result in a loss of data consistency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is a candidate key in terms of closure?

A

The minimal set of attributes for which their closure is all the set of attributes in a relation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How do you find the similarity of two sets(documents) from the shingle matrix?

A

no. of rows where both columns are 1 / no. of rows where either column is 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How do you add a check into an SQL table?

A

CREATE TABLE section( semester VARCHAR(6) CHECK(semester IN (‘Fall’, ‘Winter’, ‘Spring’, ‘Summer’)), year NUMERIC(4,0) CHECK(year>1990))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is functional dependence?

A
  • the values of a set of attributes X determine the values of another set of attributes Y
  • Denoted by X -> Y (X determines Y)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are covering constraints?

A

instances of the children of an entity include all instances of their parent (ie cover it)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the key idea of hashing?

A

“hash”each column C to a small signature h(C), such that:

  1. H(C) is small enough that the signature fits in RAM
  2. sim(C1,C2) is the same as the “similarity” of signatures h(C1) and h(C2)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the gap distance?

A
  • overcome limitation of edit distance with shortened strings (abbreviations)
  • Considered two extra operations: open gap and extend gap (with small cost)

Cost = insert + open + extend

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is jaccard similarity?

A

-The Jaccard similarity of two sets is the size of their intersection divided by the size of their union:
sim(C1,C2) = |C1 intersection C2| / |C1 union C2|

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How do you perform min-hashing?

A
  • start with a permutation for each row of the boolean matrix
  • get the first occurring row in the permuted order that correlates to a 1 in the document and store this as the new column value in the Signature matrix
  • repeat for many permutations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is jaccard distance?

A

d(C1,C2) = 1 - sim(C1,C2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is jaro-similarity?

A

JaroSim(S1, S2) = (1/3)*(C/|S1| + C/|S2| + (C-T)/C)

C=common/matching characters in S1 and S2
T=transpositions/2 (transposition=matching but different sequence order)

characters are considered matching when they are the same and not further than [max(|S1|, |S2|)/2]-1 apart

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What are reasons for discrepancies between entities? (4)

A

Text variations, local knowledge, evolving nature of data, new functionalities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

How do you check if a decomposition is lossless using functional dependencies?

A

For relational schema R(XYZ), the following holds: If X -> Y then the decomposition R1(XY) R2(XZ) is lossless.

31
Q

What is the normalisation theory for designing a good database? (4)

A
  • 1 table for each entity
  • 1 table for each relationship
  • Each cell contains a single value
  • If you have big relations, decompose them
32
Q

What is a conditional functional dependency? How is it expressed?

A

A functional dependency that holds only under some contraint

([country = UK, zip] -> street)

33
Q

How do you set the method of enforcing referential integrity when creating an SQL table?

A

CREATE TABLE Enrolled(sid CHAR(20), cid CHAR(20), grade CHAR(2), PRIMARY KEY(sid,cid), FOREIGN KEY(sid) REFERNCES Students
ON DELETE CASCADE
ON UPDATE SET DEFAULT)

34
Q

What is the property of the minhash function?

A

Pr[h_pi(C1) = h_pi(C2)] = sim(C1, C2)
the probability that the minhash function for a random permutation of rows produces the same value for two sets equals the Jaccard similarity of those sets

35
Q

What is the closure of a set of functional dependencies?

A

The set of functional dependencies logically implied by F is the closure of F, denoted F+

ie. to find the closure B+, find all the attributes that can be determined starting with B

36
Q

what is visual integration architecture / on demand integration?

A

-leave the data in the sources
-When a query comes in:
* Determine their relevant sources to the query
* Break down the query into sub-queries for the sources
* Get the answers from the sources, filter them if needed and combine them appropriately

37
Q

What are overlap constraints?

A

specify that the children of an entity do/don’t overlap

38
Q

What are Armstrongs axioms? (3)

A
  • If X subset of Y, then Y -> X (reflexivity)
  • If X -> Y, then AX -> AY (augmentation) (A is a set of attributes)
  • If X -> Y and Y -> W, then X -> W (transitivity)
39
Q

What is a k-shingle / k-gram?

A

-a sequence of k-tokens (characters or words) that appear in the document
E.g. for document D1 = abcab, Set of 2-shingles S(D1) = {ab, bc, ca}

40
Q

What is the goal of hashing?

A

find similar columns while computing small signatures

41
Q

What is the purpose of functional dependencies?

A

functional dependencies allow us to express constraints that cannot be expressed using superkeys

42
Q

How do you change the value in an attribute for all or some tuples?

A

UPDATE Students
SET grade = grade*1.2
WHERE …

43
Q

How do you define a minhash function?

A

Define a minhash function h_pi(C) = the index of the first(in the permuted order pi) row which column C has value 1

44
Q

How do you add a row into an SQL table? (2 ways depending on the order)

A

INSERT INTO Students
VALUES (53688, ‘Smith’, ‘smith@ee’, 18, 3.2)

INSERT INTO Students(name, Sid, login, age, gpa)
VALUES (‘Smith’, 53688, ‘smith@ee’, 18, 3.2)

45
Q

What happens if you enter 3 characters into an attribute with CHAR(5)?

A

it will store the three characters followed by 2 spaces

46
Q

What is a candidate key in terms of functional dependence?

A

K is a candidate key for R iff. K -> R AND for any X proper subset of K, X -/-> R

47
Q

How do you set a foreign key when creating a table in SQL?

A

CREATE TABLE Works_In(ssn CHAR(11), did INTEGER, since DATE,
PRIMARY KEY (ssn, did),
FOREIGN KEY (ssn)
REFERENCES Employees)

48
Q

What is lossless decomposition?

A

means when you decompose a table and then join you get the original table back

49
Q

How can you delete a view in SQL?

A

DROP VIEW

50
Q

What are the 4 types of integrity constraints?

A
  • Primary key
  • Foreign key (referential integrity)
  • Value-based
  • Tuple-based
51
Q

What is a tuple based check?

A

CHECK(semester IN (‘Fall’, ‘Winter’, ‘Spring’, ‘Summer’) AND (year>1990))

52
Q

What is the process of shingling?

A
  • get set of shingles
  • represent as a matrix where rows are shingles and columns are documents
  • matrix has 1 if the shingle is in the document
53
Q

How do you create a table in SQL?

A

CREATE TABLE Students(sid CHAR(20), name CHAR(20), login CHAR(10), age INTEGER, gpa REAL)

54
Q

What is an alternate key?

A

a key that could be the primary key but is not classified as the primary key

55
Q

What is an integrity constraint?

A

a constraint is a relationship among data elements that the DBMS is required to enforce

56
Q

What is a complex check clause?

A

CHECK(time_slot_id IN (SELECT time_slot_id FROM time_slot))

57
Q

What is a superkey in terms of functional dependence?

A

K is a superkey for relation R iff. K -> R

58
Q

What is an information integration system

A
  • a set of local databases, each with a local schema and a local instance
  • A global integrated schema
  • A set of mappings between the global and local schema
59
Q

How do you delete an entire table including schema in SQL?

A

DROP TABLE Students

60
Q

How do you delete the contents of a table but keep the relation in SQL?

A

DELETE FROM Students

61
Q

How do you declare that an attribute can’t be null?

A

name VARCHAR(20) NOT NULL

62
Q

How do you delete certain rows from a table in SQL?

A

DELETE FROM Students
WHERE …

63
Q

What are the 3 rules to identify lossless decomposition?

A

Attribute(R1) UNION Attribute(R2) = Attribute(R)

Attribute(R1) INTERSECTION Attribute(R2) is not empty

Attribute(R1) INTERSECTION Attribute(R2) ->
Attribute(R1) or Attribute(R1) INTERSECTION Attribute(R2) -> Attribute(R2)

64
Q

How do you add an empty column to an SQL table?

A

ALTER TABLE Students
ADD COLUMN firstYear integer

65
Q

What is the phonetic algorithm/sound encoding algorithm?

A
  • algorithm that indexes names by their sounds when pronounced in English
  • consists of the first letter of the name followed by three numbers
  • Remove all W, H
  • B,F,P,V encoded as 1
  • C,G,J,K,Q,S,X,Z encoded as 2
  • D,T encoded as 3
  • L encoded as 4
  • M,N encoded as 5
  • R encoded as 6
  • Remove vowels
66
Q

How do you perform locality sensitive hashing?

A

Divide the signature matrix into b bands, consisting of r rows. When rows are the same in the band, they are hashed to the same bucket. They have b opportunities to be hashed to the same bucket
-signatures hashed to the same bucket are compared

67
Q

What is jaro-winkler similarity?

A

-extension of jaro-similarity that gives higher weight to matching prefix

Jw(S1, S2) = JaroSim + P*L*(1-JaroSim)

  • P is a scaling factor, 0.1 by default
  • L is length of common prefix up to max 4
68
Q

What is the levenshtein edit distance?

A
  • number of operations to convert from 1st string to 2nd string
  • delete and insert character with cost 1
  • substitute character with cost 2
69
Q

How do you set a primary key when creating a table in SQL? (2 ways)

A

CREATE TABLE test (sid INTEGER PRIMARY KEY, name VARCHAR(30), major VARCHAR(30))

CREATE TABLE test (sid INTEGER, name VARCHAR(30), major VARCHAR(30), PRIMARY KEY (sid))

70
Q

What happens as the signatures get longer?

A

Smaller expected error

71
Q

What are the 3 steps for finding similar documents?

A
  1. Shingling: convert documents to sets
  2. Min-hashing: convert large sets to short signatures, while preserving similarity
  3. Locality-sensitive hashing: focus on pairs of signatures likely to be from similar documents (candidate pairs)
72
Q

What are entity hierarchies?

A

organise a group of entity sets into a parent/child hierarchy

73
Q

What is entity resolution?

A

identify the different structures/records that model the same real-world object