Week 3: Advanced SQL and Data Integration Flashcards
What is the check clause?
defines constraints on the values of a particular attributes
What is jaccard bag similarity?
Jaccard similarity but counts repetition of elements
Max is 0.5
What are personalisation and customisation in terms of data?
- personalisation: content adapted to user
- Customisation: structure adapted to user
What is the official definition of a lossless decomposition?
let R a relation schema (with constraints). A decomposition R1, R2, … Rn of R is called lossless iff. for each valid relation (instance) r(R): r = piR1(r) join piR2(r) join …
What functional dependencies can be derived from armstrongs axoims? (4)
if X -> Y and X -> W, then X -> YW
if X -> YW, then X -> Y and X -> W
if X -> Y and WY -> Z, then XW -> Z
If X is a candidate key, then X -> Y for all Y
How do you create a view in SQL?
CREATE VIEW YoungActiveStudents (name, grade)
AS SELECT S.name, E.grade
FROM Students S, Enrolled E
Where S.Sid = E.Sid and S.age <21
Why is entity resolution useful? (3)
improves data quality and integrity, fosters re-use of existing data sources, optimises space
What is the standard blocking algorithm?
- Select the most appropriate attribute name(s) with respect to noise and distinctiveness
- Transform the corresponding value(s) into a blocking key (BK)
- For each BK, create one block that contains all entities having this BK in their transformation
* works as a hash function - blocks on the equality of BKs
What is the similarity of two signatures?
the fraction of the hash functions in which they agree (for which signatures have the same value)
What is entity integrity?
Means the primary key cannot be null
How do you delete an attribute from an SQL table?
ALTER TABLE Students
DROP firstYear
What is a legal instance of a relation?
one that satisfies all specified ICs
What is a view in SQL and what is it used for?
A view is just a relation, but we store a definition, rather than a set of tuples
Views can be used to present necessary information (or a summary) while hiding details in underlying relations
What is locality sensitive hashing?
generate from the collection of all elements (signatures) a small list of candidate pairs: pairs of elements whose similarity must be evaluated
What is a similarity metric for documents?
Represent document as a set of its k-shingles
Use jaccard set similarity
What are the three ways of enforcing referential integrity in SQL?
- NO ACTION - delete/update is rejected
- CASCADE - make the same changes to all tuples that refer to the updated/deleted tuple
- SET NULL / SET DEFAULT - sets foreign key value of referencing tuple to NULL or a default value
What is the purpose of integrity constraints?
Integrity constraints guard against accidental damage to the database, by ensuring that authorised changes do not result in a loss of data consistency
What is a candidate key in terms of closure?
The minimal set of attributes for which their closure is all the set of attributes in a relation
How do you find the similarity of two sets(documents) from the shingle matrix?
no. of rows where both columns are 1 / no. of rows where either column is 1
How do you add a check into an SQL table?
CREATE TABLE section( semester VARCHAR(6) CHECK(semester IN (‘Fall’, ‘Winter’, ‘Spring’, ‘Summer’)), year NUMERIC(4,0) CHECK(year>1990))
What is functional dependence?
- the values of a set of attributes X determine the values of another set of attributes Y
- Denoted by X -> Y (X determines Y)
What are covering constraints?
instances of the children of an entity include all instances of their parent (ie cover it)
What is the key idea of hashing?
“hash”each column C to a small signature h(C), such that:
- H(C) is small enough that the signature fits in RAM
- sim(C1,C2) is the same as the “similarity” of signatures h(C1) and h(C2)
What is the gap distance?
- overcome limitation of edit distance with shortened strings (abbreviations)
- Considered two extra operations: open gap and extend gap (with small cost)
Cost = insert + open + extend
What is jaccard similarity?
-The Jaccard similarity of two sets is the size of their intersection divided by the size of their union:
sim(C1,C2) = |C1 intersection C2| / |C1 union C2|
How do you perform min-hashing?
- start with a permutation for each row of the boolean matrix
- get the first occurring row in the permuted order that correlates to a 1 in the document and store this as the new column value in the Signature matrix
- repeat for many permutations
What is jaccard distance?
d(C1,C2) = 1 - sim(C1,C2)
What is jaro-similarity?
JaroSim(S1, S2) = (1/3)*(C/|S1| + C/|S2| + (C-T)/C)
C=common/matching characters in S1 and S2
T=transpositions/2 (transposition=matching but different sequence order)
characters are considered matching when they are the same and not further than [max(|S1|, |S2|)/2]-1 apart
What are reasons for discrepancies between entities? (4)
Text variations, local knowledge, evolving nature of data, new functionalities
How do you check if a decomposition is lossless using functional dependencies?
For relational schema R(XYZ), the following holds: If X -> Y then the decomposition R1(XY) R2(XZ) is lossless.
What is the normalisation theory for designing a good database? (4)
- 1 table for each entity
- 1 table for each relationship
- Each cell contains a single value
- If you have big relations, decompose them
What is a conditional functional dependency? How is it expressed?
A functional dependency that holds only under some contraint
([country = UK, zip] -> street)
How do you set the method of enforcing referential integrity when creating an SQL table?
CREATE TABLE Enrolled(sid CHAR(20), cid CHAR(20), grade CHAR(2), PRIMARY KEY(sid,cid), FOREIGN KEY(sid) REFERNCES Students
ON DELETE CASCADE
ON UPDATE SET DEFAULT)
What is the property of the minhash function?
Pr[h_pi(C1) = h_pi(C2)] = sim(C1, C2)
the probability that the minhash function for a random permutation of rows produces the same value for two sets equals the Jaccard similarity of those sets
What is the closure of a set of functional dependencies?
The set of functional dependencies logically implied by F is the closure of F, denoted F+
ie. to find the closure B+, find all the attributes that can be determined starting with B
what is visual integration architecture / on demand integration?
-leave the data in the sources
-When a query comes in:
* Determine their relevant sources to the query
* Break down the query into sub-queries for the sources
* Get the answers from the sources, filter them if needed and combine them appropriately
What are overlap constraints?
specify that the children of an entity do/don’t overlap
What are Armstrongs axioms? (3)
- If X subset of Y, then Y -> X (reflexivity)
- If X -> Y, then AX -> AY (augmentation) (A is a set of attributes)
- If X -> Y and Y -> W, then X -> W (transitivity)
What is a k-shingle / k-gram?
-a sequence of k-tokens (characters or words) that appear in the document
E.g. for document D1 = abcab, Set of 2-shingles S(D1) = {ab, bc, ca}
What is the goal of hashing?
find similar columns while computing small signatures
What is the purpose of functional dependencies?
functional dependencies allow us to express constraints that cannot be expressed using superkeys
How do you change the value in an attribute for all or some tuples?
UPDATE Students
SET grade = grade*1.2
WHERE …
How do you define a minhash function?
Define a minhash function h_pi(C) = the index of the first(in the permuted order pi) row which column C has value 1
How do you add a row into an SQL table? (2 ways depending on the order)
INSERT INTO Students
VALUES (53688, ‘Smith’, ‘smith@ee’, 18, 3.2)
INSERT INTO Students(name, Sid, login, age, gpa)
VALUES (‘Smith’, 53688, ‘smith@ee’, 18, 3.2)
What happens if you enter 3 characters into an attribute with CHAR(5)?
it will store the three characters followed by 2 spaces
What is a candidate key in terms of functional dependence?
K is a candidate key for R iff. K -> R AND for any X proper subset of K, X -/-> R
How do you set a foreign key when creating a table in SQL?
CREATE TABLE Works_In(ssn CHAR(11), did INTEGER, since DATE,
PRIMARY KEY (ssn, did),
FOREIGN KEY (ssn)
REFERENCES Employees)
What is lossless decomposition?
means when you decompose a table and then join you get the original table back
How can you delete a view in SQL?
DROP VIEW
What are the 4 types of integrity constraints?
- Primary key
- Foreign key (referential integrity)
- Value-based
- Tuple-based
What is a tuple based check?
CHECK(semester IN (‘Fall’, ‘Winter’, ‘Spring’, ‘Summer’) AND (year>1990))
What is the process of shingling?
- get set of shingles
- represent as a matrix where rows are shingles and columns are documents
- matrix has 1 if the shingle is in the document
How do you create a table in SQL?
CREATE TABLE Students(sid CHAR(20), name CHAR(20), login CHAR(10), age INTEGER, gpa REAL)
What is an alternate key?
a key that could be the primary key but is not classified as the primary key
What is an integrity constraint?
a constraint is a relationship among data elements that the DBMS is required to enforce
What is a complex check clause?
CHECK(time_slot_id IN (SELECT time_slot_id FROM time_slot))
What is a superkey in terms of functional dependence?
K is a superkey for relation R iff. K -> R
What is an information integration system
- a set of local databases, each with a local schema and a local instance
- A global integrated schema
- A set of mappings between the global and local schema
How do you delete an entire table including schema in SQL?
DROP TABLE Students
How do you delete the contents of a table but keep the relation in SQL?
DELETE FROM Students
How do you declare that an attribute can’t be null?
name VARCHAR(20) NOT NULL
How do you delete certain rows from a table in SQL?
DELETE FROM Students
WHERE …
What are the 3 rules to identify lossless decomposition?
Attribute(R1) UNION Attribute(R2) = Attribute(R)
Attribute(R1) INTERSECTION Attribute(R2) is not empty
Attribute(R1) INTERSECTION Attribute(R2) ->
Attribute(R1) or Attribute(R1) INTERSECTION Attribute(R2) -> Attribute(R2)
How do you add an empty column to an SQL table?
ALTER TABLE Students
ADD COLUMN firstYear integer
What is the phonetic algorithm/sound encoding algorithm?
- algorithm that indexes names by their sounds when pronounced in English
- consists of the first letter of the name followed by three numbers
- Remove all W, H
- B,F,P,V encoded as 1
- C,G,J,K,Q,S,X,Z encoded as 2
- D,T encoded as 3
- L encoded as 4
- M,N encoded as 5
- R encoded as 6
- Remove vowels
How do you perform locality sensitive hashing?
Divide the signature matrix into b bands, consisting of r rows. When rows are the same in the band, they are hashed to the same bucket. They have b opportunities to be hashed to the same bucket
-signatures hashed to the same bucket are compared
What is jaro-winkler similarity?
-extension of jaro-similarity that gives higher weight to matching prefix
Jw(S1, S2) = JaroSim + P*L*(1-JaroSim)
- P is a scaling factor, 0.1 by default
- L is length of common prefix up to max 4
What is the levenshtein edit distance?
- number of operations to convert from 1st string to 2nd string
- delete and insert character with cost 1
- substitute character with cost 2
How do you set a primary key when creating a table in SQL? (2 ways)
CREATE TABLE test (sid INTEGER PRIMARY KEY, name VARCHAR(30), major VARCHAR(30))
CREATE TABLE test (sid INTEGER, name VARCHAR(30), major VARCHAR(30), PRIMARY KEY (sid))
What happens as the signatures get longer?
Smaller expected error
What are the 3 steps for finding similar documents?
- Shingling: convert documents to sets
- Min-hashing: convert large sets to short signatures, while preserving similarity
- Locality-sensitive hashing: focus on pairs of signatures likely to be from similar documents (candidate pairs)
What are entity hierarchies?
organise a group of entity sets into a parent/child hierarchy
What is entity resolution?
identify the different structures/records that model the same real-world object