Week 3: Advanced SQL and Data Integration Flashcards
What is the check clause?
defines constraints on the values of a particular attributes
What is jaccard bag similarity?
Jaccard similarity but counts repetition of elements
Max is 0.5
What are personalisation and customisation in terms of data?
- personalisation: content adapted to user
- Customisation: structure adapted to user
What is the official definition of a lossless decomposition?
let R a relation schema (with constraints). A decomposition R1, R2, … Rn of R is called lossless iff. for each valid relation (instance) r(R): r = piR1(r) join piR2(r) join …
What functional dependencies can be derived from armstrongs axoims? (4)
if X -> Y and X -> W, then X -> YW
if X -> YW, then X -> Y and X -> W
if X -> Y and WY -> Z, then XW -> Z
If X is a candidate key, then X -> Y for all Y
How do you create a view in SQL?
CREATE VIEW YoungActiveStudents (name, grade)
AS SELECT S.name, E.grade
FROM Students S, Enrolled E
Where S.Sid = E.Sid and S.age <21
Why is entity resolution useful? (3)
improves data quality and integrity, fosters re-use of existing data sources, optimises space
What is the standard blocking algorithm?
- Select the most appropriate attribute name(s) with respect to noise and distinctiveness
- Transform the corresponding value(s) into a blocking key (BK)
- For each BK, create one block that contains all entities having this BK in their transformation
* works as a hash function - blocks on the equality of BKs
What is the similarity of two signatures?
the fraction of the hash functions in which they agree (for which signatures have the same value)
What is entity integrity?
Means the primary key cannot be null
How do you delete an attribute from an SQL table?
ALTER TABLE Students
DROP firstYear
What is a legal instance of a relation?
one that satisfies all specified ICs
What is a view in SQL and what is it used for?
A view is just a relation, but we store a definition, rather than a set of tuples
Views can be used to present necessary information (or a summary) while hiding details in underlying relations
What is locality sensitive hashing?
generate from the collection of all elements (signatures) a small list of candidate pairs: pairs of elements whose similarity must be evaluated
What is a similarity metric for documents?
Represent document as a set of its k-shingles
Use jaccard set similarity
What are the three ways of enforcing referential integrity in SQL?
- NO ACTION - delete/update is rejected
- CASCADE - make the same changes to all tuples that refer to the updated/deleted tuple
- SET NULL / SET DEFAULT - sets foreign key value of referencing tuple to NULL or a default value
What is the purpose of integrity constraints?
Integrity constraints guard against accidental damage to the database, by ensuring that authorised changes do not result in a loss of data consistency
What is a candidate key in terms of closure?
The minimal set of attributes for which their closure is all the set of attributes in a relation
How do you find the similarity of two sets(documents) from the shingle matrix?
no. of rows where both columns are 1 / no. of rows where either column is 1
How do you add a check into an SQL table?
CREATE TABLE section( semester VARCHAR(6) CHECK(semester IN (‘Fall’, ‘Winter’, ‘Spring’, ‘Summer’)), year NUMERIC(4,0) CHECK(year>1990))
What is functional dependence?
- the values of a set of attributes X determine the values of another set of attributes Y
- Denoted by X -> Y (X determines Y)
What are covering constraints?
instances of the children of an entity include all instances of their parent (ie cover it)
What is the key idea of hashing?
“hash”each column C to a small signature h(C), such that:
- H(C) is small enough that the signature fits in RAM
- sim(C1,C2) is the same as the “similarity” of signatures h(C1) and h(C2)
What is the gap distance?
- overcome limitation of edit distance with shortened strings (abbreviations)
- Considered two extra operations: open gap and extend gap (with small cost)
Cost = insert + open + extend
What is jaccard similarity?
-The Jaccard similarity of two sets is the size of their intersection divided by the size of their union:
sim(C1,C2) = |C1 intersection C2| / |C1 union C2|
How do you perform min-hashing?
- start with a permutation for each row of the boolean matrix
- get the first occurring row in the permuted order that correlates to a 1 in the document and store this as the new column value in the Signature matrix
- repeat for many permutations
What is jaccard distance?
d(C1,C2) = 1 - sim(C1,C2)
What is jaro-similarity?
JaroSim(S1, S2) = (1/3)*(C/|S1| + C/|S2| + (C-T)/C)
C=common/matching characters in S1 and S2
T=transpositions/2 (transposition=matching but different sequence order)
characters are considered matching when they are the same and not further than [max(|S1|, |S2|)/2]-1 apart
What are reasons for discrepancies between entities? (4)
Text variations, local knowledge, evolving nature of data, new functionalities