Reader Week 1 Flashcards
How is privacy a continuum?
almost any publication of real-life data or results based on analysis of such data
infringes to some degree on privacy. Hence, an absolute guarantee to complete privacy
is impossible. Rather, privacy is a continuum: some releases of data (or results from
analyses based thereon) are more harmful to privacy than others.
What is differential privacy, the first objective?
In this course, you will learn about a solid mathematical framework called differential
privacy (DP) that helps you quantify the loss in privacy by releasing data or results into
the public domain. Moreover, you will learn about algorithms that actually give you
control over the amount of privacy that is lost by such a release. These algorithms,
in essence, simply add some noise to the data.
What do the algorithms do?
Add noise to the data
What is the second objective of this course?
Intuitively, it should be clear that such
perturbations will affect parameter estimates and their uncertainty (e.g., when estimating
the effect of years of schooling on income later in life). Hence, the second objective of this course will be to find ways to estimate parameters consistently, and to quantify
their variance and perform hypothesis tests correctly, when using such algorithms.
What is a set?
A set is a collection of elements.
What is each element considered to be?
Each element is considered to be distinct (e.g., different
numbers on the line of real numbers, different points in time, different individuals,
different colours).
What is the difference between countably finite, countably infinite and uncountable sets?
Countable sets
can be written down as a comma-separated list between curly brackets, where each item
in that list constitutes an element: S = {e1 , e2 , …}. The order in which these countable
elements are listed does not change the set.
In probability theory, countable sets often
arise when considering discrete data (e.g., when modelling count data) and uncountable
sets typically arise when considering continuous data.
The set of natural numbers N = {0, 1, 2, 3, …} is countably infinite.
The set of real numbers R = (−∞, ∞) is uncountable.
How is an empty set denoted?
The empty set is denoted by ∅ or {}. Set T is said to be a subset of S, which is denoted
by T ⊆ S, if each element in T is also found in S.
When is set T a subset and when a strict subset?
If T ⊆ S is such that S has at least one element that is not found in T , then T is said
to be a strict subset of S, which is denoted by T ⊂ S.
How are intersection, union defined?
For sets A and B, A ∩ B denotes the intersection of A and B, which results in a set
of elements that are found in both A and B; A ∪ B denotes the union of A and B,
which results in a set of elements that are found either in A, in B, or in both; and A \ B (in A not B)
(alternative notation A − B) denotes the set of elements in A that are not in B.
What is a multiset?
A multiset generalises the idea of a countable set. In a multiset, countable element ej
occurs nj times, where nj is a nonnegative integer. Multisets are also referred to as bags.
What is a database?
A database is a collection of tables (e.g., a database
called Sales may comprise tables called Customers, Orders, ProductsInOrders,
Products).W
What is a table?
A table is a collection of records (e.g., all customers in the Customers
table). A table is often also referred to as a relation.
What is an attribute?
In a given table, we wish to record
one or more properties of interest for all records (e.g., the address of each customer).
Such properties are called attributes.
Ideally, attributes should be measured in the same way across records (e.g., we do not want age in months for some records and age in
years for other records in the same table).
Attributes are sometimes also referred to as
variables or fields.
What is a record?
A record can be visualised as a row in a given table with data for each of its at-
tributes (e.g., all data for a given customer in the Customers table). Entry, tuple, and
observation typically mean the same as record. In a given record, in principle, the value
for a particular attribute may be missing (e.g., denoted by NA, NULL, or ⊥).
What is a key?
Many tables contain one or more keys. A key is an attribute (or a combination of
attributes) for which the values enable us to uniquely identify each record (e.g., custID
to identify each customer in the Customers table). Often, artificially generated numbers
or strings (also known as identifiers; IDs) are used as a key (e.g., BSN, SSN, customer
ID, student no.).
What is the difference between tables that permit duplicate tuples and others that do not?
Some tables (or transformations thereof) permit duplicate tuples and other tables
do not. In case duplicates are permitted, a table is a so-called bag or multiset of tuples.
In a multiset, a given tuple t can occur n = 0, 1, 2 … times. In case duplicates are not
permitted, a table is a so-called set of tuples. In a set, a given tuple t can occur either
one time or not at all in the given table. In this course, we consider the general case in
which tables are treated as multisets.
What is a database scheme?
Table 1 shows an example of this 2D representation of table R(A, B, C ) in database
X . The notation R(A, B, C ) is called a relation schema: it specifies the name of the
table (here: R), followed by a comma-seperated list between parentheses, indicating the
names of the attributes (here: A, B, and C ). The set of all relation schemas in a given
database is called the database schema.
What are some assumptions made in this course?
In this course, we assume a given database to be about a specific instance of each
table in that database (and not all possible instances). The distinction between all
possible instances of a given table and a specific instance of that table, is analogous to a
set (e.g., the set of all real numbers, denoted by R) and a particular element from that
set (e.g., 19 ∈ R).
Also, unless otherwise stated, when we talk about two different databases D and D′ ,
we mean databases with the same database schema (i.e., both have the same relation
schemas), but with different instances: so the structure of the data is the same, the only
difference lies in which particular rows are or are not found in the tables of D and D′ .
What does this SQL statement:
SELECT *, A+B AS sumAB, C-D AS difCD
FROM R
WHERE E>10;
This query tells the RDMBS to do the following: from table R, only consider the tuples
for which the value of attribute E exceeds 10, and for those tuples report the value of
all attributes separately (denoted by *), the sum of the values of attributes A and B,
and the difference between C and D. In the output, the sum of A and B is referred to
as sumAB and the difference between C and D is referred to as difCD: they have been
assigned a so-called alias by using the AS keyword.
What does this join SQL statement do?
SELECT orderID, address
FROM Orders, Customers
WHERE Orders.custID=Customers.custID;
Assuming the Orders and Customers both have an attribute called custID, where
custID in Orders tells us which unique customer that order belongs to, the above
query effectively matches each order to a unique customer via the condition after WHERE
keyword. That condition states that custID from the two relations must match. This
kind of matching is called a join. Once the tables have been joined, we specify after
the SELECT keyword that the RDBMS should tell us the identifier of the order (i.e.,
orderID) and the address of the corresponding customer (i.e., address).
What does this statement do?
SELECT COUNT(*) AS n, SUM(A) AS sumA
FROM R;
Another important aspect of the SELECT statement is that it allows us to do basic
aggregations of our data: we can calculate means, sums, and counts.
This statement tells the RDMS to count the number of rows in R and to calculate the
the sum of attribute A across rows.
What does this do?
SELECT A, SUM(B) AS sumB
FROM R
GROUP BY A;
Suppose that for each unique
value of attribute A, we wish to calculate the sum of attribute B across all tuples with that
particular value for A, and then report both that value of A and the corresponding sum
of B. Each unique value of A then constitutes a group. Such a query can be perfomed
using the GROUP BY keyword
What does this do:
SELECT A, SUM(B) AS sumB
FROM R
GROUP BY A
HAVING sumB>0;
Now consider the case where we want to filter out groups that do not meet certain
criteria (e.g., filter out groups for which the sample size is too low). This can be readily
done using the HAVING keyword
This code has almost the same objective as the code in the previous example. However,
in this case, for each unique value of A (i.e., each group), the results for that group are
only reported if sumB exceeds zero in that group
What is relational algebra?
The previously discussed SQL
statements can be written mathematically as a short sequence of appropriate operators applied to multisets. This notation is referred to as relational algebra. For instance,
the result of query SELECT A FROM R WHERE B > 10 can be written in this algebra as
y = πA (σB>10 (R)). Here, we generalise this to the following notation: y = q (D),
where q should be thought of as a function that applies multiset operators to the tables
in database D, and y is the name we assign to the output.
What is a data curator?
Say there is a data curator with access to a database which holds records about in-
dividuals. This data curator is in charge of what happens to the the database, i.e. what
analyses are performed on it.
What is the crux of Differential Privacy (DP)?
The crux of DP is then as follows: the
curator wants to offer a way for users (e.g., researchers) to be able to learn about the
properties of the set of individuals as a whole, without learning too much about any
particular individual.
What will DP ensure?
So DP will ensure that results from a query of the database, will not be affected
too much by whether or not any particular individual is present in the data. In other
words, under DP, the way the outcome of a query behaves should hardly change when
a single individual is removed or added to the database. We talk here about behaviour
of the outcome, rather than a specific outcome, because DP is all about adding a bit of
randomness to the results of queries.