Reader Week 1 Flashcards

1
Q

How is privacy a continuum?

A

almost any publication of real-life data or results based on analysis of such data
infringes to some degree on privacy. Hence, an absolute guarantee to complete privacy
is impossible. Rather, privacy is a continuum: some releases of data (or results from
analyses based thereon) are more harmful to privacy than others.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is differential privacy, the first objective?

A

In this course, you will learn about a solid mathematical framework called differential
privacy (DP) that helps you quantify the loss in privacy by releasing data or results into
the public domain. Moreover, you will learn about algorithms that actually give you
control over the amount of privacy that is lost by such a release. These algorithms,
in essence, simply add some noise to the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What do the algorithms do?

A

Add noise to the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the second objective of this course?

A

Intuitively, it should be clear that such
perturbations will affect parameter estimates and their uncertainty (e.g., when estimating
the effect of years of schooling on income later in life). Hence, the second objective of this course will be to find ways to estimate parameters consistently, and to quantify
their variance and perform hypothesis tests correctly, when using such algorithms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a set?

A

A set is a collection of elements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is each element considered to be?

A

Each element is considered to be distinct (e.g., different
numbers on the line of real numbers, different points in time, different individuals,
different colours).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the difference between countably finite, countably infinite and uncountable sets?

A

Countable sets
can be written down as a comma-separated list between curly brackets, where each item
in that list constitutes an element: S = {e1 , e2 , …}. The order in which these countable
elements are listed does not change the set.

In probability theory, countable sets often
arise when considering discrete data (e.g., when modelling count data) and uncountable
sets typically arise when considering continuous data.

The set of natural numbers N = {0, 1, 2, 3, …} is countably infinite.

The set of real numbers R = (−∞, ∞) is uncountable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How is an empty set denoted?

A

The empty set is denoted by ∅ or {}. Set T is said to be a subset of S, which is denoted
by T ⊆ S, if each element in T is also found in S.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When is set T a subset and when a strict subset?

A

If T ⊆ S is such that S has at least one element that is not found in T , then T is said
to be a strict subset of S, which is denoted by T ⊂ S.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How are intersection, union defined?

A

For sets A and B, A ∩ B denotes the intersection of A and B, which results in a set
of elements that are found in both A and B; A ∪ B denotes the union of A and B,
which results in a set of elements that are found either in A, in B, or in both; and A \ B (in A not B)
(alternative notation A − B) denotes the set of elements in A that are not in B.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a multiset?

A

A multiset generalises the idea of a countable set. In a multiset, countable element ej
occurs nj times, where nj is a nonnegative integer. Multisets are also referred to as bags.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a database?

A

A database is a collection of tables (e.g., a database
called Sales may comprise tables called Customers, Orders, ProductsInOrders,
Products).W

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a table?

A

A table is a collection of records (e.g., all customers in the Customers
table). A table is often also referred to as a relation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is an attribute?

A

In a given table, we wish to record
one or more properties of interest for all records (e.g., the address of each customer).
Such properties are called attributes.

Ideally, attributes should be measured in the same way across records (e.g., we do not want age in months for some records and age in
years for other records in the same table).

Attributes are sometimes also referred to as
variables or fields.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a record?

A

A record can be visualised as a row in a given table with data for each of its at-
tributes (e.g., all data for a given customer in the Customers table). Entry, tuple, and
observation typically mean the same as record. In a given record, in principle, the value
for a particular attribute may be missing (e.g., denoted by NA, NULL, or ⊥).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a key?

A

Many tables contain one or more keys. A key is an attribute (or a combination of
attributes) for which the values enable us to uniquely identify each record (e.g., custID
to identify each customer in the Customers table). Often, artificially generated numbers
or strings (also known as identifiers; IDs) are used as a key (e.g., BSN, SSN, customer
ID, student no.).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the difference between tables that permit duplicate tuples and others that do not?

A

Some tables (or transformations thereof) permit duplicate tuples and other tables
do not. In case duplicates are permitted, a table is a so-called bag or multiset of tuples.
In a multiset, a given tuple t can occur n = 0, 1, 2 … times. In case duplicates are not
permitted, a table is a so-called set of tuples. In a set, a given tuple t can occur either
one time or not at all in the given table. In this course, we consider the general case in
which tables are treated as multisets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is a database scheme?

A

Table 1 shows an example of this 2D representation of table R(A, B, C ) in database
X . The notation R(A, B, C ) is called a relation schema: it specifies the name of the
table (here: R), followed by a comma-seperated list between parentheses, indicating the
names of the attributes (here: A, B, and C ). The set of all relation schemas in a given
database is called the database schema.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are some assumptions made in this course?

A

In this course, we assume a given database to be about a specific instance of each
table in that database (and not all possible instances). The distinction between all
possible instances of a given table and a specific instance of that table, is analogous to a
set (e.g., the set of all real numbers, denoted by R) and a particular element from that
set (e.g., 19 ∈ R).

Also, unless otherwise stated, when we talk about two different databases D and D′ ,
we mean databases with the same database schema (i.e., both have the same relation
schemas), but with different instances: so the structure of the data is the same, the only
difference lies in which particular rows are or are not found in the tables of D and D′ .

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does this SQL statement:

SELECT *, A+B AS sumAB, C-D AS difCD
FROM R
WHERE E>10;

A

This query tells the RDMBS to do the following: from table R, only consider the tuples
for which the value of attribute E exceeds 10, and for those tuples report the value of
all attributes separately (denoted by *), the sum of the values of attributes A and B,
and the difference between C and D. In the output, the sum of A and B is referred to
as sumAB and the difference between C and D is referred to as difCD: they have been
assigned a so-called alias by using the AS keyword.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What does this join SQL statement do?

SELECT orderID, address
FROM Orders, Customers
WHERE Orders.custID=Customers.custID;

A

Assuming the Orders and Customers both have an attribute called custID, where
custID in Orders tells us which unique customer that order belongs to, the above
query effectively matches each order to a unique customer via the condition after WHERE
keyword. That condition states that custID from the two relations must match. This
kind of matching is called a join. Once the tables have been joined, we specify after
the SELECT keyword that the RDBMS should tell us the identifier of the order (i.e.,
orderID) and the address of the corresponding customer (i.e., address).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does this statement do?

SELECT COUNT(*) AS n, SUM(A) AS sumA
FROM R;

A

Another important aspect of the SELECT statement is that it allows us to do basic
aggregations of our data: we can calculate means, sums, and counts.
This statement tells the RDMS to count the number of rows in R and to calculate the
the sum of attribute A across rows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What does this do?

SELECT A, SUM(B) AS sumB
FROM R
GROUP BY A;

A

Suppose that for each unique
value of attribute A, we wish to calculate the sum of attribute B across all tuples with that
particular value for A, and then report both that value of A and the corresponding sum
of B. Each unique value of A then constitutes a group. Such a query can be perfomed
using the GROUP BY keyword

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What does this do:

SELECT A, SUM(B) AS sumB
FROM R
GROUP BY A
HAVING sumB>0;

A

Now consider the case where we want to filter out groups that do not meet certain
criteria (e.g., filter out groups for which the sample size is too low). This can be readily
done using the HAVING keyword

This code has almost the same objective as the code in the previous example. However,
in this case, for each unique value of A (i.e., each group), the results for that group are
only reported if sumB exceeds zero in that group

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is relational algebra?

A

The previously discussed SQL
statements can be written mathematically as a short sequence of appropriate operators applied to multisets. This notation is referred to as relational algebra. For instance,
the result of query SELECT A FROM R WHERE B > 10 can be written in this algebra as
y = πA (σB>10 (R)). Here, we generalise this to the following notation: y = q (D),
where q should be thought of as a function that applies multiset operators to the tables
in database D, and y is the name we assign to the output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is a data curator?

A

Say there is a data curator with access to a database which holds records about in-
dividuals. This data curator is in charge of what happens to the the database, i.e. what
analyses are performed on it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the crux of Differential Privacy (DP)?

A

The crux of DP is then as follows: the
curator wants to offer a way for users (e.g., researchers) to be able to learn about the
properties of the set of individuals as a whole, without learning too much about any
particular individual.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What will DP ensure?

A

So DP will ensure that results from a query of the database, will not be affected
too much by whether or not any particular individual is present in the data. In other
words, under DP, the way the outcome of a query behaves should hardly change when
a single individual is removed or added to the database. We talk here about behaviour
of the outcome, rather than a specific outcome, because DP is all about adding a bit of
randomness to the results of queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

So how is the promise of DP achieved?

How can the data curator ensure that the
outcomes of analyses on the database do not depend too much on any particular indi-
vidual?

A

To make an analysis DP, we will see that you will have to introduce randomness
into it, such that there is no longer a deterministic relationship between the database
and the outcomes of analyses on the database.

30
Q

How can that randomness be introduced?

A

Intuitively, this ensures that, based on the outcome of the analysis, someone querying
the database cannot be entirely sure who is and who is not part of the true database.
Hence, instead of returning the true outcome of the analysis, noise must be injected. If
this noise is adequately calibrated, then the results of queries will be very similar when
the entire database is used or when one person is left out. In that case, the mechanism
(aka process or algorithm) that answers a query with appropriate noise is called DP.

31
Q

What are the two main settings of DP?

A

(i) The offline model, where the
data curator constructs a synthetic database in a DP manner, which can then be made
public once and for all, and which researchers can use freely.

(ii) The online model,
where researchers do not get direct access to the database, but instead they can request
the data curator to answer queries on the database, which the data curator will do in a
DP manner. The online setting will be the default in this course, but we will also touch
upon the offline setting later in the course.

32
Q

Why does anonymization not work as a strategy to protect privacy?

A

The first stategy you might think of when wanting to protect privacy of individuals in
a database is anonymization. Anonymization entails removing personally identifiable
information from the data. Surely, if it is not stated which records in the database belong to which individual,
then publishing the dataset will not lead to leakage of sensitive information? Unfor-
tunately, this reasoning turns out to be wrong, because we may be able to re-identify
individuals using a so-called linkage attack: by matching the anonymized database to
other publicly available non-anonymized datasets, an attacker might be able to pin down
which records of the ‘anonymized’ database correspond to which individuals.

A solution can be to hide more information, such as the date
of birth and the ZIP code. But how do we know what information should be removed and
what information can be kept in the database? This question is very difficult to answer,
because it requires knowing what alternative data potential attackers have access to,
now and in the future.

“Data cannot be fully anonymized and remain useful”. In other words, if you want to be
absolutely sure of complete anonymization, you must strip the dataset of all its valuable
information, which then defeats the purpose of publishing the data at all.

Notice also that exact re-identification is not the only problem. It can already be
undesirable to reveal someone’s membership of the dataset without necessarily identifying
which particular record corresponds to them. For instance, consider a collection of
medical records of a clinic specialised in treating a particular type of medical condition.
If an attacker finds out that someone is a member of the dataset, then the attacker
immediately knows that this individual has the medical condition, which is a major
privacy violation.

33
Q

What is k-anonymity?

A

A common privacy notion from the scientific privacy literature is so-called k-anonymity.
This notion of privacy is widespread, for example, in the medical world. The idea of this
approach is quite simple: a dataset is said to be k-anonymous if every combination of
values for the columns containing identifying information appears for at least k different
records in the dataset.

More generally, when considering k-anonimity, one first divides the attributes into three
distinct categories: identifiers (e.g., patientID), quasi-identifiers (e.g., year of birth
and ZIP code), and other attributes (e.g., reason for visit). Obviously, the identifiers
are removed. In that case, a dataset is said to be k-anonymous, if there are at least k
different individuals with exact same values for all quasi-identifiers. Thus, k-anonymity
ensures that each individual in the database is indistinguishable from at least k − 1 other
individuals, in terms of their quasi-identifiers.

Unfortunately, what is qualified as ‘(quasi)-identifying information’ is often very sub-
jective. In fact, even the outcome of interest may serve as an identifier. This subjective partitioning of attributes is one
of the major drawbacks of k-anonymity.

34
Q

What are multiple important drawbacks

A

We can conclude that the notion of k-anonymity has multiple important drawbacks:
(i) the choice of quasi-identifiers is often debatable, (ii) a k-anonymous dataset still
allows an attacker to ‘narrow down the options’ of which records correspond to a target
individual, which can already be considered a violation of privacy, and (iii) it is not clear
what a good choice for k is.

35
Q

Suppose that you are still interested in k-anonymity despite these important draw-
backs. As a given dataset is very unlikely to be k-anonymous, even for k = 2, you will
have to make changes to your dataset to make sure that you achieve k-anonymity. So
how does this work?

A

There are two techniques that can be combined to transform a
dataset to a k-anonymous one: generalization and suppression.

36
Q

What is generalization?

A

Generalization works as follows: you simply make values of certain quasi-identifiers
less specific. For example, instead of reporting the ZIP code, you may choose to report
the first two numbers of the ZIP code.
Or instead of reporting the precise year of
birth, you may choose to report the range in which a date of birth falls. Broader
ranges will mean more individuals with similar quasi-identifier values, but also less precise
information for those who analyse the dataset.

37
Q

What is suppression?

A

Suppression works as follows: sometimes there can be certain outliers in the dataset,
which will make it very hard to achieve k-anonymity.

This is what suppression is about: removing individuals who are too
different from the rest of the sample to be able to obtain a usable k-anonymous dataset
based on generalization.

38
Q

What does best mean in making the best way to make a given dataset k-anonymous?

A

With ‘best’ we mean making
the dataset k-anonymous, while keeping the data as precise as possible. Determining the optimal way of making a dataset k-anonymous is, in fact, an NP-hard problem, but
greedy algorithms exist.

Despite its drawbacks, this privacy notion is intuitive and simple to understand, and
as a result quite widely used in practice, building on a body of literature on how to
make a data release k-anonymous. As we have seen, though, there are multiple major
disadvantages to this approach to protecting privacy—hence, in this course, we consider
the far more stringent approach of DP.

39
Q

What is the query-based approachL: differencing attacks?

A

What if, instead,
the data curator agrees to answer (certain types of) queries about the database? By
not making the data directly available to the public, you might think that the privacy of
the individuals in the database is protected. It turns out, however, that this approach is
also inadequate, as will be illustrated by the following example.

To see how we can actually leverage such statistics, consider an attacker who wants to
compromise the privacy of individual i. This attackers happens to know i participated in
the study, and, moreover, knows the ZIP code and age of that individual.

By stratifying the data in this manner, the attacker has figured out his target consumes
alcohol! Ok, so why do we care? Well, consider there are also fields about consumption
of illegal substances… Sidestepping any discussion about the ethics and morality of the
consumption of substances, this is obviously a gross violation of privacy (and potentially
a legal risk for participants).
Now, one could say: make the system such that it refuses queries that involve groups
that comprise only one underlying observation in any given group.

At this point, you may suggest a more refined approach: there must be at least m
underlying observations per group that is reported in the output. If that requirement is
violated, our RDBMS should reject the query. But even if we set m quite stringently
(e.g., m = 100), a summary statistic based on groups with at least size m can still reveal
quite a lot.

By comparing the results of these two queries, the attacker can tell there are only two
individuals with that ZIP code and age in the data, and that both consume alcohol.
Hence, he again knows for sure that his target consumes alcohol!
A solution to this problem could be to let the system audit the sequence of queries
and responses, and to refuse answering a query if answering it, given the previous queries,
would violate privacy. For example if a query is submitted that could lead to a differencing
attack when combined with one of the previous query responses. This sounds good in
theory, but there are two problems with this approach. The first is that refusing to provide
the answer for a particular query, can itself say something about the query response, and
therefore violate privacy. The second is that auditing a sequence of queries can be
computationally infeasible, especially if the number of different potential queries is very
large.

40
Q

What does this example reveal about the problem of privacy preservation?

A

This example reveals how profound the problem of privacy preservation is, even when
reporting mere summary statistics. At the same time, the example also provides us with
a tool to approach this problem: what is the worst kind of possible attack that we can
conceive, short of hacking the system, and, under such an attack, how can we preserve
privacy to what degree by using various mechanisms? That will be the main question in
this course!

41
Q

What makes DP special?

A
  1. DP wards off attacks
  2. DP allows us to quantify privacy
  3. DP allows for composition
42
Q

What is DP wards off attacks?

A

most previously existing techniques for preserving privacy,
relied on assumptions about possible attacks that people with malicious intent may
try to carry out.
For instance, for anonymization to adequately protect privacy, one must make assumptions about alternative data that attackers might have
access to. Such assumptions will often be too strong and there is simply no way
of verifying whether they are correct. Besides, you need assumptions about what
the intent of a possible attacker might be. Are they trying to re-identify one or
more individuals, or are they just trying to find out whether the data of a particular
individual is in the database at all?

DP does not have this flaw: (i) it works regardless of any auxiliary information
that the attacker might have, and (ii) it protects any kind of information about
the individuals in the dataset, so their membership of the dataset and also their
attributes. Regarding the first point, DP can be shown to be closed under post-
processing, in the sense that any transformation of the output of a DP mechanism,
even when using auxiliary data, will still be DP.

Regarding the second point, in essence, DP protects the privacy of the individuals
in the dataset in the worst-case scenario, namely the scenario in which the attacker
knows all the data in the dataset except for that of their target individual. Intu-
itively, when the privacy of that target individual is protected in this worst-case
scenario, it will also be protected in any other scenario where the attacker has less
information about the rest of the data.

43
Q

What is DP allows us to quantify privacy?

A

an advantage of DP with respect to other
notions of privacy is that it automatically comes with a quantification of privacy.

Instead of either having privacy or not, in the DP framework the concept of ‘pri-
vacy’ takes values on a continuum.

The definition of DP namely contains a parameter ε that determines how much
privacy is potentially lost when using the DP mechanism. This parameter essen-
tially determines the trade-off between privacy and accuracy, where a higher ε
implies more potential privacy loss. Essentially, ε quantifies the highest possible
information gain that an attacker might have from seeing the output of the DP
mechanism. In the coming weeks, this interpretation will be discussed in more
detail. This measure of privacy loss allows for making comparisons among DP
techniques, such as: for a particular privacy loss, which technique leads to a better
accuracy?

44
Q

What is DP allows for composition?

A

importantly, because DP allows us to quantify
privacy loss, it also allows for analysing the total privacy loss when multiple DP
calculations are done on a certain database. It turns out to be fairly easy to
quantify the total privacy loss of a combination of multiple DP techniques. In
practice, organizations or researchers may want to do many different analyses on
a certain database. By the so-called composition property of DP, it is possible
to keep an eye on how much privacy is at risk when doing multiple analyses.
Moreover, the composition property allows us to build and analyse sophisticated
DP algorithms consisting of simpler DP mechanisms.

45
Q

For a PMF to be valid?

A

Pr (X = x) = f (x)

(1) all probabilities lie between
zero and one (i.e., 0 ≤ f (x) ≤ 1 for all x ∈ R) and (2) the probabilities add up to one

46
Q

What is CDF?

A

The cumulative distribution function (CDF) F (x) of RV X is such that
Pr (X ≤ x) = F (x)

47
Q

For a PDF to be valid?

A

Under these definitions, for a continuous RV X , we have that
Z a
Pr (X ≤ a) = F (a) =
integral from a -infinity and a f (x)dx.

For a PDF to be valid, it also needs to satisfy two criteria: (1) the density is nonnegative
everywhere (i.e., f (x) ≥ 0 for all x ∈ R) and (2) the density integrates to one (i.e. integral from -inf to inf f (x)dx = 1). A PDF is permitted to have discontinuities.

48
Q

What is the expectation and some properties of expectation X?

A

The expectation of a function g (X ) of discrete RV X is denoted by E [g (X )] and is
defined as follows:
X
E [g (X )] = sum of x in X of g (x) · f (x)

  1. E [g (X )] itself is not an RV: it is a number.
  2. E [g (X ) + h(Y )] = E [g (X )] + E [h(Y )] for RVs X and Y , and their transforma-
    tions according to functions g and h respectively.
  3. E [a + b · g (X )] = a + b · E [g (X )] for constants a and b.
49
Q

What is the variance?

A

Var (X ) = E (X − E [X ])2 = E X 2 − (E [X ])2

Var (a + b · X ) = b 2 · Var (X )

50
Q

What is the joint distribution of two random variables?

A

joint PMF. Thus,
Pr (X = x, Y = y ) = fX ,Y (x, y )

joint CDF
Pr (X ≤ x, Y ≤ y ) = FX ,Y (x, y )

joint pdf
F_(X,Y) (a,b) = integral -inf to a and integral -inf to b of f_(X,Y) (x,y) dydx

51
Q

What is the marginal distribution?

A

pmf of X and Y
f_X (x) = sum of y in Y of f_X,Y (x,y)
f_Y (y) = sum of x in X of f_X,Y (x, y)

Same for the marginal pdfs but then instead of the sum you take the integral

52
Q

What if X and Y are independent ?

A

f_X ,Y (x, y ) = f_X (x) · f_Y (y )

53
Q

What is the covariance?

A

page 29

54
Q

What is the conditional distribution?

A

page 30

55
Q

What is the conditional expectation?

A

page 30,31

56
Q

What is Bayes theorem for two discrete random variables?

A

page 31-33

57
Q

What is one continuous and one discrete random variable?

A

page 33-35

58
Q

Bernoulli distribution?

A

page 35

59
Q

Binomial distribution?

A

page 35,36

60
Q

What is the continuous uniform distribution?

A

page 36q

61
Q

What is the Gaussian distribution?

A

page 37,38

62
Q

What is the Laplace distribution?

A

page 38W

63
Q

What is the mixture distribution?

A

page 38-39

64
Q

What is average of iid draws?

A

page 39

65
Q

What is the expectation and variance of the sample mean?

A

page 40

66
Q

What is the consistency and asymptotic distribution?

A

page 40

67
Q

What is the standard error?

A

page 40,41

68
Q

What is hypothesis testing?

A

page 41,42

69
Q

What is an example of a perturbed estimator?

A

page 42

70
Q

What is the relative efficiency of two estimators?

A

page 43