Reader Week 1 Flashcards

Question

What is relational algebra?

Answer 1

The previously discussed SQL statements can be written mathematically as a short sequence of appropriate operators applied to multisets. This notation is referred to as relational algebra. For instance, the result of query SELECT A FROM R WHERE B > 10 can be written in this algebra as y = πA (σB>10 (R)). Here, we generalise this to the following notation: y = q (D), where q should be thought of as a function that applies multiset operators to the tables in database D, and y is the name we assign to the output.

Answer 2

Say there is a data curator with access to a database which holds records about in- dividuals. This data curator is in charge of what happens to the the database, i.e. what analyses are performed on it.

Answer 3

The crux of DP is then as follows: the curator wants to offer a way for users (e.g., researchers) to be able to learn about the properties of the set of individuals as a whole, without learning too much about any particular individual.

Answer 4

So DP will ensure that results from a query of the database, will not be affected too much by whether or not any particular individual is present in the data. In other words, under DP, the way the outcome of a query behaves should hardly change when a single individual is removed or added to the database. We talk here about behaviour of the outcome, rather than a specific outcome, because DP is all about adding a bit of randomness to the results of queries.

Answer 5

To make an analysis DP, we will see that you will have to introduce randomness into it, such that there is no longer a deterministic relationship between the database and the outcomes of analyses on the database.

Answer 6

Intuitively, this ensures that, based on the outcome of the analysis, someone querying the database cannot be entirely sure who is and who is not part of the true database. Hence, instead of returning the true outcome of the analysis, noise must be injected. If this noise is adequately calibrated, then the results of queries will be very similar when the entire database is used or when one person is left out. In that case, the mechanism (aka process or algorithm) that answers a query with appropriate noise is called DP.

Answer 7

(i) The offline model, where the data curator constructs a synthetic database in a DP manner, which can then be made public once and for all, and which researchers can use freely. (ii) The online model, where researchers do not get direct access to the database, but instead they can request the data curator to answer queries on the database, which the data curator will do in a DP manner. The online setting will be the default in this course, but we will also touch upon the offline setting later in the course.

Answer 8

The first stategy you might think of when wanting to protect privacy of individuals in a database is anonymization. Anonymization entails removing personally identifiable information from the data. Surely, if it is not stated which records in the database belong to which individual, then publishing the dataset will not lead to leakage of sensitive information? Unfor- tunately, this reasoning turns out to be wrong, because we may be able to re-identify individuals using a so-called linkage attack: by matching the anonymized database to other publicly available non-anonymized datasets, an attacker might be able to pin down which records of the ‘anonymized’ database correspond to which individuals. A solution can be to hide more information, such as the date of birth and the ZIP code. But how do we know what information should be removed and what information can be kept in the database? This question is very difficult to answer, because it requires knowing what alternative data potential attackers have access to, now and in the future. “Data cannot be fully anonymized and remain useful”. In other words, if you want to be absolutely sure of complete anonymization, you must strip the dataset of all its valuable information, which then defeats the purpose of publishing the data at all. Notice also that exact re-identification is not the only problem. It can already be undesirable to reveal someone’s membership of the dataset without necessarily identifying which particular record corresponds to them. For instance, consider a collection of medical records of a clinic specialised in treating a particular type of medical condition. If an attacker finds out that someone is a member of the dataset, then the attacker immediately knows that this individual has the medical condition, which is a major privacy violation.

Answer 9

A common privacy notion from the scientific privacy literature is so-called k-anonymity. This notion of privacy is widespread, for example, in the medical world. The idea of this approach is quite simple: a dataset is said to be k-anonymous if every combination of values for the columns containing identifying information appears for at least k different records in the dataset. More generally, when considering k-anonimity, one first divides the attributes into three distinct categories: identifiers (e.g., patientID), quasi-identifiers (e.g., year of birth and ZIP code), and other attributes (e.g., reason for visit). Obviously, the identifiers are removed. In that case, a dataset is said to be k-anonymous, if there are at least k different individuals with exact same values for all quasi-identifiers. Thus, k-anonymity ensures that each individual in the database is indistinguishable from at least k − 1 other individuals, in terms of their quasi-identifiers. Unfortunately, what is qualified as ‘(quasi)-identifying information’ is often very sub- jective. In fact, even the outcome of interest may serve as an identifier. This subjective partitioning of attributes is one of the major drawbacks of k-anonymity.

Answer 10

We can conclude that the notion of k-anonymity has multiple important drawbacks: (i) the choice of quasi-identifiers is often debatable, (ii) a k-anonymous dataset still allows an attacker to ‘narrow down the options’ of which records correspond to a target individual, which can already be considered a violation of privacy, and (iii) it is not clear what a good choice for k is.

Answer 11

There are two techniques that can be combined to transform a dataset to a k-anonymous one: generalization and suppression.

Answer 12

Generalization works as follows: you simply make values of certain quasi-identifiers less specific. For example, instead of reporting the ZIP code, you may choose to report the first two numbers of the ZIP code. Or instead of reporting the precise year of birth, you may choose to report the range in which a date of birth falls. Broader ranges will mean more individuals with similar quasi-identifier values, but also less precise information for those who analyse the dataset.

Answer 13

Suppression works as follows: sometimes there can be certain outliers in the dataset, which will make it very hard to achieve k-anonymity. This is what suppression is about: removing individuals who are too different from the rest of the sample to be able to obtain a usable k-anonymous dataset based on generalization.

Answer 14

With ‘best’ we mean making the dataset k-anonymous, while keeping the data as precise as possible. Determining the optimal way of making a dataset k-anonymous is, in fact, an NP-hard problem, but greedy algorithms exist. Despite its drawbacks, this privacy notion is intuitive and simple to understand, and as a result quite widely used in practice, building on a body of literature on how to make a data release k-anonymous. As we have seen, though, there are multiple major disadvantages to this approach to protecting privacy—hence, in this course, we consider the far more stringent approach of DP.

Answer 15

What if, instead, the data curator agrees to answer (certain types of) queries about the database? By not making the data directly available to the public, you might think that the privacy of the individuals in the database is protected. It turns out, however, that this approach is also inadequate, as will be illustrated by the following example. To see how we can actually leverage such statistics, consider an attacker who wants to compromise the privacy of individual i. This attackers happens to know i participated in the study, and, moreover, knows the ZIP code and age of that individual. By stratifying the data in this manner, the attacker has figured out his target consumes alcohol! Ok, so why do we care? Well, consider there are also fields about consumption of illegal substances... Sidestepping any discussion about the ethics and morality of the consumption of substances, this is obviously a gross violation of privacy (and potentially a legal risk for participants). Now, one could say: make the system such that it refuses queries that involve groups that comprise only one underlying observation in any given group. At this point, you may suggest a more refined approach: there must be at least m underlying observations per group that is reported in the output. If that requirement is violated, our RDBMS should reject the query. But even if we set m quite stringently (e.g., m = 100), a summary statistic based on groups with at least size m can still reveal quite a lot. By comparing the results of these two queries, the attacker can tell there are only two individuals with that ZIP code and age in the data, and that both consume alcohol. Hence, he again knows for sure that his target consumes alcohol! A solution to this problem could be to let the system audit the sequence of queries and responses, and to refuse answering a query if answering it, given the previous queries, would violate privacy. For example if a query is submitted that could lead to a differencing attack when combined with one of the previous query responses. This sounds good in theory, but there are two problems with this approach. The first is that refusing to provide the answer for a particular query, can itself say something about the query response, and therefore violate privacy. The second is that auditing a sequence of queries can be computationally infeasible, especially if the number of different potential queries is very large.

Answer 16

This example reveals how profound the problem of privacy preservation is, even when reporting mere summary statistics. At the same time, the example also provides us with a tool to approach this problem: what is the worst kind of possible attack that we can conceive, short of hacking the system, and, under such an attack, how can we preserve privacy to what degree by using various mechanisms? That will be the main question in this course!

Answer 17

1. DP wards off attacks 2. DP allows us to quantify privacy 3. DP allows for composition

Answer 18

most previously existing techniques for preserving privacy, relied on assumptions about possible attacks that people with malicious intent may try to carry out. For instance, for anonymization to adequately protect privacy, one must make assumptions about alternative data that attackers might have access to. Such assumptions will often be too strong and there is simply no way of verifying whether they are correct. Besides, you need assumptions about what the intent of a possible attacker might be. Are they trying to re-identify one or more individuals, or are they just trying to find out whether the data of a particular individual is in the database at all? DP does not have this flaw: (i) it works regardless of any auxiliary information that the attacker might have, and (ii) it protects any kind of information about the individuals in the dataset, so their membership of the dataset and also their attributes. Regarding the first point, DP can be shown to be closed under post- processing, in the sense that any transformation of the output of a DP mechanism, even when using auxiliary data, will still be DP. Regarding the second point, in essence, DP protects the privacy of the individuals in the dataset in the worst-case scenario, namely the scenario in which the attacker knows all the data in the dataset except for that of their target individual. Intu- itively, when the privacy of that target individual is protected in this worst-case scenario, it will also be protected in any other scenario where the attacker has less information about the rest of the data.

Answer 19

an advantage of DP with respect to other notions of privacy is that it automatically comes with a quantification of privacy. Instead of either having privacy or not, in the DP framework the concept of ‘pri- vacy’ takes values on a continuum. The definition of DP namely contains a parameter ε that determines how much privacy is potentially lost when using the DP mechanism. This parameter essen- tially determines the trade-off between privacy and accuracy, where a higher ε implies more potential privacy loss. Essentially, ε quantifies the highest possible information gain that an attacker might have from seeing the output of the DP mechanism. In the coming weeks, this interpretation will be discussed in more detail. This measure of privacy loss allows for making comparisons among DP techniques, such as: for a particular privacy loss, which technique leads to a better accuracy?

Answer 20

importantly, because DP allows us to quantify privacy loss, it also allows for analysing the total privacy loss when multiple DP calculations are done on a certain database. It turns out to be fairly easy to quantify the total privacy loss of a combination of multiple DP techniques. In practice, organizations or researchers may want to do many different analyses on a certain database. By the so-called composition property of DP, it is possible to keep an eye on how much privacy is at risk when doing multiple analyses. Moreover, the composition property allows us to build and analyse sophisticated DP algorithms consisting of simpler DP mechanisms.

Answer 21

Pr (X = x) = f (x) (1) all probabilities lie between zero and one (i.e., 0 ≤ f (x) ≤ 1 for all x ∈ R) and (2) the probabilities add up to one

Answer 22

The cumulative distribution function (CDF) F (x) of RV X is such that Pr (X ≤ x) = F (x)

Answer 23

Under these definitions, for a continuous RV X , we have that Z a Pr (X ≤ a) = F (a) = integral from a -infinity and a f (x)dx. For a PDF to be valid, it also needs to satisfy two criteria: (1) the density is nonnegative everywhere (i.e., f (x) ≥ 0 for all x ∈ R) and (2) the density integrates to one (i.e. integral from -inf to inf f (x)dx = 1). A PDF is permitted to have discontinuities.

Answer 24

The expectation of a function g (X ) of discrete RV X is denoted by E [g (X )] and is defined as follows: X E [g (X )] = sum of x in X of g (x) · f (x) 1. E [g (X )] itself is not an RV: it is a number. 2. E [g (X ) + h(Y )] = E [g (X )] + E [h(Y )] for RVs X and Y , and their transforma- tions according to functions g and h respectively. 3. E [a + b · g (X )] = a + b · E [g (X )] for constants a and b.

Answer 25

Var (X ) = E (X − E [X ])2 = E X 2 − (E [X ])2 Var (a + b · X ) = b 2 · Var (X )

Answer 26

joint PMF. Thus, Pr (X = x, Y = y ) = fX ,Y (x, y ) joint CDF Pr (X ≤ x, Y ≤ y ) = FX ,Y (x, y ) joint pdf F_(X,Y) (a,b) = integral -inf to a and integral -inf to b of f_(X,Y) (x,y) dydx

Answer 27

pmf of X and Y f_X (x) = sum of y in Y of f_X,Y (x,y) f_Y (y) = sum of x in X of f_X,Y (x, y) Same for the marginal pdfs but then instead of the sum you take the integral

Answer 28

f_X ,Y (x, y ) = f_X (x) · f_Y (y )

Answer 29

page 30,31

Answer 30

page 31-33

Answer 31

page 33-35

Answer 32

page 35,36

Answer 33

page 37,38

Answer 34

page 38-39

Answer 35

page 40,41

Answer 36

page 41,42

Reader Week 1 Flashcards

(70 cards)