Questions Week 1 Lecture 2 Flashcards

1
Q

What are the common privacy preserving methods that are often inadequate?

A
  1. anonymization
  2. k-anonymity
  3. Query-based approach
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is anonymization?

A

A simple strategy to try to protect privacy of individuals in a
database is anonymization.
▶ This entails removing personally identifiable information from the
data (e.g., names, e-mail addresses, telephone numbers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a linkage attack?

A

By matching the ‘anonymized’ database to other non-anonymized
data, an attacker could find out that certain records in the
‘anonymized’ database correspond to certain individuals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the problem with anonymization?

A

Re-identification

Surely, publication of an anonymized dataset cannot lead to
leakage of private information?
▶ Unfortunately, this is a naive thought, because of the possibility of
re-identification.
▶ There could namely be a linkage attack:
By matching the ‘anonymized’ database to other non-anonymized
data, an attacker could find out that certain records in the
‘anonymized’ database correspond to certain individuals.
▶ Often, seemingly non-personally identifiable attributes, turn out to
be personally identifiable in combination with other attributes
(e.g., ZIP code and date of birth).

Note: Attacks like this only work under certain restrictions and an
attacker usually cannot know if re-identification was successful, but
the possibility of these kind of attacks is still a violation of privacy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Does it help to hide more information to anonymize data?

A

Just removing obvious identifiers is apparently not enough to
guarantee privacy. Shouldn’t we then simply hide more
information? (e.g., ZIP code, date of birth, etc.)
▶ Deciding what information should be removed, requires knowing
what alternative data potential attackers have access to, now
and in the future
▶ Clearly, it is very difficult (maybe even impossible?) to make a
well-informed judgement about this.
▶ That even seemingly strong anonymization can be insufficient is
demonstrated convincingly by the next example: the Netflix
Prize.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are thus concluding remarks about anonymization?

A

Netflix prize example shows that not just demographic data (e.g.,
ZIP code, data of birth) can be used for linkage attacks.
▶ So it is difficult to ensure that anonymization truly protects privacy
of the individuals in the data set.
▶ You could say: Data cannot be fully anonymized and remain
useful
▶ Notice: exact re-identification not the only problem, e.g. revealing
someone’s membership of the database can be undesirable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is k-anonymity?

A

A common privacy notion from the scientific privacy literature is
k-anonymity.
▶ This privacy notion is intuitive and simple to understand, making it
popular in practice.
▶ Let’s call attributes that contain ‘identifying information’
quasi-identifiers (you see why this is potentially problematic?)
▶ Then k-anonymity ensures that each individual in the database is
indistinguishable, based on their quasi-identifiers, of at least k − 1
others.
all combinations of values of quasi-identifiers should occur at
least k times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does k-anonymity bring us?

A

▶ If a data set is k-anonymous for large k, a re-identification attack
will not be successful:
at best an attacker may find out certain individual corresponds to
one of k rows in data set
▶ On the other hand, this can potentially already be problematic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the important drawbacks of k-anonymity?

A

(i) Choice of quasi-identifiers relies on assumptions about auxiliary
data attackers have access to.
(ii) A k-anonymous data set may still allow attackers to ‘narrow down
the options’ of which records correspond to a target individual.
(iii) It is not clear how to choose k

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is the fact that it is not clear how to choose k a drawback?

A

Also, there is no clear rule or guideline for the choice of k.
Not straightforward to analyze how much privacy is put at risk for
different values of k. Chosen in an ad hoc manner in practice.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How to achieve k-anonymity?

A

▶ A given dataset is unlikely to be k-anonymous, even for k = 2, so
you will have to make changes to achieve k-anonymity.
▶ There are two techniques that can be combined to transform a
data set to a k-anonymous one:

  • Generalization: make values of quasi-identifiers less precise. E.g.:
    ■ Report only first two numbers of ZIP code
    ■ Report in which range date of birth falls
    ■ etc.
  • Suppression: delete individuals who are too different from the rest of
    the sample to obtain a useful k-anonymous data set based on
    generalization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why is achieving k-anonymity complicated in practice?

A

In practice, it is often complicated to figure out the best way to
make a data set k-anonymous.
▶ Here best means making the data set k-anonymous while keeping
the data as precise as possible.
▶ Determining the optimal way of making data set k-anonymous is
NP-hard, but approximating algorithms exist.
▶ In conclusion: k-anonymity has multiple important drawbacks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the query-based approach?

A

Anonymization and k-anonymity have important shortcomings:
perhaps we should not aim at publishing an altered version of the
database.
▶ What instead the data curator agrees to answer queries about the
database, without making data itself public?
▶ You might think privacy of individuals is protected by not making
the data directly available to the public, but it turns out that this
approach is again inadequate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How to neutralize differencing attacks?

A

▶ You might argue: the system should simply audit the sequence of
queries and responses, and refuse answering a query if answering it
violates privacy.
▶ There are two problems with this approach:
1. Refusing to answer a query, can itself say something about the query
response.
2. Auditing a sequence of queries is often computationally infeasible.
▶ In conclusion: the query-based approach does not solve our
problem either due to differencing attacks.

▶ However, this example has provided us with a tool to approach
privacy preservation. The main question this course will namely be:
What is the worst kind of attack that we can conceive, and
how can we preserve privacy under such attacks?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the key advantages of DP?

A
  1. It is not longer necessary to model possible attacks
  2. It allows for quantifying privacy
  3. It allows for composition
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does the first advantage of DP mean?

A

▶ Most previously existing techniques for preserving privacy (e.g.,
anonymization) rely on certain assumptions. In particular:
(i) What alternative data attackers have access to.
(ii) What the intent of a possible attacker might be.
▶ DP does not need these assumptions:
(i) It works regardless of any auxiliary information that the attacker might
have.
(ii) It protects any kind of information about the individuals in the data
set, so their membership of the data set and their attributes
▶ Regarding point (ii): DP protects privacy in the worst-case
scenario, i.e. the scenario in which the attacker knows all the data
in the data set except for that of their target individual.

17
Q

What is the second advantage of DP?

A

An advantage of DP compared to other notions of privacy is that
it comes with a quantification of privacy: the parameter ε.
▶ It determines the trade-off between privacy and accuracy: higher
ε implies more potential privacy loss, but usually higher accuracy.
▶ In essence, ε quantifies the highest possible information gain that
an attacker might have due to the output of the DP mechanism.
▶ This enables us to compare different DP mechanisms. E.g.:
For a given privacy loss, which mechanism leads to a better
accuracy?

18
Q

What is the third advantage of DP?

A

▶ As DP allows us to quantify privacy loss, it also allows for
analysing the total privacy loss when multiple DP calculations are
done on a database.
▶ Quantifying the privacy loss of a composition of DP mechanisms,
turns out to be straightforward, which is useful because:
(i) Organizations or researchers may want to do many different analyses
on a certain database.
(ii) It allows us to build and analyse sophisticated DP algorithms
consisting of simpler DP mechanisms.
▶ By the composition property of DP, it is possible to keep an eye on
how much privacy is at risk in both these settings