Questions Week 1 Lecture 2 Flashcards
What are the common privacy preserving methods that are often inadequate?
- anonymization
- k-anonymity
- Query-based approach
What is anonymization?
A simple strategy to try to protect privacy of individuals in a
database is anonymization.
▶ This entails removing personally identifiable information from the
data (e.g., names, e-mail addresses, telephone numbers)
What is a linkage attack?
By matching the ‘anonymized’ database to other non-anonymized
data, an attacker could find out that certain records in the
‘anonymized’ database correspond to certain individuals.
What is the problem with anonymization?
Re-identification
Surely, publication of an anonymized dataset cannot lead to
leakage of private information?
▶ Unfortunately, this is a naive thought, because of the possibility of
re-identification.
▶ There could namely be a linkage attack:
By matching the ‘anonymized’ database to other non-anonymized
data, an attacker could find out that certain records in the
‘anonymized’ database correspond to certain individuals.
▶ Often, seemingly non-personally identifiable attributes, turn out to
be personally identifiable in combination with other attributes
(e.g., ZIP code and date of birth).
Note: Attacks like this only work under certain restrictions and an
attacker usually cannot know if re-identification was successful, but
the possibility of these kind of attacks is still a violation of privacy
Does it help to hide more information to anonymize data?
Just removing obvious identifiers is apparently not enough to
guarantee privacy. Shouldn’t we then simply hide more
information? (e.g., ZIP code, date of birth, etc.)
▶ Deciding what information should be removed, requires knowing
what alternative data potential attackers have access to, now
and in the future
▶ Clearly, it is very difficult (maybe even impossible?) to make a
well-informed judgement about this.
▶ That even seemingly strong anonymization can be insufficient is
demonstrated convincingly by the next example: the Netflix
Prize.
What are thus concluding remarks about anonymization?
Netflix prize example shows that not just demographic data (e.g.,
ZIP code, data of birth) can be used for linkage attacks.
▶ So it is difficult to ensure that anonymization truly protects privacy
of the individuals in the data set.
▶ You could say: Data cannot be fully anonymized and remain
useful
▶ Notice: exact re-identification not the only problem, e.g. revealing
someone’s membership of the database can be undesirable
What is k-anonymity?
A common privacy notion from the scientific privacy literature is
k-anonymity.
▶ This privacy notion is intuitive and simple to understand, making it
popular in practice.
▶ Let’s call attributes that contain ‘identifying information’
quasi-identifiers (you see why this is potentially problematic?)
▶ Then k-anonymity ensures that each individual in the database is
indistinguishable, based on their quasi-identifiers, of at least k − 1
others.
all combinations of values of quasi-identifiers should occur at
least k times.
What does k-anonymity bring us?
▶ If a data set is k-anonymous for large k, a re-identification attack
will not be successful:
at best an attacker may find out certain individual corresponds to
one of k rows in data set
▶ On the other hand, this can potentially already be problematic
What are the important drawbacks of k-anonymity?
(i) Choice of quasi-identifiers relies on assumptions about auxiliary
data attackers have access to.
(ii) A k-anonymous data set may still allow attackers to ‘narrow down
the options’ of which records correspond to a target individual.
(iii) It is not clear how to choose k
Why is the fact that it is not clear how to choose k a drawback?
Also, there is no clear rule or guideline for the choice of k.
Not straightforward to analyze how much privacy is put at risk for
different values of k. Chosen in an ad hoc manner in practice.
How to achieve k-anonymity?
▶ A given dataset is unlikely to be k-anonymous, even for k = 2, so
you will have to make changes to achieve k-anonymity.
▶ There are two techniques that can be combined to transform a
data set to a k-anonymous one:
- Generalization: make values of quasi-identifiers less precise. E.g.:
■ Report only first two numbers of ZIP code
■ Report in which range date of birth falls
■ etc. - Suppression: delete individuals who are too different from the rest of
the sample to obtain a useful k-anonymous data set based on
generalization
Why is achieving k-anonymity complicated in practice?
In practice, it is often complicated to figure out the best way to
make a data set k-anonymous.
▶ Here best means making the data set k-anonymous while keeping
the data as precise as possible.
▶ Determining the optimal way of making data set k-anonymous is
NP-hard, but approximating algorithms exist.
▶ In conclusion: k-anonymity has multiple important drawbacks
What is the query-based approach?
Anonymization and k-anonymity have important shortcomings:
perhaps we should not aim at publishing an altered version of the
database.
▶ What instead the data curator agrees to answer queries about the
database, without making data itself public?
▶ You might think privacy of individuals is protected by not making
the data directly available to the public, but it turns out that this
approach is again inadequate.
How to neutralize differencing attacks?
▶ You might argue: the system should simply audit the sequence of
queries and responses, and refuse answering a query if answering it
violates privacy.
▶ There are two problems with this approach:
1. Refusing to answer a query, can itself say something about the query
response.
2. Auditing a sequence of queries is often computationally infeasible.
▶ In conclusion: the query-based approach does not solve our
problem either due to differencing attacks.
▶ However, this example has provided us with a tool to approach
privacy preservation. The main question this course will namely be:
What is the worst kind of attack that we can conceive, and
how can we preserve privacy under such attacks?
What are the key advantages of DP?
- It is not longer necessary to model possible attacks
- It allows for quantifying privacy
- It allows for composition