4. Identity and Anonymity Flashcards

1
Q

Forms of identity

A
  • identified individual
  • pseudonym /we can detach online presence from the actual person. However, this kind of privacy can be illusory, as it is often possible to identify the actual person behind the pseudonym.)
  • anonymity (“The weakest form of identity is anonymity. With truly anonymous data, we not only do not know the individual the data is about, we cannot even tell if two data items are about the same individual.)
    The differences can easily be seen using a formal definition. Assume we have a set of data items D={d1 ,…,dn }, and an identity function I(d) that gives us information on whom the data item d is about. If we can say that, for a known individual i, I(d)=i, then I(d) is an identified individual. If we can say that I(dj )=I(dk ) (the two data items are about the same individual), but we do not know who that individual is, then I(dk ) is a pseudonym. If we cannot make either statement (identified individual or pseudonym), then the data is anonymous.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Reasons for a system to know a person’s identity

A
  • Access control
  • Attribution (the ability to prove who performed an action)
  • Enhance user experience (pseudonym might be enough here)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Representing an identity

A
  • External, easy to remember identifier (e.g. name and date of birth or address)
  • user specified identifier (user ID)
  • externally created identifiers (e.g. email address)
  • using systems created for that purpose. The X.500 standard provides a flexible framework for storing and maintaining identifying information, as do commercial systems such as Microsoft Passport or Google Wallet. Cryptographic certificates and public-key infrastructure (see Chapter 3) also provide mechanisms to verify identity. These systems generally combine representations of identity with other identity-related information (name, address) and can provide authentication mechanisms
  • Biometrics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

X.500

A

X.500 is a series of computer networking standards used to develop the equivalent of an electronic directory that is very similar to the concept of a physical telephone directory. Its purpose is to centralize an organization’s contacts so that anyone within (and sometimes without) the organization who has Internet access can look up other people in the same organization by name or department. Several large institutions and multinational corporations have implemented X.500.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Authentication categories

A

Authentication is used to ensure that an individual performing an action matches the expected identity. Authentication can be accomplished by a variety of mechanisms, each with advantages and drawbacks. These mechanisms fall into four main categories:
What you know—secret knowledge held only by the individual corresponding to the identity
What you have—authentication requires an object possessed by the individual
Where you are—the location matches the expected location
What you are—biometric data from the individual
Authentication methods typically involve authentication information held by the user, complementation information held by the server/host, and an authentication function that takes a piece of authentication information and a piece of complementation information and determines whether they do or do not match.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

A man in the middle attack

A

Simple example of a password attack performed directly through the system is the man-in-the-middle attack, in which a computer program intercepts traffic and reads the password contained in the intercept. To combat this attack, passwords are typically encrypted. Instead of presenting the password (authentication information) to a system, the system uses a one-way hash of the password and stores only the hash (complementary information). As it is extremely difficult to discover the password from the hash, this prevents the man in the middle, or an intruder who has gained access to the system, from obtaining a user’s password.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Passwords/PINs - what you know approach to authentication

A

One of the most common approaches to authenticating a user is through passwords or PINs. This is an example of what you know authentication: It is assumed that only the proper individual knows the password. Passwords can provide a high level of assurance that the correct individual is being identified, but when used improperly, they can easily be broken.
Attacks on password-based authentication fall into two categories: attacks on the password itself (e.g. guessing short pwds) and password attacks performed directly through the system (e.g. a man in the middle attack).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Replay attack

A

While the man in the middle may not know the password, he only needs to replay the hash of the password to gain access; this is called a replay attack. This kind of attack is easily combated through system design. Challenge response authentication issues a unique challenge for each authentication: The response must be correct for each challenge. With a hashed password, the challenge is an encryption key sent by the system. The user application uses the key to encrypt the hash of the password; this is compared with the system’s encryption of the stored value of the hashed password. Each authentication uses a different key, and thus a replay attack fails because the replayed password (response) is not encrypted with the current key (challenge).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What you have approach to authentication / Devices

A

The what you have approach to authentication typically uses computing devices. Identification badges or smart cards can be used; these require that the computing terminal have the ability to read the computing device. A convenient approach is to embed a radio frequency identification (RFID) chip in the device; this does require a reader, but the user doesn’t actually have to swipe the card. This particular technology introduces a privacy risk in that a malicious actor with a remote RFID reader can detect when the user is nearby, even though they are not actually trying to authenticate. If the actor can read the RFID card, then they may be able to “become” that individual through a replay attack; more advanced RFID approaches use a challenge-response approach to mitigate this attack.
Devices also exist that don’t require special hardware at the client’s terminal. These are typically in the form of small devices that display a changing PIN; the timing and sequence of PINs are known to the system. The user can type the PIN being displayed by the device just like a password, and the system checks to see if the given PIN matches what the device should be displaying.
Lastly, the computing device may be the computer the person uses to access the system (e.g., a home computer, laptop, smartphone). The system stores the IP address of the device or uses browser cookies to store a unique key on the machine; this allows the system to check whether the attempt to authenticate comes from a device previously used. Since the user already has the device, this requires no additional hardware.
Device-based authentication becomes problematic when devices are lost or stolen—until the loss is recognized and reported, access to the system may be compromised. As a result, these systems are typically combined with passwords or some other form of authentication so that the lost device alone cannot be used to gain access.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Where you are based authentication / Location

A

Location-based authentication is typically used in corporate networks. Access to corporate resources is limited to computers physically located in the company. This requires an attacker to gain physical access as well as defeat other authentication (such as passwords), making unauthorized access far more difficult. Of course, this also prevents legitimate use from outside the network, requiring the use of virtual private networks (VPNs). A VPN provides an encrypted link to the corporate network, and typically requires a high standard of authentication to make up for the loss of location-based authentication.
Note that location-based authentication can be used in other ways as well. Credit card issuers may reject transactions at unfamiliar locations unless the customer has provided advance notice of travel. While this may seem invasive from a privacy point of view, such location information will likely be made available anyway—for example, from the credit card use or the IP address used to connect to the system, so little additional information is disclosed when providing a list of authorized locations, such as a travel itinerary.
While location is useful, it should almost always be viewed as a secondary form of authentication, used to provide stronger evidence that the primary form of the authentication is.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What you are authentication / Biometrics

A

What you are as a form of authentication is growing increasingly popular. Notebook computers are available with fingerprint readers, and cameras and microphones are becoming standard equipment on many devices. Fingerprints, face and voice recognition, and other biometric methods for authentication are becoming increasingly available. This brings advantages, but also raises privacy issues.
First, systems using biometric data must protect that data. If a user’s password is compromised, the user can change it—but cannot be asked to change a face or fingerprint. As with passwords, careful system design is needed to ensure that an attacker cannot obtain or spoof the biometric data.
Use of biometric data raises inherent privacy concerns. While passwords can be associated with a pseudonym, a fingerprint is inherently identifying, and a pseudonymous account using a fingerprint for authentication should probably be considered individually identifiable. There may also be cultural issues; some users may be reluctant to have a photograph taken or to display their face for use in biometric authentication.
A second type of biometrics is based on behavior—for example, typing rate or patterns of mouse movement. While these give only a degree of assurance, they provide the opportunity for continuous authentication. Once a user authenticates to the system, the behavior in using the system can be used to ensure that the user hasn’t walked away and someone else has stepped in to use the account.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

MFA

A

The idea behind multifactor authentication is to require two different mechanisms, coming from two of the above categories (what you know, who you are, where you are, what you have). A common example is the use of a device (often an individual’s cell phone) in addition to a password.
Good implementation of two-factor authentication can make many types of attacks, such as man-in-the-middle attacks, more difficult. The key is that the two factors should proceed through independent channels, such as a password combined with a one-time temporary security code sent via a text message (SMS). While this does not eliminate attacks, an attacker must now compromise two independent systems. Conversely, forms of two-factor authentication that draw both factors from the same category (such as a password and security questions) are much less effective; a targeted attack to use personal information to guess a password will likely acquire the personal information needed to answer the security questions as well.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Authentication

A

Authentication is the means by which a system knows that the identity matches the individual who is actually using the system. There are several approaches to authentication. Often, these can be used in combination, significantly decreasing the risk of a successful attack or attempt at impersonating the user.
Authentication can be separated from the systems requiring authenticity. This is a feature of single-sign-on systems. Authentication is performed by a service that provides a time-stamped cryptographic token to the user’s system (e.g., web browser). This token can be provided to other systems, which can decide if the source of authentication (secured using a digital certificate), recency of authentication and user identified by the token satisfy access policy without requiring separate authentication.
Authentication must balance assuring the accuracy of an individual’s identity and the usability of the system. While authentication needs to be strong enough to protect personal information, excessive use of technology to perform authentication can reduce the practical effectiveness of the system and create new privacy issues by collecting sensitive personal information needed to implement complex authentication mechanisms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Radio Frequency Identification (RFID)

A

Radio Frequency Identification (RFID), is a technology that is similar in theory to barcode identification. It is a wireless non-contact use of radio frequency electromagnetic fields to transfer data, for the purpose of automatically identifying and tracking tags attached to objects.
The tags contain electronically stored information. Some tags are powered and read at short ranges by magnetic fields. Others are powered by a local power source such as a battery, or in some cases they don’t have a battery but collect energy from the interrogating EM field, and then act as a passive transponder to emit microwaves or UHF radio waves.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

TAILS

A

Tails (The Amnesic Incognito Live System) is a live operating system that you can start on almost any computer from a USB stick. It aims to preserve your privacy and anonymity by routing all your internet traffic through the Tor network and by providing a host of other privacy-centric features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Personal data definition (GDPR)

A

GDPR defines personal data as “any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person. This results in almost any data derived from a person being deemed “personal data,” even if the tie to specific individuals is unclear.

17
Q

Pseudonymity by HIPAA

A

The need to treat pseudonymous data collections with particular care is recognized in law. The U.S. HIPAA Privacy Rule does not apply to anonymous data. However, it makes a special provision for “limited data sets,” which are not individually identified; individual identifiers must be replaced with a number. This gives a pseudonymous dataset. Limited datasets can be shared under a data use agreement, but they may not be publicly released. HIPAA also provides a specific example of pseudonymity. A de-identified dataset may include:
A code or other means of record identification to allow information de-identified under this section to be re-identified by the covered entity, provided that: (1) the code or other means of record identification is not derived from or related to information about the individual and is not otherwise capable of being translated so as to identify the individual; and (2) the covered entity does not use or disclose the code or other means of record identification for any other purpose, and does not disclose the mechanism for re-identification. The challenge for IT professionals is identifying these various definitions and managing the definitions in a consistent and legally compliant manner across their systems.

18
Q

Browser fingerprinting

A

Browser fingerprinting gathers information about the user to distinguish them from millions of others online. browser fingerprinting provides enough specific attributes about your device and its settings that you can be reliably identified out of a crowd, even the extremely large crowd of millions of internet users and billions of devices. In fact, device fingerprinting can identify users with 90 to 99% accuracy.
In an effort to personalize or customize the user experience, websites can track the user with browser cookies or other techniques. Advanced techniques can uniquely identify most browsers by using data reported to the web server, including the client operating system, browser plug-ins and system fonts. While tracking can be beneficial to the individual, it also turns anonymous exchanges into pseudonymous exchanges—and, this can make data individually identifiable.

19
Q

Anonymisation

A

Anonymization techniques attempt to ensure that data is not identifiable. This is a challenging problem: For each anonymization technique, there are attacks showing conditions under which data can be re-identified. Though these attacks sometimes make very strong assumptions, it is clear that any attempt at anonymizing data faces risks. However, these risks are likely to be much smaller than the risk of misuse of identifiable data, through either insider misuse or external attackers gaining access to data.

20
Q

Strong, weak/quasi identifiers

A

Some information is clearly identifying—for example, identifying numbers such as a national identification, passport or credit card number. These are referred to as strong identifiers. Names can be strong identifiers, but common names may not be uniquely identifying. This is why names are typically used in combination with other information (e.g., birth date, address) to identify individuals. Identifiers that must be used in combination with other information to determine identity are referred to as weak identifiers. A related concept is quasi-identifiers: data that can be combined with external knowledge to link data to an individual.

21
Q

Anonymisation techniques

A

Anonymization techniques hide identity in a variety of ways. The simplest approach is suppression: removing identifying values from a record. Names and identifying numbers are typically handled through suppression.
Some types of data are amenable to generalization, which is performed by replacing a data element with a more general element; for example, by removing the day and month from a birth date or removing a street from a postal address and leaving only the city, state or province name. Replacing a date of birth by just the year of birth substantially reduces the risk that an individual can be identified but still leaves valuable information for use in data analysis.
A third approach is noise addition. By replacing actual data values with other values that are selected from the same class of data, the risk of identification is lowered. The addition is often aimed at preserving statistical properties of the data, while disrupting future attempts to identify individuals from the data. In many ways, the protection obtained by noise addition is similar to generalization.
Rounding can also be used as a form of generalization (e.g., to the nearest integer, or nearest 10); controlled rounding ensures that rounding is done in a way that preserves column summations. Where data must be suppressed, data imputation can be used to replace the suppressed values with plausible data without risking privacy. Another technique includes value swapping, which means switching values between records in ways that preserve most statistics but no longer give correct information about individuals.

22
Q

Data imputation

A

Data imputation is a method for retaining the majority of the dataset’s data and information by substituting missing data with a different value.

23
Q

Microdata

A

Microdata are unit-level data obtained from sample surveys, censuses, and administrative systems. They provide information about characteristics of individual people or entities such as households, business enterprises, facilities, farms or even geographical areas such as villages or towns.

24
Q

HIPAA safe harbour rules to generalize personal data

A

The only clear legal answer to the question of what makes data individually identifiable is contained in the HIPAA safe harbor de-identification rules, the process of the removal of specified identifiers of the patient, and of the patient’s relatives, household members, and employers.
These specify the removal or generalization of 18 types of data. Name, identifying numbers (e.g., telephone number, insurance ID) and several other data types must be suppressed. Dates must be generalized to a year and addresses to the first three digits of the postal code or more general if this does not yield a region containing at least 20,000 people. Furthermore, age must be top-coded as follows: All ages greater than 89 must simply be reported as >89. If these steps have been taken, and there is no other reason to believe that the data is identifiable, then the data can be considered de-identified and no longer subject to the HIPAA Privacy Rule.

25
Q

k-anonimity

A

The concept of k-anonymity was introduced into information security and privacy back in 1998. It’s built on the idea that by combining sets of data with similar attributes, identifying information about any one of the individuals contributing to that data can be obscured. k-Anonymization is often referred to as the power of “hiding in the crowd.” Individuals’ data is pooled in a larger group, meaning information in the group could correspond to any single member, thus masking the identity of the individual or individuals in question.
The k in k-anonymity refers to a variable — think of the classic ‘x’ in your high school algebra class. In this case, k refers to the number of times each combination of values appears in a data set. If k=2, the data is said to be 2-anonymous. This means the data points have been generalized enough that there are at least two sets of every combination of data in the data set. For example, if a data set features the locations and ages for a group of individuals, the data would need to be generalized to the point that each age/location pair appears at least twice. The k-anonymity does not provide an absolute guarantee of privacy protection.

26
Q

l-diversity

A

The l-diversity extends k-anonymity by further requiring that there be at least l distinct values in each group of k records. This prevents the privacy breach noted above; there are at least l possible occupations for an individual, even if we know which group of k people they belong to.

27
Q

t-closeness

A

t-closeness, which ensures that the distribution of values in a group of k is sufficiently close to the overall distribution.

28
Q

Database reconstruction (from aggregate data)

A

With a large enough set of aggregates, database reconstruction becomes possible. Database reconstruction builds a dataset that would generate the aggregate statistics. In many cases, it can be shown that such a reconstructed database is unique, or at least that many of the individual records in the dataset are unique (e.g., they must exist in any dataset that generated those aggregates.) While this data is not identified, it is essentially a microdata set and subject to the re-identification attacks.

29
Q

Differential privacy

A

Differential privacy adds noise to data to protect the privacy of individuals. Still, this noise can also reduce the utility of the data, making it less accurate or useful for certain types of analysis. This trade-off can be difficult to manage and requires careful balancing to balance privacy and utility. The most widely accepted definition for noise addition at this time is differential privacy. The idea behind differential privacy is to add sufficient noise to the aggregates to hide the impact of any one individual. The key idea is to compare the difference in the aggregate result between two databases that differ by one individual. Differential privacy requires that the added noise be large relative to that difference, for any two databases and any individual. Differential privacy deals with a key challenge in the release of aggregates: Even though it may be safe to release two aggregate values (e.g., two tables) independently, given both, is it possible to re-identify individuals from these tables? The answer may be yes. A simple example would be releasing the total payroll of a company, and the total payroll of the company exclusive of the CEO. While neither datum by itself reveals individual salaries, given both of these numbers it is easy to determine the CEO’s salary.

30
Q

Differential identifiability

A

Differential identifiability is a reformulation of differential privacy that limits the confidence that any particular individual has contributed to the aggregate value.

31
Q

Aggregate data

A

Instead of publishing de-identified, individual-level records, one can publish aggregate statistics derived from the data. On the face of it, this would eliminate privacy concerns. Unfortunately, it is often possible to determine individual values from such statistics. Though still evolving, releasing data aggregates rather than microdata often provides significantly better privacy protection and still meets the needs for data analysis.”

32
Q

Local differential privacy

A

Local differential privacy allows individuals to add noise before sending data to a server. The server then computes the aggregate from the noisy data. A classic example of client-side noise addition is randomized response, where those surveyed randomly choose to either answer correctly or provide a random answer. 31 Any individual response is suspect (it could easily be just a random choice), but it is possible to construct aggregate results that have provable correctness bounds. Randomized response can provide differential privacy.

33
Q

Local sensitivity

A

local sensitivity, a measure of the impact of one individual on the result of a query on a particular dataset

34
Q

Client-side techniques to enhance anonymity

A

“client-side techniques to enhance anonymity. For example, proxy servers can hide the IP address of a request by replacing it with that of the proxy server. Techniques such as onion routing and Crowds further extend this notion of proxies by hiding IP addresses even from the proxy server.28 Tor is a practical example of such a system. Tor is a peer-to-peer network where each request is routed to another peer, which routes it to another peer, and so on until a final peer makes the actual request. Encryption is used to ensure that only the first peer knows where the request came from, and only the last peer knows the server to which the request is being routed.
This hides only the IP address. Most internet traffic contains considerably more identifying information. For example, a typical HTTP request contains information on the browser, last page visited, type of machine and so on. This can make such a request identifiable even if the IP address is not known. Private web search is a browser plug-in that strips such information from the request.29 This leaves only the search text itself, but as we have seen with the AOL query log disclosure, even this may be sufficient to identify an individual. Tools have been developed to generate “cover queries”—fake query traffic that disguises the actual request.”

Excerpt From
IAPP_T_TB_Introduction-to-Privacy-for-Technology_1.1
This material may be protected by copyright.