Module 4 Flashcards
Technical Meaures and Privacy-Enhancing Technologies
Identity
Link between a piece of information and the individual or individuals associated with the data; captures what we know about who that individual is. In the language of data, identity are codes or strings used to represent and individual, device or browser. The more precise the identifier, the stronger the identifier. A strong identifier typically are numbers (SSN, Credit card numbers, etc.) and weak identifiers tend to be more general that may belong to more than one individual (zip code, area code, etc.). It should also be noted the strength and weakness can be affected by context.
Quasi-identifiers
Combine data with external knowledge, such as publicly available information, to identify an individual.
Deidentification
A technique used to prevent an individual’s identity from being connected to their personal information.
Psuedonymization
Replacing individual identifiers with numbers, letters, symbols or a combination of these, such that data points are not directly associated with a specific individual. Note: as long as the original state and the pseudonym is documented, the original data can be restored.
Anonymization
This method completely removes or alters personal information so that it’s impossible (or extremely difficult) to trace the data back to an individual. There’s no way to reverse it or find out who the data belongs to, even if you have more information.
Example: If a research study deletes all identifying information about participants (like names, addresses, or unique IDs) and generalizes data (e.g., “a 35-year-old male” instead of specific details), then even the researchers cannot figure out who the participants are.
Tokens
Is a system of deidentifying data which uses random tokens as stand-ins for meaningful data.
K-anonymity
It’s built on the idea that by combining sets of data with similar attributes, identifying information about any one of the individuals contributing to that data can be obscured. k-Anonymization is often referred to as the power of “hiding in the crowd.” Individuals’ data is pooled in a larger group, meaning information in the group could correspond to any single member, thus masking the identity of the individual or individuals in question.
I-diversity
It’s built on the idea that by combining sets of data with similar attributes, identifying information about any one of the individuals contributing to that data can be obscured. k-Anonymization is often referred to as the power of “hiding in the crowd.” Individuals’ data is pooled in a larger group, meaning information in the group could correspond to any single member, thus masking the identity of the individual or individuals in question.
t-closeness
A property of a dataset and an extension of k-anonymity that measures the diversity of sensitive values for each column in which they occur.
Aggregation
Information is expressed in a summary form that reduces the value and quality of data as well as the connection between the data and the individual it belongs to.
Frequency versus magnitude data
When reviewing aggregate data, you must first determine if the data is frequency data or magnitude data. Frequency Data: This tells you how often something happens or how many times an event occurs. It’s simply about counting how frequently something takes place. An example; Imagine you’re looking at data from a school. If you want to know how many students got an “A” grade in math, frequency data would show that 30 students received an “A.” It’s just counting how many times the grade “A” appeared.
Magnitude Data: This measures how large or intense something is. It tells you about the size, amount, or level of something, not just how many times it happens. An example; If you’re looking at the total sales for a store, magnitude data would show how much money was made (e.g., $50,000 in sales last month). It’s about the total value, not how many transactions occurred.
Noise addition through differential privacy
When data is aggregated, personal identifiers are removed from the data set being shared. However, it is still possible to reverse engineer the data to discover the underlying identifiers that were used to create the aggregation (by using auxiliary information, for example). One way to prevent reverse engineering is to “blur” the data points by using noise addition through differential privacy. The goal is to ensure that the aggregated data is still useful, while also making it nonspecific enough to avoid revealing the underlying identifiers. This is done by using an algorithm to generate values that remain meaningful and yet are nonspecific.
Differential Identifiability
While the algorithm used in differential privacy ensures that reverse engineering does not result in privacy violations, there is no clear guideline on how much noise to add before the quality of the aggregate value becomes poor. Differential identifiability improves on differential privacy by setting parameters (based on the individual identification’s contribution) for the algorithm to generate noise.
Encryption
The rapid scrambling of collected information that will require authorized access.
Algorithms
Mathematical applications applied to a block of data.
Keys
Small piece of data that controls an alorithm’s execution and is required to encrypt and decrypt a message.
Symmetric encryption
Way of keeping information secret by using a special code, called a key, to lock (encrypt) and unlock (decrypt) the information. Think of it like a padlock where the same key is used to lock and unlock the padlock. The same key is used to both lock (encrypt) and unlock (decrypt) the information. The advantage of this type of encryption is that it is fast and effective when compared to assymetric encryption.
Asymmetic encryption
way of keeping information secure by using two different keys: one for locking (encrypting) the information and another for unlocking (decrypting) it. Think of it like a mailbox: anyone can put a letter in (encrypt), but only the person with the key can open the mailbox and read the letter (decrypt). How it works:
Public Key (Locking/Encrypting): You have a key that you share with everyone. This key is used by others to lock up information they want to send you. It’s like the open slot of a mailbox where anyone can drop a letter in.
Private Key (Unlocking/Decrypting): You have another key that you keep secret and don’t share with anyone. This key is used to unlock the information and read it. It’s like the key that opens the mailbox, allowing you to take out and read the letters.
The advantage is that it is highly secure, but is slower than symmetric encryption.
Application encryption
File-level or document-based encryption, provides built-in encryption that is applied throughout a program.