chapter 3 Flashcards
what is the goal of anonymization
Balancing Data Privacy and Data Utility to make data less specific while retaining its usefulness
original database goes through _________ to become published database
anonymization
what are some anonymisation techniques
Attribute Suppression
Character Masking
Generalisation
Swapping
Data Perturbation
Synthetic Data
Data aggregation
K-anonymity
Pseudonymization
what is attribute suppression
- Removal of an entire part of data (column in
databases or spreadsheets) in a dataset - Used when an attribute is not required in the
anonymised dataset - Strongest type of anonymization technique
what is an example of attribute suppression
Example: Data consists of test scores
- Recipient only needs to analyse test scores with respect to trainers
- The “student” attribute is removed
what is character masking
- Characters of a data value is masked by using a
symbol, e.g. “*” or “x” - Used when hiding part of a string of characters, is
sufficient to provide the anonymity required - Depending on attribute type, mask to replace a fixed
number of characters, or a variable number of
characters
what is an example of character masking
Example: online grocery store conducting a study of its delivery demand from historical data
- last 4 digits of the postal codes is masked
- leaving the first 2 digits, which correspond to the “sector cod
what is generalisation
- Reduction in the precision of data, e.g., converting a person’s age into a range of values
- Used where values can be generalised into a range, and still
be useful - Data ranges that are too large may mean too much
modification, data ranges too small may be too easy to re-
identify individuals
example of generalisation
Example: Dataset contains person name, age in years, and residential address
* Age ranges of 10 years, starting with a range <20 years, and ending with
range >60 years
* Remove the block/house number and retain only the road name in Addres
- also lets say there is only 1 unique address record in the data, it is too unique already so we have to remove it from the data
what is swapping
- Rearrangement of data in the dataset such that the individual attribute values are represented, but do not correspond to the original records
- Used when subsequent analysis only needs to look at
aggregated data, not relationships between attributes - Not all attributes (columns) need to be swapped, depending
on the situation, only attributes containing values that are
relatively identifiable need to be swapped
what is an example of swapping
Example: Dataset contains information about customer records for a business organisation
- All values for all attributes have been swapped If the purpose of the anonymised dataset is to study the relationships between job profile and consumption patterns
- other methods of anonymisation may be more suitable, e.g. generalisation
what is Data Perturbation
- The values from the original dataset are modified to be slightly different
- This is used for quasi-identifiers and typically for numbers and dates, and should not be used where data accuracy is crucial
- The degree of perturbation should be proportionate, to the
range of values, of the attribute
what is an example of data perturbation
rounding off the values of the numeric columns to either base 3 or base 5 depending on the range of values of the attribute.
what is synthetic data
- Data that is artificially or programmatically created often with the help of algorithms, rather than being generated by actual events
- Captures the underlying structure and display the same
statistical distributions as the original data - Used for a wide range of activities, including as test data for
new products, and in AI model training, yet maintaining data
privacy
example of synthetic data
Example: Office facility, providing “hot-desking” facilities, keep records of the time that users start and end using their facilities.
- They would like synthetic data for 1 day, to perform simulation testing on a new facility allocation
- Synthetic data created, based on the statistics derived from the original data
what is data aggregation
- Converting a dataset from a list of records to summarised values
- Used when individual records are not required and
aggregated data is sufficient for the purpose - If the aggregated data includes a single record in any of the categories, it could be easy for someone with some additional knowledge to identify an individual, hence, aggregation may need to be applied in combination with suppression
what is an example of data aggregation
Example: charity organisation has records of the donations made, as well as some information about the donors. Aggregated data is assessed to be sufficient to perform data analysis
what is K-anonymity
- A property of a dataset that is usually used in order to
describe the dataset’s level of anonymity - Protects against re-identification, and often described as a
‘hiding in the crowd’ guarantee - k in k-anonymity refers to the number of times each
combination of values appears in a dataset - If k = 3, the data is said to be 3-anonymous, the higher the
value of ‘k’, the harder it is for individuals to be identified
what is an example of k-anonymity
Example: Research needs to be done on the types of disease
- Name, Postcode, Age, and Gender are attributes that could be used to identify an individual
- Data anonymised to achieve k-anonymity of k = 3, or at least 1/3 chance to identify an individual
what is Pseudonymization
- Replacement of identifying data with made up values, which are unique, and should have no relationship to the original values
- Used when the data values need to be uniquely distinguished
- Persistent pseudonyms allow linkage across other different
datasets - May need to follow the structure or data type of the original value, simply to look more similar to the original attribute
what is an example of pseudonymization
Example: names of persons who obtained their driving licenses and other information
- the names were replaced with pseudonyms
Useful for cross dataset linking and where original data structure is needed, but does not comply with personal data protection regulations, if applied specifically on explicit identifiers
what are the 2 phases in the anonymisation methodology
Anonymisation Preparation Phase
Anonymisation Execution Phase
what are the 4 steps in the anonymisation preparation phase
determine the release model
determine the reidentification risk threshold
classify the data attributes
remove unused data attributes
what does determine the release model mean ?
- Refers to how the anonymised dataset will be released
- Public or Non-Public release
what does Determine re- identification risk threshold mean ?
- Data anonymity increases as Risk Threshold increases
- Data Utility decreases as Risk Threshold increases
what is risk threshold
The risk threshold is a parameter that determines the desired level of privacy protection in a dataset, balancing the trade-off between data anonymity and data utility.
what does classify the data attributes mean ?
- Classification affects how the attributes will subsequently be processed
- Explicit/quasi identifiers, sensitive data
why should attributes not required in the dataset be removed/suppressed ?
Attributes not required in the anonymized dataset should be suppressed to reduce the risk of re-identification, protect individuals’ privacy, and minimize the potential for unintended data leakage or misuse.
what is step 4 in the Anonymization Preparation Phase
Remove unused data attributes: Attributes that are not required in the anonymized dataset should be suppressed
define data anonymization
Data anonymization is the irreversible process of transforming a dataset to conceal individuals’ identities and sensitive information while preserving its structure and utility for research and analysis
what are the 4 steps in the anonymization execution phase
Anonymise identifiers
Evaluate the solution
Determine controls required
Document anonymisation process
what is anonymise identifiers mean ?
- Apply relevant anonymization techniques
- Different techniques are applicable for types of identifiers
what does evaluate the solution mean
- Examine the anonymised dataset to assess if there is sufficient data anonymity and utility
what does it mean to determine the controls required
- Technical controls, including access control, authentication, encryption
- Non-technical controls, incl. legal, company processes
what does it mean to document the anonymisation process
- Details of the anonymisation process, parameters used and controls should be clearly recorded for future reference
- Facilitates maintenance