chapter 3 Flashcards
what is the goal of anonymization
Balancing Data Privacy and Data Utility to make data less specific while retaining its usefulness
original database goes through _________ to become published database
anonymization
what are some anonymisation techniques
Attribute Suppression
Character Masking
Generalisation
Swapping
Data Perturbation
Synthetic Data
Data aggregation
K-anonymity
Pseudonymization
what is attribute suppression
- Removal of an entire part of data (column in
databases or spreadsheets) in a dataset - Used when an attribute is not required in the
anonymised dataset - Strongest type of anonymization technique
what is an example of attribute suppression
Example: Data consists of test scores
- Recipient only needs to analyse test scores with respect to trainers
- The “student” attribute is removed
what is character masking
- Characters of a data value is masked by using a
symbol, e.g. “*” or “x” - Used when hiding part of a string of characters, is
sufficient to provide the anonymity required - Depending on attribute type, mask to replace a fixed
number of characters, or a variable number of
characters
what is an example of character masking
Example: online grocery store conducting a study of its delivery demand from historical data
- last 4 digits of the postal codes is masked
- leaving the first 2 digits, which correspond to the “sector cod
what is generalisation
- Reduction in the precision of data, e.g., converting a person’s age into a range of values
- Used where values can be generalised into a range, and still
be useful - Data ranges that are too large may mean too much
modification, data ranges too small may be too easy to re-
identify individuals
example of generalisation
Example: Dataset contains person name, age in years, and residential address
* Age ranges of 10 years, starting with a range <20 years, and ending with
range >60 years
* Remove the block/house number and retain only the road name in Addres
- also lets say there is only 1 unique address record in the data, it is too unique already so we have to remove it from the data
what is swapping
- Rearrangement of data in the dataset such that the individual attribute values are represented, but do not correspond to the original records
- Used when subsequent analysis only needs to look at
aggregated data, not relationships between attributes - Not all attributes (columns) need to be swapped, depending
on the situation, only attributes containing values that are
relatively identifiable need to be swapped
what is an example of swapping
Example: Dataset contains information about customer records for a business organisation
- All values for all attributes have been swapped If the purpose of the anonymised dataset is to study the relationships between job profile and consumption patterns
- other methods of anonymisation may be more suitable, e.g. generalisation
what is Data Perturbation
- The values from the original dataset are modified to be slightly different
- This is used for quasi-identifiers and typically for numbers and dates, and should not be used where data accuracy is crucial
- The degree of perturbation should be proportionate, to the
range of values, of the attribute
what is an example of data perturbation
rounding off the values of the numeric columns to either base 3 or base 5 depending on the range of values of the attribute.
what is synthetic data
- Data that is artificially or programmatically created often with the help of algorithms, rather than being generated by actual events
- Captures the underlying structure and display the same
statistical distributions as the original data - Used for a wide range of activities, including as test data for
new products, and in AI model training, yet maintaining data
privacy
example of synthetic data
Example: Office facility, providing “hot-desking” facilities, keep records of the time that users start and end using their facilities.
- They would like synthetic data for 1 day, to perform simulation testing on a new facility allocation
- Synthetic data created, based on the statistics derived from the original data