Rearrangement of data in the dataset such that the individual attribute values are represented, but do not correspond to the original records Used when subsequent analysis only needs to look at aggregated data, not relationships between attributes Not all attributes (columns) need to be swapped, depending on the situation, only attributes containing values that are relatively identifiable need to be swapped

chapter 3 Flashcards by Smashing Gaming vulture

what is the goal of anonymization

Balancing Data Privacy and Data Utility to make data less specific while retaining its usefulness

How well did you know this?

Not at all

Perfectly

original database goes through _________ to become published database

anonymization

How well did you know this?

Not at all

Perfectly

what are some anonymisation techniques

Attribute Suppression

Character Masking

Generalisation

Swapping

Data Perturbation

Synthetic Data

Data aggregation

K-anonymity

Pseudonymization

How well did you know this?

Not at all

Perfectly

what is attribute suppression

Removal of an entire part of data (column in
databases or spreadsheets) in a dataset
Used when an attribute is not required in the
anonymised dataset
Strongest type of anonymization technique

How well did you know this?

Not at all

Perfectly

what is an example of attribute suppression

Example: Data consists of test scores

Recipient only needs to analyse test scores with respect to trainers
The “student” attribute is removed

How well did you know this?

Not at all

Perfectly

what is character masking

Characters of a data value is masked by using a
symbol, e.g. “*” or “x”
Used when hiding part of a string of characters, is
sufficient to provide the anonymity required
Depending on attribute type, mask to replace a fixed
number of characters, or a variable number of
characters

How well did you know this?

Not at all

Perfectly

what is an example of character masking

Example: online grocery store conducting a study of its delivery demand from historical data

last 4 digits of the postal codes is masked
leaving the first 2 digits, which correspond to the “sector cod

How well did you know this?

Not at all

Perfectly

what is generalisation

Reduction in the precision of data, e.g., converting a person’s age into a range of values
Used where values can be generalised into a range, and still
be useful
Data ranges that are too large may mean too much
modification, data ranges too small may be too easy to re-
identify individuals

How well did you know this?

Not at all

Perfectly

example of generalisation

Example: Dataset contains person name, age in years, and residential address
* Age ranges of 10 years, starting with a range <20 years, and ending with
range >60 years
* Remove the block/house number and retain only the road name in Addres

also lets say there is only 1 unique address record in the data, it is too unique already so we have to remove it from the data

How well did you know this?

Not at all

Perfectly

what is swapping

Rearrangement of data in the dataset such that the individual attribute values are represented, but do not correspond to the original records
Used when subsequent analysis only needs to look at
aggregated data, not relationships between attributes
Not all attributes (columns) need to be swapped, depending
on the situation, only attributes containing values that are
relatively identifiable need to be swapped

How well did you know this?

Not at all

Perfectly

what is an example of swapping

Example: Dataset contains information about customer records for a business organisation

All values for all attributes have been swapped If the purpose of the anonymised dataset is to study the relationships between job profile and consumption patterns
other methods of anonymisation may be more suitable, e.g. generalisation

How well did you know this?

Not at all

Perfectly

what is Data Perturbation

The values from the original dataset are modified to be slightly different
This is used for quasi-identifiers and typically for numbers and dates, and should not be used where data accuracy is crucial
The degree of perturbation should be proportionate, to the
range of values, of the attribute

How well did you know this?

Not at all

Perfectly

what is an example of data perturbation

rounding off the values of the numeric columns to either base 3 or base 5 depending on the range of values of the attribute.

How well did you know this?

Not at all

Perfectly

what is synthetic data

Data that is artificially or programmatically created often with the help of algorithms, rather than being generated by actual events
Captures the underlying structure and display the same
statistical distributions as the original data
Used for a wide range of activities, including as test data for
new products, and in AI model training, yet maintaining data
privacy

How well did you know this?

Not at all

Perfectly

example of synthetic data

Example: Office facility, providing “hot-desking” facilities, keep records of the time that users start and end using their facilities.

They would like synthetic data for 1 day, to perform simulation testing on a new facility allocation
Synthetic data created, based on the statistics derived from the original data

How well did you know this?

Not at all

Perfectly

what is data aggregation

Study These Flashcards

Converting a dataset from a list of records to summarised values
Used when individual records are not required and
aggregated data is sufficient for the purpose
If the aggregated data includes a single record in any of the categories, it could be easy for someone with some additional knowledge to identify an individual, hence, aggregation may need to be applied in combination with suppression

what is an example of data aggregation

Study These Flashcards

Example: charity organisation has records of the donations made, as well as some information about the donors. Aggregated data is assessed to be sufficient to perform data analysis

what is K-anonymity

Study These Flashcards

A property of a dataset that is usually used in order to
describe the dataset’s level of anonymity
Protects against re-identification, and often described as a
‘hiding in the crowd’ guarantee
k in k-anonymity refers to the number of times each
combination of values appears in a dataset
If k = 3, the data is said to be 3-anonymous, the higher the
value of ‘k’, the harder it is for individuals to be identified

what is an example of k-anonymity

Study These Flashcards

Example: Research needs to be done on the types of disease

Name, Postcode, Age, and Gender are attributes that could be used to identify an individual
Data anonymised to achieve k-anonymity of k = 3, or at least 1/3 chance to identify an individual

what is Pseudonymization

Study These Flashcards

Replacement of identifying data with made up values, which are unique, and should have no relationship to the original values
Used when the data values need to be uniquely distinguished
Persistent pseudonyms allow linkage across other different
datasets
May need to follow the structure or data type of the original value, simply to look more similar to the original attribute

what is an example of pseudonymization

Study These Flashcards

Example: names of persons who obtained their driving licenses and other information

the names were replaced with pseudonyms

Useful for cross dataset linking and where original data structure is needed, but does not comply with personal data protection regulations, if applied specifically on explicit identifiers

what are the 2 phases in the anonymisation methodology

Study These Flashcards

Anonymisation Preparation Phase
Anonymisation Execution Phase

what are the 4 steps in the anonymisation preparation phase

Study These Flashcards

determine the release model

determine the reidentification risk threshold

classify the data attributes

remove unused data attributes

what does determine the release model mean ?

Study These Flashcards

Refers to how the anonymised dataset will be released
Public or Non-Public release

what does Determine re- identification risk threshold mean ?

* Data anonymity increases as Risk Threshold increases * Data Utility decreases as Risk Threshold increases

what is risk threshold

The risk threshold is a parameter that determines the desired level of privacy protection in a dataset, balancing the trade-off between data anonymity and data utility.

what does classify the data attributes mean ?

* Classification affects how the attributes will subsequently be processed * Explicit/quasi identifiers, sensitive data

why should attributes not required in the dataset be removed/suppressed ?

Attributes not required in the anonymized dataset should be suppressed to reduce the risk of re-identification, protect individuals' privacy, and minimize the potential for unintended data leakage or misuse.

what is step 4 in the Anonymization Preparation Phase

Remove unused data attributes: Attributes that are not required in the anonymized dataset should be suppressed

define data anonymization

Data anonymization is the irreversible process of transforming a dataset to conceal individuals' identities and sensitive information while preserving its structure and utility for research and analysis

what are the 4 steps in the anonymization execution phase

Anonymise identifiers Evaluate the solution Determine controls required Document anonymisation process

what is anonymise identifiers mean ?

* Apply relevant anonymization techniques * Different techniques are applicable for types of identifiers

what does evaluate the solution mean

* Examine the anonymised dataset to assess if there is sufficient data anonymity and utility

what does it mean to determine the controls required

* Technical controls, including access control, authentication, encryption * Non-technical controls, incl. legal, company processes

what does it mean to document the anonymisation process

* Details of the anonymisation process, parameters used and controls should be clearly recorded for future reference * Facilitates maintenance

chapter 3 Flashcards

(35 cards)