Privacy + Data Sci/Machine Learning Basics Flashcards
Test 1 Prep
What is Artificial Intelligence?
enabling computers with intelligence to solve complex problems
e.g.
robots
chatbots
online gaming
voice assistants
What is machine learning?
extract knowledge from data to learn from that data and make predictions
E.g. recommendations
search algorithms
classification
What is Data Science?
Business and Problem Solving using descriptive and predictive analysis
e.g.
retail trends
financial analysis
transit development
According to the Harvard Business Review, what is data science about?
Data science is about infrastructure, testing, using machine learning for decision making, and data
products.
Use data analysis to get insights
What can DS work help reveal?
hidden impacts
What are the 5 stages of the Data Science Lifecycle?
Problem definition
Data Collection/curation
Data Analysis
Context
Decision Making
What questions do we try to answer in problem definition/formulation?
What problem are we trying to solve?
What do we want to find out that we don’t know now?
What are our assumptions and hypotheses
How will we measure success?
What are the 2 types of questions we ask in problem definition/formualtion?
Exploratory –> Relationship between different elements/variables/features
Descriptive –> describe statistically, or through data, the situation
Helpful in determining what data you need
What questions do we answer in Data Collection and Curation?
What data do we need?
How do we collect that data?
Do we have enough data?
Does the data contain what we are looking for?
What is the scope of the data?
Is the data clean?
What do we do in the Data Analysis step?
from simple statistical understanding to complex models for prediction or inference.
- statistical summaries;
- discover patterns;
- machine learning algorithms
- supervised vs unsupervised vs semi-supervised learning.
- Classification - predict a label (discrete)
- Regression - predict a continuous value
What kinds of questions do we answer in the data analysis step?
Are we able to answer our question?
Do we have the right form of data and answer?
Are we allowed to use this data?
Are there biases in the data or result?
Can we explain the results?
What kind of questions do we answer in the context step?
What world or context are we in?
Does this analysis generalize to other data?
What do we do in the context step of the data science lifecycle?
Assess whether inferences carry over from samples to populations
Understand likelihood of predictions with intervals
Analyze Correlation vs Causation
What are examples of correlation?
discovery of patterns
passive data collection
What are examples of causation?
randomized experiments as needed
active data - interactive data collection
What do we do in the decision making step of the data science lifecycle?
What is the data telling us?
Does it answer the right question?
Do we trust our conclusions and decisions?
Will this be valid tomorrow?
How to act on the analysis?
Where should we consider privacy?
Input : data collection/sharing
Analysis: how data handled, shared, algorithms
Output: how analysis is presented, how models are published
What are the types of information disclosure?
Attribute disclosure
Identity Disclosure
Membership Disclosure
What is Attribute Disclosure?
Disclosure of some information about a known person
e.g. healthcare data: diagnosis: test results, etc
Disclosure is through linkage with other data
What is identity disclosure?
Identification of a person
De-anonymization
Re-identification
Reconstruction of dataset
What is membership disclosure?
Membership in a dataset revealed
Identity and attribute disclosure
What is anonymization?
A common tool used to claim privacy, especially in data sharing.
Personal Identifiable Information (PII) is masked or hidden.
What is a PII?
Personally Identifiable information
personal information includes any factual or subjective
information, recorded or not, about an identifiable individual.
age, name, ID numbers, income, ethnic origin, or blood type;
What is a dataset?
database: the form in which data is presented.
a table, “matrix”.
rows correspond to data points (depending on the data, a row is about an
individuals if the data is about people, or an event for sensor data, or a transaction
for bank data).
columns correspond to features or attributes
What is the target?
feature/attribute we are trying to predict
How can demographics uniquely identify individuals?
A single attribute: if the frequency of a particular value of an attribute is low.
More than one attribute: combinations of attributes can
combine to occur even less frequently
What did Sweeney’s experiment show?
re-identification of individuals is possible when only a
single dataset is shared, and when multiple datasets where one is anonymized are shared.
What is anonymous data according to sweeney?
- a record by itself cannot be linked to an individual;
what is an explicit identifier according to sweeney?
with no additional information the person can be directly found
({name, address} or {name, phone});
What is de-identified data according to sweeney?
all explicit identifiers removed, generalized, or replaced (name,
address, phone);
What is a quasi-identifier according to sweeney?
a set of data elements that are not explicit identifiers that in combination associates uniquely or almost uniquely to an individual
How do linkage attacks work?
If an anonymized dataset is released publicly, through the notion of unique combinations and linking with another public dataset with identifiable information, we can re-identify individuals.
what is direct linking?
direct linking refers to the process of identifying an individual by correlating anonymized data with another dataset that contains identifying information.
relies on explicit attributes
What is linking through similarity?
Linking through similarity refers to the process of re-identifying individuals in an anonymized dataset by correlating patterns or attributes with another dataset that contains identifiable information.
relies on statistical similarity
What is k-anonymity?
The information (identifier or quasi-identifier) contained for each individual in
the released dataset cannot be distinguished from at least k − 1 individuals
whose information is also in the released dataset.
Any quasi-identifier present in the released table must appear in at least k
records.
Why can’t k-anonymity be guaranteed?
Real world datasets are very sparse
if project into low dim, we lose info
these datasets provide low utility
independent releases can be linked to infer info
difficult to achieve (NP-hard)
What are the goals of k-anonymity?
membership disclosure is protected
sensitive attribute protected
identity disclosure protected
What is a homogeneity attack?
k-anonymity can create groups that leak information due to lack of diversity in the sensitive attribute.
What is a background knowledge attack?
k-anonymity does not protect against attacks based on background knowledge
What is l-diversity
Let a q∗-block be a set of tuples such that its non-sensitive values generalize to q∗.
A q∗-block is ℓ-diverse if it contains ℓ “well represented” values for the sensitive attribute S. A table is ℓ-diverse, if every q∗-block in it is ℓ-diverse.
An equivalence class is said to have ℓ-diversity if there are at least ℓ well-represented values for the sensitive attribute. A table is said to have ℓ-diversity if every equivalence class of the table has ℓ-diversity
What are the limitations of l-diversity?
May be difficult and unnecessary to achieve
insufficient to prevent attribute disclosure, as shown with the following two potential attacks: similarity attack and skewness attack
t-closeness
An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold t. A table is said to have t-closeness if all equivalence classes have t-closeness.