Big Data Analytics Management Flashcards

1
Q

What is the definition of Data Science?

A

Data science is a set of fundamental principles that guide the extraction of knowledge from data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the definition of Data Mining?

A

Data mining is the extraction of knowledge from data, via technologies that incorporate these principles.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the definition of Data-Driven Decision-Making?

A

Data-Driven Decision-Making refers to the practice of basing decisions on the analysis of data, rather than purely intuition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Tasks in data mining:

A
  1. Classification and class probability estimation
  2. Regression (“Value estimation”)
  3. Similarity matching
  4. Clustering
  5. Co-occurance grouping (market-basket analysis)
  6. Profiling
  7. Link prediction
  8. Data reduction
  9. Causal modeling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe classification and class probability estimation task

A

It attempts to predict, for each individual in a population, which of a set of classes this individual belongs to.

  • Classification would give definitive output: will respond, will not respond.
  • Class probability estimation would give output with probability that the individual belongs to that class.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe regression task

A

Regression attempts to predict, for each individual, the numerical value of some variable for that individual. Example: “How much will a given customer use a service?”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Regression vs. Classification?

A

Classification predicts WHETHER something will happen, whereas regression predicts HOW MUCH something will happen.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe similarity matching task

A

Similarity matching attempts to IDENTIFY individuals based on data known about them. Example: finding companies who are similar to the ones you are serving.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe clustering task

A

Clustering attempts to GROUP individuals in a population together by their similarity, but not driven by any specific purpose. Example: “Do our customers form natural groups or segments?”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe co-occurrence grouping task

A

It attempts to find ASSOCIATIONS between entities based on transactions involving them. Example: “What items are commonly purchased together?”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Clustering vs. co-occurrence?

A

While clustering looks at similarity between objects based on the objects’ attributes, co-occurrence grouping considers similarity of objects based on their appearing together in transactions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Describe profiling task

A

Profiling attempts to characterize the typical behavior of an individual, group, or population. Example: “What is the typical cell phone usage of this customer segment?”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Describe link prediction task

A

Link prediction attempts to predict connections between data items, usually by suggesting that a link should exist, and possibly also estimating the strength of the link. Example: “Since you and Karen share 10 friends, maybe you’d like to be Karen’s friend?”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Describe data reduction task

A

Data reduction attempts to take a large set of data and replace it with a smaller set of data that contains much of the important information in the larger set. For example, a massive dataset on consumer movie-viewing preferences may be reduced to a much smaller dataset revealing the consumer taste preferences that are latent in the viewing data (for example,viewer genre preferences).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe causal modeling task

A

Causal modeling attempts to help us understand what events actually influence others. Example: “Was this because the advertisements influenced the consumers to purchase? Or did the predictive models simply do a good job of identifying those consumers who would have purchased anyway?” A business needs to weight the trade-off of increasing investment to reduce the assumptions made, versus deciding that the conclusions are good enough given the assumptions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Conditions for supervised learning:

A
  1. It has to have a specific target;
  2. There must be data on the target.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Define label

A

The value for the target variable for an individual.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Supervised vs. unsupervised tasks

A

Supervised:

  • Classification;
  • Regression;
  • Causal modeling.

Unsupervised:

  • Clustering;
  • Co-occurrence grouping;
  • Profiling.

Both:

  • Matching;
  • Link prediction;
  • Data reduction.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Second stage of CRISP process - Data Understanding

A
  • The critical part of the data understanding phase is estimating the costs and benefits of each data source and deciding whether further investment is merited.
  • We need to dig beneath the surface to uncover the structure of the business problem and the data that are available, and then match them to one or more data mining tasks for which we may have substantial science and technology to apply.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Third stage of CRISP process - Data Preparation

A

Data preparation phase often proceeds along with data understanding, in which the data are manipulated and converted into forms that yield better results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Define data leak

A

A data leak is a situation where a variable collected in historical data gives information on the target variable—information that appears in historical data but is not actually available when the decision has to be made.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Fifth stage of CRISP process - Evaluation

A

The purpose of the evaluation stage is to assess the data mining results rigorously and to gain confidence that they are valid and reliable before moving on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Sixth stage of CRISP process - Deployment

A

In deployment the results of data mining—and increasingly the data mining techniques themselves—are put into real use in order to realize some return on investment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Data Mining vs. Software Development

A

Data mining is an exploratory undertaking closer to research and development than it is to engineering. The CRISP cycle is based around exploration; it iterates on approaches and strategy rather than on software designs. Outcomes are far less certain, and the results of a given step may change the fundamental understanding of the problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are informative attributes?

A

Information is a quantity that reduces uncertainty about something.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Define predictive model

A

Predictive model is a formula for estimating the unknown value of interest: the target.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Define supervised learning

A

Supervised learning is model creation where the model describes a relationship between a set of selected variables (attributes or features) and a predefined variable called the target variable. The model estimates the value of the target variable as a function (possibly a probabilistic function) of the features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Define supervised learning

A

Supervised learning is model creation where the model describes a relationship between a set of selected variables (attributes or features) and a predefined variable called the target variable. The model estimates the value of the target variable as a function (possibly a probabilistic function) of the features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Creation of models from data is known as

A

Model induction. Induction is a term from philosophy that refers to generalizing from specific cases to general rules (or laws, or truths).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Define training data

A

The input data for the induction algorithm, used for inducing the model. They are also called labeled data because the value for the target variable (the label) is known.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Define Big Data (l)

A

Data which does not fit into single computer and needs multiple computers to process it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Define Artificial Intelligence

A

Entity that can mimic human behavior

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Define data engineering

A

Methods to handle and prepare data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What characterizes Big Data?

A

VOLUME

  • Data at Rest
  • Terabytes to exabytes of existing data to process

VELOCITY

  • Data in Motion
  • Streaming data, milliseconds to seconds to respond

VARIETY

Data in Many Forms

  • Structured, unstructured, text, multimedia

VERACITY

  • Data in Doubt
  • Uncertainty due to data inconsistency; incompleteness, ambiguities, latency, deception, model approximations.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What type of data has been increasing rapidly?

A

Unstructured data - text, video, audio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is the issue with the growth of unstructured data?

A

Traditional systems have been designed for transactions, not unstructured data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

How Google solved the problem of unstructured data?

A
  • Google goes through every page, scans its contents and has keywords ready which creates an index (and they update it every day)
  • Traditional architecture is not enough to process data
  • Solution: cluster architecture. However, every day 900 machines die and need to be replaced
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Google challenges: How to distribute computation across multiple machines in a resource-efficient way? How to ensure that computations that are running are not lost when machines die? How to ensure that data is not lost when machines die/

A

Solution: Google File System - redundant storage of massive amounts of data on cheap and unreliable computers MapReduce distributed computing paradigm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is Hadoop?

A

Open source software framework for distributed storage and distributed processing that replicated Google’s MapReduce model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Define MapReduce

A

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

The MapReduce model consists of two main stages:

  1. Map: input data is split into discrete chunks to be processed
  2. Reduce: output of the map phase is aggregated to produce the desired result

The simple nature of the programming model lends itself to efficient and
large-scale implementations across thousands of cheap nodes (computers).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Data Mining Process

A
  1. Data Engineering and Processing (big data technologies - Google Files, Hadoop) 2. Data Science -> automated DDD
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Types of Analytics:

A

Increasing in value and difficulty:

  1. Descriptive - what happened? [Reports]
  2. Diagnostic - why did it happen? [Queries, statistical analysis]
  3. Predictive - what will happen? [Forecasts, machine learning]
  4. Diagnostic + Predictive = Prescriptive - how can we make it happen? [Optimization, planning]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Predictive analytics vs. Diagnostic analytics

A

Predictive analytics don’t answer question WHY it just predicts. While diagnostic analytics tries to answer WHY something happen.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Why diagnostic analytics is harder than predictive analytics?

A

Determining correlation is easier than implying causation (randomized controlled experiment).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

How much human input is needed in each type of analytics?

A
  • Descriptive/diagnostic analytics provide insight into the data, so that one can better understand what data to collect and store and provide insight into ways to improve future models.
  • Predictive analytics is building a model to predict when something will happen.
  • Prescriptive analytics automates action to be taken based on prediction.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Tasks of Descriptive/Diagnostic analytics:

A

Data visualization; Clustering; Co-occurrence grouping

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Tasks of Prescriptive analytics:

A
  • Uplift modeling - predict how individuals behave contingent on the action performed upon them
  • Automation - determine optimal action based on predicted reaction of individuals
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What does it mean for groups to be pure? (in terms of selecting meaningful attributes)

A

Homogeneous with respect to the target variable. If every member of a group has the same value for the target, then the group is pure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Complications while putting people into segments:

A
  • Attributes rarely split a group perfectly.
  • Not all attributes are binary.
  • Some attributes take on numeric values (continuous or integer)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Define entropy

A

Entropy is a measure of disorder that can be applied to a set, such as one of our individual segments. Consider that we have a set of properties of members of the set, and each member has one and only one of the properties. In supervised segmentation, the member properties will correspond to the values of the target variable. Disorder corresponds to how mixed (impure) the segment is with respect to these properties of interest. So, for example, a mixed up segment with lots of write-offs and lots of non-write-offs would have high entropy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What is natural measure of impurity for numeric values?

A

Variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Overview of technologies on a typical Business Intelligence (BI) stack

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

How did Netflix create competitive advantage with data?

A

They had data on what people liked and they personalized/customized/adapted their movies to customers’ preferences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

How can SafeSize leverage their data to move to B2C space?

A
  • They stored customers’ data as a ‘shoe profile’;
  • They sold their service to e-commerce shoe stores as a widget.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

3 ways on how to collect data from IMDB:

A
  • Manually
  • Manually download the file (which someone created)
  • Pretending you are a human browsing a web site (web scraping) - API
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

Web scraping can be done using:

A
  • A modern programming language which offers complete flexibility but requires more effort to implement;
  • Specialized tools which allow faster implementation but provide less flexibility and make it harder to replicate data collection.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

Web scraping steps using a programming language:

A
  1. Request a web page
  2. Parse the HTML
  3. Filter and transform data to desired format
  4. Save data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

Web scraping using a dedicated tool:

A

import.io is a simple too that tries to infer what is interesting on a website webscraper.io gives you more flexibility

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

Webscraper.io steps

A
  1. Define a starting page
  2. Define category links
  3. For each individual category or product page determine which information to collect, determine which links to follow.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

Why web scraping is not ideal?

A
  • Many sites do not allow gathering information automatically.
    • It detects if you are human based on detection of frequent requests, cookies, Robots Exclusion Protocol (it is stored on robots.txt), and other trackers.
  • Not all information is public (you can use authentication the protected information and API)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

What is the purpose of Robots Exclusion Protocol?

A

It tells everyone who is allowed to crawl their page.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

Why people doesn’t want their website to be crawled?

A

Data can be costly to acquire, so companies don’t want to be found.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

Three steps for API access:

A

It is an official way of accessing information automatically.

  1. Get an API key
  2. Query an API endpoint using the API key - an API usually provides multiple endpoints or functions (most recent movies, most popular…)
  3. Process the response
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

Twitter provides two types of API:

A
  • Representational State Transfer (REST) APIs: Used for singular queries for one term
  • Streaming API: Continuously get the tweets
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

Data structure (volume/velocity)

A
  • Cross-sectional - data that (almost) never changes;
    • e.g. city names, birth date
  • Transactional - one observation represents one transaction;
    • e.g. a website visit
  • Panel - one observation represents one individual during a time period
    • e.g. monthly bill.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

Data structure: Tidy data

A

Tidy data is on a single table according to the following rules:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its _own cel_l.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

Types of data (Variety)

A
  • Structured
  • Unstructured
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

Structured data (Variety)

A

QUALITATITVE/CATEGORICAL DATA

  • Nominal;
  • Ordinal (satisfaction level);

QUANTITATIVE DATA

  • Discrete (countable number);
  • Continuous (interval value).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

Types of unstructured data (Variety):

A
  • Text-based documents (tweets, webpages)
  • Images, videos, audio
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

What are the methods to transform unstructured data into structured data?

A
  • Topic Modeling (text);
  • Sentiment Analysis (text);
  • Feature extraction (image/video/sound).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

How can data quality be affected by (Veracity)?

A
  • Missing data
  • Measurement error
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

Two ways of missing data:

A
  • Missing observations
  • Missing values in some observations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

Why it is important to know why data is missing?

A

You need to know the reason why it is missing because it will inform whether it is a problem or not. Is data missing at random or not?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

What if data are missing at random?

A

It is fine. If data are missing at random, the remaining observations are still a representative sample of the population. Solution: listwise deletion i.e. delete all observations that do not have values for all variables in the analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

What if data are missing not at random?

A

It is a problem! The remaining observations are not a representative sample of population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

Selection bias (Veracity)

A
  • Selection bias occurs when the sampling procedure is not random and thus the sample is not representative of the population.
  • Self-selection - some members of the population are more likely to be included in the sample because of their characteristics.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

Selection bias (Veracity)

A

Selection bias occurs when the sampling procedure is not random and thus the sample is not representative of the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

Two types of selection bias (Veracity)

A
  • Self-selection - some members of the population are more likely to be included in the sample because of their characteristics.
  • Attrition - some observations may be less likely to be present in the sample due to time constraints
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

Define measurement error (Veracity)

A

Measurement error occurs when the data is collected errors that are non random.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

Types of measurement error

A
  • Recall bias - respondents recall some events more vividly than others (child deaths by gun vs swimming pools);
  • Sensitive questions - respondents may not report data accurately (wages, health conditions);
  • Faulty equipment - equipment that exhibits systematic measurement error.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

What does disorder in terms of entropy represent?

A

Disorder corresponds to how mixed (impure) the segment is with respect to properties of interest.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
82
Q

Entropy equation:

A

entropy = - p1 log (p1) - p2 log (p2) - … Each pi is the probability of property i within the set, ranging from pi = 1 when all members of the set have property i, and pi = 0 when no members of the set have property i.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
83
Q

Define information gain (IG)

A

It measures how much an attribute improves entropy over the whole segmentation it creates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
84
Q

How to address the issue in classification when the probability may be overly optimistic with small samples?

A

We use Laplace correction. Its purpose is to moderate the influence of leaves with only a few instances.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
85
Q

How do we identify informative attributes?

A

We measure the attribute on the basis of information gain, which is based on a purity measure called entropy, another is variance reduction (for numeric target).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
86
Q

How do we segment data by progressive attribute selection?

A

We use tree induction technique.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
87
Q

What is tree induction technique?

A

Tree induction recursively finds informative attributes for subsets of the data. In so doing it segments the space of instances into similar regions. The partitioning is “supervised” in that it tries to find segments that give increasingly precise information about the quantity to be predicted, the target. The resulting tree-structured model partitions the space of all possible instances into a set of segments with different predicted values for the target.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
88
Q

Define parametric learning

A

The data miner specifies the form of the model and the attributes; the goal of the data mining is to tune the parameters so that the model fits the data as well as possible.

89
Q

SVM error function is known as

A

Hinge loss. The penalty for a misclassified point is proportional to the distance from the decision boundary, so if possible the SVM will make only “small” errors.

90
Q

Learning objective session 3: Understand the stages of a predictive modeling process

A
  1. Define target
  2. Collect data
  3. Build a model (set of rules or a mathematical formula)
  4. Predict outcomes
91
Q

Learning objective session 3: Understand the main concepts and principles of predictive modeling, including the concepts of target variable, supervised segmentation, entropy and information gain.

A
  • Target variable (label) - the value you’re trying to predict;
  • Supervised segmentation is model creation where the model describes a relationship between a set of selected variables (attributes or features) and a predefined variable called the target variable. The model estimates the value of the target variable as a function (possibly a probabilistic function) of the features.
  • Entropy is a measure of disorder (surprise). It tells how impure the segment is with regards to the properties of interest.
  • Information gain measures how much an attribute decreases entropy over the whole segmentation it creates.
92
Q

Learning objective session 3: Know the basic metrics used to evaluate a predictive model and related concepts, including confusion matrix, accuracy and error rate.

A
  • Accuracy is the proportion of correct decisions made by the classifier.
  • Error rate is the proportion or wrong decisions made by the classifier.
  • Confusion matrix is a table that is often used to describe the performance of a classification model (TP, TN, FP, FN) on a set of test data for which the true values are known.
93
Q

Unsupervised methods (there is no specific target variable)

A
  • Affinity grouping - associations, market-basket analysis (Which items are commonly purchased together?)
  • Similarity matching (Which other companies are similar to ours?)
  • Clustering (Do my customers form natural groups?)
  • Sentiment analysis (What is the sentiment of my users?)
94
Q

Supervised methods (there is a specific target variable)

A
  • Predictive modeling
    • Will specific customer will default? Which accounts will be defrauded?
  • Causal modeling
    • How much would client X spend if I gave her a discount?
95
Q

Predictive vs. diagnostic

A

They pursue different goals:

Predictive modeling is the process of applying a statistical model or data mining algorithm to data for the purpose of predicting new or future observations.

Example: How much will client X spend?

Explanatory modeling is the use of statistical models for explaining how the world works (by testing causal explanations).

Example: How much would a discount change client X’s spending?

96
Q

Why empirical explanation and empirical prediction differ?

A
  • Explanatory models are based on underlying causal relationships between theoretical constructs while
  • predictive models rely on associations between measurable variables.
  • Explanatory modeling seeks to minimize model bias (i.e. specification error) to obtain the most accurate representation of the underlying theoretical model,
  • predictive modeling seeks to minimize the combination of model bias and sampling variance (how much does the model change with new data).
97
Q

Define predictive modeling

A

It is a method for estimating an unknown value of interest, which is called target.

98
Q

Process of predictive modeling

A
  1. Define (quantifiable) target
  2. Collect data - data on same or related phenomenon
  3. Build a model - a set of rules or a mathematical formula that allow establishing a prediction.
  4. Predict outcomes - the model can be applied to any customer. It gives us a prediction of the target variable.
99
Q

Two types of predictive modeling

A

REGRESSION

Attempts to estimate or predict the numerical value of some variable for an individual.

Mathematical formula:

  • Linear regression
  • Logistic regression

Rule-based formula:

  • Regression trees

CLASSIFICATION

Attempts to predict which of a (small) set of classes an individual belongs to. Mathematical formula:

  • Logistic regression
  • Support Vector Machines

Rule-based formula

  • Classification trees
100
Q

Define linear regression

A

Linear regression is an approach for modeling the relationship between a dependent variable and one or more explanatory variables.

The estimators B0, B1, B2 are obtained by minimizing the sum of squared errors.

It is used when you are trying to predict a numerical variable.

101
Q

Define logistic regression

A

If the dependent variable takes values between 0 and 1, we can use logistic regression to model its relationship with one or more explanatory variables.

F () is a function with values between 0 and 1.

P(Pass) = f(b0 + b1 x effort)

102
Q

What does R squared mean in linear regression?

A

It explains sow much of the total variation is explained by the model. Everything besides effort. The bigger it is the better because it means this percentage in variation is explained by the model.

103
Q

Steps of classification

A

1. Define target - will prospect X buy life insurance?

  1. Collect data - gather list of prospects with demographic information
  2. Build a model - logistic regression, classification trees
  3. Predict outcomes
104
Q

When can logistic regression can be used for classification?

A

Logistic regression can be used for classification when:

  • Target variable is binary.
  • The outcome of a model can be interpreted as probability.
105
Q

When should you stop segmentation?

A

Stop segmentation when at least one of the conditions is met:

  • All elements of a segment belong to the same class
  • The maximum allowed tree depth is reached
  • Using more attributes does not “help
106
Q

How to choose at each step which of the attributes to use to segment the population?

A

Resulting groups have tobe as pure as possible - homogeneous w.r.t. target variable.

107
Q

Define entropy

A

How much information is necessary to represent information about an event with X possible outcomes?

log2(X) p=1/x is log2(1/p)

Entropy measures of the general disorder of a set - how unpredictable world is

108
Q

What is information gain?

A
  • Information gain (IG) measure the change in entropy due to any amount of new information being added.
  • Information gain measure how much an attribute decrease entropy over the whole segmentation it creates.

IG = entropy(parent)- [p(c1) * entropy(c1) + p(c2) * entropy(c2) + …]

109
Q

How to evaluate a model?

A

Accuracy = number of correct decisions made/total number of decisions.

Error rate = 1 - accuracy

110
Q

Confusion matrix

A
  • True Positives (TP) - actual positives correctly predicted as positive.
  • True Negatives (TN) - actual negatives correctly predicted as negative.
  • False Positives (FP) - negatives incorrectly predicted as positive.
  • False Negatives (FN) - positives incorrectly predicted as negative.
111
Q

Learning objective session 4: Understand the concepts of generalization and overfit

A
  • Generalization is the property of a model whereby model applies to data that were not used to build the model.
  • Overfit is the tendency to tailor models to the training data, at the expense of generalization to previously unseen data points.
112
Q

Learning objective session 4: Understand the concepts of holdout data and cross-validation

A
  • Holdout data (or test set) is the data that was not used to teach the model - it was set aside so the created model could be evaluated
  • Cross-validation computes its estimates over all the data by performing multiple splits and systematically swapping out samples for testing. It computes the average and standard deviation from k folds.
113
Q

Learning objective: Be able to interpret the performance of a model by looking at different measures, such as fitting curves, learning curves, and ROC curves.

A
114
Q

Learning objective session 4: Be able to evaluate a model using the Expected Value framework.

A

The expected value framework is an analytical tool that is extremely helpful in organizing thinking about data-analytic problems.

.Combines:

  • Structure of the problem
  • Elements of the analysis that can be extracted from the data
  • Elements of the analysis that need to be acquired from other sources (e.g., business knowledge)

The benefit/cost matrix summarizes the benefits and costs of each potential outcome, always comparing with a base scenario. It does not really matter which base scenario we choose, as long as all comparisons are with the same scenario.

115
Q

What is the problem with ‘table model’?

A

Does not predict the future but just fits the data perfectly, as it memorizes the training data and performs no generalization.

116
Q

Define overfitting

A

It is the tendency to tailor model to the training data, at the expense of generalization to previously unseen data points.

117
Q

Trade-off between overfitting and generalization

A
  • If we allow ourselves enough flexibility in searching, we will find patterns
    • Unfortunately, these patterns may be just chance occurences in the data
  • We are interested in patterns that generalize, i.e., that predict well for instances that we have not yet observed
118
Q

What if by accident the test set is particularly easy/hard?

A

Solution: cross-validation.

  • Cross validation is a more sophisticated training and testing procedure.
  • Cross-validation computes its estimates over all the data by performing multiple splits and systematically swapping out samples for testing.
  • Simple estimate of the generalization performance, but also some statistics on the estimated performance (mean, variance, . . . )
119
Q

What if standard deviation is large, while the average is good?

A

You need to assess what is better - good average or small standard deviation.

120
Q

How to find which variables are the most important for the model?

A
  • (Weighted) Sum of information gain in each split a variable is used (tree-based models)
  • Difference in model performance with and without using that variable (all models)
121
Q

Characteristics of decision trees

A
  • Trees create a segmentation of the data
  • Each nod in the tree contains a test of an attribute
  • Each path evenually terminates at a leaf
  • Each leaf corresponds to a segment, and the attributes and values along the path give the characteristics.
  • Each leaf contains a value for the target variable
122
Q

The Data Mining Process: building and using a predictive model

A
123
Q

Features of entropy

A
  • P1,P2, …, Pn are the proportions of classes 1,2, …, n in the data
  • Disorder corresponds to how mixed (impure) a segment is
  • Entropy is zero at minimum disorder (all members belong to the same class)
  • Entropy is one at maximal disorder (members equaly distributed among classes)
124
Q

Confusion matrix formulas

A
125
Q

Components of scalling with a traditional database

A
  1. Scalling with a queue - you create a queue for requests so that frequent requests don’t crash the system.
  2. Scalling by sharding the database - you split the write load across multiple machines - horizontal partitioning/sharding.
    1. It starts faults and corruption issues
126
Q

Desired (8) properties of a big data system

A
  • Robustness and fault tolerance
  • Low latency reads and updates
  • Scalability
  • Generalization
  • Extensibility
  • Ad hoc queries
  • Minimal maitenance
  • Debuggability
127
Q

Problems with fully incremental achitectures

A
  • Operational complexity
    • Compaction is an intensive operation - a lot of coordination. Many things could go wrong.
  • Extreme complexity of achieving eventual consistency
    • Consistency and availability don’t go together
  • Lack of human-fault tolerance: an incremental system is constantly modifying the state it keeps in the database, which means a mistake can also modify the state in the database.
    *
128
Q

Expected value framework

A

The expected value framework is an analytical tool that is extremely helpful in organizing thinking about data-analytic problems.
Combines:

  • Structure of the problem
  • Elements of the analysis that can be extracted from the data
  • Elements of the analysis that need to be acquired from other
  • sources (e.g., business knowledge)
129
Q

Define generalization

A

Generalization is the property of a model or modeling process, whereby the model applies to data that were not used to build the model.

130
Q

Define overfitting

A

Overfitting is the tendency of data mining procedures to tailor models to the training data, at the expense of generalization to previously unseen data points.

131
Q

What is holdout data?

A

Holdout data is data used for validating a model and not used for training a model.

Performance is evaluated based on accuracy in the test data -> holdout accuracy.

Holdout accuracy is an estimate of generalization accuracy.

132
Q

Tree induction commonly uses two techniques to avoid overfitting:

A
  1. Stop growing the tree before it gets too complex
  2. Grow the tree until it is too large, then ‘prune’ it back, reducing its size (and complexity).
133
Q

How similarity analysis is used in the business?

A
  • Ad retrieval
  • Customer classification
  • Customer clustering
  • Competitor analysis
134
Q

Similarity as distance between neighbors

A

You measure the distance between the attirbutes. Distance = Pythagoras.

135
Q

Define lift

A

The lift of the co-occurrence of A and B is the probability that we actually see the two together, compared to the probability that we would see the two together if they were unrelated to (independent of) each other.

how much more frequently does this association occur than we would expect by chance?

136
Q

Define leverage

A

how much more likely than chance a discovered association is. An alternative is to look at the difference of these quantities rather than their ratio. This measure is called leverage.

137
Q

Learning objective session 5: What are the challenges of creating big data applications?

A
  1. Scaling
  2. Complexity
  3. Fault-tolerance
  4. Data-corruption
138
Q

Learning objective session 5: What are the motivations for the lambda architecture?

A
  1. Robustness and fault tolerance
  2. Low latency
  3. Minimal maitenance
  4. Ad hoc queries
139
Q

Learning objective session 5: What are the best practices in designing big data applications? How to store data? How to guarantee consistency and resilience?

A
140
Q

Learning objective session 5: What are the main features of Hadoop HDFS?

A
  • Distributed file systems are quite similar to the file systems of your computer, except they spread their storage across a cluster of computers
    • They scale by adding more machines to the cluster
    • Designed so that you have fault tolerance when a machine goes down, meaning that if you lose one machine, all your files and data will still be accessible
  • The operations you can do with a distributed filesystem are often more limited than you can do with a regular filesystem
  • For instance, you may not be able to write to the middle of a file or even modify a file at all after creation

How distributed file systems work?

  • All files are broken into blocks (usually 64 to 256 MB)
  • These blocks are replicated (typically 3 copies) among the HDFS servers (datanodes)
  • The namenode provides a lookup service for clients accessing the data and ensures the nodes are correctly replicated across the cluster
141
Q

Learning objective session 5: What are the main features of MapReduce?

A

The MapReduce model consists of two main stages:

  • Map input data is split into discrete chunks to be processed
  • Reduce output of the map phase is aggregated to produce the
  • desired result

The simple nature of the programming model lends itself to efficient and large-scale implementations across thousands of cheap nodes (computers).

142
Q

Learning objective session 5: be familiar with specific big data tools and be able to position them in the lambda architecture

A

?

143
Q

The Life Cycle of a Competitive industry: how does the number of firms change over the stages of the product lifecycyle?

A
144
Q

What are the problems with client /server data architecture model?

A
  • Analytics application is struggling to keep up with the traffic - too many requests for the database
  • You start hashing the database, however, it is messy and takes time and it is prone to errors
  • Fault-tolerance decreases as you can only fix it by having one of the databases down
  • Data corruption issues. No place to store unchangeable data, thus you corrupt the original file.
145
Q

What are the (2) desired properties of a Big Data system?

A

The desired properties of Big Data systems are related both to complexity
and scalability.

  • Complexity generally used to characterize something with many parts where those parts interact with each other in multiple way
  • Scalability ability to maintain performance in the face of increasingdata or load by adding resources to the system

A Big Data system must perform well, be resource-efficient, and it must be
easy to reason about

146
Q

Desired (4) properties of a Big Data system

A
  1. Robustness and fault tolerance
    1. Duplicated data
    2. Concurrency
  2. Low latency
  3. Minimal maitenance
    1. ​Anticipating when to add machines to scale,
    2. keeping processes up and running
    3. debugging
  4. Ad hoc queries
    1. ​Being able to mine a dataset arbitrarily gives opportunities for business optimization and new applications.
147
Q

What are the functions and properties of batch layer in lambda architecture?

A
  • Manages the master dataset – an immutable, append-only set of raw data
  • Pre-computes arbitrary query functions – called batch views
  • Runs in a loop and continuously recomputes the batch views from scratch
  • Very simple to use and understand
  • Scales by adding new machines.
148
Q

What are the advantages of storing data in raw format?

A
  • Data is always true (or correct): all records are always correct; no need to go back and re-write existing records; you can simply append new data
  • You can always go back to the data and perform queries you did not anticipate when building the system

Data should be stored in raw format, should be immutable and should be
kept forever

149
Q

What are the features and properties of speed layer in lambda architecture?

A
  • Accommodates all requests that are subject to low latency requirements
  • Its goal is to ensure new data is represented in query functions as quickly as needed for the application requirements
  • Similar to the batch layer in that it produces views based on data it receives
    • One big difference is that the speed layer only looks at recent data, whereas the batch layer looks at all the data at once
  • Does incremental computation instead of the recomputation done in the batch layer
150
Q

What are the features and properties of serving layer in lambda architecture?

A
  • Indexes batch views so that they can be queried with low latency
  • The serving layer is a specialized distributed database that loads in a batch view and makes it possible to do random reads on it
  • When new batch views are available, the serving layer automatically swaps those in so that more up-to-date results are available
  • It does not need to support specific record updates
    • This is a very important point, as random writes cause most of the complexity in databases
151
Q

Distributed File Systems

A

Distributed file systems are quite similar to the file systems of your computer, except they spread their storage across a cluster of computers

  • They scale by adding more machines to the cluster
  • Designed so that you have fault tolerance when a machine goes down, meaning that if you lose one machine, all your files and data will still be accessible

The operations you can do with a distributed filesystem are often more limited than you can do with a regular filesystem

  • For instance, you may not be able to write to the middle of a file or even modify a file at all after creation
152
Q

Google File System - Motivation & Assumptions

A

Motivation

Google needed a good distributed file system

  • Redundant storage of massive amounts of data on cheap and unreliable computers

Why not use an existing file system?

  • Google’s problems were different from anyone else’s
    • Different workload and design priorities
  • Google File System is designed for Google apps and workload
  • Google apps are designed for Google File System

Assumptions

  • High component failure rates
    • Inexpensive commodity components fail all the time
  • “Modest” number of HUGE files
    • Just a few million
    • Each is 100MB or larger; multi-GB files typical
  • Files are write-once, mostly appended to
  • Large streaming reads
153
Q

Define BigTable

A

BigTable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.

154
Q

What is Hadoop File System (HDFS)?

A

Hadoop File System is the open source alternative to Google File System

  • Comodity hardware
  • Tolerant to failure
155
Q

How distributed file systems work?

A
156
Q

What is Hadoop MapReduce?

A
  • Hadoop MapReduce is a distributed computing paradigm originally pioneered by Google
  • Used to process data in the batch layer
157
Q

What is Split-Apply-Combine Approach?

A

Many data analysis problems involve the application of a split-apply-combine strategy:

  • Split: Break up a big problem into manageable pieces;
  • Apply: Operate on each piece independently;
  • Combine: Put all the pieces back together.
158
Q

Key benefits of MapReduce

A
  • Simplicity: Developers can write applications in their language of choice, such as Java, C++ or Python
  • Scalability: MapReduce can very large amounts of data, stored in HDFS on one cluster
  • Speed: Parallel processing means that MapReduce can take problems that used to take days to solve and solve them in hours or minutes
  • Recovery MapReduce takes care of failures. If a machine with one copy of the data is unavailable, another machine has a copy of the same key/value pair, which can be used to solve the same sub-task.
  • Minimal data motion: MapReduce moves compute processes to the
    data on HDFS and not the other way around. Processing tasks can occur on the physical node where the data resides. This significantly reduces the network traffic patterns and contributes to Hadoop’s processing speed
159
Q

Limitations of MapReduce

A

MapReduce is a very powerful and flexible tool that allows performing almost any data transformation task. However it has some limitations:

  • MapReduce is designed specifically for batch processing
  • Low level framework (hard to use)

New tools have been developed to simplify the use of MapReduce

  • Apache HIVE (similar to SQL)
  • Apache Pig (script language)
160
Q

What are elastic clouds?

A
  • Elastic clouds allow you to rent hardware on demand rather than own your own hardware in your own location.
  • Elastic clouds let you increase or decrease the size of your cluster nearly instantaneously, so if you have a big job you want to run, you can allocate the hardware temporarily.
  • Elastic clouds dramatically simplify system administration. They also provide additional storage and hardware allocation options that can significantly drive down the price of your infrastructure.

Examples of suppliers:

  • Microsoft Azure
  • Amazon Web Services (AWS)
  • Digital Ocean
161
Q

Area under the ROC curve (AUC)

A
  • The area under the ROC curve (depicted in gray) is the probability that the model will rank a randomly chosen positive case higher than a negative case
  • AUC is useful when a single number is needed to summarize performance, or when nothing is known about the operating conditions
162
Q

Underfitting vs. Overfitting

A
  • Underfitting A model that is too simple does not fit the data well (high bias)
    • e.g., fitting a quadractic function with a linear model
  • Overfitting A model that is too complex fits the data too well (high variance)
    • e.g., fitting a quadractic function with a 3rd degree function
163
Q

Bias vs. Variance in underfitting and overfitting

A
  • Bias a model that underfits is wrong on average (high bias) but is not highly affected by slightly different training data
  • Variance a model that overfits is right on average, but is highly sensitive to specific training data
164
Q

Bias-Variance tradeoff

A
  • When trying the optimal model we are in fact trying to find the optimal tradeoff between bias and variance;
  • We can reduce variance by putting many models together and aggregating their outcomes.

More complexity generally gives us lower bias but higher variance, while lower variance models tend to have higher bias.

165
Q

What are ensemble methods?

A

Ensemble methods use multiple algorithms to obtain better predictive performance than could be obtained from any of the algorithms by itself

Using multiple algorithms usually increases model performance by:

  • reducing variance: models are less dependent on the specific training data

Examples:

  • Bagging (or bootstrap aggregation) creates multiple data sets from t_he original training data by bootstrapping_ – re-sample with repetition. Runs several models and aggregates output with a voting system
  • Random Forest combines bagging with random selection of features (or
    predictors)
  • Boosting applies classifiers sequentially, assigning higher weights to observations that have been mis-classified by the previous methods
166
Q

Explain the trade-off between overfitting and generalization

A
  • If we allow ourselves enough flexibility in searching, we will find patterns
    • Unfortunately, these patterns may be just chance occurences in the data…
  • We are interested in patterns that generalize, i.e., that predict well for instances that we have not yet observed
167
Q

What is a fitting graph?

A

A fitting graph shows the accuracy (or error rate) of a model as a function
of model complexity.

Generally, there will be more overfitting as one allows the model to be
more complex.

168
Q

What is model complexity?

A

Complexity is a measure of the flexibility of a model.

  • If the model is a mathematical function, complexity is measured by the number of parameters
  • If the model is a tree, complexity is measured by the number of nodes
169
Q

Why is overfitting causing a model to become worse?

A

As a model gets more complex, it is allowed to pick up harmful spurious
correlations.

  • These correlations do not represent characteristics of the population in general
  • They may become harmful when they produce incorrect generalizations in the model
170
Q

Define cross-validation

A
  • Cross-validation computes its estimates over all the data by performing multiple splits and systematically swapping out samples for testing
  • Simple estimate of the generalization performance, but also some statistics on the estimated performance (mean, variance, . . . )
171
Q

What is a learning curve?

A

A learning curve is a plot of the generalization performance (testing data) against the amount of training data

  • Generalization performance improves as more training data are available
  • Steep initially, but then marginal advantage of more data decreases
172
Q

How do you calculate True Positive Rate and False Positve Rate?

A
173
Q

What is Receiver Operating Characteristic (ROC) curve?

A
  • The ROC graph shows the entire space of performance possibilities for a given model, independent of class balance
  • Plots classifiers false positive rate on the x axis against true positive rate on the y axis
  • It depicts relative trade-offs that a classifier makes between benefits (true positives) and costs (false positives):
    • (0, 0) is the strategy of never issuing a positive classification
    • (1, 1) is the strategy of always issuing a positive classification
    • The line linking (0,0) to (1,1) is the strategy of guessing randomly
174
Q

Potential sources of bias on models

A
  • Algorithm is wrong
  • Data is biased
  • People are biased
175
Q

Self-advertising

A
  • There is a trade-off:
    • Get money for ads
    • Or do self-advertising

Thus, you need to know how likely the person will convert to figure out the expected value of self-ad.

Target variable

  • Conversion within a week

Data to use:

  • Historical data

What are informative attributes for selection?

  • Income
  • Age
  • Device
  • Status (working/non-working)
  • Number of friends using spotify
  • Number of hours listened
  • Number of skips
  • Made a step towards buying
  • Clicking on premium options
176
Q

Define uplift modelling

A

Uplift modelling identifies individuals that are most likely to respond favorably to an action.

177
Q

Predictive modelling vs. uplift modelling

A
  • Predictive modelling:
    • Will a targeted customer buy?
    • Will I buy Spotify premium?
    • focus only on distinguishing between customers that buy if they are targeted versus those that do not buy
  • Uplift modelling:
    • Will the customer buy ONLY if targeted?
    • Are the self-ads the reason why I buy Spotify Premium?
    • further distinguis different behaviors among those that do not get targeted
178
Q

How to build an uplift model?

A
  • The core complication with uplift modeling lies in the fact that the cannot measure the uplift for an individual because we cannot simultaneously target and not target a single person.
    • We can overcoe this by randomly assigning similarly looking people to different treatments and assessing the differences in their behavior.
179
Q

Two ways to build an uplift model

A
  • Differential approach
  • Two model approach
180
Q

Two model approach for creating uplift model

A
  1. Choose a target variable
  2. Run two predictive models:
    1. Experimental group
    2. Control group
  3. Calculate difference in predicted outcomes across models
181
Q

What are the problems with two model approach for building uplift model?

A
  • Each model is trained to minimize difference in expected cusomer value within a leaf, not to minimize the differences in uplift.
  • It does not mean that you’re going to identify those who will have the highest lift.
182
Q

Differential approach in creating uplift models

A
  1. Define uplift as a target varibale
  2. Run one predictive model with both treatment and control groups
    1. At each split, minimize variations in uplift, not in expected value
183
Q

Two model vs. differential approach in uplift modeling

A

Two model approach

Each predictive model find splits to optimize expected life-time value

  • Best split is the split that minimizes variation of life-time value within each group

Differential Approach

Uplift models find splits to optimize difference in treatment effect

  • Best split is the split that minimizes variation of treatment effects within each group (or that maximizes the variance or treatment effects across groups)
184
Q

When uplift modeling is worthwile

A
  1. Existence of a valid control group if there is no adequate control group it is not possible to create an uplift model
  2. Negative effects uplift models usually have a much better performance when some customers react negatively to intervention
  3. Negatively correlated outcomes when the outcome is negatively correlated with the incremental impact of a marketing activity, the benefit of uplifting modeling may be larger.
185
Q

What are the problems with predictive modeling?

A
  • We can use predictive models for predicting outcomes based on individual attributes
    • However, models based only on observational data do not inform how users would react to a specific intervention
  • There is no distinction between individual attributes, which are mostly immutable, and the causal part of the model
    • Did the customers upgrade because they saw the ad, or were they going to upgrade anyways?
186
Q

What are uses for clustering?

A

•Product recommendations
–Different types of similarity can be used!
•Customer segments
•Personality types
•Store and warehouse layout
•Text mining
•Reducing problem complexity and enhancing interpretation

187
Q

How to evaluate how good is a given clustering?

A

Distortion is a measure for each cluster which is calculated distance of each point and its cluster centroid. However; while k-means clustering converges to a stable solution, the actual solution depends on the starting centroid choice.

188
Q

How to choose a value of k for clustering?

A

An elbow method is used to determine when within group sum of squares decreases (and does not decrease much more).

189
Q

Interpreting clusters ‘characteristically’ through labelling

A

We can interpret clusters by looking at a typical cluster member or typical characteristic(s). Essentially, showing the cluster centroid.

190
Q

How do you interpret TF-IDF?

A

Inverse Document Frequency tells that the fever documents in which term occurs the more significant it likely is to be to the documents it does occur in.

Combined with Term Frequency which counts within the documents form the TF values for each term, and the document counts across the corpus form the IDF values

It compares TF with the entire corpus’ IDF.

The TF–IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.

191
Q

Six types of ethical concerns raised by algorithms

A

Epistemic concerns

  1. Inconclusive evidence
  2. Inscrutable evidence
  3. Misguided evidence

Normative concerns

  1. Unfair outcomes
  2. Transformative effects
  3. Traceability
192
Q

Ethical concern raised by algorithms: Inconclusive evidence

A

Correlation does not imply causation. Algorithms produce knowledge that is yet uncertain and has not been proven.

Leads to -> Unjustified actions

193
Q

Ethical concern raised by algorithms: Inscrutable evidence

A
  • The connection between the data and conclusion is not accessible.

A lack of knowledge regarding the data being used (e.g. relating to their scope, provenance and quality), but more importantly also the inherent difficulty in the interpretation of how each of the many data-points used by a machine-learning algorithm contribute to the conclusion it generates, cause practical as well as principled limitations.

Leads to -> Opacity

194
Q

Ethical concern raised by algorithms: Misguided evidence

A
  • “Garbage in, garbage out” - input data is biased or incomplete.
  • The output of an alogorithm incorporates the values and assumptions that are presented in the inpute data of the algorithm. In thisway, the output can never exceed the input (e.g. cannot become more objective)
  • Leads to -> Bias
195
Q

Ethical concern raised by algorithms: Unfair outcomes

A
  • Should data-driven discrimination be allowed?
  • The decisions and actions resulting from the outcome of an algorithm should be examined according to ethical criteria and principles considering the ‘fairness’ of the decision or action (including its effects)

Leads to -> Discrimination

196
Q

Ethical concern raised by algorithms: Transformative effects

A
  • Autonomous decision-making can be questionable and yet appear ethically neutral because they do not seem to cause any obvious harm. This is because algorithms can affect how we conceptualise the world, and modify its social and political organisation.

Leads to -> Challenges for autonomy and informational privacy

197
Q

Ethical concern raised by algorithms: Traceability

A
  • How do you assign responsibility of an algorithm?
198
Q

Ethical challenges that stem from ethical concerns of algorithms

A
  • Unjustified actions
    • Actions taken on the basis of inductive correlations have real impact on human interests independent of their validity.
  • Opacity
    • Lack of accessibility
    • Lack of comprehesibility
    • Information asymmetry
    • Even if people would want to they would not be able to explain how it works - algorithms can be too complex
  • Bias
    • Embedded social bias
    • Technical bias (constraints)
    • Emergent bias
  • Discrimination
  • Autonomy
    • Personalisation algorithms tread a fine line between supporting and controlling decisions by filtering which information is presented to the user based upon indepth understanding of preferences, behaviours, and perhaps vulnerabilities to influence
    • Deciding which information is relevant is subjective
    • Personalisation algorithms reduce the diversity of information users encounter by excluding content deemed irrelevant or contradictory to the user’s beliefs
  • Informational privacy
    • While there are laws (GDPR) which protect data of identfiable individual, you can still be clustered into a group, which you don’t want to be identified with.
  • Moral responsibility
    • Black box - so nobody’s responsible for the algorithm.
199
Q

Four dimensions of data

A
200
Q

Course objective: Describe the main steps of the Cross-Industry Process for Data Mining (CRISP-DM).

A

Steps: business understanding, data understanding, data preparation, data modeling, evaluation, and deployment

201
Q

Course objective: Distinguish among the four different Data Science methods covered in class.

A
202
Q

Course objective: Using the frameworks discussed in class, recognize the ethical dilemmas in collecting and analyzing the Big Data.

A

Frameworks of ethical dilemmas:

  • Six types of ethical concerns raised by algorithms (Mittelstadt et al. 2016)
  • Transparency about 5 aspects
  • Algorithmic opacity in 3 ways
  • Classifier discrimination in every CRISM-DM cycle
  • Different types and origins of bias
  • Assertion-based framework
203
Q

Course objective: Evaluate the ethical position of a firm in a specific data collection situation.

A

.

204
Q

How do you calculate sensitivity (with confusion matrix)?

A

Sensitivity is calculated as the number of correct positive predictions divided by the total number of positives.

205
Q

What are the two ways to control for complexity in co-occurrence modelling?

A
  • Support: Add a rule which says which part of the occurences the model should address.
    • For example, place a constraint that such rules must apply to some minimum percentage of the data—let’s say that we require rules to apply to at least 0.01% of all transactions.
  • Confidence: Add a rule which determines the strenght of the association.
    • For example, we might say we require the strength to be above some threshold, such as 5%
206
Q

What is support in co-occurrence modeling?

A

Support of association is an indication of how frequently the items appear in the data.

207
Q

What is confidence in co-occurance modeling?

A

The probability that B occurs when A occurs we’ve seen before; it is p(B|A), which in association mining is called the confidence or strength of the rule.

208
Q

What is lift in terms of co-occurrence modeling?

A

Lift answers the question - how much more frequently does this association occur than we would expect by chance?

The lift of the co-occurrence of A and B is the probability that we actually see the two together, compared to the probability that we would see the two together if they were unrelated to (independent of) each other. As with other uses of lift we’ve seen, a lift greater than one is the factor by which seeing A “boosts” the likelihood of seeing B as well.

209
Q

What is leverage in terms of co-occurrence modeling?

A

Leverage looks at the difference of seeing the probability of the items purchased together minus the probability of items purchased independently from each other.

210
Q

Three ways a model makes an error:

A
  • Inherent randomness
    • Prediction is not ‘deterministic’ - there is no promise that people will bahave according to our model.
  • Bias
    • No matter how much data is given, the model will never achieve maximum accuracy (unless your model takes ALL factors into account)
  • Variance
    • Model accuracy varies accross difference training sets
211
Q

How can a company sustain a competitive advantage with Data Science?

A
  • Formidable historical advantage
  • Unique intellectual property
    • Novel techniques for mining the data
  • Unique intangible collateral assets
    • Implementation of the model
    • Company culture regarding implementing Data Science solutions (e.g. culture of experimentation)
  • Superior data scientists
    • You need at least one superstar data scientist to be able to evaluate the quality of the prospective hires
  • Superior data science management
    • Understand the business needs
    • Be able to communicate to techies and suits
    • Coordinate models with business constraints and costs
    • Anticipate outcomes of data science projects
    • They need to do this within the culture of a particular firm
212
Q

Alternative ways to attract data science human capital:

A
  • Engage academic data scientists (pay for their PhD)
  • Take top-notch data scientists as scientific advisors
  • Hire a third-party to conduct the data science
213
Q

What are difficulties in analyzing text?

A
  • Messy
  • Stopwords
  • Similar words for the same thing
  • Stemming
  • Which words are relevant?
214
Q

How batch and serve layers satisfy almost all properties?

A

Robustness and fault tolerance: the server uses replication under the hood to ensure availability when servers go down, and they are human-fault tolerant because when a mistake is made you can fix your algorithm or remove the bad data and recompute it from scratch.
Scalability: both the batch and serving layers are easily scalable. They’re both fully distributed systems, and scaling them is as easy as adding new machines.
Generalization: the architecture described is as general as it gets. You can com-pute and update arbitrary views of an arbitrary dataset.
Extensibility: adding a new view is as easy as adding a new function of the mas-ter dataset. Because the master dataset can contain arbitrary data, new types of data can be easily added.
Ad hoc queries: the batch layer supports ad hoc queries innately. All the data is
conveniently available in one location.
Minimal maintenance: main component to maintain in this system is Hadoop. Hadoop requires some administration knowledge, but it’s fairly straightforward to operate.
Debuggability: in traditional databases, an output can replace the original input. In the batch and serving layers, the input is the master dataset and the output is the views. Likewise, you have the inputs and outputs for all the intermediate steps. Having the inputs and outputs gives you all the information you need to debug when something goes wrong.

215
Q

Does a serving layer support random writes?

A
  • No.
  • This is very important, as random writes cause most of the complexity in databases. By not supporting random writes, these are more simple.
  • That simplicity makes them robust, predictable, easy to configure, and easy to operate.
216
Q

What is the main difference between batch and speed layer?

A

Speed layer only looks at the most recent data.

217
Q

5 categories of Big Data technologies:

A

Batch computation systems: high throughput, high latency systems. Batch computation systems can do nearly arbitrary computa-tions, but they may take hours or days to do so. The only batch computation sys-tem we’ll use is Hadoop.
Serialization frameworks: provide tools and libraries for using objects between
languages. They can serialize an object into a byte array from any language, and then deserialize that byte array into an object in any lan-guage. Serialization frameworks provide a Schema Definition Language for defining objects and their fields, and they provide mechanisms to safely version objects so that a schema can be evolved without invalidating existing objects.
Random-access NoSQL databases: there are many of these databases. They sacrifice the full expressiveness of SQL and instead specialize in certain kinds of operations. They all have different seman-tics and are meant to be used for specific purposes. They’re not meant to be used for arbitrary data warehousing.

Messaging/queuing systems: provides a way to send and consume messages betweenprocesses in a fault-tolerant and asynchronous manner.
Realtime computation system: high throughput, low latency, stream-processing
systems. They can’t do the range of computations a batch-processing system can, but they process messages extremely quickly.

218
Q

Define data normalization

A

Data normalization refers to storing data in a structured manner to minimize redundancy and promote consistency.

219
Q

3 issues with KNN modeling:

A

I. Intelligibility: the lack of an explicit, interpretable model may pose a problem in some areas. There are two aspects: the justification of a decision and the intelligibility of an entire model. With k-NN, it is easy to describe how a single instance is decided: the set of neighbors participating in the decision can be presented, along with their contributions. What is difficult to explain more deeply is what knowledge has been mined from the data. Also, visualization is possible with two dimensions, but not with many dimensions.
II. Dimensionality and domain knowledge: numeric attributes may have vastly different ranges, and unless they are scaled appropriately the effect of one attribute with a wide range can swamp the effect of another with a much smaller range. There is also a problem with having to many attributes, or many that are irrelevant to similarity judgement. Some problems are said to be high dimensional (including irrelevant variables) and they suffer from the curse of dimensionality. The prediction is then confused by the presence of too many irrelevant attributes. To solve this, one can conduct feature selection (determining which features should be included in the data mining model).
III. Computational efficiency: the main computational cost of a nearest neighbor method is in the prediction/classification step, when the database must be queried to find the nearest neighbor of a new instance.