Big Data Analytics Management Flashcards

1
Q

What is the definition of Data Science?

A

Data science is a set of fundamental principles that guide the extraction of knowledge from data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the definition of Data Mining?

A

Data mining is the extraction of knowledge from data, via technologies that incorporate these principles.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the definition of Data-Driven Decision-Making?

A

Data-Driven Decision-Making refers to the practice of basing decisions on the analysis of data, rather than purely intuition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Tasks in data mining:

A
  1. Classification and class probability estimation
  2. Regression (“Value estimation”)
  3. Similarity matching
  4. Clustering
  5. Co-occurance grouping (market-basket analysis)
  6. Profiling
  7. Link prediction
  8. Data reduction
  9. Causal modeling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe classification and class probability estimation task

A

It attempts to predict, for each individual in a population, which of a set of classes this individual belongs to.

  • Classification would give definitive output: will respond, will not respond.
  • Class probability estimation would give output with probability that the individual belongs to that class.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe regression task

A

Regression attempts to predict, for each individual, the numerical value of some variable for that individual. Example: “How much will a given customer use a service?”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Regression vs. Classification?

A

Classification predicts WHETHER something will happen, whereas regression predicts HOW MUCH something will happen.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe similarity matching task

A

Similarity matching attempts to IDENTIFY individuals based on data known about them. Example: finding companies who are similar to the ones you are serving.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe clustering task

A

Clustering attempts to GROUP individuals in a population together by their similarity, but not driven by any specific purpose. Example: “Do our customers form natural groups or segments?”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe co-occurrence grouping task

A

It attempts to find ASSOCIATIONS between entities based on transactions involving them. Example: “What items are commonly purchased together?”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Clustering vs. co-occurrence?

A

While clustering looks at similarity between objects based on the objects’ attributes, co-occurrence grouping considers similarity of objects based on their appearing together in transactions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Describe profiling task

A

Profiling attempts to characterize the typical behavior of an individual, group, or population. Example: “What is the typical cell phone usage of this customer segment?”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Describe link prediction task

A

Link prediction attempts to predict connections between data items, usually by suggesting that a link should exist, and possibly also estimating the strength of the link. Example: “Since you and Karen share 10 friends, maybe you’d like to be Karen’s friend?”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Describe data reduction task

A

Data reduction attempts to take a large set of data and replace it with a smaller set of data that contains much of the important information in the larger set. For example, a massive dataset on consumer movie-viewing preferences may be reduced to a much smaller dataset revealing the consumer taste preferences that are latent in the viewing data (for example,viewer genre preferences).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe causal modeling task

A

Causal modeling attempts to help us understand what events actually influence others. Example: “Was this because the advertisements influenced the consumers to purchase? Or did the predictive models simply do a good job of identifying those consumers who would have purchased anyway?” A business needs to weight the trade-off of increasing investment to reduce the assumptions made, versus deciding that the conclusions are good enough given the assumptions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Conditions for supervised learning:

A
  1. It has to have a specific target;
  2. There must be data on the target.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Define label

A

The value for the target variable for an individual.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Supervised vs. unsupervised tasks

A

Supervised:

  • Classification;
  • Regression;
  • Causal modeling.

Unsupervised:

  • Clustering;
  • Co-occurrence grouping;
  • Profiling.

Both:

  • Matching;
  • Link prediction;
  • Data reduction.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Second stage of CRISP process - Data Understanding

A
  • The critical part of the data understanding phase is estimating the costs and benefits of each data source and deciding whether further investment is merited.
  • We need to dig beneath the surface to uncover the structure of the business problem and the data that are available, and then match them to one or more data mining tasks for which we may have substantial science and technology to apply.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Third stage of CRISP process - Data Preparation

A

Data preparation phase often proceeds along with data understanding, in which the data are manipulated and converted into forms that yield better results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Define data leak

A

A data leak is a situation where a variable collected in historical data gives information on the target variable—information that appears in historical data but is not actually available when the decision has to be made.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Fifth stage of CRISP process - Evaluation

A

The purpose of the evaluation stage is to assess the data mining results rigorously and to gain confidence that they are valid and reliable before moving on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Sixth stage of CRISP process - Deployment

A

In deployment the results of data mining—and increasingly the data mining techniques themselves—are put into real use in order to realize some return on investment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Data Mining vs. Software Development

A

Data mining is an exploratory undertaking closer to research and development than it is to engineering. The CRISP cycle is based around exploration; it iterates on approaches and strategy rather than on software designs. Outcomes are far less certain, and the results of a given step may change the fundamental understanding of the problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What are informative attributes?
Information is a quantity that reduces uncertainty about something.
26
Define predictive model
Predictive model is a formula for estimating the unknown value of interest: the target.
27
Define supervised learning
Supervised learning is model creation where the model describes a relationship between a set of selected variables (attributes or features) and a predefined variable called the target variable. The model estimates the value of the target variable as a function (possibly a probabilistic function) of the features.
28
Define supervised learning
Supervised learning is model creation where the model describes a relationship between a set of selected variables (attributes or features) and a predefined variable called the target variable. The model estimates the value of the target variable as a function (possibly a probabilistic function) of the features.
29
Creation of models from data is known as
Model induction. Induction is a term from philosophy that refers to generalizing from specific cases to general rules (or laws, or truths).
30
Define training data
The input data for the induction algorithm, used for inducing the model. They are also called labeled data because the value for the target variable (the label) is known.
31
Define Big Data (l)
Data which does not fit into single computer and needs multiple computers to process it
32
Define Artificial Intelligence
Entity that can mimic human behavior
33
Define data engineering
Methods to handle and prepare data
34
What characterizes Big Data?
VOLUME - Data at Rest - Terabytes to exabytes of existing data to process VELOCITY - Data in Motion - Streaming data, milliseconds to seconds to respond VARIETY Data in Many Forms - Structured, unstructured, text, multimedia VERACITY - Data in Doubt - Uncertainty due to data inconsistency; incompleteness, ambiguities, latency, deception, model approximations.
35
What type of data has been increasing rapidly?
Unstructured data - text, video, audio
36
What is the issue with the growth of unstructured data?
Traditional systems have been designed for transactions, not unstructured data.
37
How Google solved the problem of unstructured data?
* Google goes through every page, scans its contents and has keywords ready which creates an index (and they update it every day) * Traditional architecture is not enough to process data * **Solution:** cluster architecture. However, every day 900 machines die and need to be replaced
38
Google challenges: How to distribute computation across multiple machines in a resource-efficient way? How to ensure that computations that are running are not lost when machines die? How to ensure that data is not lost when machines die/
Solution: Google File System - redundant storage of massive amounts of data on cheap and unreliable computers MapReduce distributed computing paradigm
39
What is Hadoop?
Open source software framework for distributed storage and distributed processing that replicated Google's MapReduce model.
40
Define MapReduce
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. The MapReduce model consists of two main stages: 1. **Map**: input data is split into discrete chunks to be processed 2. **Reduce:** output of the map phase is aggregated to produce the desired result The simple nature of the programming model lends itself to **efficient** and **large-scale implementations** across thousands of cheap nodes (computers).
41
Data Mining Process
1. Data Engineering and Processing (big data technologies - Google Files, Hadoop) 2. Data Science -\> automated DDD
42
Types of Analytics:
Increasing in value and difficulty: 1. **Descriptive** - what happened? [Reports] 2. **Diagnostic** - why did it happen? [Queries, statistical analysis] 3. **Predictive** - what will happen? [Forecasts, machine learning] 4. Diagnostic + Predictive = **Prescriptive** - how can we make it happen? [Optimization, planning]
43
Predictive analytics vs. Diagnostic analytics
Predictive analytics don't answer question WHY it just predicts. While diagnostic analytics tries to answer WHY something happen.
44
Why diagnostic analytics is harder than predictive analytics?
Determining correlation is easier than implying causation (randomized controlled experiment).
45
How much human input is needed in each type of analytics?
* _Descriptive/diagnostic analytics_ provide insight into the data, so that one can better understand what data to collect and store and provide insight into ways to improve future models. * _Predictive analytics_ is building a model to predict when something will happen. * _Prescriptive analytics_ automates action to be taken based on prediction.
46
Tasks of Descriptive/Diagnostic analytics:
Data visualization; Clustering; Co-occurrence grouping
47
Tasks of Prescriptive analytics:
* Uplift modeling - predict how individuals behave contingent on the action performed upon them * Automation - determine optimal action based on predicted reaction of individuals
48
What does it mean for groups to be pure? (in terms of selecting meaningful attributes)
Homogeneous with respect to the target variable. If every member of a group has the same value for the target, then the group is pure.
49
Complications while putting people into segments:
* Attributes rarely split a group perfectly. * Not all attributes are binary. * Some attributes take on numeric values (continuous or integer)
50
Define entropy
**Entropy** is a measure of disorder that can be applied to a set, such as one of our individual segments. Consider that we have a set of properties of members of the set, and each member has one and only one of the properties. In supervised segmentation, the member properties will correspond to the values of the target variable. Disorder corresponds to how mixed (impure) the segment is with respect to these properties of interest. So, for example, a mixed up segment with lots of write-offs and lots of non-write-offs would have high entropy.
51
What is natural measure of impurity for numeric values?
Variance
52
Overview of technologies on a typical Business Intelligence (BI) stack
53
How did Netflix create competitive advantage with data?
They had data on what people liked and they personalized/customized/adapted their movies to customers' preferences
54
How can SafeSize leverage their data to move to B2C space?
* They stored customers' data as a 'shoe profile'; * They sold their service to e-commerce shoe stores as a widget.
55
3 ways on how to collect data from IMDB:
* Manually * Manually download the file (which someone created) * Pretending you are a human browsing a web site (web scraping) - API
56
Web scraping can be done using:
* A modern **programming** language which offers complete flexibility but requires more effort to implement; * **Specialized tools** which allow faster implementation but provide less flexibility and make it harder to replicate data collection.
57
Web scraping steps using a programming language:
1. Request a web page 2. Parse the HTML 3. Filter and transform data to desired format 4. Save data.
58
Web scraping using a dedicated tool:
import.io is a simple too that tries to infer what is interesting on a website webscraper.io gives you more flexibility
59
Webscraper.io steps
1. Define a starting page 2. Define category links 3. For each individual category or product page determine which information to collect, determine which links to follow.
60
Why web scraping is not ideal?
* Many sites do not allow gathering information automatically. * It detects if you are human based on detection of frequent requests, cookies, Robots Exclusion Protocol (it is stored on robots.txt), and other trackers. * Not all information is public (you can use authentication the protected information and API)
61
What is the purpose of Robots Exclusion Protocol?
It tells everyone who is allowed to crawl their page.
62
Why people doesn't want their website to be crawled?
Data can be costly to acquire, so companies don't want to be found.
63
Three steps for API access:
It is an official way of accessing information automatically. 1. Get an API key 2. Query an API endpoint using the API key - an API usually provides multiple endpoints or functions (most recent movies, most popular...) 3. Process the response
64
Twitter provides two types of API:
* **Representational State Transfer (REST) APIs:** Used for singular queries for one term * **Streaming API:** Continuously get the tweets
65
Data structure (volume/velocity)
* **Cross-sectional** - data that (almost) never changes; * e.g. city names, birth date * **Transactional** - one observation represents one transaction; * e.g. a website visit * **Panel** - one observation represents one individual during a time period * e.g. monthly bill.
66
Data structure: Tidy data
Tidy data is on a single table according to the following rules: 1. Each _variable_ must have its _own column_. 2. Each _observation_ must have its _own row_. 3. Each _value_ must have its _own cel_l.
67
Types of data (Variety)
* Structured * Unstructured
68
Structured data (Variety)
QUALITATITVE/CATEGORICAL DATA - Nominal; - Ordinal (satisfaction level); QUANTITATIVE DATA - Discrete (countable number); - Continuous (interval value).
69
Types of unstructured data (Variety):
* Text-based documents (tweets, webpages) * Images, videos, audio
70
What are the methods to transform unstructured data into structured data?
* Topic Modeling (text); * Sentiment Analysis (text); * Feature extraction (image/video/sound).
71
How can data quality be affected by (Veracity)?
* Missing data * Measurement error
72
Two ways of missing data:
* Missing observations * Missing values in some observations
73
Why it is important to know why data is missing?
You need to know the reason why it is missing because it will inform whether it is a problem or not. Is data missing at random or not?
74
What if data are missing at random?
It is fine. If data are missing at random, the remaining observations are still a representative sample of the population. Solution: listwise deletion i.e. delete all observations that do not have values for all variables in the analysis.
75
What if data are missing not at random?
It is a problem! The remaining observations are not a representative sample of population.
76
Selection bias (Veracity)
* **Selection bias** occurs when the sampling procedure is not random and thus the sample is not representative of the population. * **Self-selection** - some members of the population are more likely to be included in the sample because of their characteristics.
77
Selection bias (Veracity)
Selection bias occurs when the sampling procedure is not random and thus the sample is not representative of the population.
78
Two types of selection bias (Veracity)
* **Self-selection** - some members of the population are more likely to be included in the sample because of their characteristics. * **Attrition** - some observations may be less likely to be present in the sample due to time constraints
79
Define measurement error (Veracity)
Measurement error occurs when the data is collected errors that are non random.
80
Types of measurement error
* **Recall bias** - respondents recall some events more vividly than others (child deaths by gun vs swimming pools); * **Sensitive questions** - respondents may not report data accurately (wages, health conditions); * **Faulty equipment** - equipment that exhibits systematic measurement error.
81
What does disorder in terms of entropy represent?
Disorder corresponds to how mixed (impure) the segment is with respect to properties of interest.
82
Entropy equation:
entropy = - p1 log (p1) - p2 log (p2) - ... Each pi is the probability of property i within the set, ranging from pi = 1 when all members of the set have property i, and pi = 0 when no members of the set have property i.
83
Define information gain (IG)
It measures how much an attribute improves entropy over the whole segmentation it creates.
84
How to address the issue in classification when the probability may be overly optimistic with small samples?
We use Laplace correction. Its purpose is to moderate the influence of leaves with only a few instances.
85
How do we identify informative attributes?
We measure the attribute on the basis of information gain, which is based on a purity measure called entropy, another is variance reduction (for numeric target).
86
How do we segment data by progressive attribute selection?
We use tree induction technique.
87
What is tree induction technique?
Tree induction recursively finds informative attributes for subsets of the data. In so doing it segments the space of instances into similar regions. The partitioning is “supervised” in that it tries to find segments that give increasingly precise information about the quantity to be predicted, the target. The resulting tree-structured model partitions the space of all possible instances into a set of segments with different predicted values for the target.
88
Define parametric learning
The data miner specifies the form of the model and the attributes; the goal of the data mining is to tune the parameters so that the model fits the data as well as possible.
89
SVM error function is known as
Hinge loss. The penalty for a misclassified point is proportional to the distance from the decision boundary, so if possible the SVM will make only “small” errors.
90
Learning objective session 3: Understand the stages of a predictive modeling process
1. Define target 2. Collect data 3. Build a model (set of rules or a mathematical formula) 4. Predict outcomes
91
Learning objective session 3: Understand the main concepts and principles of predictive modeling, including the concepts of target variable, supervised segmentation, entropy and information gain.
* **Target variable** (label) - the value you're trying to predict; * **Supervised segmentation** is model creation where the model describes a relationship between a set of selected variables (attributes or features) and a predefined variable called the target variable. The model estimates the value of the target variable as a function (possibly a probabilistic function) of the features. * **Entropy** is a measure of disorder (surprise). It tells how impure the segment is with regards to the properties of interest. * **Information gain** measures how much an attribute decreases entropy over the whole segmentation it creates.
92
Learning objective session 3: Know the basic metrics used to evaluate a predictive model and related concepts, including confusion matrix, accuracy and error rate.
* **Accuracy** is the proportion of correct decisions made by the classifier. * **Error rate** is the proportion or wrong decisions made by the classifier. * **Confusion matrix i**s a table that is often used to describe the performance of a classification model (TP, TN, FP, FN) on a set of test data for which the true values are known.
93
Unsupervised methods (there is no specific target variable)
* **Affinity grouping** - associations, market-basket analysis (Which items are commonly purchased together?) * **Similarity matching** (Which other companies are similar to ours?) * **Clustering** (Do my customers form natural groups?) * **Sentiment analysis** (What is the sentiment of my users?)
94
Supervised methods (there is a specific target variable)
* **Predictive modeling** * Will specific customer will default? Which accounts will be defrauded? * **Causal modeling** * How much would client X spend if I gave her a discount?
95
Predictive vs. diagnostic
They pursue different goals: **Predictive modeling** is the process of applying a statistical model or data mining algorithm to data for the purpose of _predicting new or future observations._ _Example:_ How much will client X spend? **Explanatory modeling** is the use of statistical models for explaining how the world works (by testing _causal explanations)_. _Example:_ How much would a discount change client X's spending?
96
Why empirical explanation and empirical prediction differ?
* **Explanatory models** are based on _underlying causal relationships between theoretical constructs_ while * **predictive models** rely on _associations between measurable variables._ * **Explanatory modeling** seeks to _minimize model bias_ (i.e. specification error) to obtain the most accurate representation of the underlying theoretical model, * **predictive modeling** seeks to _minimize the combination of model bias and sampling variance_ (how much does the model change with new data).
97
Define predictive modeling
It is a method for estimating an unknown value of interest, which is called target.
98
Process of predictive modeling
1. Define (quantifiable) target 2. Collect data - data on same or related phenomenon 3. Build a model - a set of rules or a mathematical formula that allow establishing a prediction. 4. Predict outcomes - the model can be applied to any customer. It gives us a prediction of the target variable.
99
Two types of predictive modeling
REGRESSION Attempts to estimate or predict the numerical value of some variable for an individual. Mathematical formula: * Linear regression * Logistic regression Rule-based formula: * Regression trees CLASSIFICATION Attempts to predict which of a (small) set of classes an individual belongs to. Mathematical formula: * Logistic regression * Support Vector Machines Rule-based formula * Classification trees
100
Define linear regression
**Linear regression** is an approach for modeling the relationship between a _dependent variable_ and one or more _explanatory variables_. The estimators B0, B1, B2 are obtained by _minimizing the sum of squared errors._ It is used when you are trying to predict a numerical variable.
101
Define logistic regression
If the dependent variable takes values between 0 and 1, we can use **logistic regression** to model its relationship with one or more **explanatory variables.** F () is a function with values between 0 and 1. P(Pass) = f(b0 + b1 x effort)
102
What does R squared mean in linear regression?
It explains sow much of the total variation is explained by the model. Everything besides effort. The bigger it is the better because it means this percentage in variation is explained by the model.
103
Steps of classification
# 1. **Define target** - will prospect X buy life insurance? 2. **Collect data** - gather list of prospects with demographic information 3. **Build a model** - logistic regression, classification trees 4. **Predict outcomes**
104
When can logistic regression can be used for classification?
Logistic regression can be used for classification when: * Target variable is binary. * The outcome of a model can be interpreted as probability.
105
When should you stop segmentation?
Stop segmentation when at least one of the conditions is met: * All elements of a segment belong to the same class * The maximum allowed tree depth is reached * Using more attributes does not "help
106
How to choose at each step which of the attributes to use to segment the population?
Resulting groups have tobe as pure as possible - homogeneous w.r.t. target variable.
107
Define entropy
How much information is necessary to represent information about an event with X possible outcomes? log2(X) p=1/x is log2(1/p) **Entropy** measures of the general disorder of a set - how unpredictable world is
108
What is information gain?
* **Information gain** (IG) measure the change in entropy due to any amount of new information being added. * Information gain measure how much an attribute decrease entropy over the whole segmentation it creates. IG = entropy(parent)- [p(c1) \* entropy(c1) + p(c2) \* entropy(c2) + ...]
109
How to evaluate a model?
**Accuracy** = number of correct decisions made/total number of decisions. **Error rate** = 1 - accuracy
110
Confusion matrix
* True Positives (TP) - actual positives correctly predicted as positive. * True Negatives (TN) - actual negatives correctly predicted as negative. * False Positives (FP) - negatives incorrectly predicted as positive. * False Negatives (FN) - positives incorrectly predicted as negative.
111
Learning objective session 4: Understand the concepts of generalization and overfit
* **Generalization** is the property of a model whereby _model applies to data that were not used to build the model._ * **Overfit** is the tendency to _tailor models to the training data_, at the expense of generalization to previously unseen data points.
112
Learning objective session 4: Understand the concepts of holdout data and cross-validation
* **Holdout data** (or test set) is the data that was not used to teach the model - it was set aside so the created model could be evaluated * **Cross-validation** computes its estimates over all the data by performing multiple splits and systematically swapping out samples for testing. It computes the average and standard deviation from k folds.
113
Learning objective: Be able to interpret the performance of a model by looking at different measures, such as **fitting curves**, **learning curves**, and **ROC curves.**
114
Learning objective session 4: Be able to _evaluate a model_ using the **Expected Value framework.**
The **expected value framework** is an analytical tool that is extremely helpful in organizing thinking about data-analytic problems. .Combines: * Structure of the problem * Elements of the analysis that can be extracted from the data * Elements of the analysis that need to be acquired from other sources (e.g., business knowledge) The **benefit/cost matrix** summarizes the benefits and costs of each potential outcome, always comparing with a base scenario. It does not really matter which base scenario we choose, as long as all comparisons are with the same scenario.
115
What is the problem with 'table model'?
Does not predict the future but just fits the data perfectly, as it memorizes the training data and performs no generalization.
116
Define overfitting
It is the tendency to tailor model to the training data, at the expense of generalization to previously unseen data points.
117
_Trade-off_ between **overfitting** and **generalization**
* If we allow ourselves enough flexibility in searching, we will find patterns * Unfortunately, these patterns may be just chance occurences in the data * We are interested in patterns that generalize, i.e., that predict well for instances that we have not yet observed
118
What if by accident the test set is particularly easy/hard?
**Solution:** cross-validation. * Cross validation is a more sophisticated training and testing procedure. * Cross-validation computes its estimates over all the data by performing multiple splits and systematically swapping out samples for testing. * Simple estimate of the generalization performance, but also some statistics on the estimated performance (mean, variance, . . . )
119
What if standard deviation is large, while the average is good?
You need to assess what is better - good average or small standard deviation.
120
How to find which variables are the most important for the model?
* (Weighted) Sum of information gain in each split a variable is used (tree-based models) * Difference in model performance with and without using that variable (all models)
121
Characteristics of decision trees
* Trees create a segmentation of the data * Each nod in the tree contains a test of an attribute * Each path evenually terminates at a leaf * Each leaf corresponds to a segment, and the attributes and values along the path give the characteristics. * Each leaf contains a value for the target variable
122
The Data Mining Process: _building_ and _using_ a predictive model
123
Features of entropy
* P1,P2, ..., Pn are the proportions of classes 1,2, ..., n in the data * Disorder corresponds to how mixed (impure) a segment is * Entropy is **zero at minimum** **disorder** (all members belong to the same class) * Entropy is **one at maximal** **disorder** (members equaly distributed among classes)
124
Confusion matrix formulas
125
Components of scalling with a traditional database
1. **Scalling with a queue** - you create a queue for requests so that frequent requests don't crash the system. 2. **Scalling by sharding the database** - you split the write load across multiple machines - horizontal partitioning/sharding. 1. It starts faults and corruption issues
126
Desired (8) properties of a big data system
* Robustness and fault tolerance * Low latency reads and updates * Scalability * Generalization * Extensibility * Ad hoc queries * Minimal maitenance * Debuggability
127
Problems with fully incremental achitectures
* **Operational complexity** * *Compaction* is an intensive operation - a lot of coordination. Many things could go wrong. * **Extreme complexity of achieving eventual consistency** * Consistency and availability don't go together * **Lack of human-fault tolerance:** an incremental system is constantly modifying the state it keeps in the database, which means a mistake can also modify the state in the database. *
128
Expected value framework
The expected value framework is an analytical tool that is extremely helpful in organizing thinking about data-analytic problems. Combines: * Structure of the problem * Elements of the analysis that can be extracted from the data * Elements of the analysis that need to be acquired from other * sources (e.g., business knowledge)
129
Define generalization
**Generalization** is the property of a model or modeling process, whereby the model applies to data that were not used to build the model.
130
Define overfitting
**Overfitting** is the tendency of data mining procedures to _tailor models to the training data_, at the expense of generalization to previously unseen data points.
131
What is holdout data?
**Holdout data** is data used for validating a model and not used for training a model. Performance is evaluated based on accuracy in the test data -\> **holdout accuracy.** Holdout accuracy is an estimate of **generalization accuracy.**
132
Tree induction commonly uses two techniques to avoid overfitting:
1. Stop growing the tree before it gets too complex 2. Grow the tree until it is too large, then 'prune' it back, reducing its size (and complexity).
133
How similarity analysis is used in the business?
- Ad retrieval - Customer classification - Customer clustering - Competitor analysis
134
Similarity as distance between neighbors
You measure the distance between the attirbutes. Distance = Pythagoras.
135
Define lift
The **lift of the co-occurrence** of A and B is the probability that we actually see the two together, compared to the probability that we would see the two together if they were unrelated to (independent of) each other. how much more frequently does this association occur than we would expect by chance?
136
Define leverage
how much more likely than chance a discovered association is. An alternative is to look at the difference of these quantities rather than their ratio. This measure is called **leverage.**
137
Learning objective session 5: What are the challenges of creating big data applications?
1. Scaling 2. Complexity 3. Fault-tolerance 4. Data-corruption
138
Learning objective session 5: What are the motivations for the lambda architecture?
1. Robustness and fault tolerance 2. Low latency 3. Minimal maitenance 4. Ad hoc queries
139
Learning objective session 5: What are the best practices in designing big data applications? How to store data? How to guarantee consistency and resilience?
140
Learning objective session 5: What are the main features of Hadoop HDFS?
* Distributed file systems are quite similar to the file systems of your computer, except they spread their storage across a cluster of computers * They scale by adding more machines to the cluster * Designed so that you have fault tolerance when a machine goes down, meaning that if you lose one machine, all your files and data will still be accessible * The operations you can do with a distributed filesystem are often more limited than you can do with a regular filesystem * For instance, you may not be able to write to the middle of a file or even modify a file at all after creation How distributed file systems work? * All files are **broken into blocks** (usually 64 to 256 MB) * These blocks are **replicated** (typically 3 copies) among the HDFS servers (datanodes) * The namenode provides a **lookup service** for clients accessing the data and ensures the nodes are correctly replicated across the cluster
141
Learning objective session 5: What are the main features of MapReduce?
The MapReduce model consists of two main stages: * **Map** input data is split into discrete chunks to be processed * **Reduce** output of the map phase is aggregated to produce the * desired result The simple nature of the programming model lends itself to efficient and large-scale implementations across thousands of cheap nodes (computers).
142
Learning objective session 5: be familiar with specific big data tools and be able to position them in the lambda architecture
?
143
The Life Cycle of a Competitive industry: how does the number of firms change over the stages of the product lifecycyle?
144
What are the problems with client /server data architecture model?
* Analytics application is **struggling to keep up with the traffic** - too many requests for the database * You start hashing the database, however, it is messy and takes time and it is **prone to errors** * **Fault-tolerance** decreases as you can only fix it by having one of the databases down * **Data corruption issues.** No place to store unchangeable data, thus you corrupt the original file.
145
What are the (2) desired properties of a Big Data system?
The desired properties of Big Data systems are related both to _complexity_ and _scalability._ * **Complexity** generally used to characterize something with many parts where those parts interact with each other in multiple way * **Scalability** ability to maintain performance in the face of increasingdata or load by adding resources to the system A Big Data system must _perform well_, be _resource-efficient_, and it must be **easy to reason about**
146
Desired (4) properties of a Big Data system
1. **Robustness and fault tolerance** 1. **​**Duplicated data 2. Concurrency 2. **Low latency** 3. **Minimal maitenance** 1. ​Anticipating when to add machines to scale, 2. keeping processes up and running 3. debugging 4. **Ad hoc queries** 1. ​Being able to mine a dataset arbitrarily gives opportunities for business optimization and new applications.
147
What are the functions and properties of batch layer in lambda architecture?
* Manages the **master dataset** – an immutable, append-only set of raw data * Pre-computes arbitrary query functions – called **batch views** * Runs in a loop and continuously recomputes the batch views from scratch * Very simple to use and understand * Scales by adding new machines.
148
What are the advantages of storing data in raw format?
* Data is always true (or correct): all records are always correct; no need to go back and re-write existing records; you can simply append new data * You can always go back to the data and **perform queries you did not anticipate** when building the system Data should be stored in **raw format**, should be **immutable** and should be **kept forever**
149
What are the features and properties of speed layer in lambda architecture?
* Accommodates all requests that are subject to _low latency requirements_ * Its goal is to ensure new data is represented in query functions as quickly as needed for the application requirements * Similar to the batch layer in that it produces views based on data it receives * One big difference is that the speed layer _only looks at recent data,_ whereas the batch layer looks at all the data at once * Does _incremental computation_ instead of the recomputation done in the batch layer
150
What are the features and properties of serving layer in lambda architecture?
* Indexes **batch views** so that they can be queried with low latency * The serving layer is a specialized distributed database that loads in a batch view and makes it possible to do random reads on it * When new batch views are available, the serving layer automatically swaps those in so that more up-to-date results are available * It does not need to support specific record updates * This is a very important point, as random writes cause most of the complexity in databases
151
Distributed File Systems
Distributed file systems are quite similar to the file systems of your computer, except they spread their storage across a cluster of computers * They scale by adding more machines to the cluster * Designed so that you have fault tolerance when a machine goes down, meaning that if you lose one machine, all your files and data will still be accessible The operations you can do with a distributed filesystem are often more limited than you can do with a regular filesystem * For instance, you may not be able to write to the middle of a file or even modify a file at all after creation
152
Google File System - Motivation & Assumptions
Motivation Google needed a good distributed file system * Redundant storage of massive amounts of data on cheap and unreliable computers Why not use an existing file system? * Google’s problems were different from anyone else’s * Different workload and design priorities * Google File System is designed for Google apps and workload * Google apps are designed for Google File System Assumptions * High component failure rates * Inexpensive commodity components fail all the time * "Modest" number of HUGE files * Just a few million * Each is 100MB or larger; multi-GB files typical * Files are write-once, mostly appended to * Large streaming reads
153
Define BigTable
**BigTable** is a _distributed storage system_ for managing _structured data_ that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.
154
What is Hadoop File System (HDFS)?
Hadoop File System is the open source alternative to Google File System * Comodity hardware * Tolerant to failure
155
How distributed file systems work?
156
What is Hadoop MapReduce?
* Hadoop MapReduce is a distributed computing paradigm originally pioneered by Google * Used to process data in the batch layer
157
What is Split-Apply-Combine Approach?
Many data analysis problems involve the application of a split-apply-combine strategy: * **Split:** Break up a big problem into manageable pieces; * **Apply:** Operate on each piece independently; * **Combine:** Put all the pieces back together.
158
Key benefits of MapReduce
* **Simplicity:** Developers can write applications in their language of choice, such as Java, C++ or Python * **Scalability:** MapReduce can very large amounts of data, stored in HDFS on one cluster * **Speed:** Parallel processing means that MapReduce can take problems that used to take days to solve and solve them in hours or minutes * **Recovery** MapReduce takes care of failures. If a machine with one copy of the data is unavailable, another machine has a copy of the same key/value pair, which can be used to solve the same sub-task. * **Minimal data motion:** MapReduce moves compute processes to the data on HDFS and not the other way around. Processing tasks can occur on the physical node where the data resides. This significantly reduces the network traffic patterns and contributes to Hadoop’s processing speed
159
Limitations of MapReduce
MapReduce is a very powerful and flexible tool that allows performing almost any data transformation task. However it has some limitations: * MapReduce is designed specifically for batch processing * Low level framework (hard to use) New tools have been developed to simplify the use of MapReduce * Apache HIVE (similar to SQL) * Apache Pig (script language)
160
What are elastic clouds?
* Elastic clouds allow you to **rent hardware on demand** rather than own your own hardware in your own location. * Elastic clouds let you **increase or decrease the size of your cluster nearly instantaneously**, so if you have a big job you want to run, you can allocate the hardware temporarily. * Elastic clouds dramatically **simplify system administration.** They also provide additional storage and hardware allocation options that can significantly drive down the price of your infrastructure. Examples of suppliers: * Microsoft Azure * Amazon Web Services (AWS) * Digital Ocean
161
Area under the ROC curve (AUC)
* The area under the ROC curve (depicted in gray) is the probability that the model will rank a randomly chosen positive case higher than a negative case * AUC is useful when a single number is needed to summarize performance, or when nothing is known about the operating conditions
162
Underfitting vs. Overfitting
* Underfitting A model that is too simple does not fit the data well (high bias) * e.g., fitting a quadractic function with a linear model * Overfitting A model that is too complex fits the data too well (high variance) * e.g., fitting a quadractic function with a 3rd degree function
163
Bias vs. Variance in underfitting and overfitting
* Bias a model that underfits is wrong on average (high bias) but is not highly affected by slightly different training data * Variance a model that overfits is right on average, but is highly sensitive to specific training data
164
Bias-Variance tradeoff
* When trying the optimal model we are in fact trying to find the optimal tradeoff between bias and variance; * We can reduce variance by putting many models together and aggregating their outcomes. More complexity generally gives us lower bias but higher variance, while lower variance models tend to have higher bias.
165
What are ensemble methods?
**Ensemble methods** use multiple algorithms to obtain better predictive performance than could be obtained from any of the algorithms by itself Using **multiple algorithms** usually increases model performance by: * **reducing variance:** models are less dependent on the specific training data Examples: * **Bagging** (or bootstrap aggregation) creates multiple data sets from t_he original training data by bootstrapping_ – re-sample with repetition. Runs several models and _aggregates output with a voting system_ * **Random Forest** combines bagging with random selection of features (or predictors) * **Boosting** applies classifiers sequentially, assigning higher weights to observations that have been mis-classified by the previous methods
166
Explain the trade-off between **overfitting** and **generalization**
* If we allow ourselves enough flexibility in searching, we will find patterns * Unfortunately, these patterns may be just chance occurences in the data... * We are interested in patterns that generalize, i.e., that predict well for instances that we have not yet observed
167
What is a fitting graph?
A **fitting graph** shows the accuracy (or error rate) of a model as a function of model complexity. Generally, there will be more overfitting as one allows the model to be more complex.
168
What is model complexity?
**Complexity** is a measure of the flexibility of a model. * If the model is a _mathematical function_, complexity is measured by the **number of parameters** * If the model is a _tree_, complexity is measured by the **number of nodes**
169
Why is overfitting causing a model to become worse?
As a model gets more complex, it is allowed to pick up harmful **spurious correlations.** * These correlations do _not represent characteristics of the population in general_ * They may become harmful when they _produce incorrect generalizations_ in the model
170
Define cross-validation
* **Cross-validation** computes its estimates over all the data by performing multiple splits and systematically swapping out samples for testing * Simple estimate of the generalization performance, but also some statistics on the estimated performance (mean, variance, . . . )
171
What is a learning curve?
A **learning curve** is a plot of the generalization performance (testing data) against the amount of training data * Generalization performance improves as more training data are available * Steep initially, but then marginal advantage of more data decreases
172
How do you calculate True Positive Rate and False Positve Rate?
173
What is Receiver Operating Characteristic (ROC) curve?
* The ROC graph shows the entire space of performance possibilities for a given model, independent of class balance * Plots classifiers false positive rate on the x axis against true positive rate on the y axis * It depicts relative trade-offs that a classifier makes between benefits (true positives) and costs (false positives): * (0, 0) is the strategy of never issuing a positive classification * (1, 1) is the strategy of always issuing a positive classification * The line linking (0,0) to (1,1) is the strategy of guessing randomly
174
Potential sources of bias on models
* Algorithm is wrong * Data is biased * People are biased
175
Self-advertising
* There is a trade-off: * Get money for ads * Or do self-advertising Thus, you need to know how likely the person will convert to figure out the expected value of self-ad. Target variable - Conversion within a week Data to use: - Historical data What are informative attributes for selection? - Income - Age - Device - Status (working/non-working) - Number of friends using spotify - Number of hours listened - Number of skips - Made a step towards buying - Clicking on premium options
176
Define uplift modelling
**Uplift modelling** identifies _individuals that are most likely to respond favorably_ to an action.
177
Predictive modelling vs. uplift modelling
* Predictive modelling: * Will a targeted customer buy? * Will I buy Spotify premium? * focus only on distinguishing between customers that buy if they are targeted versus those that do not buy * Uplift modelling: * Will the customer buy ONLY if targeted? * Are the self-ads the reason why I buy Spotify Premium? * further distinguis different behaviors among those that do not get targeted
178
How to build an uplift model?
* The c**ore complication with uplift modeling** lies in the fact that the _cannot measure the uplift for an individual_ because we cannot simultaneously target and not target a single person. * We can overcoe this by **randomly assigning similarly looking people to different treatments** and assessing the differences in their behavior.
179
Two ways to build an uplift model
- Differential approach - Two model approach
180
Two model approach for creating uplift model
1. Choose a target variable 2. Run two predictive models: 1. Experimental group 2. Control group 3. Calculate difference in predicted outcomes across models
181
What are the problems with two model approach for building uplift model?
* Each model is trained to minimize difference in expected cusomer value within a leaf, not to minimize the differences in uplift. * It does not mean that you're going to identify those who will have the highest lift.
182
Differential approach in creating uplift models
1. Define uplift as a target varibale 2. Run one predictive model with both treatment and control groups 1. At each split, minimize variations in uplift, not in expected value
183
Two model vs. differential approach in uplift modeling
**Two model approach** Each predictive model find splits to optimize expected life-time value * Best split is the split that minimizes variation of life-time value within each group **Differential Approach** Uplift models find splits to optimize difference in treatment effect * Best split is the split that minimizes variation of treatment effects within each group (or that maximizes the variance or treatment effects across groups)
184
When uplift modeling is worthwile
1. **Existence of a valid control group** if there is no adequate control group it is not possible to create an uplift model 2. **Negative effects** uplift models usually have a much better performance when some customers react negatively to intervention 3. **Negatively correlated outcomes** when the outcome is negatively correlated with the incremental impact of a marketing activity, the benefit of uplifting modeling may be larger.
185
What are the problems with predictive modeling?
* We can use predictive models for predicting outcomes based on individual attributes * However, models based only on observational data do not inform how users would react to a specific intervention * There is no distinction between individual attributes, which are mostly immutable, and the causal part of the model * Did the customers upgrade because they saw the ad, or were they going to upgrade anyways?
186
What are uses for clustering?
•Product recommendations –Different types of similarity can be used! •Customer segments •Personality types •Store and warehouse layout •Text mining •Reducing problem complexity and enhancing interpretation
187
How to evaluate how good is a given clustering?
Distortion is a measure for each cluster which is calculated distance of each point and its cluster centroid. However; while k-means clustering converges to a stable solution, the actual solution depends on the starting centroid choice.
188
How to choose a value of k for clustering?
An elbow method is used to determine when within group sum of squares decreases (and does not decrease much more).
189
Interpreting clusters 'characteristically' through labelling
We can interpret clusters by looking at a _typical cluster member_ or _typical characteristic_(s). Essentially, showing the cluster centroid.
190
How do you interpret TF-IDF?
Inverse Document Frequency tells that the fever documents in which term occurs the more significant it likely is to be to the documents it does occur in. Combined with Term Frequency which counts within the documents form the TF values for each term, and the document counts across the corpus form the IDF values It compares TF with the entire corpus' IDF. The TF–IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.
191
Six types of ethical concerns raised by algorithms
Epistemic concerns 1. Inconclusive evidence 2. Inscrutable evidence 3. Misguided evidence Normative concerns 4. Unfair outcomes 5. Transformative effects 6. *Traceability*
192
Ethical concern raised by algorithms: Inconclusive evidence
Correlation does not imply causation. Algorithms produce knowledge that is yet uncertain and has not been proven. Leads to -\> Unjustified actions
193
Ethical concern raised by algorithms: Inscrutable evidence
* The connection between the data and conclusion is not accessible. A lack of knowledge regarding the data being used (e.g. relating to their scope, provenance and quality), but more importantly also the inherent difficulty in the interpretation of how each of the many data-points used by a machine-learning algorithm contribute to the conclusion it generates, cause practical as well as principled limitations. Leads to -\> Opacity
194
Ethical concern raised by algorithms: Misguided evidence
* "Garbage in, garbage out" - input data is biased or incomplete. * The output of an alogorithm incorporates the values and assumptions that are presented in the inpute data of the algorithm. In thisway, the output can never exceed the input (e.g. cannot become more objective) * Leads to -\> Bias
195
Ethical concern raised by algorithms: Unfair outcomes
* Should data-driven discrimination be allowed? * The decisions and actions resulting from the outcome of an algorithm should be examined according to ethical criteria and principles considering the 'fairness' of the decision or action (including its effects) Leads to -\> Discrimination
196
Ethical concern raised by algorithms: Transformative effects
* Autonomous decision-making can be questionable and yet appear ethically neutral because they do not seem to cause any obvious harm. This is because algorithms can affect how we conceptualise the world, and modify its social and political organisation. Leads to -\> Challenges for autonomy and informational privacy
197
Ethical concern raised by algorithms: Traceability
* How do you assign responsibility of an algorithm?
198
Ethical challenges that stem from ethical concerns of algorithms
* Unjustified actions * Actions taken on the basis of inductive correlations have real impact on human interests independent of their validity. * Opacity * Lack of accessibility * Lack of comprehesibility * Information asymmetry * Even if people would want to they would not be able to explain how it works - algorithms can be too complex * Bias * Embedded social bias * Technical bias (constraints) * Emergent bias * Discrimination * Autonomy * Personalisation algorithms tread a fine line between supporting and controlling decisions by filtering which information is presented to the user based upon indepth understanding of preferences, behaviours, and perhaps vulnerabilities to influence * Deciding which information is relevant is subjective * Personalisation algorithms reduce the diversity of information users encounter by excluding content deemed irrelevant or contradictory to the user’s beliefs * Informational privacy * While there are laws (GDPR) which protect data of identfiable individual, you can still be clustered into a group, which you don't want to be identified with. * Moral responsibility * Black box - so nobody's responsible for the algorithm.
199
Four dimensions of data
200
Course objective: Describe the main steps of the Cross-Industry Process for Data Mining (CRISP-DM).
Steps: business understanding, data understanding, data preparation, data modeling, evaluation, and deployment
201
Course objective: Distinguish among the four different Data Science methods covered in class.
202
Course objective: Using the frameworks discussed in class, recognize the ethical dilemmas in collecting and analyzing the Big Data.
Frameworks of ethical dilemmas: * Six types of ethical concerns raised by algorithms (Mittelstadt et al. 2016) * Transparency about 5 aspects * Algorithmic opacity in 3 ways * Classifier discrimination in every CRISM-DM cycle * Different types and origins of bias * Assertion-based framework
203
Course objective: Evaluate the ethical position of a firm in a specific data collection situation.
.
204
How do you calculate sensitivity (with confusion matrix)?
Sensitivity is calculated as the number of correct positive predictions divided by the total number of positives.
205
What are the two ways to control for complexity in co-occurrence modelling?
* **Support:** Add a rule which says which part of the occurences the model should address. * For example, place a constraint that such rules must apply to some minimum percentage of the data—let’s say that we require rules to apply to at least 0.01% of all transactions. * **Confidence:** Add a rule which determines the strenght of the association. * For example, we might say we require the strength to be above some threshold, such as 5%
206
What is *support* in co-occurrence modeling?
**Support** of association is an indication of how frequently the items appear in the data.
207
What is *confidence* in co-occurance modeling?
The probability that B occurs when A occurs we’ve seen before; it is p(B|A), which in association mining is called the *confidence* or strength of the rule.
208
What is *lift* in terms of co-occurrence modeling?
**Lift** answers the question - how much more frequently does this association occur than we would expect by chance? The lift of the co-occurrence of A and B is the probability that we actually see the two together, compared to the probability that we would see the two together if they were unrelated to (independent of) each other. As with other uses of lift we’ve seen, a lift greater than one is the factor by which seeing A “boosts” the likelihood of seeing B as well.
209
What is *leverage* in terms of co-occurrence modeling?
**Leverage** looks at the difference of seeing the probability of the items purchased together minus the probability of items purchased independently from each other.
210
Three ways a model makes an error:
* Inherent randomness * Prediction is not 'deterministic' - there is no promise that people will bahave according to our model. * Bias * No matter how much data is given, the model will never achieve maximum accuracy (unless your model takes ALL factors into account) * Variance * Model accuracy varies accross difference training sets
211
How can a company sustain a competitive advantage with Data Science?
* **Formidable historical advantage** * **Unique intellectual property** * Novel techniques for mining the data * **Unique intangible collateral assets** * Implementation of the model * Company culture regarding implementing Data Science solutions (e.g. culture of experimentation) * **Superior data scientists** * You need at least one superstar data scientist to be able to evaluate the quality of the prospective hires * **Superior data science management** * Understand the business needs * Be able to communicate to techies and suits * Coordinate models with business constraints and costs * Anticipate outcomes of data science projects * They need to do this within the culture of a particular firm
212
Alternative ways to attract data science human capital:
* Engage academic data scientists (pay for their PhD) * Take top-notch data scientists as scientific advisors * Hire a third-party to conduct the data science
213
What are difficulties in analyzing text?
* Messy * Stopwords * Similar words for the same thing * Stemming * Which words are relevant?
214
How batch and serve layers satisfy almost all properties?
● **Robustness and fault tolerance:** the server uses replication under the hood to ensure availability when servers go down, and they are human-fault tolerant because when a mistake is made you can fix your algorithm or remove the bad data and recompute it from scratch. ● **Scalability:** both the batch and serving layers are easily scalable. They’re both fully distributed systems, and scaling them is as easy as adding new machines. ● **Generalization**: the architecture described is as general as it gets. You can com-pute and update arbitrary views of an arbitrary dataset. ● **Extensibility:** adding a new view is as easy as adding a new function of the mas-ter dataset. Because the master dataset can contain arbitrary data, new types of data can be easily added. ● **Ad hoc queries:** the batch layer supports ad hoc queries innately. All the data is conveniently available in one location. ● **Minimal maintenance:** main component to maintain in this system is Hadoop. Hadoop requires some administration knowledge, but it’s fairly straightforward to operate. ● **Debuggability:** in traditional databases, an output can replace the original input. In the batch and serving layers, the input is the master dataset and the output is the views. Likewise, you have the inputs and outputs for all the intermediate steps. Having the inputs and outputs gives you all the information you need to debug when something goes wrong.
215
Does a serving layer support random writes?
* No. * This is very important, as random writes cause most of the complexity in databases. By not supporting random writes, these are more simple. * That simplicity makes them robust, predictable, easy to configure, and easy to operate.
216
What is the main difference between batch and speed layer?
Speed layer only looks at the most recent data.
217
5 categories of Big Data technologies:
● **Batch computation systems:** high throughput, high latency systems. Batch computation systems can do nearly arbitrary computa-tions, but they may take hours or days to do so. The only batch computation sys-tem we’ll use is Hadoop. ● **Serialization frameworks:** provide tools and libraries for using objects between languages. They can serialize an object into a byte array from any language, and then deserialize that byte array into an object in any lan-guage. Serialization frameworks provide a Schema Definition Language for defining objects and their fields, and they provide mechanisms to safely version objects so that a schema can be evolved without invalidating existing objects. ● **Random-access NoSQL databases:** there are many of these databases. They sacrifice the full expressiveness of SQL and instead specialize in certain kinds of operations. They all have different seman-tics and are meant to be used for specific purposes. They’re not meant to be used for arbitrary data warehousing. ● **Messaging/queuing systems:** provides a way to send and consume messages betweenprocesses in a fault-tolerant and asynchronous manner. ● **Realtime computation system:** high throughput, low latency, stream-processing systems. They can’t do the range of computations a batch-processing system can, but they process messages extremely quickly.
218
Define data normalization
**Data normalization** refers to storing data in a structured manner to minimize redundancy and promote consistency.
219
3 issues with KNN modeling:
I. **Intelligibility:** the lack of an explicit, interpretable model may pose a problem in some areas. There are two aspects: the justification of a decision and the intelligibility of an entire model. With k-NN, it is easy to describe how a single instance is decided: the set of neighbors participating in the decision can be presented, along with their contributions. What is difficult to explain more deeply is what knowledge has been mined from the data. Also, visualization is possible with two dimensions, but not with many dimensions. II. **Dimensionality and domain knowledge:** numeric attributes may have vastly different ranges, and unless they are scaled appropriately the effect of one attribute with a wide range can swamp the effect of another with a much smaller range. There is also a problem with having to many attributes, or many that are irrelevant to similarity judgement. Some problems are said to be high dimensional (including irrelevant variables) and they suffer from the curse of dimensionality. The prediction is then confused by the presence of too many irrelevant attributes. To solve this, one can conduct feature selection (determining which features should be included in the data mining model). III. **Computational efficiency:** the main computational cost of a nearest neighbor method is in the prediction/classification step, when the database must be queried to find the nearest neighbor of a new instance.