Big data Flashcards

1
Q

What are the FOUR dimensions of Big Data?

A
  • Volume: refers to the quantity of available data
  • Velocity: refers to the rate at which the data is recorded/collected
  • Veracity: refers to quality and applicability of data
  • Variety: refers to the different type of available data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What characterizes big data/how is it defined?

A

Big Data is the Information asset characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its transformation into Value

Gartner: Big data is high-volume, highvelocity and high-variety information assets that demand cost-effective, innovative forms of information processing for
enhanced insight and decision making

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the drivers behind big data?

A
  • Increased data volumes being captured and stored
  • Rapid acceleration of data growth
  • Increased data volumes pushed into the network
  • Growing variation in types of data assets for analysis
  • Alternate and unsynchronized methods for facilitating data delivery
  • Rising demand for real-time integration of analytical results
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is NoSQL?

A

“Not only SQL” –> alternate model for data management

  • Provides a variety of methods for managing information to best suit specific business process needs, such as in-memory data management, columnar layouts to speed query response, and graph databases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is MPP? and how is it related to Big Data?

A

Massively parallel Processing

–> A type of computer, utilizing high-bandwidth networks and massive I/O devices

RELATION TO BD:
- Big data is smarter, since it couples clusters of hardware components with open source tools and technology

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What five aspects will a corporation considering incorporating Big Data need to consider?

A

• Feasibility: Is the enterprise aligned in a way that allows for new and emerging technologies to be brought into
the organization?
• Reasonability: will the resource requirements exceed the capability of the existing or planned environment?
• Value: do the result warrant the investment?
• Integrability: any constraints or impediments within the organization from a technical, social, or political
perspective?
• Sustainability: are costs associated with maintenance, configuration, skills maintenance, and adjustments to the
level of agility sustainable?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Name the 7 types of people needed for implementing Big Data?

A

1) Business envangelist –> understands current limitations of existing tech infrastructure
2) Technical envangelist –> undestands the emerging tech and the science behind
3) business analyst –> engages the business process owners, and identify measures to quantify
4) Big Data application architect –> Experienced in high performance computing
5) application developer –> Identify the technical resources with the right set of skills for programming
6) Program manager –> experienced in project managment
7) data scientist –> Experienced in coding and statistics/AI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the Big Data framework? and what key components does it consist of?

A

Overall picture of the Big Data landscape, consists of:

  • Infrastructure (e.g. SAP and SQL)
  • Analytics (e.g. Google analytics)
  • Applications (e.g. Human capital, legal, security)
  • Cross-infrastructure analytics (Google, Microsoft, Oracle)
  • Open source (R-studio,
  • Data Sources and API(Application Programming Interface)s (Garmin, Apple)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is API?

A

Application programming interface

is a set of routines, protocols, and tools for building software applications. Basically, an API specifies how software components should interact. Additionally, APIs are used when programming graphical user interface (GUI) components

  • Enables software components to talk to each other
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which is better row or column-oriented data?

A

Column-oriented data; since it reduces the latency by storing each column separately

Access performance; ROW: not good for many simultaneous queries (as opposed to column)

Speed of aggregation; Much faster in column-oriented data

Suitability to compression; column-data better suited for compression, decreasing storage needs.

Data load speed; faster in column, since data is stored separately you can load in parallel using multiple threads

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Hardware versus software?

A

Go to slide 36 and 37 and discuss

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Name the four tools and techniques?

A

Processing capability
- Often interconnected by several nodes, allowing tasks to be run simultaneously called MULTITHREADING

Storage of data

Memory
- Holds the data in the node currently running

Network
- Communication infrastructure between the nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What types of architectural clusters exist? And what the two OVERALL?

A

Slide 42 and 43:

OVERALL: centralized and decentralized

  • Fully connected network topology
  • Mesh network topology
  • Star network topology
  • Common bus topology
  • Ring network topology
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does the general architecture distinguish between? and what are their roles?

A

Management of computing resources
- oversees the pool of processing nodes, assign tasks and monitors activity

Management of data/storage
- oversees the data storage pool and distributes datasets across the collection of storage resources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the three important layers of Hadoop?

A
  • Hadoop Distributed File System (HDFS)
  • MapReduce
  • YARN: a new generation framework for job scheduling and cluster management
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the main function of HDFS?

A
  • Attempts to enable the storage of LARGE files, by distributing the data among a pool of data nodes
  • Monitoring of communication between nodes and masters
  • Rebalancing of data from one block to another if free capacity is available
  • Managing integrity using checksums/digital signatures
  • Metadata replication to protect against corruption
  • Snapshots/copying of data to establish check-points
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the four advantages of using HDFS?

A

1) decreasing the cost of specialty large-scale storage systems
2) providing the ability to rely on commodity components
3) enabling the ability to deploy using cloud-based services
4) reducing system management costs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is MapReduce?

A
  • It is a software framework
  • Used to write applications which process vast amount of data in-parallel on large clusters
  • It is fault-tolerant
  • Combines both data and computational independence
    (both data and computations can be distributed across nodes, which enables strong parallelization)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the two steps in MapReduce?

A

Map: Describes the computation analysis applied to a set of input key/value pairs to produce a set of intermediate key/value pairs

Reduce: the set of values associated with the intermediate key/value pairs output by the Map operation are combined to provide the results

Example; count the number of occurences of a word in a corpus:

key: is the word
value: is the number of times the word is counted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is parallelization?

A

the act of designing a computer program or system to process data in parallel. Normally, computer programs compute data serially: they solve one problem, and then the next, then the next. If a computer program or system is parallelized, it breaks a problem down into smaller pieces that can each independently be solved at the same time by discrete computing resources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the four use cases for big data?

A

Counting; document indexing, filtering, aggregation

Scanning; sorting, text analysis, pattern recognition

Modeling; analysis and prediction

Storing; rapid access to stored large datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is data mining?

A

The art and science of discovering knowledge, insights and patterns in data

  • e.g. predicting the winning chances of a sports team
  • or identifying friends and foes in warfare
  • or forecasting rainfall patterns in a region

It helps recognizing the hidden value in data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Describe the typical process of data mining?

A
  1. Understand the application domain
  2. Identify data sources and select target data
  3. Pre-process: cleaning, attribute selection
  4. Data mining to extract patterns or models
  5. Post-process: identifying interesting or useful patterns
  6. Incorporate patterns in real world tasks

OR:

Data input –> data consolidation –> data cleaning –> data transformation –> data reduction –> well-formed data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

In terms of data mining, what does ETL stand for?

A

Extract, transform, load

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are some of the common mistakes of data-mining?

A
  1. Selecting the wrong problem for data mining
  2. Not leaving insufficient time for data acquisition, selection and preparation
  3. Looking only at aggregated results and not at individual records/predictions
  4. Being sloppy about keeping track of the data mining procedure and results
  5. Ignoring suspicious (good or bad) findings and quickly moving on
  6. Running mining algorithms repeatedly and blindly, without thinking about the next stage
  7. Naively believing everything you are told about the data
  8. Naively believing everything you are told about your own data mining analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Describe the Havard Business Review matrix

A

Two parameters;

1) Not useful –> Very useful
2) Time-consuming –> not time-consuming

Upper-left: Plan
Upper-right: learn
Lower-left: ignore
Lower-right: browse

BEWARE –> VERY bad matrix, cannot be used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Why do we have an error terms in a statistical linear model?

A

Because the model is not deterministic (perfect)

Error terms are normally distributed!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What method is usually used to estimate a linear relationship?

A

Ordinary Least Square (OLS)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

How do we define the slope, i.e. B1?

A

Cov(x,y) / Var(x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

How do we define the intercept?

A

B0 = y - B1x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is the estimated variance of error?

A

SSE / (n - 2):

SSE: Sum of squared errors AKA. sum of squared residuals SSR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

How do we find R^2 and what is its nickname?

A

Coefficient of determination

ESS / TSS or (1 - RSS/TSS)

Or; explained sum of squares over total variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What are the four principal assumptions that justify the use of a linear model?

A

1) Linearity and additivity
- Slope of the line does not depend on the values of the other variables
- Effects from each independ. var. is additive to each other to estimate the dependent variable

2) Statistical independence of the errors
3) Homoscedasticity: constant variance of errors
4) Normality of the error distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What happens when linearity and additivity is violated in the OLS?

A
  • NOT good
  • Use another model that is not linear
    Either; logarithmic (if all datapoints are positive) or polynomial model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What happens when independence of the error terms is violated in the OLS?

A

Diagnosis: look at a residual time series plot
- Often a concern in panel data/longitudinal data

  • If minor cases of positive serial correlation; add the lagged independent variable as predictor
  • If minor cases negative serial correlation, solve by differentiating
  • If large serial correlation –> RESTRUCTURE THE MODEL, maybe standardize all variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What happens when homoscedasticity is violated using the OLS?

A

Diagnosis; plot the residuals versus the predicted values
- You get imprecise predictions and confidence intervals

  • If errors are increasing over time, CIs for out-of-sample predictions are unrealistically narrow

SOLUTION: robust standard errors or transformation of model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What happens when normality is vioalted using the OLS?

A

Diagnosis; Plot of normality or a Shapiro-Wilk test
- Makes it very hard to determine if a coefficient is significantly different from zero and to provide CIs

BUT; if your goal is only to estimate the values of the coefficients, then THIS IS NOT A PROBLEM, only if you have to do predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What model can we use, in the case the outcome is not normal (Gaussian)?

A

Generalized Linear Model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is a Generalized Linear Model (GLM)?

A

Characterized by THREE components:

1) Random: associated with the dependent variable and its probability distribution
2) systematic: identifies the selected covariates through a linear predictor
3) link function: identifies the function of E[Y] such that it is equal to the systematic component

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Normally, will a binary response be derived from a linear relation?

A

No; it will typically be non-linear –> could use logistic regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What shape does the logistic regression have?

A

S-curved

As you will almost never reach 1 and almost never 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What is panel data and what is the associated model?

A

Panel(aka longitudinal) data; repeated measure of the same subject –> Panel Regression

  • AVOID GLM and LM, since the independence assumption does not hold anymore
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What are the pros of a Panel regression?

A
  • account for sample heterogeinity and compute individual-specific estimates
  • suitable to study the dynamics
  • minimise bias due to aggregation (like average over time)
  • enable the control for unobserved variables like cultural factors or difference in business practices across companies (i.e. national policies, federal regulations, international agreements, etc.)
  • enable more complex hierarchical models (Generalized Additive Models, GAM)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What are the cons of a Panel regression?

A
  • data collection (i.e. sampling design, coverage)
  • unwanted correlation (i.e. same country but different measures)
  • analyses are much more complex
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What are the two possible approaches to a panel regression, and how do they work?

A

Fixed effects (FE) model –> explores the relationship between predictor and outcome variables within a subject

Random effects (RE) model –> Assumes that the variation across subjects is random and uncorrelated with the predictors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What are the two assumptions of the FE model?

A

1) correlation between subject’s error term and predictors –> FE removes the effect of time-invariants characeristics, so we can estimate the NET effect of the predictors outcome
2) All the time-invariants characteristics are individual specific so that they are NOT correlated with other individuals

CAUTIONS; Lots of dummy variables and increased risk of multicollinearity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

When do we use the RE model and what is the assumptions behind?

A

Used when differences across subjects have some influence on the dependent variable

Assumption: The subject’s error term is uncorrelated with the predictors –> therefore, you can use time-invariant variables as EXPLANATORY variables

CAUTIONS: RE needs REALLY strong assumptions, and FE is often more convinceable. Testing the RE can be done through Hausmann test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What are the between and within-entity errors in the RE model?

A

Slide 102

Between = U_it
Within = e_it
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What are the two overall types of classifications for clustering?

A

1) Clustering (Unsupervised learning)
- No prior classification, algorithm explores all possible combinations

2) Supervised learning
- We do have prior knowledge on the data; we know the labels, or categories etc.
- Training a computer to learn a new system we created
- Based on data, therefore, the more data, the better the system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What are the three partitions of unsupervised learning in clustering, and what are the sub-categories?

A

Hard clustering

1) Hierarchical methods (agglomerative and divisive)
2) Partinioning methods (non-hierarchical)

Soft clustering
3) Fuzzy methods

More complex methods

4) Density-based methods
5) Model-based methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

In regards to clustering, explain the concept of (statistical) distance

A
  • Needed for every classification method
  • Distance defines the DISsimilarity between two points
  • numerous of parameters exist –> think of the dissimilarity matrix slide 118
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Name two examples of measures of statistical distance

A

1) Pearson correlation distance

2) Eisen cosine correlation distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

How does hierarchical clustering methods work? and which two types exist? How is visualized?

A

1) Agglomerative; each observation is considered a cluster. Iteratively, the most similar clusters(leafs) are merged until there on single cluster forms (roots)
2) Divisive; the INVERSE of the agglomerative. Begins with the root and subsequently the most heterogeneous clusters are divided until each observation forms a cluster

Thus, no need for defining number of cluster groups beforehand

Visualization; Dendrogram

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

What are the pros and cons of the hierarchical method of clustering?

A

Pros:
• No a-priori information about the number of clusters required
• Easy to implement
• Very replicable

Cons:
• Not very efficient O(n2 log n)
• Based on dissimilarity matrix which has to be choosed in advance
• No objective function is directly minimised
• The dendrogram is not the best tool to choose the optimum number of clusters
• Hard to treat non-convex shapes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

How does the partioning method of clustering work?

A
  • Simplest method
  • Requires predefined number of clusters
  • iterative methods, geometry based

3 most famous types:

1) K-means: each cluster is represented by the center of the cluster
2) K-medoids or PAM: each cluster is represented by one of the points in the cluster
3) CLARA (Clustering LARge Applications): Suitable for large datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What are the pros and cons of the partioning method of clustering?

A

Pros:
• k-means is relatively efficient O(tkn), with k, t &laquo_space;n
• easy to implement and understand
• totally replicable

Cons:
• PAM does not scale well for large datasets
• Applicable only when mean is defined (i.e. no categorical data)
• Need to specify k in advance
• k-means is unable to handle noisy data and outliers. PAM does better
• Not suitable to discover clusters with non-convex shapes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

How do we check if data can be clustered (has a tendency) or wheter we are just synthezising clusters?

A
  • Use the Hopkins statistics test; it measures the probability of your dataset to be uniformly distributed
  • Aka. it tests the spatial randomness of the data

Mechanincs:

  • Based on each observations distance to its neighbor
  • Then comparing this distance to a random sample’s distance to its neighbor

Formula:
Average dist. to random neighbor / (ave. dist. to random neighbor + ave. dist .to real neighbor)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

Three ways to find the optimal number of clusters?

A

Plenty of methods exist, however we focus only on 3:

1) Silhoutte
2) Gap-statistics
3) Within sum of squares

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

Name the four overall mehtods of supervised clustering

A

1) Heuristic approach
- k-NN (nearest neighbor)

2) Model-based approach
- Linear discriminant analysis
- Quadratic discriminant analysis
- Logistic
- Näive Bayes

3) Binary decision
- Classification and regression trees (CART)
- Random forest

4) Optimisation based
- Support Vector Machine (SVM)
- Neural Networks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

What is the Näive Bayes, and how does it work?

A
  • A probalistic machine learning algorithm

- Based on two components; Conditional probability and Bayes Rule

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

What is Bayes rule?

A

P(Y | X) = ( P(X | Y) * P(Y) ) / P(X)

62
Q

What clustering method/classifier is the best?

A

4 Criterions:
1) Accuracy: the empirical accuracy computed on the same dataset than the one used to learn the
classifier;

2) AccuracyCV: the mean of the accuracies obtained by the resampling scheme;
3) AccuracyInf: a lowest bound on the mean accuracy obtained as follows: (Slide 167)

4) AccuracyPAC: a highly probably bound on the accuracy obtained just by subtracting the standard
deviation

63
Q

What are the two fundamental concepts of model evaluation?

A

Generalization –> a property of a model that is able to be applied to unused data (used for predictions) –> the model generalizes beyond the training data

Overfitting –> When a model is tailored to the training data at the expense of generalization to test data
- Always present to some extent, thus you have a trade-off between model complexity and overfitting

64
Q

How does one go around assessing overfitting?

A
  • Assessment based on holdout data; comparing the predicted values of the model with hidden true values (holdout data)
  • See graph slide 184 –> you want not perfect overlap, but rather some distance between your training set and your hold out.
  • Error is a decreasing function of complexity

CAUTION: it is ONLY a way to get a feeling of the generalization, since it provides only a single estimate

65
Q

What are the drivers behind overfitting?

A

TWO:

1) too many features/independent variables
2) too complex functions (^2, ^3, ^4) (cut these out)

66
Q

What model is more prone to suffer from overfitting?

A

Logistic model, especially compared to the Support Vector Machine (SVM) model.

Flower example; logistic model very sensitive to outliers, which makes the model bad/overfitted

67
Q

What method is superior to holdout data, when evaluating model performance?

A

Cross-validation (CV)

68
Q

What is cross-validation and how does it work?

A
  • CV performs multiple splits and systematically swaps out samples for testing
  • Number of partitions is k –> called “folds” (around 5 folds normally)
  • CV then iterates training and test sets k times; e.g. first test uses fold 1-4 for training, and fold 5 for hold out, then second test uses fold 2-5 for training, and fold 1 for hold out.
69
Q

What is a general method to avoid overfitting?

A

Ensure that the data is STRICTLY independent of model building

  • This allows for independent estimates of model accuracy and directly compare multiple models
70
Q

What is a general strategy of parameter optimization?

A

Regularization (or penalization)

- Don’t optimize the fit to the data, optimize the combination of FIT and SIMPLICITY

71
Q

In regards to model evaluation of optimizing parameters, how does regularization work?

A

2 Steps:
1) Find a set of parameters that maximizes some objective function, which indicates how well the model fits the data

2) Incorporate a PENALTY function, that assesses the importance of adding another parameter to the model

Famous models:
• Ridge regression: L2-norm –> sum of squares of the weights
• LASSO regression: L1-norm –> sum of absolute values (sort of automatic features selection)

72
Q

In regards to model evaluation an model fitting, what is a learning curve?

A
  • When the size of the training data increases, so will the learning curve i.e. the visualization of the generalization performance against the amount of training data is referred to as a learning curve

IMPORTANT: (slide 197)
• The learning curve shows the performance only on test
data plotted against the size of training data
• A fitting graph shows the generalization performance
and the performance on the training data plotted
against model complexity. Here the size of training data
is fixed

73
Q

How is accuracy measured given a classifier for model evaluation?

A

Accuracy = 1 - error rate = # correct decisions / # total decisions

Problems; We can have different kinds of correct and incorrect decisions –> confusion matrix

74
Q

How does a confusion matrix work?

A

If positive versus negative –> 2 x 2 matrix, with observed classes in the columns (O(1) and O(0)) and predicted classes in the rows (P(1) and P(0))

  • The main diagonal counts the correct predictors.

However:

1) When we have O(1) and P(0), we have False negatives
2) when we have O(0) and P(1) we have False positives

75
Q

Briefly explain how a confusion matrix is important in unbalanced scenarios

A

When the general distribution is skewed, we are in big trouble

  • That is, that we can end up in the two FALSE cases more skewed
  • The anti-diagonal (number of false decisions) is not uniformly distributed across the matrix
  • In this case you want to count these two false predictions separately, since they have different costs
76
Q

How can we fix the problem of unbalanced scenarios in a confusion matrix?

A

We can use expected value, to appoint different probabilities to each of the false scenarios, based on previous data –> that is, we build a machine learning model that helps our model understand when it has made a mistake/false prediction

  • Expected value can be used for two purposes:
    1) Framing the usage of the classifier
    2) Framing the evaluation of the classifier
77
Q

What are some of the evaluation metrics expected value can use to frame the evaluation of a classifier?

A

EXAMPLE; targeted marketing for upselling

• A false positive (FP) occurs when we classify a consumer as a likely responder and therefore target her,
but she does not respond.

• A false negative (FN) is a consumer who was predicted not to be a likely responder (so was not offered
the product), but would have bought it if offered.

  • A true positive (TP) is a consumer who is offered the product and buys it.
  • A true negative (TN) is a consumer who was not offered a deal and who would not have bought it even if it had been offered.
  • most of the metrics used to evaluate a model are summaries of the confusion matrix
78
Q

What are specificity and sensitivity in terms of a binary classification test?

A

Sensitivity = True Positive rate (measures ratio of true positive response that are correctly identified)
TP / (TP + FN)

Specificity = True Negative rate (measures proportion of true negative response that are correctly identified)
TN / (TN + FP)

79
Q

What is needed in R to make a plot?

A

1) data.frame

2) OR a rectangular representation of the data

80
Q

What is a coordinate system?

A

Definition: a set of position scales and their relative geometric arrangement

Two axes crossing each other, to enable graphical representation of data, based on two points.

NOTE: axes do not have to be intercepting at an acute angle, one could be a circle and the other run radially…

81
Q

What is the most widely used coordinate system?

A

The 2d Cartesian coordinate system

82
Q

What happens in a Cartesian coordinate system, when you change the units of your data?

A

As long as the transformation is linear, the Cartesian coordinate system is INVARIANT –> graphical output will be the same, even if figures/placing of gridlines are not identical

Example; fahrenheit to celcius (slide 236)

83
Q

What are examples of non-linear coordinates, and when are they used?

A

When distance between data is not linear, and you need to adjust the scale

Examples: (slide 237)

  • Log transformation
  • Logarithmic transformation (most common)
84
Q

When is the log scale coordinate system preferred?

A

When mulitplication/division is applied, since it applies addition/subtraction in log-scale

85
Q

What is a polar coordination system? and when are they useful

A

Like a scope for shooting a + with a circle around it:
The cross is then 1,2,3,4 and then you add circles around that correspond to different layers; 1,2,3,4

Useful for data of periodic nature –> temperature in a region varying each month –> shows the circular relationship –> starting value joins the ending value

Slide 239-240

86
Q

What is a geospatial coordinate system?

A

based on maps

REMEMBER the earth is a sphere, so do not use Cartesian Coordinates, use something else:
- Lambert equal area or Transverse Mercator

87
Q

What plots are useful representing amounts?

A

Bar plots

88
Q

What plots are useful representing distributions?

A

Box plots, violin plots, stacked histograms, overlapping densities

89
Q

What plots are useful representing proportions?

A

Mosaic plots, multiple pie charts, stacked bars, parallel sets

90
Q

What plots are useful representing relationships?

A

Line-graphs, connected scatter plot, smooth line graphs, correlogram

91
Q

What plots are useful representing uncertainty?

A

Error bars, confidence bands on regressions

92
Q

What is the difference between structured and unstructured data?

A

STRUCTURED DATA IS COMPRISED OF CLEARLY DEFINED DATA TYPES WHOSE PATTERN MAKES THEM EASILY SEARCHABLE

UNSTRUCTURED DATA IS EVERYTHING ELSE – DATA THAT IS USUALLY NOT AS EASILY SEARCHABLE, INCLUDING FORMATS LIKE AUDIO, VIDEO AND SOCIAL MEDIA POSTINGS AND TEXT

93
Q

What is natural language processing?

A

Definition: defined as the application of computational
techniques to the analysis and synthesis of natural language and speech, by representing texts mathematically

–> Can go from word-count to a neural network

94
Q

What is the source called in NLP?

A

The corpus

95
Q

What is the difference bewteen computational linguistics and NLP?

A

Computational Linguistics is a more theoretical field that develops computational methods to answer the scientific questions from the point of view of linguists

Natural Language Processing is dedicated to give solutions to engineering problems related to natural language, focusing on the people

96
Q

What is the typical workflow in an NLP setting?

A

Collection of docs –> pre-processing –> exploratory research –> represntation of relevant features in usable vector space –> apply it to a model

97
Q

What are the 7 problems in NLP?

A

1) Ambiguity –> words can have several meanings depending on context
2) Synonomy –> we can express the same idea with different terms (“fine” as example)
3) Syntax –> language structure, based on rules, you can reorder sentences
4) Coreference –> “the “firm” is (…), “it” is therefore”
5) Normalization versus information –> CONSULTing, CONSULTant, but we loose some information by normalizing
6) Representation –> word embedding (man–> woman, King –> queen), the transformation of words into numbered vectors
7) Style –> sarcasm as example, way of saying things

98
Q

In terms of textual ambiguity what does signifier and signified mean?

A

Signifier = the way we represent the information –> with a word

Signified = mental concept, the meaning of that information

Ambiguity then happens because of the difference between the two

99
Q

What is RegEx?

A

Programming language used to clean textual data with regular expressions, we do not want in there.

Start by removing most evident stuff we don’t want, and replace with space e.g., continue this process, until you have a human readable document.

100
Q

In the bag of words NLP technique, how does the corpus work?

A

Consists of N documents (D), which each consist of N number of words (w)

101
Q

What are some of the limitations of the bag of words NLP technique?

A

Does not consider

  • grammar
  • word order
  • sentence structure
  • punctuation
  • any relationship between words
102
Q

What is inverse document frequency? and what are the two considerations you need to take into account, when using it?

A

A measure of whether a given term is common or rare in a corpus.

TWO considerations:

  • Should not be too rare –> not meaningful for a cluster
  • Should not be too common –> if too common, it doesn’t distinguish anything
103
Q

In relation to NLP, what is keyness?

A

A measure associated with features that occur differentially across different categories

  • It gives the distinguishing features of the corpora

Example with cat and dog being “keyness” words:
• Corpus A: cat 52; dog 17; cow 31
• Corpus B: cat 9; dog 40; cow 31

104
Q

In NLP, what is lexical dispersion?

A

Visualization of occurences of a term in a context
We want to have an informative measure which communicates where the term has been used in the text

(President example; who said future when in their speeches) => x-ray plot

105
Q

Two methods of processing the test in NLP?

A

1) Lemmatization: Better and good are in the same lemma –> a process by which inflected forms of words are grouped together to be analyzed as a single aspect
2) stemming; similar, but instead you break all words down to its stem/base root form

106
Q

What do we use Syntactic Parsing for in NLP?

A
  • involves the analysis of words in the sentence for grammar and their arrangement in a manner that shows the relationships among the words

Important attributes of text syntactics:

  • Dependency grammar
  • Part of speech tags
107
Q

What are the three ways we can achieve syntactic parsing?

A
  • Part-of-Speech tagging (POS) (noun, verb, adjective)
  • Dependency parsing (analyzes the grammatics of a sentence, linking head words with other words, which modify the head word)
  • Named Entity Recognition (ENR) (identifies persons, and names/organizations)
108
Q

What is meant by tone/sentiment and how can we measure it?

A

Positve versus negative tone –> Used in KNOWN data categories

Two ways to model

  • Lexicon based (pure words counting, more sophisticated)
  • Dee Learning
109
Q

How does the lexicon approach work in terms of computing the narrative tone of a text?

A
  • One relies on hash tables, containing categories of words, like positive and negative
  • Then by matching these lexica with the corpus, you get an indication of negative versus positive tone
110
Q

What is topic modeling?

A
  • Detection of hidden patterns or topics in the corpus

Three models:

  • LSA
  • Latent dirichelet allocation (LDA)
  • CTM etc.
111
Q

How does an LDA model work?

A
  • generative statistical model which allows us to explain observations through unobserved characteristics
  • The intuition is that each document is viewed as a mixture of a given set of topics
  • Each topic is then viewed as a mixture of a set of words
112
Q

In the LDA, how does one establish how many topics are optimal?

A

NOT rigourous way: PERPLEXITY: detecting optimal model by computing incremental power (look at graphs slide 314)

MORE rigourous way: defining a chi-square to test

113
Q

How is AI, ML and DL related?

A

AI is the overall circle, then inside AI, we have ML. And inside ML we have DL.

114
Q

Definition of AI

A

the effort to automate intellectual tasks normally performed by humans”

115
Q

What is symbolic AI?

A

Reach human-level AI by coding a sufficiently large set of explicit rules for manipulating knowledge

116
Q

How can we think of the change of the paradigm from symbolic AI to Machine Learning (ML)?

A

Classical programming:

  • INPUTS: Rules + data
  • OUTPUT: Answer

Machine learning:

  • INPUTS: Data + Answers (from classical)
  • Output: rules (that feed in again)
117
Q

How do we characterize an ML system and how is different from statistics?

A
  • It is trained rather than programmed (so relies on a bunch of examples, which it has made rules from)
  • Unlike statistics, ML is able to manage large and complex datasets which classical statistics like Bayesian one would be impractical
  • As a result, ML, and especially deep learning, exhibits comparatively little mathematical theory and is engineering oriented
  • It’s a hands-on discipline in which ideas are proven empirically more often than theoretically
118
Q

What three things are needed to do ML? And how does this help defining ML?

A

1) Input data points: e.g. images
2) Examples of the expected output: e.g. images of cats and dogs
3) A way to measure the performance of the algorithm: LEARNING through FEEDBACK

ML searches for useful representations of some input data, within a predefined space of possibilities, using guidance from a feedback signal

119
Q

What is the overarching central problem in ML?

A

To meaningfully transform data, to learn useful representations of the input data

120
Q

What is deep learning, and what does “deep” refer to? How does this related to a Neural Network?

A
  • Subfield of ML, where the learning process is taken even further to successive layers of increasingly meaningful representations

DEEP/DEPTH = # of layers that contribute to a model

In DL the layered representation is almost always learned through models called NEURAL NETWORKS

121
Q

What is the definition of a Neural Network?

A

“…a computing system made up of a number of simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs.”

122
Q

Explain what is meant by a neuron in a neural network, and how each individual neuron works

A

Neuron is basically a node, which can be inside either a hidden layer or an output layer

Works:
Input from axon from a neuron –> Synapse (weight) –> carried by the dendrite (w*x) to –> cell body, which summarizes all inputs and adds a bias –> then shot out into activation function –> through output axon

123
Q

What is a layer and what is it constituted of?

A

Layers can be either input or hidden or output –> the more layers the deeper the learning

Layers are made out of interconnected nodes/neurons

124
Q

How do we determine how many layers there are inside a NNET?

A

Input layer (DOES NOT COUNT)

Hidden layers (just count amount of hidden layers)

Output layer (just one?)

125
Q

How do we find # of neurons, # of biases, # of weights and # of parameters in a NNET?

A

Neurons; count the number of neurons present in each hidden layer and output layer

Biases; ONE bias per neuron

Weights; depends on how many inputs there are into each neuron. # of inputs * # neurons = # of weigths

Parameters; biases + weights

126
Q

How does DL and NNET work in terms of image recognition?

A

The more layers, the more filtered, the more different each layer becomes from the original photo, but also more informative about the output

DL is considered a multistage way to learn data representations

127
Q

How does parametrization and evaluation work in a NNET?

A

An NNET is parametrized by its weights –> the goal is then to find the correct value of these

128
Q

What is the loss function in a NNET?

A

A function that controls the output of a neural network by measuring how far the output is from what we expect

Function that compares predictions with true targets

129
Q

What is backpropagation in NNET?

A

The use of the DL loss score acquired from the loss function, which is shot into an OPTIMIZER that implements the feedback mechanism called backpropagation

  • This makes adjustments by updating the weights, i.e. “training” the network
130
Q

What is considered the gears of a NNET? and how do they work?

A

Tensors or Tensor operations –> generalizations of vectors an matrices which we used to play with in linear algebra

  • They are just geometric transformations of input data
131
Q

What is the overall purpose of the Activiation Function in a NNET?

A

To check the output variable Y of the neuron, and decide whether other connections should consider this neuron as activated or not

132
Q

What 5 Activation Functions have we been discussing?

A

1) Step function (1 or 0, based on threshold)
2) Linear function (continuous line, providing a range of activations)
3) Sigmoid function (a smooth step function like, still binary outcomes of 0 or 1)
4) Tanh function (sister to Sigmoid range from -1, 1)
5) ReLU (Rectified Linear Unit): A non-linear function, that looks like a linear one –> often a preferred model

133
Q

NNET: Pros and cons of using a STEP activation function?

A
  • Easy
  • Binary activation (either 0 or 1)
  • Does NOT support multiple neurons
134
Q

NNET: Pros and cons of using a LINEAR activation function?

A
  • Support multiple neurons
  • Gradient is constant (no change in variations because of linearity)
  • If NN is with many layers, the activation function in the last layer is just a linear combination of the first one (low complexity)
135
Q

NNET: Pros and cons of using a SIGMOID activation function?

A
  • Non-linear combinations
  • Support for multiple activations
  • Output range is from 0 to 1 (like logistic, therfore GOOD for probability)
  • offers VERY SMALL GRADIENTS at boundaries
136
Q

NNET: What is GRADIENTS within the activation function?

A

Gradients measure changes in variation –> the smoother the curve, the lower the gradients

137
Q

NNET: Pros and cons of using a TANH activation function?

A
  • Output range can be -1 to 1
  • Stronger gradients than Sigmoid, which is not good
  • Small gradients at boundaries though
138
Q

NNET: Pros and cons of using a ReLU activation function?

A
  • Only gives an output, if x is positive, and 0 otherwise
  • A bit like a stepwise linear function, but should be though of as a combination between a linear and a non-linear function
  • NO upper-boundary as compared to other logistic models
  • Enable sparse activations (i.e. few neurons to not activiate)
  • Gradient for x < 0, is not existing, not reponse from neurons
  • It IS LESS COMPUTATIONALLY EXPENSIVE than Tanh and Sigmoid, which makes it very PREFERABLE
139
Q

How does one choose an activation function?

A

No predefined rules:

  • If you know the function you are trying to replicate, maybe use a.f. similar to that one
  • Sigmoid are good for classifiers
  • When NO PRIOR KNOWLEDGE –> RELU is good
140
Q

Why do we need time discretization in NNETs?

A

t-1, t, t+1

Fundamental since it significantly simplifies the mathematical implementation of NNETs

141
Q

Describe the role of the bias in Neural Networks

A
  • The bias node is ALWAYS on
  • Can be thought of as the INTERCEPT in a regression model
  • If a NNET does not have a bias node in a given layer, it will not be able to produce an output in the next layer that differs from 0 in a linear scale (or any transformation of 0 when passed to the a.f.)
142
Q

When does a NNET consist of “dense layers”?

A

When ALL layers and nodes are connected

143
Q

Name the THREE types of NNETs

A

1) Feedforward Neural Networks –> information moves ONLY in one direction (no cycles)
Can be either
- SINGLE-Layer perceptron
- Multiple Layer Perecptron (MLP)

2) Convolutional Neural Networks
3) Recurrent Neural Networks –> has internal memory and cycling ability

144
Q

NNET; how does the single-layer perceptron Feedforward network work?

A

ONE output layer that receives inputs and provides an output (input layer does not count)

145
Q

NNET; How does the multiple-layer perceptron Feedforward network work?

A
  • Consists of multiple layers of computational units
  • Each neuron is then connected to each other in an Feedforward manner
  • Sigmoid is typically used as a.f.
  • They can learn non-linear representations
146
Q

NNET; How does the Convolutional Neural Network (CNN) work?

A
  • Evolution of the MLP
  • Only a limited region (receptive area) reponds to the stimuli and then activate
  • Requires minimum preprocessing
  • Very widely used in image recognition
147
Q

What are the three types of layers used in a Convolutional Neural Network (CNN)?

A

1) Convolutional layer –> takes two signals and produces a third, extracting features of an image e.g.

2) Pooling layers –> placed in bewteen convolutional layers, to summarize.
- This layer reduces computational costs by reducing number of weights (parameters) up to 75%
- Also controls for overfitting and enables generalization

3) Dropout layers –> chop off irrelevant connections to avoid:
- Unnecessary computations and overfitting
- Done by randomly setting layers to 0, but only done in the learning stage on the training set
- Set to 0 because the network goal is to be redundant, so that it can provide the right classification without all layers activated

148
Q

What is a Reccurent Neural Network?

A
  • Neural Network with feedback loops
  • Involves backpropagation
  • Applied to sequential tasks like handwriting and speech recognition
149
Q

What is web scraping?

A

Web scraping is a technique for converting the data present in unstructured format (HTML tags) over the web to the structured format which can easily be accessed and used. This means that you are going to build a data.frame or a corpus at the end of the day

150
Q

What is the vanishing gradient problem? and in which a.f functions is it present?

A
  • The problem of vanishing gradients arises due to the nature of the backpropagation optimization
  • Gradients tend to get smaller and smaller as we keep on moving backward
  • Implies that neurons in earlier layers learn very slowly compared to neurons in the last layers
  • Vanishing Gradient Problem results in a decrease in the prediction accuracy of the model and take a long time to train a model.