Big data Flashcards

Question

What are some of the common mistakes of data-mining?

Answer 1

1. Selecting the wrong problem for data mining 2. Not leaving insufficient time for data acquisition, selection and preparation 3. Looking only at aggregated results and not at individual records/predictions 4. Being sloppy about keeping track of the data mining procedure and results 5. Ignoring suspicious (good or bad) findings and quickly moving on 6. Running mining algorithms repeatedly and blindly, without thinking about the next stage 7. Naively believing everything you are told about the data 8. Naively believing everything you are told about your own data mining analysis

Answer 2

Two parameters; 1) Not useful --> Very useful 2) Time-consuming --> not time-consuming Upper-left: Plan Upper-right: learn Lower-left: ignore Lower-right: browse BEWARE --> VERY bad matrix, cannot be used

Answer 3

Because the model is not deterministic (perfect) | Error terms are normally distributed!

Answer 4

Ordinary Least Square (OLS)

Answer 5

Cov(x,y) / Var(x)

Answer 6

B0 = y - B1x

Answer 7

SSE / (n - 2): | SSE: Sum of squared errors AKA. sum of squared residuals SSR

Answer 8

Coefficient of determination ESS / TSS or (1 - RSS/TSS) Or; explained sum of squares over total variance

Answer 9

1) Linearity and additivity - Slope of the line does not depend on the values of the other variables - Effects from each independ. var. is additive to each other to estimate the dependent variable 2) Statistical independence of the errors 3) Homoscedasticity: constant variance of errors 4) Normality of the error distribution

Answer 10

- NOT good - Use another model that is not linear Either; logarithmic (if all datapoints are positive) or polynomial model

Answer 11

Diagnosis: look at a residual time series plot - Often a concern in panel data/longitudinal data - If minor cases of positive serial correlation; add the lagged independent variable as predictor - If minor cases negative serial correlation, solve by differentiating - If large serial correlation --> RESTRUCTURE THE MODEL, maybe standardize all variables

Answer 12

Diagnosis; plot the residuals versus the predicted values - You get imprecise predictions and confidence intervals - If errors are increasing over time, CIs for out-of-sample predictions are unrealistically narrow SOLUTION: robust standard errors or transformation of model

Answer 13

Diagnosis; Plot of normality or a Shapiro-Wilk test - Makes it very hard to determine if a coefficient is significantly different from zero and to provide CIs BUT; if your goal is only to estimate the values of the coefficients, then THIS IS NOT A PROBLEM, only if you have to do predictions

Answer 14

Generalized Linear Model

Answer 15

Characterized by THREE components: 1) Random: associated with the dependent variable and its probability distribution 2) systematic: identifies the selected covariates through a linear predictor 3) link function: identifies the function of E[Y] such that it is equal to the systematic component

Answer 16

No; it will typically be non-linear --> could use logistic regression

Answer 17

S-curved As you will almost never reach 1 and almost never 0

Answer 18

Panel(aka longitudinal) data; repeated measure of the same subject --> Panel Regression - AVOID GLM and LM, since the independence assumption does not hold anymore

Answer 19

* account for sample heterogeinity and compute individual-specific estimates * suitable to study the dynamics * minimise bias due to aggregation (like average over time) * enable the control for unobserved variables like cultural factors or difference in business practices across companies (i.e. national policies, federal regulations, international agreements, etc.) * enable more complex hierarchical models (Generalized Additive Models, GAM)

Answer 20

* data collection (i.e. sampling design, coverage) * unwanted correlation (i.e. same country but different measures) * analyses are much more complex

Answer 21

Fixed effects (FE) model --> explores the relationship between predictor and outcome variables within a subject Random effects (RE) model --> Assumes that the variation across subjects is random and uncorrelated with the predictors

Answer 22

1) correlation between subject’s error term and predictors --> FE removes the effect of time-invariants characeristics, so we can estimate the NET effect of the predictors outcome 2) All the time-invariants characteristics are individual specific so that they are NOT correlated with other individuals CAUTIONS; Lots of dummy variables and increased risk of multicollinearity

Answer 23

Used when differences across subjects have some influence on the dependent variable Assumption: The subject's error term is uncorrelated with the predictors --> therefore, you can use time-invariant variables as EXPLANATORY variables CAUTIONS: RE needs REALLY strong assumptions, and FE is often more convinceable. Testing the RE can be done through Hausmann test

Answer 24

Slide 102 ``` Between = U_it Within = e_it ```

Answer 25

1) Clustering (Unsupervised learning) - No prior classification, algorithm explores all possible combinations 2) Supervised learning - We do have prior knowledge on the data; we know the labels, or categories etc. - Training a computer to learn a new system we created - Based on data, therefore, the more data, the better the system

Answer 26

Hard clustering 1) Hierarchical methods (agglomerative and divisive) 2) Partinioning methods (non-hierarchical) Soft clustering 3) Fuzzy methods More complex methods 4) Density-based methods 5) Model-based methods

Answer 27

- Needed for every classification method - Distance defines the DISsimilarity between two points - numerous of parameters exist --> think of the dissimilarity matrix slide 118

Answer 28

1) Pearson correlation distance | 2) Eisen cosine correlation distance

Answer 29

1) Agglomerative; each observation is considered a cluster. Iteratively, the most similar clusters(leafs) are merged until there on single cluster forms (roots) 2) Divisive; the INVERSE of the agglomerative. Begins with the root and subsequently the most heterogeneous clusters are divided until each observation forms a cluster Thus, no need for defining number of cluster groups beforehand Visualization; Dendrogram

Answer 30

Pros: • No a-priori information about the number of clusters required • Easy to implement • Very replicable Cons: • Not very efficient O(n2 log n) • Based on dissimilarity matrix which has to be choosed in advance • No objective function is directly minimised • The dendrogram is not the best tool to choose the optimum number of clusters • Hard to treat non-convex shapes

Answer 31

- Simplest method - Requires predefined number of clusters - iterative methods, geometry based 3 most famous types: 1) K-means: each cluster is represented by the center of the cluster 2) K-medoids or PAM: each cluster is represented by one of the points in the cluster 3) CLARA (Clustering LARge Applications): Suitable for large datasets

Answer 32

Pros: • k-means is relatively efficient O(tkn), with k, t << n • easy to implement and understand • totally replicable Cons: • PAM does not scale well for large datasets • Applicable only when mean is defined (i.e. no categorical data) • Need to specify k in advance • k-means is unable to handle noisy data and outliers. PAM does better • Not suitable to discover clusters with non-convex shapes

Answer 33

- Use the Hopkins statistics test; it measures the probability of your dataset to be uniformly distributed - Aka. it tests the spatial randomness of the data Mechanincs: - Based on each observations distance to its neighbor - Then comparing this distance to a random sample's distance to its neighbor Formula: Average dist. to random neighbor / (ave. dist. to random neighbor + ave. dist .to real neighbor)

Answer 34

Plenty of methods exist, however we focus only on 3: 1) Silhoutte 2) Gap-statistics 3) Within sum of squares

Answer 35

1) Heuristic approach - k-NN (nearest neighbor) 2) Model-based approach - Linear discriminant analysis - Quadratic discriminant analysis - Logistic - Näive Bayes 3) Binary decision - Classification and regression trees (CART) - Random forest 4) Optimisation based - Support Vector Machine (SVM) - Neural Networks

Answer 36

- A probalistic machine learning algorithm | - Based on two components; Conditional probability and Bayes Rule

Answer 37

P(Y | X) = ( P(X | Y) * P(Y) ) / P(X)

Answer 38

4 Criterions: 1) Accuracy: the empirical accuracy computed on the same dataset than the one used to learn the classifier; 2) AccuracyCV: the mean of the accuracies obtained by the resampling scheme; 3) AccuracyInf: a lowest bound on the mean accuracy obtained as follows: (Slide 167) 4) AccuracyPAC: a highly probably bound on the accuracy obtained just by subtracting the standard deviation

Answer 39

Generalization --> a property of a model that is able to be applied to unused data (used for predictions) --> the model generalizes beyond the training data Overfitting --> When a model is tailored to the training data at the expense of generalization to test data - Always present to some extent, thus you have a trade-off between model complexity and overfitting

Answer 40

- Assessment based on holdout data; comparing the predicted values of the model with hidden true values (holdout data) - See graph slide 184 --> you want not perfect overlap, but rather some distance between your training set and your hold out. - Error is a decreasing function of complexity CAUTION: it is ONLY a way to get a feeling of the generalization, since it provides only a single estimate

Answer 41

TWO: 1) too many features/independent variables 2) too complex functions (^2, ^3, ^4) (cut these out)

Answer 42

Logistic model, especially compared to the Support Vector Machine (SVM) model. Flower example; logistic model very sensitive to outliers, which makes the model bad/overfitted

Answer 43

Cross-validation (CV)

Answer 44

- CV performs multiple splits and systematically swaps out samples for testing - Number of partitions is k --> called "folds" (around 5 folds normally) - CV then iterates training and test sets k times; e.g. first test uses fold 1-4 for training, and fold 5 for hold out, then second test uses fold 2-5 for training, and fold 1 for hold out.

Answer 45

Ensure that the data is STRICTLY independent of model building - This allows for independent estimates of model accuracy and directly compare multiple models

Answer 46

Regularization (or penalization) | - Don't optimize the fit to the data, optimize the combination of FIT and SIMPLICITY

Answer 47

2 Steps: 1) Find a set of parameters that maximizes some objective function, which indicates how well the model fits the data 2) Incorporate a PENALTY function, that assesses the importance of adding another parameter to the model Famous models: • Ridge regression: L2-norm --> sum of squares of the weights • LASSO regression: L1-norm --> sum of absolute values (sort of automatic features selection)

Answer 48

- When the size of the training data increases, so will the learning curve i.e. the visualization of the generalization performance against the amount of training data is referred to as a learning curve IMPORTANT: (slide 197) • The learning curve shows the performance only on test data plotted against the size of training data • A fitting graph shows the generalization performance and the performance on the training data plotted against model complexity. Here the size of training data is fixed

Answer 49

Accuracy = 1 - error rate = # correct decisions / # total decisions Problems; We can have different kinds of correct and incorrect decisions --> confusion matrix

Answer 50

If positive versus negative --> 2 x 2 matrix, with observed classes in the columns (O(1) and O(0)) and predicted classes in the rows (P(1) and P(0)) - The main diagonal counts the correct predictors. However: 1) When we have O(1) and P(0), we have False negatives 2) when we have O(0) and P(1) we have False positives

Answer 51

When the general distribution is skewed, we are in big trouble - That is, that we can end up in the two FALSE cases more skewed - The anti-diagonal (number of false decisions) is not uniformly distributed across the matrix - In this case you want to count these two false predictions separately, since they have different costs

Answer 52

We can use expected value, to appoint different probabilities to each of the false scenarios, based on previous data --> that is, we build a machine learning model that helps our model understand when it has made a mistake/false prediction - Expected value can be used for two purposes: 1) Framing the usage of the classifier 2) Framing the evaluation of the classifier

Answer 53

EXAMPLE; targeted marketing for upselling • A false positive (FP) occurs when we classify a consumer as a likely responder and therefore target her, but she does not respond. • A false negative (FN) is a consumer who was predicted not to be a likely responder (so was not offered the product), but would have bought it if offered. * A true positive (TP) is a consumer who is offered the product and buys it. * A true negative (TN) is a consumer who was not offered a deal and who would not have bought it even if it had been offered. - most of the metrics used to evaluate a model are summaries of the confusion matrix

Answer 54

Sensitivity = True Positive rate (measures ratio of true positive response that are correctly identified) TP / (TP + FN) Specificity = True Negative rate (measures proportion of true negative response that are correctly identified) TN / (TN + FP)

Answer 55

1) data.frame | 2) OR a rectangular representation of the data

Answer 56

Definition: a set of position scales and their relative geometric arrangement Two axes crossing each other, to enable graphical representation of data, based on two points. NOTE: axes do not have to be intercepting at an acute angle, one could be a circle and the other run radially...

Answer 57

The 2d Cartesian coordinate system

Answer 58

As long as the transformation is linear, the Cartesian coordinate system is INVARIANT --> graphical output will be the same, even if figures/placing of gridlines are not identical Example; fahrenheit to celcius (slide 236)

Answer 59

When distance between data is not linear, and you need to adjust the scale Examples: (slide 237) - Log transformation - Logarithmic transformation (most common)

Answer 60

When mulitplication/division is applied, since it applies addition/subtraction in log-scale

Answer 61

Like a scope for shooting a + with a circle around it: The cross is then 1,2,3,4 and then you add circles around that correspond to different layers; 1,2,3,4 Useful for data of periodic nature --> temperature in a region varying each month --> shows the circular relationship --> starting value joins the ending value Slide 239-240

Answer 62

based on maps REMEMBER the earth is a sphere, so do not use Cartesian Coordinates, use something else: - Lambert equal area or Transverse Mercator

Answer 63

Box plots, violin plots, stacked histograms, overlapping densities

Answer 64

Mosaic plots, multiple pie charts, stacked bars, parallel sets

Answer 65

Line-graphs, connected scatter plot, smooth line graphs, correlogram

Answer 66

Error bars, confidence bands on regressions

Answer 67

STRUCTURED DATA IS COMPRISED OF CLEARLY DEFINED DATA TYPES WHOSE PATTERN MAKES THEM EASILY SEARCHABLE UNSTRUCTURED DATA IS EVERYTHING ELSE – DATA THAT IS USUALLY NOT AS EASILY SEARCHABLE, INCLUDING FORMATS LIKE AUDIO, VIDEO AND SOCIAL MEDIA POSTINGS AND TEXT

Answer 68

Definition: defined as the application of computational techniques to the analysis and synthesis of natural language and speech, by representing texts mathematically --> Can go from word-count to a neural network

Answer 69

The corpus

Answer 70

Computational Linguistics is a more theoretical field that develops computational methods to answer the scientific questions from the point of view of linguists Natural Language Processing is dedicated to give solutions to engineering problems related to natural language, focusing on the people

Answer 71

Collection of docs --> pre-processing --> exploratory research --> represntation of relevant features in usable vector space --> apply it to a model

Answer 72

1) Ambiguity --> words can have several meanings depending on context 2) Synonomy --> we can express the same idea with different terms ("fine" as example) 3) Syntax --> language structure, based on rules, you can reorder sentences 4) Coreference --> "the "firm" is (...), "it" is therefore" 5) Normalization versus information --> CONSULTing, CONSULTant, but we loose some information by normalizing 6) Representation --> word embedding (man--> woman, King --> queen), the transformation of words into numbered vectors 7) Style --> sarcasm as example, way of saying things

Answer 73

Signifier = the way we represent the information --> with a word Signified = mental concept, the meaning of that information Ambiguity then happens because of the difference between the two

Answer 74

Programming language used to clean textual data with regular expressions, we do not want in there. Start by removing most evident stuff we don't want, and replace with space e.g., continue this process, until you have a human readable document.

Answer 75

Consists of N documents (D), which each consist of N number of words (w)

Answer 76

Does not consider - grammar - word order - sentence structure - punctuation - any relationship between words

Answer 77

A measure of whether a given term is common or rare in a corpus. TWO considerations: - Should not be too rare --> not meaningful for a cluster - Should not be too common --> if too common, it doesn't distinguish anything

Answer 78

A measure associated with features that occur differentially across different categories - It gives the distinguishing features of the corpora Example with cat and dog being "keyness" words: • Corpus A: cat 52; dog 17; cow 31 • Corpus B: cat 9; dog 40; cow 31

Answer 79

Visualization of occurences of a term in a context We want to have an informative measure which communicates where the term has been used in the text (President example; who said future when in their speeches) => x-ray plot

Answer 80

1) Lemmatization: Better and good are in the same lemma --> a process by which inflected forms of words are grouped together to be analyzed as a single aspect 2) stemming; similar, but instead you break all words down to its stem/base root form

Answer 81

- involves the analysis of words in the sentence for grammar and their arrangement in a manner that shows the relationships among the words Important attributes of text syntactics: - Dependency grammar - Part of speech tags

Answer 82

- Part-of-Speech tagging (POS) (noun, verb, adjective) - Dependency parsing (analyzes the grammatics of a sentence, linking head words with other words, which modify the head word) - Named Entity Recognition (ENR) (identifies persons, and names/organizations)

Answer 83

Positve versus negative tone --> Used in KNOWN data categories Two ways to model - Lexicon based (pure words counting, more sophisticated) - Dee Learning

Answer 84

- One relies on hash tables, containing categories of words, like positive and negative - Then by matching these lexica with the corpus, you get an indication of negative versus positive tone

Answer 85

- Detection of hidden patterns or topics in the corpus Three models: - LSA - Latent dirichelet allocation (LDA) - CTM etc.

Answer 86

- generative statistical model which allows us to explain observations through unobserved characteristics - The intuition is that each document is viewed as a mixture of a given set of topics - Each topic is then viewed as a mixture of a set of words

Answer 87

NOT rigourous way: PERPLEXITY: detecting optimal model by computing incremental power (look at graphs slide 314) MORE rigourous way: defining a chi-square to test

Answer 88

AI is the overall circle, then inside AI, we have ML. And inside ML we have DL.

Answer 89

the effort to automate intellectual tasks normally performed by humans”

Answer 90

Reach human-level AI by coding a sufficiently large set of explicit rules for manipulating knowledge

Answer 91

Classical programming: - INPUTS: Rules + data - OUTPUT: Answer Machine learning: - INPUTS: Data + Answers (from classical) - Output: rules (that feed in again)

Answer 92

- It is trained rather than programmed (so relies on a bunch of examples, which it has made rules from) - Unlike statistics, ML is able to manage large and complex datasets which classical statistics like Bayesian one would be impractical - As a result, ML, and especially deep learning, exhibits comparatively little mathematical theory and is engineering oriented - It’s a hands-on discipline in which ideas are proven empirically more often than theoretically

Answer 93

1) Input data points: e.g. images 2) Examples of the expected output: e.g. images of cats and dogs 3) A way to measure the performance of the algorithm: LEARNING through FEEDBACK ML searches for useful representations of some input data, within a predefined space of possibilities, using guidance from a feedback signal

Answer 94

To meaningfully transform data, to learn useful representations of the input data

Answer 95

- Subfield of ML, where the learning process is taken even further to successive layers of increasingly meaningful representations DEEP/DEPTH = # of layers that contribute to a model In DL the layered representation is almost always learned through models called NEURAL NETWORKS

Answer 96

“...a computing system made up of a number of simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs."

Answer 97

Neuron is basically a node, which can be inside either a hidden layer or an output layer Works: Input from axon from a neuron --> Synapse (weight) --> carried by the dendrite (w*x) to --> cell body, which summarizes all inputs and adds a bias --> then shot out into activation function --> through output axon

Answer 98

Layers can be either input or hidden or output --> the more layers the deeper the learning Layers are made out of interconnected nodes/neurons

Answer 99

Input layer (DOES NOT COUNT) Hidden layers (just count amount of hidden layers) Output layer (just one?)

Answer 100

Neurons; count the number of neurons present in each hidden layer and output layer Biases; ONE bias per neuron Weights; depends on how many inputs there are into each neuron. # of inputs * # neurons = # of weigths Parameters; biases + weights

Answer 101

The more layers, the more filtered, the more different each layer becomes from the original photo, but also more informative about the output DL is considered a multistage way to learn data representations

Answer 102

An NNET is parametrized by its weights --> the goal is then to find the correct value of these

Answer 103

A function that controls the output of a neural network by measuring how far the output is from what we expect Function that compares predictions with true targets

Answer 104

The use of the DL loss score acquired from the loss function, which is shot into an OPTIMIZER that implements the feedback mechanism called backpropagation - This makes adjustments by updating the weights, i.e. "training" the network

Answer 105

Tensors or Tensor operations --> generalizations of vectors an matrices which we used to play with in linear algebra - They are just geometric transformations of input data

Answer 106

To check the output variable Y of the neuron, and decide whether other connections should consider this neuron as activated or not

Answer 107

1) Step function (1 or 0, based on threshold) 2) Linear function (continuous line, providing a range of activations) 3) Sigmoid function (a smooth step function like, still binary outcomes of 0 or 1) 4) Tanh function (sister to Sigmoid range from -1, 1) 5) ReLU (Rectified Linear Unit): A non-linear function, that looks like a linear one --> often a preferred model

Answer 108

- Easy - Binary activation (either 0 or 1) - Does NOT support multiple neurons

Answer 109

- Support multiple neurons - Gradient is constant (no change in variations because of linearity) - If NN is with many layers, the activation function in the last layer is just a linear combination of the first one (low complexity)

Answer 110

- Non-linear combinations - Support for multiple activations - Output range is from 0 to 1 (like logistic, therfore GOOD for probability) - offers VERY SMALL GRADIENTS at boundaries

Answer 111

Gradients measure changes in variation --> the smoother the curve, the lower the gradients

Answer 112

- Output range can be -1 to 1 - Stronger gradients than Sigmoid, which is not good - Small gradients at boundaries though

Answer 113

- Only gives an output, if x is positive, and 0 otherwise - A bit like a stepwise linear function, but should be though of as a combination between a linear and a non-linear function - NO upper-boundary as compared to other logistic models - Enable sparse activations (i.e. few neurons to not activiate) - Gradient for x < 0, is not existing, not reponse from neurons - It IS LESS COMPUTATIONALLY EXPENSIVE than Tanh and Sigmoid, which makes it very PREFERABLE

Answer 114

No predefined rules: - If you know the function you are trying to replicate, maybe use a.f. similar to that one - Sigmoid are good for classifiers - When NO PRIOR KNOWLEDGE --> RELU is good

Answer 115

t-1, t, t+1 | Fundamental since it significantly simplifies the mathematical implementation of NNETs

Answer 116

- The bias node is ALWAYS on - Can be thought of as the INTERCEPT in a regression model - If a NNET does not have a bias node in a given layer, it will not be able to produce an output in the next layer that differs from 0 in a linear scale (or any transformation of 0 when passed to the a.f.)

Answer 117

When ALL layers and nodes are connected

Answer 118

1) Feedforward Neural Networks --> information moves ONLY in one direction (no cycles) Can be either - SINGLE-Layer perceptron - Multiple Layer Perecptron (MLP) 2) Convolutional Neural Networks 3) Recurrent Neural Networks --> has internal memory and cycling ability

Answer 119

ONE output layer that receives inputs and provides an output (input layer does not count)

Answer 120

- Consists of multiple layers of computational units - Each neuron is then connected to each other in an Feedforward manner - Sigmoid is typically used as a.f. - They can learn non-linear representations

Answer 121

- Evolution of the MLP - Only a limited region (receptive area) reponds to the stimuli and then activate - Requires minimum preprocessing - Very widely used in image recognition

Answer 122

1) Convolutional layer --> takes two signals and produces a third, extracting features of an image e.g. 2) Pooling layers --> placed in bewteen convolutional layers, to summarize. - This layer reduces computational costs by reducing number of weights (parameters) up to 75% - Also controls for overfitting and enables generalization 3) Dropout layers --> chop off irrelevant connections to avoid: - Unnecessary computations and overfitting - Done by randomly setting layers to 0, but only done in the learning stage on the training set - Set to 0 because the network goal is to be redundant, so that it can provide the right classification without all layers activated

Answer 123

- Neural Network with feedback loops - Involves backpropagation - Applied to sequential tasks like handwriting and speech recognition

Answer 124

Web scraping is a technique for converting the data present in unstructured format (HTML tags) over the web to the structured format which can easily be accessed and used. This means that you are going to build a data.frame or a corpus at the end of the day

Answer 125

- The problem of vanishing gradients arises due to the nature of the backpropagation optimization - Gradients tend to get smaller and smaller as we keep on moving backward - Implies that neurons in earlier layers learn very slowly compared to neurons in the last layers - Vanishing Gradient Problem results in a decrease in the prediction accuracy of the model and take a long time to train a model.