Exam Practice Flashcards

Question

Describe the framework for MapReduce from Hadoop? And give and example of where MapReduce can be used?

Answer 1

Two steps: 1) Map: describes the computation or analysis applied to a set of input key/value pairs to produce a set of intermediate key/value pairs 2) Produce: the set of values associated with the intermediate key/value pairs output by the Map operation are combined to provide the results - MapReduce is a series of basic operations applied in a sequence to small chunks of many datasets - Combines both data and computational independence - It is fault-tolerant - Can be used to count number of occurrences of a word in a corpus (breaks down each document, each paragraph, each sentence slide 22 in L2).

Answer 2

Cloud computing refers to the place where data is being processed

Answer 3

An art and science of discovering knowledge, insights and patterns in data

Answer 4

- To recognize of hidden value in data | - To effectively gather quality data and efficiently process it

Answer 5

1) Understand the application domain 2) Identify data sources and select target data 3) Pre-process: cleaning, attribute selection 4) Data mining to extract patterns or models 5) Post-process: identifying interesting or useful patterns 6) Incorporate process in real world task

Answer 6

1) Wrong problem for mining 2) Not having sufficient time for data acquisition 3) Only focus on aggregated results 4) Being sloppy with the procedure 5) Ignoring suspicious findings 6) Running mining algorithms repeatedly and blindly 7) Naively believing in the data

Answer 7

Ordinary Least Squares | OLS because it minimizes the sum of squared errors

Answer 8

Break out the dataset in two and estimate the model based on one and then make predictions on the other

Answer 9

``` beta = COV[X,Y]/VAR[X] intercept = plug as residual ```

Answer 10

1) Linearity and additivity 2) Statistical independence of errors 3) Homoscedasticity: constant variance of the errors 4) Normality of the error distribution

Answer 11

Plot observed vs. predicted values

Answer 12

Run Durbin Watson test on residuals and check if they exhibit autocorrelation

Answer 13

Shapiro-Wilk, Kolmogorov-Smirnov, Jarque-Bera, Anderson-Darling

Answer 14

If the response variable of OLS does not follow a normal distribution (not Gaussian). Composed of; 1) Randomness: associated with the dependent variable and its probability distribution 2) Systematic: identifies the selected covariates through a linear predictor 3) Link function: identifies the function E[Y] s.t. it is equal to the systematic component (e.g. log transformation)

Answer 15

Binary response variable

Answer 16

Pros: - suitable for studying dynamics - minimize bias - time fixed effects to control for unobserved variables Cons: - data collection - unwanted correlation - complexity

Answer 17

FE: removes time-invariant characteristics RE: variation across subjects is random and uncorrelated with the predictors

Answer 18

Supervised Learning: train computer to learn new classification each time Clustering (unsupervised): start from scratch every time

Answer 19

1) Hierarchical methods (agglomerative and divisive) 2) Partitioning methods (non-hierarchical) 3) Fuzzy methods 4) Density-based methods 5) Model-based methods

Answer 20

1) Square root of sum of squared distance between components | 2) Sum of abs. value difference between components

Answer 21

Pros: - No apriori information needed about number of clusters required - Easy to implement - Very replicable Cons: - Not very efficient O (radius) - Based on dissimilarity matrix (predetermined) - No objective function is minimized - The dendogram is not the best tool to choose the optimum number of clusters - Hard to treat non-convex shapes

Answer 22

Pros: - k-means is relatively efficient O (radius) - Easy to implement and understand - Totally replicable Cons: - PAM does not scale well for large datasets - Applicable only when mean is defined - Need to specify k in advance - k-means is unable to handle noisy data and outliers - Not suitable to discover clusters with non-convex shapes

Answer 23

1) Agglomerative: each observation is considered as a cluster. Iteratively, the most similar clusters are merged until one single cluster forms (root) 2) Divisive: the inverse of agglomerative. It begins with a single root and the most heterogeneous clusters are divided until each observation form a cluster

Answer 24

1) k-means: each cluster is represented by the center of the cluster 2) k-medoids (PAM): each cluster is represented by one of the points in that cluster 3) CLARA: suitable when large dataset are analyzed

Answer 25

Hopkins test for spatial randomness of the data

Answer 26

1) Silhouette 2) Gap-statistic 3) Within Sum of Squares

Answer 27

- Probabilistic machine learning algorithm - The term "naive" refers to the assumption that the features going into the model are independent - Very fast and scalable Key concepts: 1) conditional probability, 2) bayes rule

Answer 28

Because then we can interpret them in terms of standard deviations and that 95% of them must lie within 95% for being normal (Gaussian)

Answer 29

No we can run stepwise regressions and is based on minimizing AIC

Answer 30

Whereas the method of least squares results in estimates of the conditional mean of the response variable given certain values of the predictor variables, quantile regression aims at estimating either the conditional median or other quantiles of the response variable

Answer 31

1) Generalization: ability of a model to be applied to unused data 2) Overfitting: model is tailored to the training data at the expense of generalization to test data

Answer 32

A high complex model is very accurate in training data A low complex model is not very accurate The more complex the model is, the more overfitting you get

Answer 33

In statistics, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably"

Answer 34

Cross-validation: performs multiple splits and systematically swaps out samples for testing

Answer 35

Test data shall always be independent of model building | Some overfitting will always be present (complexity control)

Answer 36

Regularization (penalization). Add a penalty function to the regression equation. Famous models are Ridge L2-norm and Lasso L1-norm

Answer 37

Confusion matrix. Contingency table of observed classes vs. predicted classes. (Remember we need to count the errors of false positives and false negatives separately!)

Answer 38

With the help of the average silhouette method, we can measure the quality of our clustering operation. With this, we can determine how well within the cluster is the data object. If we obtain a high average silhouette width, it means that we have good clustering. The average silhouette method calculates the mean of silhouette observations for different k values. With the optimal number of k clusters, one can maximize the average silhouette over significant values for k clusters.

Answer 39

In 2001, researchers at Stanford University. Tibshirani, G.Walther and T. Hastie published the Gap Statistic Method. We can use this method to any of the clustering method like K-means, hierarchical clustering etc. Using the gap statistic, one can compare the total intracluster variation for different values of k along with their expected values under the null reference distribution of data. With the help of Monte Carlo simulations, one can produce the sample dataset. For each variable in the dataset, we can calculate the range between min(xi) and max (xj) through which we can produce values uniformly from interval lower bound to upper bound.

Answer 40

With the measurement of the total intra-cluster variation, one can evaluate the compactness of the clustering boundary. We can then proceed to define the optimal clusters as follows: 1. we calculate the clustering algorithm for several values of k. This can be done by creating a variation within k from 1 to 10 clusters. 2. We then calculate the total intra-cluster sum of square (iss). 3. Then, we proceed to plot iss based on the number of k clusters. This plot denotes the appropriate number of clusters required in our model. In the plot, the location of a bend or a knee is the indication of the optimum number of clusters.

Answer 41

1) LDA 2) KNN 3) SVM 4) CART 5) Random Forrest 6) Naive Bayes

Answer 42

Combination of a set of position scales and their relative geometric arrangement. It is translation invariant under linear transformation.

Answer 43

Logarithmic scale. Multiplication on a log scale looks like addition on a linear scale.

Answer 44

Axes are curved We specify positions via an angle and a radial distance from the origin Useful for data of a periodic nature

Answer 45

Locations on the globe Degrees North and West Use of Cartesian axes is misleading!

Answer 46

- Grouped bars | - Stacked Bars

Answer 47

- Boxplots - Violin plots - Jittered points - Sina plot - Stacked histograms - Overlapping densities - Ridgeline plot

Answer 48

- Mosaic plot - Tree map - Parallel sets - Multiple pie charts - Grouped bars - Stacked bars - Stacked densities

Answer 49

- Line graph - Connected scatter plot - Smooth line graph - Density contours - 2D bins - Hex bins - Correlogram

Answer 50

- Error bars - 2D error bars - Confidence band - Graded confidence band

Answer 51

Structured data is comprised of clearly defined data types whose pattern make them easily searchable Unstructured data is everything else.

Answer 52

Is the application of computational techniques to the analysis and synthesis of natural language and speech. In NLP, data is referred to as corpus.

Answer 53

Branch of Machine Learning that makes use of a specific type of architectures or models (like Neural Networks) to solve learning tasks

Answer 54

CL is a theoretical field that develops computational methods to answer scientific questions from the point of view of linguists NLP is dedicated to give solutions to engineering problems related to natural language, focusing on people

Answer 55

1) Collection of documents 2) Preprocess those documents such that we can do exploratory data analysis 3) Represent relevant features in some usable vector space 4) Apply a suitable model

Answer 56

1) Ambiguity: choice of word vs. meaning of word 2) Synonymy 3) Syntax: structure of sentences 4) Coreference: e.g. concepts deduced implicitly 5) Normalization of words: break words to the core 6) Representation (transform words into vectors) 7) Style (e.g. irony and sarcasm)

Answer 57

Regular expressions is a programming language to process textual data (cleaning it). Allows to search for patterns and manipulate them in textual data (clear out "fill" words).

Answer 58

Simply count words' frequency in a document

Answer 59

A measure associated with features that occur differently across different categories

Answer 60

An informative measure which communicates where the term has been used in the text (often displayed x-ray plots)

Answer 61

Lemmatization: grouping of similar words (run base for running and ran) and consider these words equal Stemming: takes a word and refers back to its base or root form. (for verbs take infinitiv)

Answer 62

1) Part-of-Speech Tagging (POS) 2) Dependency Parsing 3) Named Entity Recognition (NER)

Answer 63

1) Lexicon based (e.g. pure words counting based on hash tables of positive and negative words) 2) Deep learning

Answer 64

Using Topic Modelling (dimension reduction tool, i.e. reduces number of topics conferred about)

Answer 65

1) By perplexity (not rigourous) | 2) By a chi-square test (rigourous)

Answer 66

A form of unsupervised learning whereby the topic categories of the text are unknown. By e.g. the LDA technique, it finds words that appear together and group them into topics. The researcher decides on the number of topics of the texts without prior information.

Answer 67

Artificial Intelligence contains Machine Learning contains Deep Learning

Answer 68

Effort to automate intellectual tasks normally performed by humans

Answer 69

Data and Answers --> Machine Learning --> Rules It is trained rather than explicitly programmed To do ML, we need three things; 1) Input data points (e.g. images) 2) Examples of the expected output (e.g. images of cats) 3) A way to measure performance of the algorithm (feedback loop = learning) Based on this it uses statistics to determine what output is most likely the correct one and feeds on feedback

Answer 70

A specific subfield of Machine Learning. The learning process in DL is taken further to successively layers of increasingly meaningful representations. The layered representation is almost always learned through models called neural networks. The number of layers that contribute to a model is called the depth. DL is a multistage way to learn data representations

Answer 71

A computing system made up of a number of simple, highly interconnected processing elements, which process information by using their dynamic state response to external inputs. A mathematical framework for learning representation from data

Answer 72

Number of weights = sum across layers for each weight Biases = Sum of number of hidden nodes and output nodes Parameters = Sum of weights and biases

Answer 73

Measures how far the output is from what we expect which is used as a feedback loop to readjust weights and next time yield an output closer to expectations (called the Backpropagation algorithm)

Answer 74

- Near-human level image classification - Near-human level speech recognition - Near-human level handwriting transcription - Improved machine translation - Digital assistants - Near-human level autonomous driving - Ability to answer natural-language questions

Answer 75

To check the output value Y and decide whether other connections should consider this neuron as activated or not.

Answer 76

1) Stepwise function 2) Linear function 3) Sigmoid 4) Tanh 5) ReLu max(0,x)

Answer 77

The bias node i NN is a node that is always on Analogous to the intercept in a regression model If a NNET does not have a bias node in a given layer, it will be able to produce an output in the next layer that differs from zero in a linear scale

Answer 78

1) Feedforward Neural Networks (single and multi-layer perceptron) 2) Convolutional Neural Networks (Evolution of MLP with less computational needs) 3) Recurrent Neural Networks (Have feedback loops - Backpropagation, often applied to sequential tasks)

Answer 79

A feedforward neural network is an artificial neural network where connections between the units do not form a cycle. In this network, the information moves in only one direction, forward, from the input nodes, through the hidden nodes (if any) and to the output nodes. There are no cycles or loops in the network. We can distinguish two types of feedforward neural networks: 1) Single-layer, 2) Multi-layer

Answer 80

Convolutional Neural Networks are very similar to ordinary Neural Networks, they are made up of neurons that have learnable weights and biases. In convolutional neural network (CNN, or ConvNet or shift invariant or space invariant) the unit connectivity pattern is inspired by the organization of the visual cortex, Units respond to stimuli in a restricted region of space known as the receptive field. Receptive fields partially overlap, over-covering the entire visual field. Unit response can be approximated mathematically by a convolution operation. They are variations of multilayer perceptrons that use minimal preprocessing. Their wide applications is in image and video recognition, recommender systems and natural language processing. CNNs requires large data to train on.

Answer 81

In recurrent neural network (RNN), connections between units form a directed cycle (they propagate data forward, but also backwards, from later processing stages to earlier stages). This allows it to exhibit dynamic temporal behavior. Unlike feedforward neural networks, RNNs can use their internal memory to process arbitrary sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition, speech recognition and other general sequence processors.

Answer 82

Web scraping is a technique for converting the data present in unstructured format (HTML tags) over the web to the structured format which can easily be accessed and used. This means that you are going to build a data.frame or a corpus at the end of the day.

Answer 83

- Heuristic - Model-based (Naive Bayes) - Binary decision - Optimisation based

Answer 84

- Accuracy: % of correct predictions - AccuracyInf: a lowest bound on the mean - AccuracyCV: mean of accuracies by the resampling scheme - AccuracyPAC: highly probable bound on the accuracy obtained just by subtracting the standard deviation

Exam Practice Flashcards

(108 cards)