Class 1-12 Flashcards

Question 1

Q

What are the aims of numerical summaries of discrete variables?

Answer

A

Aim is to describe the distribution of the variable.
Question to address is : What are the relative frequencies of different categories? Which categories are common and which are rare?
Since a categorical variable takes a finite number of possible values, the simplest thing to do is tabulate the number of occurances of each type.

Question 2

Q

What are the aims of numerical summaries of continuous variables?

Answer

A

Aim is to summarize the data in terms of its distribution.

* It is common to start with some descriptive statistics to get a feeling for the data.

Question 3

Q

What is the standard deviation?

Answer

A

• Is a measure of how spread out numbers are;
it is the square root of the Variance.
• Variance is the average of the squared differences from the Mean.
a) Calculate Mean (the simple average of the numbers)
b) Then for each number: subtract the Mean and square the result (the squared difference).
c) Sum up those squared differences / (n-1)

Question 4

Q

What is exploratory data analysis? (EDA)

Answer

A

• is the process of analyzing and visualizing the data to get a better understanding of the data and glean insight from it.

Question 5

Q

How Does Exploratory Data Analysis Differ

from Summary Analysis?

Answer

A

Summary:
A summary analysis is a numeric reduction of a historical data set.
Quite passive and focused on
the past.

Exploratory:
Aims to gain insight into the engineering/scientific process behind the data
Active and futuristic.

Question 6

Q

What is “variation”?

Answer

A

Is the tendency of the values of a variable to change from measurement to measurement.
• Measuring any continuous variable twice, will give two different results.
• Categorical variables can vary if you measure across different subjects (e.g., eye colors of people), or different times (e.g., the energy levels).
• Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualize the distribution of the variable’s values.

Question 7

Q

What is a “Histogram”?

Answer

A

A histogram is similar to a bar plot. Categorizes a continuous variable content into non-overlapping intervals for the sake of display (=binning).

Question 8

Q

What is a “Density Curve”?

Answer

A

the y-axis represents the probability of observing any given value, such that the area under the curve equals one.

Question 9

Q

What is a “Box Plot”?

Answer

A

Graphical representation of the five-number summary
• Depicts quartiles (i.e., the 25%, 50%, and 75% quantiles), minimum, maximum and outliers (if present).
• Conveys the shape of the data distribution, the presence of extreme values, and the ability to compare with other variables using the same scale
• Excellent tool for screening data, determining thresholds for variables and developing working hypotheses.

Question 10

Q

What is a Normal Distribution and

Why Should You Care?

Answer

A

Many statistical methods are based on the properties of a normal distribution.
Applying certain methods to data that are not normally distributed can give misleading or incorrect results.
Most methods that assume normality is robust enough for all data except the very abnormal.

Question 11

Q

What are attributes of the “Gaussian Distribution”

Answer

A

• Has the following properties

Gaussian distributions are symmetric around their mean.
The mean, median, and mode of a Gaussian distribution are equal.
The area under the curve is equal to 1.0.
Gaussian distributions are denser in the center and less dense in the tails.
Gaussian distributions are defined by two parameters, the mean and the standard deviation.
68% of the area under the curve is within one standard deviation of the mean.
Approximately 95% of the area of a Gaussian distribution is within two standard deviations of the mean.

Question 12

Q

What is a “Scatterplot”

Answer

A

For continuous variables, the most common visualization technique is the scatterplot, which simply maps each variable to an x- or y-axis coordinate.

Question 13

Q

When can we make use of visualization tools?

Answer

A

visual exploration is the first thing when dealing with a new task
when analyzing models’ performance
for sharing insights & reporting results

Question 14

Q

What is the iterative process of EDA?

Answer

A

generate questions about the data
search for answers by visualizing, transforming, and modeling the data
use new knowledge to ask better or new questions

Question 15

Q

Define “Data Science”

Answer

A

• deals with large volumes of comlex data from multiple sources
• aims to develop methods, tools, or services capable of
a. ingesting such data
b. generating semiautomated decision-support systems

Question 16

Q

What is “Descriptive Analytics”?

Answer

A

goal: understand the past and present

* tools: summary statistics, correlations, visualizations

Question 17

Q

What is “Predictive Analytics”?

Answer

A

goal: detect patterns in the historic data to predict what will happen
tools: statistical and machine learning

Question 18

Q

What is “Prescriptive Analytics”?

Answer

A

goal: extend predictive analytics, i.e., data is used to determine (prescribe) the best course of action
tools: optimization, heuristic search

Question 19

Q

Goal of a model

Answer

A

The goal of a model: to provide a simple low-dimensional summary of dataset ideally it:
• captures true “signals” i.e., patterns generated by the phenomenon of interest
• ignores “noise” i.e., random variation that we are not interested in

Question 20

Q

What are supervised models?

Answer

A

generate predictions via approximating the observable relationship between the data input and output
• use labeled data, i.e., we have prior knowledge of the values of our
target variable
• example: regression

Question 21

Q

What are unsupervised models?

Answer

A

a.k.a. “data discovery” models
• does not have labeled outputs
• help to discover interesting relationships within the data, i.e., infer the natural structure present within a set of data points
• example: clustering

Question 22

Q

Predictive tasks/problems:

Answer

A

classification of an instance to one of the categories based on its features
regression - prediction of a numerical response variable based on other features

Question 23

Q

Descriptive tasks/problems:

Answer

A

clustering - identifying partitions of observations based on the features of these observations so that the members within the groups are more similar to each other than those in the other groups
anomaly detection - search for observations that are “greatly dissimilar” to the rest of the sample or to some group of instances

Question 24

Q

What is linear regression?

Answer

A

• represents a method for the regression task/problem (prediction of a
numeric outcome)
• allows to model an output/response variable y as a linear additive
function of input variables x1, …, xn: y = β0 + β1x1 + β2x2 + … + βnxn

Question 25

Q

What is logistic regression?

Answer

A

Allows modeling the outcome of a binary variable
• probability that an event of interest(class) happens
Uses a link function (transformation) to limit the outcome to the values between 0 and 1

Question 26

Q

How is accuracy calculated?

Answer

A

correctly_classified_instances /

total_number_of _instances

Question 27

Q

What is a classification error?

Answer

A

classification_error = 1 − accuracy

Question 28

Q

What is overfitting?

Answer

A

Overfitting refers to modeling every minor variation in the input.
Note: It is way more likely that minor variation is noise than true signal!

Question 29

Q

Signs of overfitting?

Answer

A

performs well on the training set (low error)

* produces high error on the test, previously unseen data

Question 30

Q

Causes of overfitting?

Answer

A

high dimensionality (a large set of predictors)
fitting a model to achieve minimal errors on a training set
use of nonlinear methods (e.g.,tree-based methods)

Question 31

Q

What is bias?

Answer

A

Bias is errors due to the simplification of assumptions (underfitting)
• a linear model is applied to non-linear data
• too little data is used - lacking details

Question 32

Q

What is variance?

Answer

A

Variance reflects the changes as the training set changes (overfitting).
• model is trained a lot on a noisy dataset
• complex models like decision trees are applied

Question 33

Q

What is regularization?

Answer

A

Regularization
• is the manifestation of the bias variance trade-off
• represents alterations to the estimation process
• i.e. an additional objective is introduced via adding a
complexity penalty:
1. low training error or good fit (initial one)
2. low complexity (new)

Question 34

Q

What is L1 (Lasso) regression?

Answer

A

LASSO results in sparser models due to different ways of setting upper bounds of coefficients
• L1-norm forces coefficients to take on 0 values
I can be used for feature selection
I mitigates the issue of multicollinearity
• L2-norm can no

Question 35

Q

What is L2 (Ridge) regression?

Answer

A

Ridge penalizes stronger for very large coefficients

• penalizes sum of squared coefficients

Question 36

Q

What is elastic net regression?

Answer

A

Elastic-Net is a combination of LASSO- and Ridge-penalties.
• emerged from the critique on LASSO
• has the strengths of LASSO and Ridge regression
• realized in glmnet package in R
• the penalty can be represented as

Question 37

Q

What is the TPR (True Positive Rate)

Answer

A

Also called sensitivity (True Positive / True Positive + False Negative)

Question 38

Q

What is the FPR (False Positive Rate)

Answer

A

is calculated as 1 - specificity or 1 - (True Negativ / True Negativ + False Positive)

Question 39

Q

What is AUC (Area under the ROC curve) ?

Answer

A

AUC summarizes ROC in a single number.
The higher the AUC, the better the model (closer to the optimum)
AUC of a good classifier is well above 0.5

Question 40

Q

What is a decision tree?

Answer

A

easily interpretable (compared to other methods, that can be regarded as “black boxes”)
can be used for classification and regression
work by partitioning the data into smaller, more homogeneous groups
measures the impurity (chaos in the system)
makes such a split that minimizes the impurity in the resulting partitions
splitting process continues within the newly created partitions until no further improvement is possible (recursive partitioning)

Question 41

Q

What is entropy?

Answer

A

Entropy is the degree of chaos in the system. The higher the entropy, the less ordered the system is.

Question 42

Q

What are stopping criteria (also called Decision Tree hyper parameters)?

Answer

A

Larger values of maxdepth will produce larger trees, thus smaller bias but larger variance
Larger values of minsplit mean more data points per node, thus larger bias and smaller variance.
• the minimum number of observations that must exist in a node in order for a split to be attempted
Larger values of minbucket mean more data points per terminal node, thus larger bias and smaller variance.
• the minimum number of observations in any terminal or leaf node