Session 3 Flashcards

Question

when is recall more important?

Answer 1

if costs of false negatives are higher (corona, promotional mailing, loan defaults)

Answer 2

if costs of false positives are higher (search engine results, spam filters)

Answer 3

- simple split - k fold validation / cross validation ( rotation estimation) - leave-one-out - boostrapping - jackknifing

Answer 4

- split data into 2 mutually excluive sets calles training (70%) and testing (30%) - for ANN make three subests (training: 60%, validation: 20%, and testing: 20%) - with the training data we develop the model, which is then validates with the 20% testing data

Answer 5

- we split data into k ME subsets (e.g., 10) - we use each as testing while rest as training - test and training expeimentation is repeated k times - aggregate the test results for true estimation of prediciton accuracy training - never train and predict on the same model

Answer 6

assummes a linear relationship between both variables

Answer 7

train model's ß so differences between actual and predicted Y become minimal --> once trained, plug in value X and you should get a value for Y

Answer 8

how well do the predicitons of the regression model actually match the data? (ranges from 0-1) (1 is best) --> if 1: model perfectly explains all variabilitiy in data

Answer 9

differences between actual and predicted values

Answer 10

precision of estimated coefficients (lower = better)

Answer 11

teste if coefficients are statistically significant (ß1 and ß0) --> if less than 0.05 it is significant (means: are results actually meaningful or jsut by chance?)

Answer 12

tests if one coeffiecient is significant

Answer 13

overall significance or your model with all variables (if f= large and p is low --> significant)

Answer 14

the variable is used in its original form. For example, if you’re using income measured in dollars as is, that’s a level variable.

Answer 15

the natural logarithm of the variable is used. Taking the log often helps interpret relationships as percentage changes and makes data with large ranges or skewness easier to analyze. For example, using the log of income allows you to interpret results in terms of proportional changes rather than absolute values.

Answer 16

level-level (y is level, x is level) level-log (y is level, x is log) log-level log-log

Answer 17

1. linearity (a change in X is associatied with a proportional change in Y) 2. independence: the residuals (errors: differences between observed and predicted values) are independent = for one observation, they are not inflruences by the residuals of any other observation 3. homoscedasticity

Answer 18

- probabilities could be less than 0 or greater than 1 - there could be heteroscedasticity, t and F statistics as well as standard errors are not generally valid --> we might reject a hypothesis although it is true

Answer 19

the spread of errors (scatter = difference between predicted and actual values) changes depending on the value of the independent variable --> e.g., you are testing how much people spend based on their income, but in real life the spread of what people with low income spend is much lower than that of high income people (some spend much, and some little)

Answer 20

here we can only have the outcomes 1 or 0 for the independent variable (amenities) -> price would not be 0.50€ but >1 and <1

Answer 21

it is the probability (0-1) that the dependent variable (y) is 1

Answer 22

it is the intercept: baseline or starting value, representing the log odds of y=1 when all independent variables (x) are 0.

Answer 23

This measures the change in the log odds of y=1 for a one-unit increase in x.

Answer 24

gives the odds ratio, which tells you how the odds change with a one-unit increase in x

Session 3 Flashcards

(52 cards)