Session 3 Flashcards

1
Q

what are the two most common processes for data mining?

A
  1. crisp-dm
  2. Semma
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is CRSIP DM? (6 steps)

A
  1. business understanding
  2. data understanding
  3. data preparation
  4. model building
  5. testing and evaluation
  6. deployment
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is important in the business and data understanding stages?

A
  • nature of data may differ (different sources)
  • data is securely uoloaded,m stored and analysed
  • it is important to understand, to be able to realize what is likely impacted by what and which meadning which data has (means, medians, distributions etc)

Business: understand content and business needs and applications
Data: find patterns, trends and knowledge without complex models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what four steps are happening in the data preparation step?

A
  1. data consolirdation
  2. data cleaning
  3. data transformation
  4. data reduction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what happends durimg data consolidation?

A

collect, selevt and integrate data (same formats for variables)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what happens during data cleaning?

A

impute missing values (take averages)
reduce noice (errors, outliers etc.)
eliminate duplicates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what happens during data transformation?

A

normalize data (one varibla 0-10000, others 0-4, match scales to prevent disproportionality)
discretize data (data from 0-100, make 4 buckets instead -> we did this with price)
create new attributes (ratio, amenities etc.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what happens during data reduction?

A

reduce dimensions of data (states: US/ regions)
reduce volume (less important as computers become more capable)
balance data (not all zeros or 1s, we did this with TVs or Bathrooms)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what are the three types of patterns that data mining is supposed to find?

A
  1. prediction
  2. association
  3. segmentation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is prediction? (what three types can be done)

A

clear trends to allow prediction:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what are the two learning types?

A

supervised
un-supervised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is supervised learning?

A

known classes -> output you want to have is known and algorithms is trained of labelled historic data set: input data has corresponsidng output labels / target labels)
–> trained model should minimiue differences between predicted and true outputs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is unsupervised learning?

A

known variable –> algorithm tries to find values on its own based on patterns and strcutures (e.g., clustering)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is a regression?

A

based on historical data in a scatterplot, we estimate a line that minimizes the deviation between observation and this regression line
-> regression line can then be used as a prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what are the two variables in a regression?

A

dependent: left (y axis), e.g., price
independent: right (x axis), e.g., number of amenities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is a classification?

A
  • used to analysie historial data to automativally generate a model that can predict future behaviour
  • in contrast to regression: here output vaiable is categorial (nominal or ordinal) not any prices, but either 0 or 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

what are the four classification techniques?

A
  1. logsitic regression
  2. decision trees (ID3)
  3. artificial neural networs
  4. support vector machines,bayesian classifies
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

what is important terminology in data science/ mining?
features/attributes
target variable/ attribute/ label
Bias

A

features/attributes
target variable/ attribute/ label
Bias=intercept

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

what is the confusion matrix?

A

tells us about how good the predicition is

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

what is accuracy?

A

% probability for true predictors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

what is recall (sensitivity, hit rate)?

A

% for a true prediciton in a positive case

–> how many actual psotivies were predicted?

22
Q

what is the true negative rate (specificity)?

A

% for a true predicition in a negative case

23
Q

what is the false alarm rate?

A

% for a wrong prediction in a negative case?

24
Q

what is precision?

A

% for a positive among all positive results

–> how many of the predicted positives are actually positives

25
Q

when is recall more important?

A

if costs of false negatives are higher (corona, promotional mailing, loan defaults)

26
Q

when is precision more important?

A

if costs of false positives are higher
(search engine results, spam filters)

27
Q

what are five validation techniques?

A
  • simple split
  • k fold validation / cross validation ( rotation estimation)
  • leave-one-out
  • boostrapping
  • jackknifing
28
Q

what is the simple split?

A
  • split data into 2 mutually excluive sets calles training (70%) and testing (30%)
  • for ANN make three subests (training: 60%, validation: 20%, and testing: 20%)
  • with the training data we develop the model, which is then validates with the 20% testing data
29
Q

what is the k-fold validation?

A
  • we split data into k ME subsets (e.g., 10)
  • we use each as testing while rest as training
  • test and training expeimentation is repeated k times
  • aggregate the test results for true estimation of prediciton accuracy training
  • never train and predict on the same model
30
Q

what is a linear regression?

A

assummes a linear relationship between both variables

31
Q

what is the goal of a linear regression model?

A

train model’s ß so differences between actual and predicted Y become minimal –> once trained, plug in value X and you should get a value for Y

32
Q

what is r squared?

A

how well do the predicitons of the regression model actually match the data? (ranges from 0-1) (1 is best)
–> if 1: model perfectly explains all variabilitiy in data

33
Q

what are residuals?

A

differences between actual and predicted values

34
Q

what is the standard error?

A

precision of estimated coefficients (lower = better)

35
Q

what is a p-value?

A

teste if coefficients are statistically significant (ß1 and ß0)
–> if less than 0.05 it is significant (means: are results actually meaningful or jsut by chance?)

36
Q

what is t statistics

A

tests if one coeffiecient is significant

37
Q

what are f statistics?

A

overall significance or your model with all variables (if f= large and p is low –> significant)

38
Q

what does it mean when a variable is level?

A

the variable is used in its original form. For example, if you’re using income measured in dollars as is, that’s a level variable.

39
Q

what does it mean when a variable is log?

A

the natural logarithm of the variable is used. Taking the log often helps interpret relationships as percentage changes and makes data with large ranges or skewness easier to analyze. For example, using the log of income allows you to interpret results in terms of proportional changes rather than absolute values.

40
Q

what are the four possible models ?

A

level-level (y is level, x is level)
level-log (y is level, x is log)
log-level
log-log

41
Q

what is a level-level function?

A
42
Q

what is a level-log function?

A
43
Q

what is a log-level function?

A
44
Q

what is a log-log function?

A
45
Q

what are the assumptions in a linear regression model?

A
  1. linearity (a change in X is associatied with a proportional change in Y)
  2. independence: the residuals (errors: differences between observed and predicted values) are independent = for one observation, they are not inflruences by the residuals of any other observation
  3. homoscedasticity
46
Q

What are drawbacks of linear regression?

A
  • probabilities could be less than 0 or greater than 1
  • there could be heteroscedasticity, t and F statistics as well as standard errors are not generally valid
    –> we might reject a hypothesis although it is true
47
Q

what is heteroscedasticity?

A

the spread of errors (scatter = difference between predicted and actual values) changes depending on the value of the independent variable
–> e.g., you are testing how much people spend based on their income, but in real life the spread of what people with low income spend is much lower than that of high income people (some spend much, and some little)

48
Q

what is the formular of a logistic regression?

A

here we can only have the outcomes 1 or 0 for the independent variable (amenities)
-> price would not be 0.50€ but >1 and <1

49
Q

what is f(y) in the formular for logistic regression?

A

it is the probability (0-1) that the dependent variable (y) is 1

50
Q

what is ß0 in the logistic regression formular?

A

it is the intercept:
baseline or starting value, representing the log odds of y=1 when all independent variables (x) are 0.

51
Q

what is the slope in the logistic regression formular?

A

This measures the change in the log odds of y=1 for a one-unit increase in x.

52
Q

what does e^ß1 mean?

A

gives the odds ratio, which tells you how the odds change with a one-unit increase in x