Session 3 Flashcards
what are the two most common processes for data mining?
- crisp-dm
- Semma
what is CRSIP DM? (6 steps)
- business understanding
- data understanding
- data preparation
- model building
- testing and evaluation
- deployment
what is important in the business and data understanding stages?
- nature of data may differ (different sources)
- data is securely uoloaded,m stored and analysed
- it is important to understand, to be able to realize what is likely impacted by what and which meadning which data has (means, medians, distributions etc)
Business: understand content and business needs and applications
Data: find patterns, trends and knowledge without complex models
what four steps are happening in the data preparation step?
- data consolirdation
- data cleaning
- data transformation
- data reduction
what happends durimg data consolidation?
collect, selevt and integrate data (same formats for variables)
what happens during data cleaning?
impute missing values (take averages)
reduce noice (errors, outliers etc.)
eliminate duplicates
what happens during data transformation?
normalize data (one varibla 0-10000, others 0-4, match scales to prevent disproportionality)
discretize data (data from 0-100, make 4 buckets instead -> we did this with price)
create new attributes (ratio, amenities etc.)
what happens during data reduction?
reduce dimensions of data (states: US/ regions)
reduce volume (less important as computers become more capable)
balance data (not all zeros or 1s, we did this with TVs or Bathrooms)
what are the three types of patterns that data mining is supposed to find?
- prediction
- association
- segmentation
what is prediction? (what three types can be done)
clear trends to allow prediction:
what are the two learning types?
supervised
un-supervised
what is supervised learning?
known classes -> output you want to have is known and algorithms is trained of labelled historic data set: input data has corresponsidng output labels / target labels)
–> trained model should minimiue differences between predicted and true outputs
what is unsupervised learning?
known variable –> algorithm tries to find values on its own based on patterns and strcutures (e.g., clustering)
what is a regression?
based on historical data in a scatterplot, we estimate a line that minimizes the deviation between observation and this regression line
-> regression line can then be used as a prediction
what are the two variables in a regression?
dependent: left (y axis), e.g., price
independent: right (x axis), e.g., number of amenities
what is a classification?
- used to analysie historial data to automativally generate a model that can predict future behaviour
- in contrast to regression: here output vaiable is categorial (nominal or ordinal) not any prices, but either 0 or 1
what are the four classification techniques?
- logsitic regression
- decision trees (ID3)
- artificial neural networs
- support vector machines,bayesian classifies
what is important terminology in data science/ mining?
features/attributes
target variable/ attribute/ label
Bias
features/attributes
target variable/ attribute/ label
Bias=intercept
what is the confusion matrix?
tells us about how good the predicition is
what is accuracy?
% probability for true predictors
what is recall (sensitivity, hit rate)?
% for a true prediciton in a positive case
–> how many actual psotivies were predicted?
what is the true negative rate (specificity)?
% for a true predicition in a negative case
what is the false alarm rate?
% for a wrong prediction in a negative case?
what is precision?
% for a positive among all positive results
–> how many of the predicted positives are actually positives
when is recall more important?
if costs of false negatives are higher (corona, promotional mailing, loan defaults)
when is precision more important?
if costs of false positives are higher
(search engine results, spam filters)
what are five validation techniques?
- simple split
- k fold validation / cross validation ( rotation estimation)
- leave-one-out
- boostrapping
- jackknifing
what is the simple split?
- split data into 2 mutually excluive sets calles training (70%) and testing (30%)
- for ANN make three subests (training: 60%, validation: 20%, and testing: 20%)
- with the training data we develop the model, which is then validates with the 20% testing data
what is the k-fold validation?
- we split data into k ME subsets (e.g., 10)
- we use each as testing while rest as training
- test and training expeimentation is repeated k times
- aggregate the test results for true estimation of prediciton accuracy training
- never train and predict on the same model
what is a linear regression?
assummes a linear relationship between both variables
what is the goal of a linear regression model?
train model’s ß so differences between actual and predicted Y become minimal –> once trained, plug in value X and you should get a value for Y
what is r squared?
how well do the predicitons of the regression model actually match the data? (ranges from 0-1) (1 is best)
–> if 1: model perfectly explains all variabilitiy in data
what are residuals?
differences between actual and predicted values
what is the standard error?
precision of estimated coefficients (lower = better)
what is a p-value?
teste if coefficients are statistically significant (ß1 and ß0)
–> if less than 0.05 it is significant (means: are results actually meaningful or jsut by chance?)
what is t statistics
tests if one coeffiecient is significant
what are f statistics?
overall significance or your model with all variables (if f= large and p is low –> significant)
what does it mean when a variable is level?
the variable is used in its original form. For example, if you’re using income measured in dollars as is, that’s a level variable.
what does it mean when a variable is log?
the natural logarithm of the variable is used. Taking the log often helps interpret relationships as percentage changes and makes data with large ranges or skewness easier to analyze. For example, using the log of income allows you to interpret results in terms of proportional changes rather than absolute values.
what are the four possible models ?
level-level (y is level, x is level)
level-log (y is level, x is log)
log-level
log-log
what is a level-level function?
what is a level-log function?
what is a log-level function?
what is a log-log function?
what are the assumptions in a linear regression model?
- linearity (a change in X is associatied with a proportional change in Y)
- independence: the residuals (errors: differences between observed and predicted values) are independent = for one observation, they are not inflruences by the residuals of any other observation
- homoscedasticity
What are drawbacks of linear regression?
- probabilities could be less than 0 or greater than 1
- there could be heteroscedasticity, t and F statistics as well as standard errors are not generally valid
–> we might reject a hypothesis although it is true
what is heteroscedasticity?
the spread of errors (scatter = difference between predicted and actual values) changes depending on the value of the independent variable
–> e.g., you are testing how much people spend based on their income, but in real life the spread of what people with low income spend is much lower than that of high income people (some spend much, and some little)
what is the formular of a logistic regression?
here we can only have the outcomes 1 or 0 for the independent variable (amenities)
-> price would not be 0.50€ but >1 and <1
what is f(y) in the formular for logistic regression?
it is the probability (0-1) that the dependent variable (y) is 1
what is ß0 in the logistic regression formular?
it is the intercept:
baseline or starting value, representing the log odds of y=1 when all independent variables (x) are 0.
what is the slope in the logistic regression formular?
This measures the change in the log odds of y=1 for a one-unit increase in x.
what does e^ß1 mean?
gives the odds ratio, which tells you how the odds change with a one-unit increase in x