Exam 2 Flashcards
(34 cards)
Why we use exploratory modeling
obtain the best fit model from all observations
Why we use predictive modeling
split observations into a training set and a validation set
What are training sets used for
create the model
what are validation sets used for
evaluate accuracy of the model
predictors
these are our variables
target
what we are trying to estimate to test model
regression
determining the relationship between a variable and one or more other variables
linear regression
gives a set of observations, determine the equation of a line that can be used to describe the dataset
types of error
error - estimated
mean error - average of errors
mean square error = same as mean but sum of errors are squared
root mean square error = same thing as MSE but taking the square root
How do you know if r-square is accurate?
the closer to 1 it is the more accurate it is
validation set
used to test a model
why do we split validation and training sets
to learn about the data and test it
class
category for data
when would we want to use a class?
to identify a label for the data points
what is the max of k
training size of dataset
What is data normalization
organizing data to reduce redundancy. It is important with knn because the distance squares the difference in features
regression model vs classification model
regression = numeric outcomes
classification = categorical outcomes
what type of data can we use with KNN
numbers with categories, must prepare the data (like we did with car example in class)
scatter plots are good to show change over time true or false
false
how much data do you use to find K (max value of k?)
Set aside 80% of total data for K
Knn and Bayes differences
Knn is based off euclidean distance and bayes is based on categorical data
why do we use categorical for bayes
probability based
why do we use numerical for knn
to measure the euclidean distance between the data
what is the best chart for comparing 2 things
bar chart