Exam 2 Flashcards
Why we use exploratory modeling
obtain the best fit model from all observations
Why we use predictive modeling
split observations into a training set and a validation set
What are training sets used for
create the model
what are validation sets used for
evaluate accuracy of the model
predictors
these are our variables
target
what we are trying to estimate to test model
regression
determining the relationship between a variable and one or more other variables
linear regression
gives a set of observations, determine the equation of a line that can be used to describe the dataset
types of error
error - estimated
mean error - average of errors
mean square error = same as mean but sum of errors are squared
root mean square error = same thing as MSE but taking the square root
How do you know if r-square is accurate?
the closer to 1 it is the more accurate it is
validation set
used to test a model
why do we split validation and training sets
to learn about the data and test it
class
category for data
when would we want to use a class?
to identify a label for the data points
what is the max of k
training size of dataset
What is data normalization
organizing data to reduce redundancy. It is important with knn because the distance squares the difference in features
regression model vs classification model
regression = numeric outcomes
classification = categorical outcomes
what type of data can we use with KNN
numbers with categories, must prepare the data (like we did with car example in class)
scatter plots are good to show change over time true or false
false
how much data do you use to find K (max value of k?)
Set aside 80% of total data for K
Knn and Bayes differences
Knn is based off euclidean distance and bayes is based on categorical data
why do we use categorical for bayes
probability based
why do we use numerical for knn
to measure the euclidean distance between the data
what is the best chart for comparing 2 things
bar chart
what is the best chart for finding proportions
pie chart
shows the relationship between 2 variables
scatter plot
what chart is used for change over time
line plot
one hot encoding
001, etc
euclidean distance equation
what kind of data do we use for bayes classifiers
categorical but for features we must numeric it
What affects the accuracy of a Bayes classifier?
assumption of features, the quality and size of the training data, feature relevance, distribution of features, class imbalance, parameter estimation, data preprocessing, outliers
What is data imbalance, and why does that matter for Bayes classifiers?
biased predictions and unreliable probability estimates for minority classes
What is the correlation between skewness in histogram and box plot?
Skew left = median is below mean
symmetric = mean = median
skew right = median is above mean