Chapter 4 Quiz Flashcards
number of predictors or input variables used by the model
dimensionality
average, stdev, min, max, median, count
summary statistics
gives info on scale, types of values, extremes, central values, skew, dispersion
summary statistics
presence of two or more predictors sharing the same linear relationship with the outcome variable
multicollinearity
how to avoid multicollinearity?
take out variables that have a strong correlation with others in the matrix
combining close or similar categories is done through what?
expert knowledge and common sense
few variables are made from weighted linear combinations of the original variables, retain majority of info of full original set
principal component analysis
what subsets is PCA most valuable with?
subsets with same scale and high correlation
variance is maximal, minimize distance from line
first principal component
perpendicular to z prime, second largest variability
second principal component
what does PCA allow you to see?
structure of data and weight of each variable
what models can be used to help combine or remove categorical variables?
classification and regression trees
how to make category into numerical?
use midpoint of range for all
what does having too many dimensions cause?
sparsity
what does PCA want to do?
minimize perpendicular sum of lines (most variance)
L1 lasso, encourages sparsity, many coefficients will become zero
regularization
tails of data, how big they are
kurtosis
how symmetric the data is
skewness