Practice Machine Learning Flashcards
2019-01-16
caret package in R
The caret package, short for classification and regression training, contains numerous tools for developing predictive models using the rich set of models available in R.
The caret package (short for Classification And Regression Training) contains functions to streamline the model training process for complex regression and classification problems.
The package utilizes a number of R packages but tries not to load them all at package start-up (by removing formal package dependencies, the package startup time can be greatly decreased). The package “suggests” field includes 30 packages. caret loads packages as needed and assumes that they are installed.
function createDataPartition
create stratified random splits of a data set.
stratification
to ensure that the random sampling is done in a way that guarantees that each class is properly represented in both training and test set.
partition
- a wall, screen, or piece of glass used to separate one area from another in a room or vehicle
- the process of dividing a country into two or more separate countries
nR04
the variable nR04 is the number of 4-membered rings in a compound
zero-variance predictor
the simple split of the data into a test and training set caused three descriptors to have a single unique value in the training set
function nearZeroVar
The function nearZeroVar can be used to identify near zero-variance predictors in a dataset. It returns an index of the column numbers that violate the two conditions above.
multicollinearity
In statistics, multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy.
predictor
An independent variable, sometimes called an experimental or predictor variable, is a variable that is being manipulated in an experiment in order to observe the effect on a dependent variable, sometimes called an outcome variable.
VIF
In linear models, the traditional method for reducing multicollinearity is to identify the offending predictors;
For each variable, this statistic measures the increase in the variation of the model parameter estimate in comparison to the optimal situation (i.e., an orthogonal design).
PCA
principal component analysis can reduce the number of variables while maintaining accuracy.
Principal component analysis is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
orthogonal
relating to or composed of right angles
simple regression model
But the idea is that basically we’re going to fit a line to a set of data. So that line will consist of basically multiplying a set of coefficients by each of the different predictors. And so then we get new predictors or new covariance and we multiply them by the coefficients that we estimated with our prediction model and then we get a new prediction for a new value.
RMSE
Root mean squared error
The root-mean-square deviation or root-mean-square error is a frequently used measure of the differences between values predicted by a model or an estimator and the values observed.
Feature selection with caret package in R
Three methods:
- Remove redundant features; findCorrelation
- Rank features by importance; Learning Vector Quantization (LVQ), decision tree
- Feature selection. Recursive Feature Elimination, random forest.