PCA, LLE, t-SNE, RandomForest, XGBoost Flashcards
What are the key ideas behind of PCA?
Principal Component Analysis is used in exploratory data analysis and for making predictive models. We transform the points to spread out along the axis that maximizes the variance of the data (The first component). We then add a second axis orthogonal to the first one with the biggest variance in that direction. Thus, the original data is approximated by data that has many fewer dimensions and that summarizes well the original data. This is done by finding the Eigenvalues and shit.
What are the key ideas behind LLE?
Locally linear embeddings.
• for each object Xi find a few neighboring objects;
• measure distances between Xi and these neighbours
• find Yi in low dimensional space that preserve all
mutual distances => a very simple optimization
problem!
What are the key ideas behind t-SNE?
t-distributed stochastic neighbour embedding differs from PCA by being able to work with datasets that are not linearly separable.
What are the key ideas behind RandomForest?
Random forests create a bunch of decision trees and then classifies according to which class is most represented in the decisions.
What are the most important parameters of RandomForest? What do they mean/Affect?
n_estimators = number of trees in the foreset max_features = max number of features considered for splitting a node max_depth = max number of levels in each decision tree min_samples_split = min number of data points placed in a node before the node is split min_samples_leaf = min number of data points allowed in a leaf node bootstrap = method for sampling data points (with or without replacement)
What is an OOB accuracy estimate? How can it be performed?
Out-of-bag estimate, is a method of measuring the prediction error of random forests, boosted decision trees, and other machine learning models utilizing bootstrap aggregating (bagging). In bagging an OOB dataset is created by subtracting the bootstrap dataset from the original. The bootstrap dataset is created by sampling the original dataset uniformly in different groups and then aggregating the groups.
It is performed as:
- Find all models (or trees, in the case of a random forest) that are not trained by the OOB instance.
- Take the majority vote of these models’ result for the OOB instance, compared to the true value of the OOB instance.
- Compile the OOB error for all instances in the OOB dataset.
What is the overall idea of XGBoost?
Extreme Gradient Boosting is used for supervised learning problems. The boosting part means building a strong classifier from a number of weak. In gradient boosting, each predictor corrects its predecessor’s error. The weight of variables predicted wrong by the tree is increased and these variables are then fed to the second decision tree. These individual classifiers/predictors then ensemble to give a strong and more precise model. It can work on regression, classification, ranking, and user-defined prediction problems.