10 Statistical Techniques Flashcards
What is a Random forest?
- very similar to bagging.
- bootstrap samples of your training set.
- faster, because each tree learns only from a subset of features.
- in bagging, you give each tree the full set of features
“data scientist is a person who is better at statistics than any ——– and better at programming than any ———– .”
programmer, statistician
The two best-known techniques for shrinking the coefficient estimates towards zero
ridge regression and the lasso
what is Bootstrapping in Resampling?
sampling with replacement from the original data, and take the “not chosen” data points as test cases. We can make this several times and calculate the average score as estimation of our model performance.
techniques for Non linear models
- step function
- piecewise function
- spline
- -generalized additive model
What is Classification?
a data mining technique that assigns categories to a collection of data in order to aid in more accurate predictions and analysis.
Do body weight calorie intake, fat intake, and participant age have an influence on heart attacks (Yes vs No)?
Logistic Regression
Types of Dimensionality reduction?
1) Principal Components Regression (unsupervised)
2) Partial least squares (PLS) (supervised)
How Discriminant Analysis works?
use Bayes’ theorem to provide propability that a new member belongs to which class.
Three Tree-Based Methods
1) Bagging
2) Boosting
3) Random forest
Why study Statistical Learning?
It is important to understand the ideas behind the various techniques, in order to know how and when to use them.
What is the advantage of Boosting?
By combining the advantages and pitfalls by varying your weighting formula you can come up with a good predictive force for a wider range of input data, using different narrowly tuned models.
Types of Subset Selection?
1) Best-Subset Selection
2) Forward Stepwise Selection
3) Backward Stepwise Selection
4) Hybrid Methods
What is Support Vector Machines (SVM)?
a hyperplane is n-1 dimensional subspace of an n-dimensional space) that best separates two classes of points with the maximum margin.
what is Principal Component Analysis(PCA) ?
Producing low dimensional representation of the dataset by identifying a set of linear combination of features which have maximum variance and are mutually un-correlated.
* Understanding latent interaction between the variable in an unsupervised setting