Unit 5: Random Forests and Ensemble Learning Flashcards
What is a Random Forest, and how does it function?
Random Forest: An ensemble learning method that uses multiple decision trees to improve accuracy.
Functionality:
Bagging: Samples data with replacement to create diverse decision trees. Majority Voting: Final prediction is made based on the majority vote from all trees.
What are the advantages of using Random Forest over a single decision tree?
Advantages:
Reduced Overfitting: Random Forest mitigates overfitting compared to a single decision tree. Improved Accuracy: Combines multiple trees for a more robust model. Feature Importance: Provides insights into feature significance in predictions. Importance Calculation: Based on the decrease in accuracy when a feature is permuted.
Explain different ensemble learning techniques and their applications.
Ensemble Learning Techniques:
Bagging (Bootstrap Aggregating): Reduces variance by averaging predictions from multiple models (e.g., Random Forest). Boosting: Sequentially trains models, each correcting errors of the previous one (e.g., AdaBoost, Gradient Boosting). Stacking: Combines predictions from multiple models using another model to improve performance.
What are common evaluation metrics for assessing model performance?
Evaluation Metrics:
Accuracy: The ratio of correctly predicted instances to total instances. Equation: Accuracy=TP+TNTP+TN+FP+FNAccuracy=TP+TN+FP+FNTP+TN (where TP, TN, FP, and FN are True Positives, True Negatives, False Positives, and False Negatives, respectively). Precision: The ratio of correctly predicted positive observations to the total predicted positives. Equation: Precision=TPTP+FPPrecision=TP+FPTP Recall (Sensitivity): The ratio of correctly predicted positive observations to all actual positives. Equation: Recall=TPTP+FNRecall=TP+FNTP F1 Score: The harmonic mean of precision and recall. Equation: F1 Score=2⋅Precision⋅RecallPrecision+RecallF1 Score=2⋅Precision+RecallPrecision⋅Recall
Mean
Mean (Average):
Mean=∑i=1nxin
Mean=n∑i=1nxi
Where xixi are the data points and nn is the number of data points.
Median:
For an ordered dataset, the median is the middle value. If there is an even number of observations:
Median=2x2n+x2+1n
Mode:
The most frequently occurring value in a dataset.
Linear Regression Equation:
y=mx+b
y=mx+b
Where yy is the predicted value, mm is the slope, xx is the feature, and bb is the y-intercept.
Confusion Matrix: Useful for evaluating classification models: Accuracy=TP+TNTP+TN+FP+FN Accuracy=TP+TN+FP+FNTP+TN Where: TPTP = True Positives TNTN = True Negatives FPFP = False Positives FNFN = False Negatives Precision and Recall: Precision: Precision=TPTP+FP Precision=TP+FPTP Recall (Sensitivity): Recall=TPTP+FN Recall=TP+FNTP
Unit 3: Neural Networks
Activation Function (Sigmoid): σ(x)=11+e−x σ(x)=1+e−x1 Mean Squared Error (MSE) Loss Function: MSE=1n∑i=1n(yi−y^i)2 MSE=n1i=1∑n(yi−y^i)2 Where yiyi is the true value and y^iy^i is the predicted value.
Unit 4: Support Vector Machines and Flexible Discriminants
SVM Decision Function: f(x)=∑i=1nαiyiK(xi,x)+b f(x)=i=1∑nαiyiK(xi,x)+b Where: αiαi are the weights, yiyi are the class labels, KK is the kernel function.
Unit 5: Random Forests and Ensemble Learning
Entropy for Information Gain: H(S)=−∑i=1cpilog2(pi) H(S)=−i=1∑cpilog2(pi) Where pipi is the proportion of class ii in set SS. Gini Index: Gini(S)=1−∑i=1cpi2 Gini(S)=1−i=1∑cpi2
General Statistical Concepts
Central Limit Theorem: States that the distribution of the sample means approaches a normal distribution as the sample size increases, regardless of the population's distribution. Hypothesis Testing: Test statistic: z=xˉ−μsn z=n sxˉ−μ Where: xˉxˉ = sample mean, μμ = population mean, ss = sample standard deviation, nn = sample size.
These equations and concepts should provide a solid foundation for the mathematical aspects of your course. If there’s a specific topic or equation you’re trying to recall, let me know, and I can provide more targeted information!
Variance:
Variance(s2)=n−1∑i=1n(xi−xˉ)2 Where xˉxˉ is the mean.