Unit 2 Flashcards
Decision Tree Learning
Definition: Supervised Machine Learning method where data is split based on a parameter.
Entities: Decision nodes (splits) and leaves (decisions/outcomes).
Types: Classification Trees (Categorical Outcomes),Regression Trees (Continuous Outcomes).
Representing Concepts as Decision Trees
Tree Structure: Root node,leaf node, splitting, branches/subtrees.
Building Trees: Using CART algorithm (Classification and Regression Tree).
Recursive Induction of Decision Trees
Recursive Partitioning: Statistical method for variable analysis.
Decision Tree Example: Survival based on Titanic Ship based on variables.
Picking the Best Splitting Attribute
Information Gain: Measure for choosing the feature that provides the best split.
Ex: Classifying people at a theatre based on attributes.
Entropy and Information Gain
Entropy: Measure of randomness and disorder in information.
Information Gain: Reduction in entropy, calculated for each attribute.
Computational Complexities of ML Models
Ex: K Nearest Neighbour, Logistic Regression, SVM, Decision Tree, Randon Forest, Naive Bayes.
Time and Space Complexity: Big O Notation.
Occam’s Razor
Principle: The simplest explanation is likely the correct one.
Application in ML: Balancing model complexity for accurate predictions
Overfitting in ML
Definition: Model captures noise, reducing efficiency and accuracy.
Reasons: Uncleaned data, high variance, inadequate training data, model complexity.
Noisy Data and Pruning
Noisy Data: Corrupted to distorted data with low signal-to-noise ratio.
Pruning: Data compression technique to reduce non-critical parts of decision trees.
Experimental Evaluation of Learning Algorithms
Hypothesis Accuracy: Estimating accuracy using statistical methods.
Factors: Testing, likelihood, strategy with limited data.
Comparing Learning Algorithms and Cross-Validation
Factors: Time complexity, space complexity, sample complexity, unbiased data, online/offline algorithms, parallelizability, parametricity.
Cross-Validation Types: Holdout method, K-fold cross validation, Leave-p-out, Leave-one-out.
Learning Curves and Statistical Hypothesis Testing
Learning Curves: Plots showing progress over training experience.
Hypothesis Testing: Confirming observations using sample data and statistical tests.
Random Forest Complexity
Training Time: O(n * log(n)dk), kis being the number of Decision Trees.
Run-time: O(depth of tree * k).
Space Complexity: O(depth of tree * k).
Naive Bayes Complexity:
Training Time: O(n*d).
Run-time: O(c*d), retrieving features for each class ‘c’.
Occam’s Razor in Model Selection
Model Selection: Choosing the appropriate model for a machine learning problem.
Balance: Achieving a balance between model simplicity and accuracy.
Limitations of Cross-Validation
Computational Resources: Cross-validation can be computationally expensive.
Unseen Data: Test dataset in cross-validation may contain crucial information.
Learning Curve Types
Diminishing-Returns Curve: Rapid progression initially, slows over time.
Increasing-Returns Curve: Progression accelerates over time.
Increasing- Decreasing Return Curve (S-curve): Combination of both.
Complex Learning Curve: Varied progression patterns.
Hypothesis Tesing in ML
Purpose: Confirming observations about the population using sample data.
Null Hypothesis: Assumes no significant difference.
Alternate Hypothesis: Assumes a significant difference.
Types of Cross-Validation Methods
Holdout Method: Basic, dividing dataset into training and testing.
K-fold Cross-Validation: Improved holdout method with k subsets.
Leave-p-out Cross-validation: Exhaustive method leaving p data points out.
Leave-one-out Cross Validation: Simplified version, p equals one.
Applications: Evaluating and selecting ML models.
Bias-Variance Tradeoff
Balance: The tradeoff between bias (inflixibility) and variance (noisiness).
High Bias: Model is too simple, may underfit.
High Variance: Model is too complex, may overfit.