Linear Reg Flashcards

Question

What about multiple hypothesis testing? How could we do this?

Answer 1

Bonferroni Correction • Assume that individual tests are independent. • Divide the desired p-value threshold by the number of tests performed. • Example – We now have, the threshold set to 0.05/20 = 0.0025. –P(making a mistake) = 0.0025 –P(not making a mistake) = 0.9975 –P(not making any mistake) = 0.9975^20 = 0.9512 –P(making at least one mistake) = 1 - 0.9512 = 0.0488 Non Parametric Tests • They do not make any assumption about the distribution of the variable in the population • Mann-Whitney U Test –Nonparametric equivalent of the independent t-test • Wilcoxon matched-pairs signed rank test –Used to compare two related groups

Answer 2

We can use the threshold to optimize our precision and recall a higher the threshold, increases precision and lower recall a lower threshold, decreases precision and increase recall • Suppose we use a near one threshold to classify positive examples • Then, we will classify as positives only examples for which we are very confident (this is a pessimistic classifier) • Precision will be high – In fact, we are not likely to produce few false positives • Recall will be low – In fact, we are likely to produce many false negatives • Suppose we use a near zero threshold to classify positive examples • Then, we will classify everything as positives (this is an optimistic classifier) • Precision will be low as we are going to generate the maximum number of false positives (everything is positive!) • Recall will be high since by classifying everything as positive we are going to generate the minimum number of false negatives

Answer 3

• Plot precision as a function of recall for varying threshold values • The best classifier would be the one that has always a precision equal to one (but never happens) • More in general classifiers will show of different shapes • How to decide among more classifiers? –Use the area under the curve (the nearer to one, the better) –Use F1 measure Determining the Best Classification Threshold Using a Precision-Recall Curve and F1-Score A classification threshold is a value that determines the point at which a model classifies predictions as positive or negative. Adjusting this threshold directly affects the precision and recall of the model, and consequently, the F1-score. The best threshold balances these metrics based on the application’s requirements. Steps to Determine the Best Threshold 1. Generate the Precision-Recall Curve • A precision-recall curve plots precision (y-axis) against recall (x-axis) for various threshold values. • Lower thresholds increase recall but may reduce precision, while higher thresholds increase precision but reduce recall. 2. Calculate F1-Score for Each Threshold • For each threshold on the precision-recall curve: • Compute precision: \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} • Compute recall: \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} • Compute the F1-score: F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} 3. Select the Threshold with the Highest F1-Score • The threshold with the maximum F1-score represents the optimal trade-off between precision and recall. • This threshold is ideal when precision and recall are equally important. Visualization Approach • Plot the precision-recall curve and annotate it with F1-scores for key thresholds. • Mark the threshold corresponding to the highest F1-score on the curve. • This helps in visually understanding how threshold adjustments impact precision, recall, and F1. Why Use the F1-Score? • The F1-score balances precision and recall, making it a suitable metric for imbalanced datasets. • When both false positives and false negatives are important, the F1-score helps identify a threshold that minimizes both error types. When Precision or Recall is More Critical • If precision is more important (e.g., minimizing false positives in spam detection), select a threshold that maximizes precision, even if it lowers recall. • If recall is more important (e.g., detecting all cases of disease), select a threshold that maximizes recall, even if precision is slightly reduced. Limitations • The optimal threshold based on F1-score might not be suitable if precision and recall have unequal importance. In such cases, weighted metrics or a cost-sensitive approach might be more appropriate. • Real-world applications may require domain-specific adjustments to thresholds beyond what the F1-score alone suggests. By calculating the F1-score for various thresholds on the precision-recall curve, you can identify the best classification threshold for balancing precision and recall effectively, ensuring the model performs optimally for its intended use case.

Answer 4

ROC Curves: Definition A Receiver Operating Characteristic (ROC) curve is a graphical representation of a classification model’s performance across various decision thresholds. It plots: • True Positive Rate (TPR) (y-axis), also known as Recall or Sensitivity: \text{TPR} = \frac{\text{TP}}{\text{TP} + \text{FN}} • False Positive Rate (FPR) (x-axis): \text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}} The curve shows the trade-off between sensitivity and specificity as the classification threshold is adjusted. Key Features of the ROC Curve 1. Diagonal Line (Baseline): • Represents a random classifier. A curve close to this line indicates poor performance. 2. Perfect Model: • A perfect model reaches the top-left corner of the graph, indicating \text{TPR} = 1 and \text{FPR} = 0 . 3. Area Under the Curve (AUC): • The AUC-ROC score quantifies the overall performance of the model. It ranges from 0 to 1: • 1.0: Perfect model. • 0.5: Random guess. • < 0.5: Worse than random. Using ROC Curves to Compare Models 1. Compare AUC Scores • Models with a higher AUC score generally perform better across all thresholds. • Example: If Model A has an AUC of 0.90 and Model B has an AUC of 0.75, Model A is better at distinguishing between classes. 2. Visual Analysis • Compare the shapes of the ROC curves: • A steeper curve near the top-left corner indicates better performance at higher TPRs and lower FPRs. • A flatter curve closer to the diagonal line suggests poor discrimination ability. 3. Focus on Specific Regions • Depending on the application, some thresholds might matter more than others: • In medical diagnosis, prioritize the part of the curve where FPR is low, as false positives can be costly. • In fraud detection, focus on higher TPR regions to detect as many true positives as possible. 4. Threshold Selection • Use the ROC curve to select a threshold that balances TPR and FPR according to the application’s needs. Advantages of ROC Curves 1. Threshold Independence: • ROC curves evaluate performance across all thresholds, offering a holistic view of the model. 2. Class Imbalance Resilience: • Unlike accuracy, ROC curves are not affected by imbalanced datasets since they focus on TPR and FPR. 3. Comparative Analysis: • Easily compare multiple models’ discrimination abilities in the same plot. Limitations of ROC Curves 1. Not Suitable for Imbalanced Datasets: • In highly imbalanced datasets, FPR might appear low due to the abundance of true negatives, making the curve misleading. • In such cases, a Precision-Recall (PR) curve is often preferred. 2. Application-Specific Metrics: • ROC curves focus on TPR and FPR but might not reflect application-specific costs or priorities (e.g., false negatives being more critical than false positives). Example Scenario: Comparing Models Case: You have three models (A, B, and C) for a binary classification task. Plot their ROC curves: 1. Model A: • AUC = 0.95 (steep curve near the top-left corner). • Excellent at distinguishing between positive and negative classes. 2. Model B: • AUC = 0.80 (moderate curve). • Performs well but less reliably than Model A. 3. Model C: • AUC = 0.60 (curve close to the diagonal line). • Barely better than random guessing. Decision: • Select Model A for the best performance across thresholds. • If your use case prioritizes specific thresholds (e.g., low FPR), examine the corresponding regions of the ROC curves. By using ROC curves and their AUC scores, you can compare models, analyze trade-offs, and select the best model for your specific application.

Answer 5

• If the goal is to obtain good generalization performance, there are no context-independent or usage-independent reasons to favor one classification method over another • If one algorithm seems to outperform another in a certain situation, it is a consequence of its fit to the particular problem, not the general superiority of the algorithm • When confronting a new problem, this theorem suggests that we should focus on the aspects that matter most –Prior information –Data distribution –Amount of training data –Cost or reward

Answer 6

The k-Nearest Neighbor (k-NN) is a simple, yet powerful, supervised machine learning algorithm often used for classification and regression tasks. It classifies a data point based on the majority class of its nearest neighbors or predicts a value by averaging the values of its nearest neighbors. How k-NN Works: 1. Training Phase: • k-NN does not perform explicit training. It simply stores the training data. • This is why it is considered a lazy learning algorithm. 2. Prediction Phase: • When given a new data point, the algorithm computes the distance between the point and all points in the training set. • It selects the k nearest neighbors (commonly using Euclidean distance, but other metrics like Manhattan or Minkowski can be used). • For classification, the predicted class is determined by majority voting among the neighbors. • For regression, the prediction is often the average (or weighted average) of the neighbors’ values. Relation to Instance-Based Methods: • Instance-based methods are a family of machine learning techniques where the model “learns” by storing the training data and making predictions based on these stored instances. • k-NN is a classic example of an instance-based method because it relies entirely on stored training instances to make predictions, rather than building an explicit model or deriving parameters. Differences Between k-NN and Instance-Based Methods (Broadly): While k-NN is a specific implementation of instance-based learning, other instance-based methods may differ in the following ways: Aspect k-NN Other Instance-Based Methods Specificity A specific algorithm within instance-based methods. A broader category including other methods like RBF networks or case-based reasoning. Distance Function Typically uses Euclidean or similar distance metrics. May use more complex similarity measures depending on the method. Prediction Strategy k-NN uses majority voting (classification) or averaging (regression). Other methods may employ weighting, kernel functions, or heuristics. Memory Usage Stores all training data, often leading to high memory requirements. Some methods may condense or preprocess the instances for efficiency. Adaptation Predictions rely on all data points near a query. Other methods might use only specific “prototypical” instances or adapt based on context. Key Takeaways: • k-NN is a specific type of instance-based learning method. • All k-NN methods are instance-based, but not all instance-based methods are k-NN. • Instance-based methods include a range of algorithms that rely on stored instances for predictions, sometimes with enhancements to address k-NN’s drawbacks, such as sensitivity to irrelevant features or large memory requirements.

Answer 7

If k is too small, classification might be sensitive to noise points If k is too large, neighborhood may include quite dissimilar examples

Answer 8

• Basic Approach –Linear scan of the data –Classification time for a single distance depends on the number of data points and the number of variables O(nd) for n instances of d variables –This becomes prohibitive when the training set is large • Nearest-neighbor search can be speeded up by using –KD-Trees – Ball-Trees

Answer 9

Split the space hierarchically using a tree generated from the data To find the neighbor of a specific example, navigate the tree using the example Effectiveness of KD-trees • Search complexity depends on depth of tree. • It is the logarithm of number of nodes for balanced tree O(log(n)) • Occasional rebalancing of tree may be needed randomizing order of data is another option • But amount of backtracking required depends on quality of tree • Some nodes are square (good) while others are skinny (bad)

Answer 10

It suggest that the thing I am trying to predict has local properties

Answer 11

• What’s the probability of the class given an example? • An example is represented as a tuple of attributes • Given the target y (identifying the class value for the instance) we are looking for the class with the highest probability for x • Naïve Bayes classifiers assume that attributes are statistically independent.Thus, evidence splits into parts that are independent • Training –Count the frequency of tuples (xi,y) for each attribute value xi and each class value y –Use the counts to compute estimates for the class probability P(y) and the conditional probability P(xi|y) • Testing – Given an example x, computes the most likely class as • Two assumptions –Attributes are equally important –Attribute are statistically independent • Statistically independent means –That knowing the value of one attribute xj says nothing about the value of another xi if the class y is known, that is, P(xi|xj,y) = P(xi|y) –Independence assumption is almost never correct! But the scheme works well in practice

Answer 12

Answer in image DM3

Answer 13

• What if an attribute value does not occur with every class value? (for instance, “Outlook = overcast” for class “no”) • The corresponding probability will be zero, and posteriori probability will also be zero! (No matter how likely the other values are!) • Typical remedy is to add 1 to count for every (attribute value, class) pair –Process called smoothing. –Adding 1 is called a Laplace estimator • Resulting probabilities will never be zero! It also stabilizes probability estimates

Answer 14

• During training, instance is not included in frequency count for attribute value-class combination • During testing, the attribute will be omitted from calculation

Answer 15

• So far, we applied Naïve Bayes to categorical data. • What if some (or all) of the attributes are numeric? Two options. Discretize the data to make it binary or discrete • Compute a probability density for each class –Assume parametric form for distribution and estimate its parameters. E.g., assume attribute values for class follow Gaussian distribution –Directly estimate probability density from the data. E.g., use kernel smoothing to estimate density of values along axis for class

Answer 16

Answer in DM4

Answer 17

• Bayesian Belief Networks (BBN) provide a graphical representation of probabilistic relationships among a set of random variables • Describe the probability distribution governing a set of variables by specifying –Conditional independence assumptions that apply on subsets of the variables –A set of conditional probabilities • Two key elements – A direct acyclic graph, encoding the dependence relationships among variables –The network topology imposes conditions regarding the variable conditional independence –A probability table associating each node to its immediate parents' nodes

Answer 18

Answer in DM5

Linear Reg Flashcards

(42 cards)