Short Answer Prep Flashcards
What is the purpose of Multiple Linear Regression (MLR)?
MLR models the relationship between a label and multiple features, allowing us to analyze the cumulative effect of all features on the label.
Why is Linearity an important assumption in MLR?
Linearity ensures that the relationship between each predictor and the outcome is linear; if not, the model cannot accurately capture the relationship.
Why must errors in MLR be normally distributed?
Normal distribution of errors (residuals) is necessary for valid statistical tests, helping to ensure unbiased predictions.
What is Homoskedasticity in MLR, and why is it important?
Homoskedasticity means that residual variance remains constant across predictor values, ensuring that predictions are consistent across all levels of features.
What does No Multicollinearity mean in MLR?
Predictors should not be highly correlated; multicollinearity can make it difficult to determine each feature’s unique impact, leading to unreliable coefficients.
Why is Independence of Errors an important assumption?
Errors should not be autocorrelated, meaning that they should not show patterns or dependencies, ensuring the model’s predictive accuracy.
How can missing data impact the assumptions of MLR?
Missing data can violate assumptions by creating dependencies between predictors (multicollinearity) and reducing model reliability, possibly biasing results.
Why might age be a suitable predictor in MLR, but credit card numbers are not?
Age has a meaningful, measurable effect on predictions, while credit card numbers are identifiers with no predictive relevance or linear relationship with the outcome
What is the most common distance measure used in k-NN?
Euclidean distance, which measures closeness by calculating the square root of the sum of squared differences between feature pairs.
How does k-NN classify a new instance?
By assigning the class that is most common among its k-nearest neighbors based on the distance measure.
What is the effect of choosing a smaller k (e.g., k=2) in k-NN?
A smaller k-value may lead to overfitting, as it considers fewer neighbors, making it more sensitive to minor fluctuations.
What does it mean that k-NN is a “lazy” algorithm?
k-NN doesn’t build a model upfront but instead classifies instances directly from the training data, making it computationally intensive at prediction time.
What distinguishes k-means from k-medoids clustering?
K-means uses an average point as a cluster center, which might not exist in the dataset, while k-medoids uses an actual data point, making it more robust to outliers.
Describe the key assumptions of Multiple Linear Regression (MLR) and explain why age could be a meaningful predictor while credit card numbers are not. Additionally, discuss how missing data can impact MLR assumptions.
Briefly Introduce MLR Assumptions
Summarize that MLR assumes certain conditions to ensure accurate and unbiased predictions.
Explain Each Assumption
Linearity: Predictor and outcome relationship should be linear.
Normal Distribution of Errors: Residuals should follow a normal distribution for unbiased predictions.
Homoskedasticity: Consistent variance of residuals across all predictor levels.
Independence of Errors: Errors should not display patterns or dependencies.
No Multicollinearity: Predictors should be independent of one another.
Importance of Suitable Predictors (Age vs. Credit Card Number)
Age: Age is a numerical predictor with meaningful variation, making it suitable for MLR as it can have a linear relationship with an outcome (e.g., income).
Credit Card Number: Credit card numbers are identifiers, lacking a measurable or linear relationship to most outcomes, and thus provide no predictive relevance.
Impact of Missing Data on MLR Assumptions
Missing data can violate assumptions, particularly multicollinearity and independence. It introduces dependencies, reduces the model’s robustness, and could lead to biased coefficients and unreliable predictions.
Explain how k-Nearest Neighbors (k-NN) uses distance measures for classification and discuss the effect of the choice of k on model performance.
Introduce Distance Measure in k-NN
Explain that k-NN often uses Euclidean distance, calculated by taking the square root of the sum of squared differences between feature pairs. This measures how “close” an instance is to others in multi-dimensional space.
Describe k-NN Classification
k-NN classifies a new data point by looking at the class labels of its k-nearest neighbors and assigning it the most common class.
Effect of Choosing k on Model Performance
Small k (e.g., k=2): Can lead to overfitting, as it is highly sensitive to minor variations.
Large k (e.g., k=20): Generalizes better and is less sensitive to noise, reducing the risk of overfitting but potentially underfitting if k is too large.