04 - ML Flashcards

1
Q

Are classification and logistic regression the same?

A

Classification is done by many supervised learning algorithms for example decision trees, random forests, etc. Logistic regression is one of those algorithms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Is it possible to create models in Machine learning?

A

Machine learning is mainly about creating mathematical relations between features of a dataset. These mathematical relations are called models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How was linear regression done prior to machine learning?

A

Long long ago, it was done by mathematicians with very few datasets. In the availability of computers, we are now enabled with high computational powers and hence we can prepare linear regression models with any amount of data in today’s world.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What was fearless about ML?

A

It’s a reference that ML practitioners usually don’t spend too much time evaluating the statistical validity of a method; they prefer to just create a model and evaluate its performance, rather than thinking about the statistical validity of the process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

If we don’t understand how a model’s intricacies work, how can we be 100% sure that it’s a successful model?

A

Interpretation of how the model is working mathematically is not always possible. While preparing the algorithm, its mathematical steps are derived. While applying the model it is not always possible and needed to know the steps. It is about the performance of the model on unseen data that decides the success of the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Do more data points bring new insights?

A

Yes, with larger data the insights will be supported with more data and hence they will be more trustworthy. The same will be the quality of model training. With more data, models train better.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why is the model interpretation important?

A

Model interpretation is important because it gives an understanding of how the model reached the final results/predictions. It gives some clarity and indication about if some changes have to be done to make the model perform better.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Does prediction imply a causal relationship exists?

A

Prediction is not always necessary to imply a causal relationship. We may train a machine learning model with variables that do not affect each other in real life but are capable of making good predictions through the trained models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the features of the machine learning model?

A

Features are the attributes or variables that are used in a machine learning model. Generally, there are independent and dependent features. Independent features are the input features of the model while the dependent feature is the output variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Is risk similar to variance?

A

For a machine learning model, the higher the error more will be chance of its failure over the unseen data and it will be a measure of the risk of the model. So risk can be interpreted as the error associated with the model and hence minimizing risk is equivalent to minimizing error or variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are overfitting and underfitting?

A

Overfitting is a case when a machine learning model follows the noises of the data too much. It becomes too complex to make good predictions on the unseen data. Underfitting is the opposite case where the model is simple enough to capture the patterns of the data and hence it is also unable to generalize over the unseen data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does interpolation mean?

A

Interpolation is a statistical method by which related known values are used to estimate an unknown value. It is useful in treating missing values in machine learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why do we square the residual?

A

It is for mathematical purposes. If we don’t square them then the positive values and negative values cancel out each other and we will still get the error as zero or close to zero which is wrong.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the limit of the error made by a regression model?

A

There is no existing defined limit on errors by a linear regression model. The only condition is that the sum squared of error terms should be minimum.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Can we use either normalization or standardization?

A

We use either standardization or normalization as it brings all the variables to a uniform scale. If we don’t do that there is a chance a less important variable will be given more priority.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Is it acceptable if the slope of a variable is very small in the model?

A

The slope of any variable tells about how the dependent variable would change for a unit increase in the independent variable while keeping other variables constant. The larger the value of slope higher the change and Vice versa. A small slope will indicate that a certain feature is not influencing the target variable and that can be removed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Is that always good to add more variables that could potentially have an influence on the outcome?

A

Yes, it is good to add more variables that are influencing the outcome. It will make the predictions more generic and reliable over unseen data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Is there a way to validate the independence of error terms?

A

Yes, to do that hypothesis testing is done.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What do we mean by “linear predictor”?

A

A linear predictor means a linear relationship between the output and the input variables. The term linear is associated with the coefficients/parameters of the model. Every equation that is not following the linearity in parameters is a non-linear relation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Don’t we divide the TSS by ‘n’ so we have something somehow ‘independent’ from the number of samples we take?

A

Dividing the total sum of squares with n will not make any sense because the residual sum of squares will also be divided by n. As we are considering the ratio between them we do not need to divide TSS with n.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Can we interpret R² as a measure to understand how linear our dataset is?

A

No, R² is used to explain how much our model can capture the variance in the data. The higher the value better is the performance of the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How to choose between Normalization and Standardization?

A

Normalization, typically means that the range of values is “normalized to be from 0 to 1” while Standardization typically means that the range of values is standardized to measure how many standard deviations the value is from its mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is an acceptable R-squared in the real world?

A

Higher the value of R-square better is the fit of the model over the given data. This is because a higher r-squared value implies a lower residual sum of squares.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Do we call it R squared because the square is mathematically useful somewhere else?

A

R is the correlation between the predicted values and the observed values of Y. R square is the square of this coefficient and indicates the percentage of variation explained by your regression line out of the total variation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Can we discriminate the features before the training? like finding the correlation between features against labels?

A

Yes, we can find the correlation between the variables and if any 2 variables are highly correlated then we drop one of them as it is redundant to our model and adds no new information to the model we develop. Along with that by doing exploratory data analysis we can remove variables based on many criteria.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Do we need to test for multicollinearity in case of more variables?

A

Yes, we do need to check for multicollinearity before we develop a model. We drop the redundant variables. If we have a VIF value for a variable greater than 5 we generally drop those variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Is Linear Regression a good base model to compare other methods?

A

Yes. It is always good to develop a base model as Linear regression for continuous variables and then we can move to more complicated algorithms such as Boosting, neural nets, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

How to estimate the parameters in the model?

A

To estimate the parameters in the model, we minimize the sum square of errors in which the parameters are treated as variables. By applying optimization algorithms we can estimate them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Do we iterate to reduce the distance from the true value?

A

Yes, if a model is not performing well in the first go, it means it is not close enough to the true line. In such a case we do iterate by making changes in the variables of the model, in the data, or hyper-parameters of the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

How reliable is the prediction if the user provides a clean dataset to run the training?

A

Having clean data is good for the machine learning model, but it is not the only factor that decides the reliability of the model. Along with clean data, the features selected in the model, its performance over unseen data are things that make the model reliable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

How do we know if there is bias in our estimation and how to measure it?

A

Bias can be interpreted as the training error in the model. If the training error is very high, then the model is biased and simple enough to perform reliably on the unseen data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Is there a book or website that you recommend on today’s concepts?

A

You can refer to ISLR (An Introduction to Statistical Learning) by Gareth James and co-authors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Are confidence interval and confidence band the same?

A

They are two different representations of the same thing. A confidence band is the lines on a probability plot or fitted line plot that depict the upper and lower confidence bounds for all points on a fitted line within the range of data. On a fitted line plot, the confidence interval for the mean response of a specified predictor value is the points on the confidence bands directly above and below the predictor value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Can you describe a little bit about the difference between MLE and OLS? Especially under what kind of context we might favor one over the other?

A

The ordinary least squares, or OLS, can also be called the linear least squares. This is a method for approximately determining the unknown parameters located in a linear regression model. Maximum likelihood estimation, or MLE, is a method used in estimating the parameters of a statistical model and for fitting a statistical model to data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Does R-square error increase with the addition of new features?

A

Yes, the r-square value increases with the addition of new features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Can we use logarithmic regression in machine learning?

A

Yes, if the data is fitting better over a logarithmic equation then it can be fit on that. Machine learning bothers about what relation is fitting the best to the given data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Is standardization the same as scaling?

A

To standardize data is about how many standard deviations the data point is away from the mean of the data. Standardization is just a technique to perform feature scaling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

How do we know that linear regression is the wrong choice for a problem?

A

If the trained linear regression model is not doing good over the unseen data after all the possible modifications then it is not a good choice for that specific problems and one should prefer to choose from other algorithms suitable for regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

PCA gives us coefficients to multiply with the variables (X) to find the Y, is it some sort of regression?

A

Even though PCA does that, it is not a supervised learning method and hence it does not do any sort of regression. The computations of PCA are different from that of linear regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Can you use PCA to reduce dimension for your dataset for regression?

A

The main purpose of PCA is to reduce the dimension (number of features) from the given dataset. It is applied mainly when the number of features is very high. It is not used for regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

We use R-squared in supervised learning as well as in unsupervised PCA, but are there any differences in applying in these 2 categories?

A

We try to minimize the sum of the squared residuals for each point, we find that the projection which achieves the fraction of the total variance accounted by the first k principal components is the R2 of the projection in PCA. In supervised learning, r-squared is how well the regression model fits the observed data. A higher r-squared indicates a better fit for the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What are the differences between classification and clustering?

A

The difference between classification and clustering is that classification is used in supervised learning techniques where predefined labels are assigned to instances. On the other hand, clustering is an unsupervised learning technique where similar instances are grouped based on their features or properties; there are no predefined labels assigned to instances.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Is ML always a black box?

A

It’s not always a black box. It’s generally a scale of interpretability, some ML algorithms like Linear Regression and Decision Trees are highly interpretable, others like Neural Networks are not very interpretable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

How is ML different from a model?

A

Machine learning algorithms are procedures that are implemented in code and are run on data. Machine learning models are output by algorithms and are comprised of model data and a prediction algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What is the difference between linear and non-linear regression?

A

Nonlinear regression is a form of regression analysis where we model data in a nonlinear combination of model parameters. But in linear regression, we have a linear combination of parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What does the common notion that “Theory doesn’t always explain the success” mean?

A

There can be instances where the model can give very good performance on some data but might not be suitable for the data as per the theory of the algorithms. For example, deep neural networks work very nicely on large datasets but it is not very obvious ‘why’ it worked on the basis of mathematical theory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

When you were comparing building a model to just making a prediction, were you referring to having interpretable coefficients?

A

No, if the purpose of the model is simply to make predictions and provide insights on the problem, then it is not important for the coefficients to be interpretable. On the other hand, if we want the model to be interpretable, then it is of utmost importance we know what parameters are we using and they must be interpretable to us.

48
Q

How accurate does an ML algorithm need to be to be considered a correct model?

A

A data scientist’s job is to strive for as high accuracy as possible, but there is no pre-stated benchmark to consider a model ‘correct’. It depends on the problem at hand and business requirements.

49
Q

What is the difference between Vector data records and vector components?

A

A vector data record is an input array where particular values at particular positions are signifying particular coefficients. For example, if I want to express length and breadth in an array with the format [L, B], then [5, 2] is a vector data record. Whereas vector data components refer to each value within that vector, for example, 5 and 2 are two components in the previous example.

50
Q

Does the time and situations where advertisements were done matter?

A

Yes, it might be the case but in our scenario, we are more concerned with the overall effects of each kind of advertisement we have had so far.

51
Q

What’s the idea behind degrees of freedom?

A

In machine learning, the degrees of freedom refer to the number of parameters in the model. In linear regression, it refers to the number of coefficients. If there are more degrees of freedom (model parameters) in machine learning, then the model is expected to overfit the training dataset.

52
Q

What is the degree of freedom for regression?

A

For a simple linear regression with one variable, there are two degrees of freedom, the slope, and the intercept. If there are multiple variables, the number of degrees of freedom increases as the number of slope parameters used to describe the line in the n-dimensional space increases. So, if there are n variables, then there are n+1 degrees of freedom (intercept + slopes).

53
Q

Are Theta and X_i two vector components that give X at given I?

A

Theta is the vector denoting the slope parameters, and X stands for our data. So the product of these two matrices should ideally give us the predicted value of y (the target variable) at given x.

54
Q

Is residual vertical? or orthogonal to the line?

A

Residual is the difference between the actual value of the target variable and the predicted values. Hence, it is always vertical.

55
Q

Does theta represent a different slope?

A

Theta is the vector denoting the slope parameters. Theta has components - Theta0, Theta1 … Theta-m

56
Q

Why did we add the 1’s into the X vector? Was that to help with some of the math we’re doing here somehow?

A

We need to add 1 to the vector of X’s because that way, on multiplying the theta vector with the X vector we automatically get the intercept.

57
Q

What happens when the relationship is nonlinear?

A

Linear Regression only tries to find the “line of best fit” i.e. the best possible line that explains the pattern. We are assuming that there is a linear relationship between the independent and dependent variables. If a non-linear pattern is better suited, then more complex models can be built.

58
Q

How often do we actually have linear relationships?

A

It depends on the problem on hand. Linear relationships are more interpretable than any other model, but as the complexity of the problem increases, the linear relationship between independent and dependent variables is observed less often.

59
Q

From the estimation - would that mean if there is no advertising on TV, radio, or news, there is already a sale of 2.94?

A

Yes, the intercept of the model implies that there’s already a sales of 2.94 even if there is no advertising on TV, radio, and /or news.

60
Q

Do we need to perform a hypothesis test for theta_not?

A

Yes, it is advised we find confidence intervals and perform a hypothesis test on each coefficient to check whether that variable is significant for our model or not.

61
Q

If the newspaper does not help in estimating sales, can we remove it?

A

Yes, the evaluation metric Adjusted R-squared gives good intuition about the usefulness of each feature contributing to the final prediction.

62
Q

There may be 10 independent variables to a target variable. How do we choose the right dependent variable to a target variable?

A

We need to see if all ten of these target variables are significantly contributing enough to the target variables. If we want to reduce the dimension of feature space, we can do PCA to do dimensionality reduction.

63
Q

Will more samples will result in the slopes of the lines converging?

A

Yes, the more the sample data, the closer we are to a perfect prediction graph. So the slopes for our line start to converge towards their optimal value.

64
Q

What is W_i in the structural model?

A

W_i stands for noise which is normally distributed and independent.

65
Q

Do we only need to assume W_i is Gaussian or we should assume all variables are jointly Gaussian?

A

We only assume W_i as normal. There is no such restriction for the independent and dependent variables.

66
Q

What is Wi and what if it’s not normal?

A

Wi is the error of regression. It should be normally distributed. This is the basic assumption which should not be violated. Maximum likelihood estimation does not work if this assumption doesn’t hold true, in which case we should not use linear regression.

67
Q

Can we use prior information? For example, maybe we know a coefficient must be positive in our model

A

Yes, it is usable. We can do by using bayesian model with the prior probabilities

68
Q

Is there another distribution apart from the normal one that can be used for the maximum likelihood method?

A

In principle, you can do that, but you won’t get OLS as a result. 200+ years ago, Laplace (who kind of started this study) used Laplace distribution which later turned out to be not as good as normal.

69
Q

In some experimental data analysis, I got a negative R^2. Is it possible at all?

A

The baseline model for linear regression is the model where the prediction is the average value of the target value for any value of X. The Negative R2 implies that it’s worse than the baseline model.

70
Q

What is a good correlation to say that there is a strong correlation?

A

It depends on the problem at hand but generally, a value greater than 0.7 is considered a strong correlation.

71
Q

What is the range for the value of R squared?

A

R-Squared indicates the percentage of the variance in the dependent variable that the independent variables explain collectively. Hence, the value of R-squared lies between 0 and 1.

72
Q

Does more fitting have a bigger R square?

A

In overfitting, we would likely get a high R-Squared on the training data but a relatively small R-Squared on the test/validation data.

73
Q

Should we standardize the data before calculating regression coefficients?

A

It is useful to standardize data to make it computationally feasible but it’s not great for interpretation.

74
Q

Is the variation due to the randomness in the sample, or in the population itself?

A

Variation exists both in-sample and in population. We try to capture the true statistics of the population using the sample statistics.

75
Q

If we do not reject the null hypothesis, do we eliminate the corresponding variable and rerun the regression?

A

Yes, we need to re-run regression after eliminating insignificant variables.

76
Q

Wouldn’t it be more logical to make the null hypothesis theta > zero?

A

In that case, it would be difficult to conclude with rejection as there are many possible values we can test for.

77
Q

Is there a fundamental difference between a z-test and a wald test?

A

No, they are essentially the same tests.

78
Q

Is there a Bayesian equivalent to this method?

A

Yes - there is a bayesian variation for almost everything.

79
Q

Why don’t we use a t-test for coefficients?

A

We use a t-test, instead of a z-test, to check the significance of independent variables because the population variance is unknown.

80
Q

Do we perform dimensionality reduction before performing supervised learning?

A

Yes, it is generally a good idea when you have a lot of variables and the interpretation of variables is not required.

81
Q

For use in science say one is trying to determine a model with known inputs and outputs but unknown interactions. if we fit a model to data that has been collected over years of experimentation from multiple events/experiments. what is the best test to provide how well the model fits?

A

The most important part of research is to keep iterating experiments with little changes to seek the best performance that we can achieve. As the professor, showed how different equations fit the data differently and yield different accuracies. So no set method can be called ‘best’. It is about identifying which test would be the most relevant in what situation.

82
Q

What is a confounding phenomenon?

A

A confounding phenomenon is when an unknown variable influences the dependent variable and the independent variable. Confounding is a causal concept, and as such, cannot be described in terms of correlations or associations.
Note that if a variable/parameter has a strong positive/negative correlation with any of the other input variables and also with the output, it does not necessarily indicate confounding, it can also be a coincidence that is only observed in the sample data and might not be the case in the population data.

83
Q

To distinguish if it’s causality or just a correlation you need to compare all the values to each other?

A

Correlation just represents the association between two variables while causality expresses how one variable depends on the other. No test can tell causal relation. The only way is to do controlled experiments. You cannot always do controlled experiments so economists also try natural experiments.
Note that correlation does not imply causality. It is possible for two variables to be associated with each other without one of them causing the observed behavior in the other.

84
Q

If a model works better for some ranges of Xs, would it be valid to have multiple linear regressions for specific ranges of X?

A

Yes, but linear but assumption is broken then. So, one way is to use k spline linear regression or can try other non-linear algorithms.

85
Q

Why is it called “supervised” learning?

A

A supervised machine learning algorithm is trained on input data that has been labeled for a particular output. The name “supervised” learning originates from the idea that training this type of algorithm is like having a teacher supervise the whole process.

86
Q

What is the difference between classification and logistic regression?

A

Logistic regression is one of the algorithms used to perform classification. There are other algorithms that can be used to solve a classification problem, such as k Nearest Neighbors (kNN), Decision Trees and Neural Networks.

87
Q

What if the response variable is a level: gold, silver, brown? Is it a classification problem?

A

A classification problem is when the target variable is categorical. Here, Gold, silver, and brown are three categories, so it is a classification problem.
But if the categories are ordinal, then we also use regression to solve the problem. Consider an example where the ‘ratings’ is the target variable and it has values from 1 to 10, with 1 representing very poor and 10 being excellent. These are discrete values but are ordinal, i.e., have some order involved. It can be solved using regression as well.

88
Q

What are some examples of contexts in which ML and traditional statistical models differ?

A

Machine Learning models and Statistical models are just two flavors of the same thing. ML models are generally algorithmic, and they help in prediction. We can fearlessly apply machine learning algorithms on a dataset to get predictions, while statistical models tend to be more mathematical and help to generate statistical insights from the data, rather than only trying to make predictions on it.

89
Q

What does it mean when we say that when it comes to Linear Regression, the two paths (the Machine Learning path and the Statistics/Probability path) are the same?

A

In Linear Regression, while using the statistical path we use ordinary least squares to get the final result, while in linear regression using machine learning, we try to optimize the error term using a gradient descent optimization algorithm to get the final result. But it doesn’t matter how you approach the problem - you will get similar results.

90
Q

Is estimator the same as likelihood?

A

No. An estimator is a statistic used to estimate an unknown parameter and likelihood is the probability of getting a result for a given value of the parameters.

91
Q

What is meant by noisy data?

A

Noisy data is meaningless data that cannot be understood and interpreted correctly. For example, negative or zero values for height variables can be considered as noisy data.

92
Q

What is the purpose of the estimator?

A

The purpose of the estimator is to find estimates of coefficients (beta0, beta1,….) that in some sense define the “best” fit line for the data.

93
Q

Can a distribution have multiple parameters (thetas)?

A

Yes, we allow theta to be a vector of multiple components that characterizes that distribution for eg: mean and standard deviation in Normal distribution.

94
Q

Does the estimator include outliers?

A

The estimator has nothing to do with data. If the data has outliers, it takes them into account. The main aim of estimators is to estimate parameters based on the data.

95
Q

Is there a minimum number of data records required to build an estimator?

A

No, not necessarily, but a good amount of data helps to improve the quality of the estimates. You can estimate even if you have just one measurement, but it might be a poor estimate.

96
Q

Let’s say we are interested more in the outliers and understanding the outliers more. Would that be less Machine Learning and more of a Statistical Testing/Modelling based approach?

A

Outliers are extreme values that deviate from other observations on the data. Identification of outliers is important for the following reasons. 1. An outlier may indicate bad data. For example, the data may have been coded incorrectly or an experiment may not have been run correctly 2. Outliers may occur due to random variation or may indicate something scientifically interesting. Identifying outliers are part of Exploratory Data Analysis(EDA) techniques rather than applying Machine Learning algorithms (or) Statistical testing.

97
Q

At what stage do you consider computational capacity before the model building or after?

A

The computational capacity mainly depends on the size of the data and the complexity of the algorithm. Generally, computations are not a big concern for simple algorithms like linear regression, but if we want to reduce the computational time, we can build the model on a sample of the original data instead of full data.

98
Q

What is “g” and how is it different from g*?

A

The notation ‘g’ is the function/estimator that takes Xs as input and predicts Ys. ‘g’ is the general estimator and g* is the optimal estimator

99
Q

What is overfitting?

A

Overfitting occurs when a statistical model fits exactly against its training data including the noise in the data. When this happens, the algorithm cannot perform accurately against unseen data, defeating its purpose. The generalization of a model to new data is ultimately what allows us to use machine learning algorithms to make predictions.

100
Q

Is eliminating outliers necessary in Linear Regression?

A

The slope of the regression line will change due to outliers in most of the cases. So, linear regression is sensitive to outliers, and it is good to eliminate outliers before building the linear regression model.

101
Q

What is the degree of freedom in the context of regression?

A

In machine learning, the degrees of freedom refer to the number of parameters in the model. In linear regression, it refers to the number of coefficients. If there are more degrees of freedom (model parameters) in machine learning, then the model is expected to overfit the training dataset.

102
Q

Why are we trying to minimize theta - the vector of coefficients?

A

We are finding the theta that minimizes the sum of squares of residuals because we need the predicted value to be as close as possible to the true value.

103
Q

Wouldn’t the number of coefficients be the same as the number of independent variables ?

A

Nearly, because we also have one additional coefficient called the intercept of regression. The intercept (often labeled as constant or theta_0) is the mean of the dependent variable when you set all of the independent variables in your model to zero. Having an intercept gives our model the freedom to capture ALL the linear patterns while a model with no intercept can capture only those patterns that pass through the origin.

104
Q

What is the difference between Simple Linear Regression and Multiple Linear Regression?

A

Simple linear regression has one independent variable and one dependent variable. Multiple regression is performed when we have more than one independent variable, but only one dependent variable.

105
Q

How do we decide which variables are important in Regression?

A

You can’t compare the regular regression coefficients because they use different scales. Fit the regression model using the standardized independent variables and compare the standardized coefficients. Because they all use the same scale, you can compare them directly. Standardized coefficients signify the mean change of the dependent variable given a one standard deviation shift in an independent variable.

106
Q

What if we have data points that are distributed randomly around a sin() wave, the line will be a wrong solution, right? How do we create more complicated prediction functions?

A

Linear regression does not work well in these cases as it simply provides the line of best fit, and cannot capture complex non-linear relationships in the data. So we would need to use more sophisticated algorithms capable of detecting non-linear patterns in the data, such as neural networks.

107
Q

What is Wi and what if it’s not normal?

A

Wi is the error of regression. It should be normally distributed. This is the basic assumption which should not be violated. Maximum likelihood estimation does not work if this assumption doesn’t hold true, in which case we should not use linear regression.

108
Q

Is there any test that determines the p-value for each coefficient (theta)?

A

The ‘t-tests’ are used to conduct hypothesis tests on the regression coefficients (thetas) obtained in linear regression. The p-value for each variable tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (< 0.05) indicates that you can reject the null hypothesis. We can use these coefficient p-values to determine which variables are significant in the regression model.

109
Q

What are Ȳ (Y bar) and Ŷ (Y hat)?

A

Y bar is the mean of all the Yi, the bar symbol indicates the mean and Y hat is the predicted value by the model.

110
Q

What is the total sum of squares in regression?

A

The total sum of squares, denoted TSS, is the squared differences between the observed dependent variable and its mean. You can think of this as the dispersion of the observed variables around the mean (Y bar).

111
Q

Are RSS and Wi the same in regression?

A

RSS is the sum of the differences between the predicted value and the mean of the dependent variable. Think of it as a measure that describes how well our line fits the data. Wi is the difference between any data point and the regression line.

112
Q

What is R squared and what it means when the value is 1?

A

R2 is a statistic that will give some information about the goodness of fit of a model. In regression, the R2 coefficient of determination is a statistical measure of how well the regression predictions approximate the real data points. An R2 of 1 indicates that the regression predictions perfectly fit the data.

113
Q

What do the diagonal elements represent in the covariance matrix?

A

The covariance matrix is a symmetric matrix that shows covariances of each pair of variables. The diagonal element shows the covariance of a variable with itself which nothing but the variance of that variable. So, each diagonal element represents the variance of the respective variable.

114
Q

Is it possible or common to see the p-value high but the CI band not including the value zero? And vice versa?

A

No, the two things are equivalent. If the confidence interval includes zero, that means there is no statistically meaningful or statistically significant proof that the variable helps to predict the target variable. It is the same as saying the p-value is high (>0.05). Similarly, when CI doesn’t capture zero the p-value will be low.

115
Q

How do we decide when to use linear regression or other methods?

A

Linear regression is a basic model and it can be the first model to try for a regression problem. The linear regression model is widely used in many situations before attempting non-linear and more complicated models. It is the most accomplished theoretical model and helps in interpretability; some key concepts can also be explained well using the linear regression model.
In practice, we use multiple models of different kinds and the algorithm that gives the best results depends on the data and the problem on hand.

116
Q

In Lasso regularization, if one theta is 0, do we need to remove that variable and redo theta calculation for the remaining variables?

A

Lasso regularization can be used for feature selection. If any variable coefficient (theta) is zero, we can remove that variable. We can build the model with the remaining variables which have non-zero coefficient values.