Regression Flashcards

Question

Is Linear Regression a good base model to compare other methods?

Answer 1

Yes. It is always good to develop a base model as Linear regression for continuous variables and then we can move to more complicated algorithms such as Boosting, neural nets, etc.

Answer 2

To estimate the parameters in the model, we minimize the sum square of errors in which the parameters are treated as variables. By applying optimization algorithms we can estimate them.

Answer 3

Yes, if a model is not performing well in the first go, it means it is not close enough to the true line. In such a case we do iterate by making changes in the variables of the model, in the data, or hyper-parameters of the model.

Answer 4

Having clean data is good for the machine learning model, but it is not the only factor that decides the reliability of the model. Along with clean data, the features selected in the model, its performance over unseen data are things that make the model reliable.

Answer 5

Bias can be interpreted as the training error in the model. If the training error is very high, then the model is biased and simple enough to perform reliably on the unseen data.

Answer 6

You can refer to ISLR (An Introduction to Statistical Learning) by Gareth James and co-authors.

Answer 7

They are two different representations of the same thing. A confidence band is the lines on a probability plot or fitted line plot that depict the upper and lower confidence bounds for all points on a fitted line within the range of data. On a fitted line plot, the confidence interval for the mean response of a specified predictor value is the points on the confidence bands directly above and below the predictor value.

Answer 8

The ordinary least squares, or OLS, can also be called the linear least squares. This is a method for approximately determining the unknown parameters located in a linear regression model. Maximum likelihood estimation, or MLE, is a method used in estimating the parameters of a statistical model and for fitting a statistical model to data.

Answer 9

Yes, the r-square value increases with the addition of new features.

Answer 10

Yes, if the data is fitting better over a logarithmic equation then it can be fit on that. Machine learning bothers about what relation is fitting the best to the given data.

Answer 11

To standardize data is about how many standard deviations the data point is away from the mean of the data. Standardization is just a technique to perform feature scaling.

Answer 12

If the trained linear regression model is not doing good over the unseen data after all the possible modifications then it is not a good choice for that specific problems and one should prefer to choose from other algorithms suitable for regression.

Answer 13

Even though PCA does that, it is not a supervised learning method and hence it does not do any sort of regression. The computations of PCA are different from that of linear regression.

Answer 14

The main purpose of PCA is to reduce the dimension (number of features) from the given dataset. It is applied mainly when the number of features is very high. It is not used for regression.

Answer 15

It's not always a black box. It's generally a scale of interpretability, some ML algorithms like Linear Regression and Decision Trees are highly interpretable, others like Neural Networks are not very interpretable.

Answer 16

Machine learning algorithms are procedures that are implemented in code and are run on data. Machine learning models are output by algorithms and are comprised of model data and a prediction algorithm.

Answer 17

Nonlinear regression is a form of regression analysis where we model data in a nonlinear combination of model parameters. But in linear regression, we have a linear combination of parameters.

Answer 18

There can be instances where the model can give very good performance on some data but might not be suitable for the data as per the theory of the algorithms. For example, deep neural networks work very nicely on large datasets but it is not very obvious 'why' it worked on the basis of mathematical theory.

Answer 19

No, if the purpose of the model is simply to make predictions and provide insights on the problem, then it is not important for the coefficients to be interpretable. On the other hand, if we want the model to be interpretable, then it is of utmost importance we know what parameters are we using and they must be interpretable to us.

Answer 20

A data scientist's job is to strive for as high accuracy as possible, but there is no pre-stated benchmark to consider a model 'correct'. It depends on the problem at hand and business requirements.

Answer 21

A vector data record is an input array where particular values at particular positions are signifying particular coefficients. For example, if I want to express length and breadth in an array with the format [L, B], then [5, 2] is a vector data record. Whereas vector data components refer to each value within that vector, for example, 5 and 2 are two components in the previous example.

Answer 22

Yes, it might be the case but in our scenario, we are more concerned with the overall effects of each kind of advertisement we have had so far.

Answer 23

In machine learning, the degrees of freedom refer to the number of parameters in the model. In linear regression, it refers to the number of coefficients. If there are more degrees of freedom (model parameters) in machine learning, then the model is expected to overfit the training dataset.

Answer 24

For a simple linear regression with one variable, there are two degrees of freedom, the slope, and the intercept. If there are multiple variables, the number of degrees of freedom increases as the number of slope parameters used to describe the line in the n-dimensional space increases. So, if there are n variables, then there are n+1 degrees of freedom (intercept + slopes).

Answer 25

Theta is the vector denoting the slope parameters, and X stands for our data. So the product of these two matrices should ideally give us the predicted value of y (the target variable) at given x.

Answer 26

Residual is the difference between the actual value of the target variable and the predicted values. Hence, it is always vertical.

Answer 27

Theta is the vector denoting the slope parameters. Theta has components - Theta0, Theta1 … Theta-m

Answer 28

We need to add 1 to the vector of X's because that way, on multiplying the theta vector with the X vector we automatically get the intercept.

Answer 29

Linear Regression only tries to find the "line of best fit" i.e. the best possible line that explains the pattern. We are assuming that there is a linear relationship between the independent and dependent variables. If a non-linear pattern is better suited, then more complex models can be built.

Answer 30

It depends on the problem on hand. Linear relationships are more interpretable than any other model, but as the complexity of the problem increases, the linear relationship between independent and dependent variables is observed less often.

Answer 31

Yes, the intercept of the model implies that there's already a sales of 2.94 even if there is no advertising on TV, radio, and /or news.

Answer 32

Yes, it is advised we find confidence intervals and perform a hypothesis test on each coefficient to check whether that variable is significant for our model or not.

Answer 33

Yes, the evaluation metric Adjusted R-squared gives good intuition about the usefulness of each feature contributing to the final prediction.

Answer 34

We need to see if all ten of these target variables are significantly contributing enough to the target variables. If we want to reduce the dimension of feature space, we can do PCA to do dimensionality reduction.

Answer 35

Yes, the more the sample data, the closer we are to a perfect prediction graph. So the slopes for our line start to converge towards their optimal value.

Answer 36

W_i stands for noise which is normally distributed and independent.

Answer 37

We only assume W_i as normal. There is no such restriction for the independent and dependent variables.

Answer 38

Wi is the error of regression. It should be normally distributed. This is the basic assumption which should not be violated. Maximum likelihood estimation does not work if this assumption doesn't hold true, in which case we should not use linear regression.

Answer 39

Yes, it is usable. We can do by using bayesian model with the prior probabilities

Answer 40

In principle, you can do that, but you won't get OLS as a result. 200+ years ago, Laplace (who kind of started this study) used Laplace distribution which later turned out to be not as good as normal.

Answer 41

The baseline model for linear regression is the model where the prediction is the average value of the target value for any value of X. The Negative R2 implies that it's worse than the baseline model.

Answer 42

It depends on the problem at hand but generally, a value greater than 0.7 is considered a strong correlation.

Answer 43

R-Squared indicates the percentage of the variance in the dependent variable that the independent variables explain collectively. Hence, the value of R-squared lies between 0 and 1.

Answer 44

In overfitting, we would likely get a high R-Squared on the training data but a relatively small R-Squared on the test/validation data.

Answer 45

It is useful to standardize data to make it computationally feasible but it's not great for interpretation.

Answer 46

Variation exists both in-sample and in population. We try to capture the true statistics of the population using the sample statistics.

Answer 47

Yes, we need to re-run regression after eliminating insignificant variables.

Answer 48

In that case, it would be difficult to conclude with rejection as there are many possible values we can test for.

Answer 49

No, they are essentially the same tests.

Answer 50

Yes - there is a bayesian variation for almost everything.

Answer 51

We use a t-test, instead of a z-test, to check the significance of independent variables because the population variance is unknown.

Answer 52

Yes, it is generally a good idea when you have a lot of variables and the interpretation of variables is not required.

Answer 53

The most important part of research is to keep iterating experiments with little changes to seek the best performance that we can achieve. As the professor, showed how different equations fit the data differently and yield different accuracies. So no set method can be called 'best'. It is about identifying which test would be the most relevant in what situation.

Answer 54

Yes, but linear but assumption is broken then. So, one way is to use k spline linear regression or can try other non-linear algorithms.

Answer 55

A supervised machine learning algorithm is trained on input data that has been labeled for a particular output. The name “supervised” learning originates from the idea that training this type of algorithm is like having a teacher supervise the whole process.

Answer 56

Machine Learning models and Statistical models are just two flavors of the same thing. ML models are generally algorithmic, and they help in prediction. We can fearlessly apply machine learning algorithms on a dataset to get predictions, while statistical models tend to be more mathematical and help to generate statistical insights from the data, rather than only trying to make predictions on it.

Answer 57

In Linear Regression, while using the statistical path we use ordinary least squares to get the final result, while in linear regression using machine learning, we try to optimize the error term using a gradient descent optimization algorithm to get the final result. But it doesn't matter how you approach the problem - you will get similar results.

Answer 58

No. An estimator is a statistic used to estimate an unknown parameter and likelihood is the probability of getting a result for a given value of the parameters.

Answer 59

Noisy data is meaningless data that cannot be understood and interpreted correctly. For example, negative or zero values for height variables can be considered as noisy data.

Answer 60

The purpose of the estimator is to find estimates of coefficients (beta0, beta1,....) that in some sense define the “best” fit line for the data.

Answer 61

Yes, we allow theta to be a vector of multiple components that characterizes that distribution for eg: mean and standard deviation in Normal distribution.

Answer 62

The estimator has nothing to do with data. If the data has outliers, it takes them into account. The main aim of estimators is to estimate parameters based on the data.

Answer 63

No, not necessarily, but a good amount of data helps to improve the quality of the estimates. You can estimate even if you have just one measurement, but it might be a poor estimate.

Answer 64

Outliers are extreme values that deviate from other observations on the data. Identification of outliers is important for the following reasons. 1. An outlier may indicate bad data. For example, the data may have been coded incorrectly or an experiment may not have been run correctly 2. Outliers may occur due to random variation or may indicate something scientifically interesting. Identifying outliers are part of Exploratory Data Analysis(EDA) techniques rather than applying Machine Learning algorithms (or) Statistical testing.

Answer 65

The computational capacity mainly depends on the size of the data and the complexity of the algorithm. Generally, computations are not a big concern for simple algorithms like linear regression, but if we want to reduce the computational time, we can build the model on a sample of the original data instead of full data.

Answer 66

Overfitting occurs when a statistical model fits exactly against its training data including the noise in the data. When this happens, the algorithm cannot perform accurately against unseen data, defeating its purpose. The generalization of a model to new data is ultimately what allows us to use machine learning algorithms to make predictions.

Answer 67

The slope of the regression line will change due to outliers in most of the cases. So, linear regression is sensitive to outliers, and it is good to eliminate outliers before building the linear regression model.

Answer 68

In machine learning, the degrees of freedom refer to the number of parameters in the model. In linear regression, it refers to the number of coefficients. If there are more degrees of freedom (model parameters) in machine learning, then the model is expected to overfit the training dataset.

Answer 69

We are finding the theta that minimizes the sum of squares of residuals because we need the predicted value to be as close as possible to the true value.

Answer 70

Nearly, because we also have one additional coefficient called the intercept of regression. The intercept (often labeled as constant or theta_0) is the mean of the dependent variable when you set all of the independent variables in your model to zero. Having an intercept gives our model the freedom to capture ALL the linear patterns while a model with no intercept can capture only those patterns that pass through the origin.

Answer 71

Simple linear regression has one independent variable and one dependent variable. Multiple regression is performed when we have more than one independent variable, but only one dependent variable.

Answer 72

You can't compare the regular regression coefficients because they use different scales. Fit the regression model using the standardized independent variables and compare the standardized coefficients. Because they all use the same scale, you can compare them directly. Standardized coefficients signify the mean change of the dependent variable given a one standard deviation shift in an independent variable.

Answer 73

Linear regression does not work well in these cases as it simply provides the line of best fit, and cannot capture complex non-linear relationships in the data. So we would need to use more sophisticated algorithms capable of detecting non-linear patterns in the data, such as neural networks.

Answer 74

Wi is the error of regression. It should be normally distributed. This is the basic assumption which should not be violated. Maximum likelihood estimation does not work if this assumption doesn't hold true, in which case we should not use linear regression.

Answer 75

The 't-tests' are used to conduct hypothesis tests on the regression coefficients (thetas) obtained in linear regression. The p-value for each variable tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (< 0.05) indicates that you can reject the null hypothesis. We can use these coefficient p-values to determine which variables are significant in the regression model.

Answer 76

Y bar is the mean of all the Yi, the bar symbol indicates the mean and Y hat is the predicted value by the model.

Answer 77

The total sum of squares, denoted TSS, is the squared differences between the observed dependent variable and its mean. You can think of this as the dispersion of the observed variables around the mean (Y bar).

Answer 78

RSS is the sum of the differences between the predicted value and the mean of the dependent variable. Think of it as a measure that describes how well our line fits the data. Wi is the difference between any data point and the regression line.

Answer 79

R2 is a statistic that will give some information about the goodness of fit of a model. In regression, the R2 coefficient of determination is a statistical measure of how well the regression predictions approximate the real data points. An R2 of 1 indicates that the regression predictions perfectly fit the data.

Answer 80

The covariance matrix is a symmetric matrix that shows covariances of each pair of variables. The diagonal element shows the covariance of a variable with itself which nothing but the variance of that variable. So, each diagonal element represents the variance of the respective variable.

Answer 81

No, the two things are equivalent. If the confidence interval includes zero, that means there is no statistically meaningful or statistically significant proof that the variable helps to predict the target variable. It is the same as saying the p-value is high (>0.05). Similarly, when CI doesn't capture zero the p-value will be low.

Answer 82

Linear regression is a basic model and it can be the first model to try for a regression problem. The linear regression model is widely used in many situations before attempting non-linear and more complicated models. It is the most accomplished theoretical model and helps in interpretability; some key concepts can also be explained well using the linear regression model. In practice, we use multiple models of different kinds and the algorithm that gives the best results depends on the data and the problem on hand.

Regression Flashcards

(106 cards)