05 - ML Flashcards
(134 cards)
What is the difference between noise and variance?
While collecting data, the data acquisition medium (either human or machine) may make errors. Such errors are called noise. Variance measures the variation of data points from the mean of the data.
What is heterosekadascity? What is generally the shape of the heteroscedastic data?
Heterosekadascity is the phenomenon of having a different variance of data points along the regression line. Heteroscedastic data is generally having an irregular shape i.e. Cone.
How does the weight help with making a prediction?
When weights are established in a model the relationship between the dependent and independent variables is established. This expression is called the model. To predict unknown data, the values of independent variables are passed in the model to get the dependent variable.
Are the coefficients stand for weights in the model?
Yes, the coefficients established after the training process are the weights for each variable in the model. They play a vital role in making the prediction.
What is the difference between endogeneity and multicollinearity?
Endogeneity is the phenomenon of existing correlation between the independent variables and the error terms of the model. While multicollinearity is the process of correlation between independent variables of a model. These two are two different concepts.
What is the difference between correlation and causation?
Correlation between two features ensures that there is a relation (strong or weak) existing. It does not tell anything about whether one feature has originated from the other. For example, age and education can be correlated but none of them is originating the other. Causation is the effect of one feature causing to originate the other feature. For example, poverty causing starvation is a causal effect.
Is it necessary to run PCA to resolve multicollinearity?
Yes, it can be but not in all cases. PCA is used to reduce the number of features when they are very high in count. This process is called dimensionality reduction. To get rid of multicollinearity one out of two correlated features can be removed for all the pairs of correlated variables.
What is the maximum likelihood function?
The likelihood function is a function that is the combination of the likelihood of occurrence of all the events of a sample. When this function is maximized to get the parameters of the model then it is called the maximum likelihood function. It is a function of parameters not the variables of the data.
Can autocorrelation cause endogeneity in the data?
Autocorrelation is the process of existing correlation between a variable and its lagged version. In such a case the error term may be in correlation with the variable with existing correlation.
How do you detect endogeneity, to mitigate?
To detect endogeneity one can collect the error terms and find whether they are correlated to each other or not? If they are then the feature will be endogenic. This can be tested by visualization also, where if the variable seems to be related to the error terms then it can be declared to be endogenic. One of the possible ways of mitigating endogeneity is to do encoding of categorical variables. Encoding creates additional variables that may lead to removing the correlation between the variable with the error term.
Is there an overlap between the two terms - heteroscedasticity and endogeneity?
No, they are two different phenomena. Heteroscedasticity is the phenomenon of having a different variance of data points along the best fit line of regression. While endogeneity is the phenomenon of existing correlation between the independent variable and the error terms.
In regression, is it needed to have a categorical variable?
A regression problem is not necessary to have only continuous or categorical variables. It is possible that it can have either only continuous variables, only categorical variables, or a mix of the two also.
Should any boolean or binomial data always be converted to a 1/0?
While processing the data in python, it is much needed to convert it into numeric data type. Doing so helps us do mathematical operations on the data. Due to this, it is good to convert a binary independent variable to 0 and 1 using any of the encoding methods.
Would log(x2) and x2 have a very correlation that could cause multicollinearity issues?
Log(x2) is a transformed version of x2 and it is an increasing function. Creating such a column may cause multicollinearity because in the case of log transformation they will have some correlation.
Why do we need a non-linear function added to the model?
It is not always possible to get a linear relationship between the dependent and independent variables. In real life, most of the time a non-linear relation captures the actual relation between the variables better. Due to this, we need non-linear functions added to the model.
Would the product of two variables be an interaction between the features?
Yes, it is an interaction between the two variables. When two variables are multiplied, as an effect of interaction a new feature is generated.
Can you take the log of Y and still have linear regression?
Yes, it is possible. A linear relationship between variables means that the index/power of the coefficients is not equal to 1. Linearity is seen with the index of the parameter, not with that of the variables. So Log(y) and the independent variables can have a linear relationship still.
Are rescaling, normalizing, and standardizing are different?
Normalizing and standardizing are two methods of doing rescaling/scaling. In normalizing we measure how much standard deviation is the actual data away from the mean of the data while in standardization we bring back the data to a certain range of numbers.
Can you provide an example of the weighted least square algorithm? How do we pick the weight?
Weighted least square algorithm is a method of finding the parameters of a model. It can be applied with any algorithm where this is used as the cost function for the algorithm for example Linear regression. The weights should, ideally, be equal to the reciprocal of the variance of the measurement. Corresponding to each record there is a different weight associated with it.
If adding more variables reduces endogeneity, then how to reduce heteroscedasticity?
One of the prominent methods to do so is to use weighted least square analysis. Giving different weightage to the variables resolves heteroscedasticity.
Can you give some examples where scaling/ normalizing is necessary and where the regression will work just fine without transforming the data first?
Scaling and normalizing are used when different features of the dataset are at a different scale, for example, if a dataset contains weight in kg and height in meter then they have different scales and it is needed to do scaling of the data. Linear regression works relatively better if there is no big-scale difference between the features.
Will the addition of more variables cause overfitting?
When new variables are added to the model then it makes the model more complex and hence it will try to capture the noise in the data while being trained. This will cause the overfitting of the model.
If we take the log of Y, can we still account for outliers in our prediction model?
It is possible that even in the transformed data (log transformation) some points are far away from the main herd of the data points and they can be outliers. Transformation is not a fixed solution to outliers.
Can a regression model include both continuous and categorical variables at the same time?
Yes, a regression model can include both the categorical and continuous variables at the same time. The only thing is the categories in the categorical data need to be converted into numbers so that the model can be established properly.