DS 1 Flashcards
What is the basic assumption of Naive Bayes classifier?
Features Are Independent
Advantages of Naive Bayes Classifier?
- Works well with many number of features (because of independence assumption)
- Works well with large training dataset
- Converges fast when training a model (all it needs are prior probabilities that do not change and can be stores ahead of time)
- performs well with categorical features (use one-hot encoding. n-1 dummy variables for n categories. Assign 1 or 0 depending on whether that category is present in a row)
Disadvantages of Naive Bayes classifier?
- Correlated features affect performance
Is feature scaling required in NBC?
No. Because it is based on probability and not on the distance metric
Can NBC handle missing values?
It can. An attribute with a missing value is ignored when training, and if an attribute has a missing value during classifying, it is not used in the probability calculation.
Impact of outliers in NBC
Robust, since we are using probabilities
Problem statements using NBC?
- Sentiment analysis
- Document categorization
- spam classification
What is concept of linear regression?
given one or more independent features and a dependent feature, we fit a best fit line to the data. This best fit line follows the equation of the straight line y=mx+c. The best fit line is selected by minimizing a cost function (such as mean squared error). The cost function can be minimized by a gradient descent algorithm (move into the direction of the negative gradient).
Underfitting?
Underfitting is a scenario in data science where a data model is unable to capture the relationship between the input and output variables accurately, generating a high error rate on both the training set and unseen data.
High bias
High Variance
Overfitting?
Overfitting is a concept in data science, which occurs when a statistical model fits exactly against its training data. When this happens, the algorithm, unfortunately, cannot perform accurately against unseen data, defeating its purpose.
Low Bias
High Variance
What is bias and variance?
Bias = error on training data Variance = error on test data
What is multicollinearity?
When two independent features are highly correlated (almost around 85%) and the dependent variable is also correlated with them.
Not useful to use both the features.
find which features are highly correlated using heatmap (works with smaller no. of features)
for large features, –> use ridge or lasso regression: penalizes features that are correlated
Basic assumptions of linear regression?
There are four assumptions associated with a linear regression model:
- Linearity: The relationship between X and the mean of Y is linear.
- Homoscedasticity: The variance of residual is the same for any value of X.
- Independence: Observations are independent of each other.
- Normality: For any fixed value of X, Y is normally distributed.
Advantages of linear regression?
- works exceptionally well for linearly separable data
- converges quite fast and it is easy to train
- handle overfitting using dimensionality reduction and regularization
Disadvantages of linear regression?
- Sometimes Lot of Feature Engineering Is required
- If the independent features are correlated it may affect performance
- It is often quite prone to noise and overfitting