DS 1 Flashcards

1
Q

What is the basic assumption of Naive Bayes classifier?

A

Features Are Independent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Advantages of Naive Bayes Classifier?

A
  1. Works well with many number of features (because of independence assumption)
  2. Works well with large training dataset
  3. Converges fast when training a model (all it needs are prior probabilities that do not change and can be stores ahead of time)
  4. performs well with categorical features (use one-hot encoding. n-1 dummy variables for n categories. Assign 1 or 0 depending on whether that category is present in a row)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Disadvantages of Naive Bayes classifier?

A
  1. Correlated features affect performance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Is feature scaling required in NBC?

A

No. Because it is based on probability and not on the distance metric

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Can NBC handle missing values?

A

It can. An attribute with a missing value is ignored when training, and if an attribute has a missing value during classifying, it is not used in the probability calculation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Impact of outliers in NBC

A

Robust, since we are using probabilities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Problem statements using NBC?

A
  1. Sentiment analysis
  2. Document categorization
  3. spam classification
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is concept of linear regression?

A

given one or more independent features and a dependent feature, we fit a best fit line to the data. This best fit line follows the equation of the straight line y=mx+c. The best fit line is selected by minimizing a cost function (such as mean squared error). The cost function can be minimized by a gradient descent algorithm (move into the direction of the negative gradient).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Underfitting?

A

Underfitting is a scenario in data science where a data model is unable to capture the relationship between the input and output variables accurately, generating a high error rate on both the training set and unseen data.
High bias
High Variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Overfitting?

A

Overfitting is a concept in data science, which occurs when a statistical model fits exactly against its training data. When this happens, the algorithm, unfortunately, cannot perform accurately against unseen data, defeating its purpose.
Low Bias
High Variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is bias and variance?

A
Bias = error on training data
Variance = error on test data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is multicollinearity?

A

When two independent features are highly correlated (almost around 85%) and the dependent variable is also correlated with them.
Not useful to use both the features.
find which features are highly correlated using heatmap (works with smaller no. of features)
for large features, –> use ridge or lasso regression: penalizes features that are correlated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Basic assumptions of linear regression?

A

There are four assumptions associated with a linear regression model:

  1. Linearity: The relationship between X and the mean of Y is linear.
  2. Homoscedasticity: The variance of residual is the same for any value of X.
  3. Independence: Observations are independent of each other.
  4. Normality: For any fixed value of X, Y is normally distributed.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Advantages of linear regression?

A
  1. works exceptionally well for linearly separable data
  2. converges quite fast and it is easy to train
  3. handle overfitting using dimensionality reduction and regularization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Disadvantages of linear regression?

A
  1. Sometimes Lot of Feature Engineering Is required
  2. If the independent features are correlated it may affect performance
  3. It is often quite prone to noise and overfitting
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Is feature scaling required in Linear regression?

A

Yes

17
Q

Impact of missing values and outliers in LR?

A

affected by both

18
Q

What are the advantages of SVM?

A
  1. works quite well when there is a clear separation between classes
  2. memory efficient
  3. more effective in high dimensional spaces
  4. any complex problem can be solved using SVM kernels
  5. works well even with structured or semi-structured data
19
Q

What are disadvantages of SVM?

A
  1. not suitable for large training datasets
  2. target classes overlapping
  3. SVM underperforms when no. of features exceeds no. of training samples
20
Q

Is feature scaling required for SVM?

A

Yes

21
Q

Is SVM sensitive to missing values?

A

yes. No easy accomodation of missing covariate information

22
Q

Is SVM sensitive to outliers?

A

yes, presence of even a few outliers can lead to serious missclassification

23
Q

How do we solve underfitting and overfitting in SVM?

A

we use a soft margin instead of a hard one (allow some points inside the margins) though these are still penalized. This makes sure our model doesn’t overfit.

24
Q

SVM use cases?

A

Text and hypertext categorization

handwriting recognition

25
Q

What are the basic assumptions of logistic regression?

A

a linear relationship between independent features and log odds

26
Q

What are advantages of logistic regression?

A
  1. easy to understand, implement and train
  2. no assumptions about distributions of classes in feature space
  3. good accuracy for many simple datasets and performs well when the dataset is linearly separable
  4. less inclined to overfit
  5. can be extended to multiclass problems
27
Q

What are the disadvantages of logistic regression?

A
  1. Sometimes lot of feature engineering is required
  2. if independent features are correlated, it may affect performance
  3. quite prone to noise and overfitting
  4. number of observations less than number of features, logistic regression should not be used
  5. non-linear problems cannot be solved because has a linear decision surface
28
Q

Is feature scaling required in Logistic regression?

A

yes

29
Q

Is logistic regression sensitive to missing values and outliers?

A

Missing values: yes

Outliers: not much, using a sigmoid function tapers the outliers