DS 1 Flashcards by Tahreem Rasul

What is the basic assumption of Naive Bayes classifier?

Features Are Independent

How well did you know this?

Not at all

Perfectly

Advantages of Naive Bayes Classifier?

Works well with many number of features (because of independence assumption)
Works well with large training dataset
Converges fast when training a model (all it needs are prior probabilities that do not change and can be stores ahead of time)
performs well with categorical features (use one-hot encoding. n-1 dummy variables for n categories. Assign 1 or 0 depending on whether that category is present in a row)

How well did you know this?

Not at all

Perfectly

Disadvantages of Naive Bayes classifier?

Correlated features affect performance

How well did you know this?

Not at all

Perfectly

Is feature scaling required in NBC?

No. Because it is based on probability and not on the distance metric

How well did you know this?

Not at all

Perfectly

Can NBC handle missing values?

It can. An attribute with a missing value is ignored when training, and if an attribute has a missing value during classifying, it is not used in the probability calculation.

How well did you know this?

Not at all

Perfectly

Impact of outliers in NBC

Robust, since we are using probabilities

How well did you know this?

Not at all

Perfectly

Problem statements using NBC?

Sentiment analysis
Document categorization
spam classification

How well did you know this?

Not at all

Perfectly

What is concept of linear regression?

given one or more independent features and a dependent feature, we fit a best fit line to the data. This best fit line follows the equation of the straight line y=mx+c. The best fit line is selected by minimizing a cost function (such as mean squared error). The cost function can be minimized by a gradient descent algorithm (move into the direction of the negative gradient).

How well did you know this?

Not at all

Perfectly

Underfitting?

Underfitting is a scenario in data science where a data model is unable to capture the relationship between the input and output variables accurately, generating a high error rate on both the training set and unseen data.
High bias
High Variance

How well did you know this?

Not at all

Perfectly

Overfitting?

Overfitting is a concept in data science, which occurs when a statistical model fits exactly against its training data. When this happens, the algorithm, unfortunately, cannot perform accurately against unseen data, defeating its purpose.
Low Bias
High Variance

How well did you know this?

Not at all

Perfectly

What is bias and variance?

Bias = error on training data
Variance = error on test data

How well did you know this?

Not at all

Perfectly

What is multicollinearity?

When two independent features are highly correlated (almost around 85%) and the dependent variable is also correlated with them.
Not useful to use both the features.
find which features are highly correlated using heatmap (works with smaller no. of features)
for large features, –> use ridge or lasso regression: penalizes features that are correlated

How well did you know this?

Not at all

Perfectly

Basic assumptions of linear regression?

There are four assumptions associated with a linear regression model:

Linearity: The relationship between X and the mean of Y is linear.
Homoscedasticity: The variance of residual is the same for any value of X.
Independence: Observations are independent of each other.
Normality: For any fixed value of X, Y is normally distributed.

How well did you know this?

Not at all

Perfectly

Advantages of linear regression?

works exceptionally well for linearly separable data
converges quite fast and it is easy to train
handle overfitting using dimensionality reduction and regularization

How well did you know this?

Not at all

Perfectly

Disadvantages of linear regression?

Sometimes Lot of Feature Engineering Is required
If the independent features are correlated it may affect performance
It is often quite prone to noise and overfitting

How well did you know this?

Not at all

Perfectly

Is feature scaling required in Linear regression?

Study These Flashcards

Yes

Impact of missing values and outliers in LR?

Study These Flashcards

affected by both

What are the advantages of SVM?

Study These Flashcards

works quite well when there is a clear separation between classes
memory efficient
more effective in high dimensional spaces
any complex problem can be solved using SVM kernels
works well even with structured or semi-structured data

What are disadvantages of SVM?

Study These Flashcards

not suitable for large training datasets
target classes overlapping
SVM underperforms when no. of features exceeds no. of training samples

Is feature scaling required for SVM?

Study These Flashcards

Yes

Is SVM sensitive to missing values?

Study These Flashcards

yes. No easy accomodation of missing covariate information

Is SVM sensitive to outliers?

Study These Flashcards

yes, presence of even a few outliers can lead to serious missclassification

How do we solve underfitting and overfitting in SVM?

Study These Flashcards

we use a soft margin instead of a hard one (allow some points inside the margins) though these are still penalized. This makes sure our model doesn’t overfit.

SVM use cases?

Study These Flashcards

Text and hypertext categorization

handwriting recognition

What are the basic assumptions of logistic regression?

a linear relationship between independent features and log odds

What are advantages of logistic regression?

1. easy to understand, implement and train 2. no assumptions about distributions of classes in feature space 3. good accuracy for many simple datasets and performs well when the dataset is linearly separable 4. less inclined to overfit 5. can be extended to multiclass problems

What are the disadvantages of logistic regression?

1. Sometimes lot of feature engineering is required 2. if independent features are correlated, it may affect performance 3. quite prone to noise and overfitting 4. number of observations less than number of features, logistic regression should not be used 5. non-linear problems cannot be solved because has a linear decision surface

Is feature scaling required in Logistic regression?

yes

Is logistic regression sensitive to missing values and outliers?

Missing values: yes | Outliers: not much, using a sigmoid function tapers the outliers

DS 1 Flashcards

(29 cards)