Miscellaneous Flashcards

1
Q

You are given a data set consisting of variables with missing values. How will you deal with them?

A

Evaluate if missing values are missing randomly or systematically.

Quickest way: If the dataset is large and randomly, we can simply remove the rows with missing data values.

For smaller datasets, we can impute missing values with the mean, median, or average of the rest of the data using pandas data frame in python:
df.mean(), df.fillna(mean)

Other option of imputation is using KNN for numeric or classification values (as KNN just uses k closest values to impute the missing value)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is dimensionality reduction?

A

Dimensionality reduction refers to the process of converting a data set with vast dimensions into data with fewer dimensions (fields) to convey similar information concisely.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the benefits of dimensionality reduction?

A

Compresses data:
Reduces storage space
Reduces computation time
Removes redundant features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How should you maintain a deployed model?

A

MECR:
1. Monitor - monitor models to determine performance accuracy if changes are made

  1. Evaluate - evaluation metrics of model are calculate to determine if a new algo is needed
  2. Compare - new models are compared to each other to determine which model performs the best
  3. Rebuild - best performing model is rebuilt on the current state of data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does it mean for data to be stationary?

A

The mean, variance, and covariance should not be a function of time.

  1. Mean should not increase over time. if it is not stationary the mean will increase over time.
  2. Variance - homoscedasticity. if it is not stationary there will be a varying spread of the data over time
  3. Covariance - if it is not stationary you will notice the spread becomes closer as the time increases and then more spread out at other periods of time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

‘People who bought this also bought’ recommendations seen on Amazon are a result of which algorithm?

A

Recommendation system accomplished with collaborative filtering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is collaborative filtering?

A

Predicts based on what might interest a person based on preferences of other users - purchase hx, ratings, selection, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a Generative Adversarial Network (GAN)?

A

2 components:

  1. Generator
  2. Discriminator

Suppose there is a wine shop purchasing wine from dealers, which they resell later. But some dealers sell
fake wine. In this case, the shop owner should be able to distinguish between fake and authentic wine. The
forger will try different techniques to sell fake wine and make sure specific techniques go past the shop
owner’s check. The shop owner would probably get some feedback from wine experts that some of the
wine is not original. The owner would have to improve how he determines whether a wine is fake or
authentic.
The forger’s goal is to create wines that are indistinguishable from the authentic ones while the shop owner
intends to tell if the wine is real or not accurately.

  • There is a noise vector coming into the forger who is generating fake wine.
  • Here the forger acts as a Generator.
  • The shop owner acts as a Discriminator.
  • The Discriminator gets two inputs; one is the fake wine, while the other is the real authentic wine.

The shop owner has to figure out whether it is real or fake.

The generator is a CNN that keeps producing images and is closer in appearance to the real images. The discriminator tries to determine the difference between real and fake images. Ultimate aim is to make the discriminator learn to identify real and fake images.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a use case for GAN?

A

Photoshop - photo editing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

You are given a dataset on cancer detection. You have built a classification model with 96% accuracy. Are you happy with the model performance and what do you do?

A

NO.
Consider imbalanced datasets with cancer detection. If imbalanced - accuracy should not be used as a metric.
4% of patients wrongly diagnosed, early diagnosis is crucial for cancer detection. consider type 1 and type 2 error in the context of the data - missed diagnosis vs unnecessary extra diagnostic tests
We should use Sensitivity (True Positive rate), Specificity (True Negative rate), F measure to determine the class wise performance of the classifier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly