Machine Learning Interview Flashcards

1
Q

What is heuristics and why are they used?

A

Shortcuts to solutions. Rules-based approaches. An approximate solution instead of an exact solution.

They are used because some problems either can’t be solved or require too much time or processing power to be reasonable for solving the problem at hand., e.g. don’t want a robot taking forever to assess the best move in chess

Nearest neighbour heuristic –> Ask the computer to figure out the closest city that has’t been visited yet by the salesperson and make that the next stop. Doesn’t consider future moves so it’s not the most effective.

Alpha-beta pruning (games) –> runs through many different possible next moves until a move is determined to be worse than a previously considered move

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is AI?

A

A broad category that involves teaching computers to solve problems using algorithms. They mimic human intelligence.

This can be done using a set of complex rules processing or training machine learning models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the challenges with ML

A
  • Lack of understanding: Setting expectations that AI is not a straightforward process & educating stakeholders
  • High expectations: Communicating often
    Lack of data
  • Wrong problem: Identifying real machine learning problems that solve a real-world business problem
  • Lack of governance: Monitoring/retraining models at regular intervals and setting up related governance processes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is ML?

A

Machine learning is about training a machine (set of mathematical models) with a historical dataset such that the machine can predict the unseen data.

The key part of machine learning systems is that its performance can be improved based on the new data set (experience).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is deep learning?

A

Deep learning problems form the subset of machine learning problems.

Deep learning represents the aspect of machine learning which mimics human brain for learning from data set and predicting the outcome on unseen data set.

In deep learning models. the features are learnt automatically based on the optimisation algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When should one use deep learning?

A
  1. Complex problems –> Being able to solve complex problems that require discovering hidden patterns in the data and/or a deep understanding of relationships between a large number of interdependent variables.
  2. Success definition –> Increases the difficulty for humans to understand and interpret the model. They model very complex situations with high non-linearity and interactions between inputs so might not be the best choice if explainability is a concern
  3. Resource availability –> Complicated and expensive endeavor. Models are slow to train and require a lot of computational power. GPUs are pretty much always a requirement.
  4. Availability of quality labelled data –> Millions of labelled data points for a classification task are required making these more difficult endeavors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you identify whether a problem is a machine learning one?

A
  • It is not easy to identify a finite set of rules based on which one can determine output related to numerical problems or classification problems (e.g. email classification)
  • Although the finite set of rules can be identified, however, the fact that rules change very fast makes it difficult to deploy the solution changes in the production (e.g. dynamic pricing for airlines)
  • Whether the solution requires a large volume of data for testing/quality assurance (QA) (e.g. chess or reconciliations)
  • Whether the solution improves with the improvement in a variety of data (e.g. reconciliations)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the different kinds of ML problems?

A

Three most common…

  1. Supervised learning problems: These are problems where the output labels or actual values related to the response variable (a variable that needs to be predicted) are available. The machine is trained using both the data and the related output value. Later, the machine makes the prediction on an unseen dataset.

Supervised learning problems can be categorized into the following different types:
- Regression (Predict the numerical value given the data set)
- Classification (Predict the class or the label of the dataset)

  1. Unsupervised learning problems: These are problems where output values or labels ain’t present. Clustering is one common type of unsupervised learning problems. The machine learns the clusters of data given the data set.
  2. Reinforcement learning problems: Given the environment, the machine learns to perform the most optimal action based on feedback it gets by performing an action in a simulation or training environment. Some of the key aspects of reinforcement learning include environment, current state, action, future state, reward, etc.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is feature engineering and what role do product managers play?

A

Feature engineering is one of the key stages of the machine learning model development lifecycle. It can be defined as the process of identifying the most important features that can be used to train a machine learning model, which generalizes well for an unseen dataset (larger population). You need to clearly understand the concept of features

Feature engineering comprises of the following tasks:

  • Identifying raw features which can be obtained from the dataset
  • Identifying derived features which can be obtained using the raw data set
  • Extracting features from the existing features
  • Selecting the most important features from features obtained in the above stages

As a product manager, you play a key role in helping data scientists identify raw features and derived features. Rest is the work of the data scienitist.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is your approach towards model governance/monitoring?

A

Model performance can be classified into three categories, namely, the green zone, the yellow zone, and the red zone. One needs to identify thresholds for putting the model performance in the green, yellow, and red zones. Based on which zone model performance is found, the model is scheduled for retraining.

  • Green Zone: If model performance is above a particular threshold, say, 85-90%, the model can be said to be in the green zone. One may not need to do anything.
  • Yellow Zone: If the model performance is between say 60-70% to green zone threshold, the model falls in the yellow zone and requires scrutiny.
  • Red Zone: If the model performance is less than a particular threshold, say, 60%, the model gets scheduled to be retrained.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is accuracy and how to best use it?

How do you use other metrics?

A

Accuracy = total correct predictions / total predictions

Measures the total misclassification done by the model.

Can be misleading if the dataset is unbalanced (majority comprised of one of the labels). Ok to use if the dataset is balanced.

We use precision when we want the prediction of 1 to be as correct as possible and we use recall when we want our model to spot as many real 1 as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why is testing ML projects challenging?

A
  • Edge cases
  • Live data can change
  • Outcome itself is dependent not only on code but training data
  • Model performance is something else to test for
  • Model behaviour covered to ensure main known constraints and behaviours/patterns are met
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do you handle data science uncertainty when planning?

A
  • Contract with stakeholder (examples from working with ops)
  • Educate the stakeholder about unpredictability
  • Encourage trying a few different approaches (if resources allow)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you help the team manage and prioritise BAU model improvements?

A
  • Need to be metrics and objectives driven
  • Logged and managed together
  • If possible, bringing the stakeholder group into the process, aka workshops
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is precision and how to best use it?

Give an example.

A

Precision = true positive / total predicted positive (true and false positive)
Preicion is a good measure when the costs of getting a prediction wrong is much higher than cost of missing out on the right prediction, e.g. email spam detection. If precision is low, there are too many false positives and important email goes to the spam folder.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is recall and how to best use it?

Give an example.

A

Recall = true positive / total actual positive (true positive and false negative)
The model should capture all examples of the class.
Best used when cost of missing a prediction is much higher than a wrong predction. e.g. when a bank is transaction is predicted as non-fradulent or sick patients. Or airport detectors missing any actual bombs/dangerous items. High coverage!

17
Q

What is an F1 score and how to best use it?

A

F1-Score = 2 x (precision*recall)/(precision + recall) –> seeking a balance between precision and recall

Best used when the data is unbalanced.

18
Q

What is the difference between supervised and unsupervised methods?

Can you provide some examples?

A

Supervised: Need labels, answers questions in pre-defined categories (e.g. email classification - classifying the email exactly)

Unsupervised: No need for labels, good for exploring, can visualise well (e.g. figuring out how many email classes there are/exploration)

19
Q

Why is versioning important?

A

Model, data and code

  1. Helps a data scientist keep track of their changes and ultimately pick the right model
  2. Governance and regulatory requirements
20
Q

What product metrics would you use for a chatbot?

A

User engagement: qualitative and quantitative

Performance: Model metrics and response ties

Stability: SLA, bugs

21
Q

How would you design for feedback capture? Can you give an example?

(Does no feedback mean positive feedback?)

A

Ensuring that the design of the UI makes it very clear to the user what action they are taking next so that they don’t click without a clear intent.

I would make sure to user test this multiple times to ensure the mentality is engrained.

This gives data science teams reassurance on the other end.

Example: Reconciliations accept or decline. Decline made a difference. Had to stop people from progressing until they clicked somethig.