Machine Learning Interview Flashcards
What is heuristics and why are they used?
Shortcuts to solutions. Rules-based approaches. An approximate solution instead of an exact solution.
They are used because some problems either can’t be solved or require too much time or processing power to be reasonable for solving the problem at hand., e.g. don’t want a robot taking forever to assess the best move in chess
Nearest neighbour heuristic –> Ask the computer to figure out the closest city that has’t been visited yet by the salesperson and make that the next stop. Doesn’t consider future moves so it’s not the most effective.
Alpha-beta pruning (games) –> runs through many different possible next moves until a move is determined to be worse than a previously considered move
What is AI?
A broad category that involves teaching computers to solve problems using algorithms. They mimic human intelligence.
This can be done using a set of complex rules processing or training machine learning models.
What are the challenges with ML
- Lack of understanding: Setting expectations that AI is not a straightforward process & educating stakeholders
- High expectations: Communicating often
Lack of data - Wrong problem: Identifying real machine learning problems that solve a real-world business problem
- Lack of governance: Monitoring/retraining models at regular intervals and setting up related governance processes
What is ML?
Machine learning is about training a machine (set of mathematical models) with a historical dataset such that the machine can predict the unseen data.
The key part of machine learning systems is that its performance can be improved based on the new data set (experience).
What is deep learning?
Deep learning problems form the subset of machine learning problems.
Deep learning represents the aspect of machine learning which mimics human brain for learning from data set and predicting the outcome on unseen data set.
In deep learning models. the features are learnt automatically based on the optimisation algorithm.
When should one use deep learning?
- Complex problems –> Being able to solve complex problems that require discovering hidden patterns in the data and/or a deep understanding of relationships between a large number of interdependent variables.
- Success definition –> Increases the difficulty for humans to understand and interpret the model. They model very complex situations with high non-linearity and interactions between inputs so might not be the best choice if explainability is a concern
- Resource availability –> Complicated and expensive endeavor. Models are slow to train and require a lot of computational power. GPUs are pretty much always a requirement.
- Availability of quality labelled data –> Millions of labelled data points for a classification task are required making these more difficult endeavors
How do you identify whether a problem is a machine learning one?
- It is not easy to identify a finite set of rules based on which one can determine output related to numerical problems or classification problems (e.g. email classification)
- Although the finite set of rules can be identified, however, the fact that rules change very fast makes it difficult to deploy the solution changes in the production (e.g. dynamic pricing for airlines)
- Whether the solution requires a large volume of data for testing/quality assurance (QA) (e.g. chess or reconciliations)
- Whether the solution improves with the improvement in a variety of data (e.g. reconciliations)
What are the different kinds of ML problems?
Three most common…
- Supervised learning problems: These are problems where the output labels or actual values related to the response variable (a variable that needs to be predicted) are available. The machine is trained using both the data and the related output value. Later, the machine makes the prediction on an unseen dataset.
Supervised learning problems can be categorized into the following different types:
- Regression (Predict the numerical value given the data set)
- Classification (Predict the class or the label of the dataset)
- Unsupervised learning problems: These are problems where output values or labels ain’t present. Clustering is one common type of unsupervised learning problems. The machine learns the clusters of data given the data set.
- Reinforcement learning problems: Given the environment, the machine learns to perform the most optimal action based on feedback it gets by performing an action in a simulation or training environment. Some of the key aspects of reinforcement learning include environment, current state, action, future state, reward, etc.
What is feature engineering and what role do product managers play?
Feature engineering is one of the key stages of the machine learning model development lifecycle. It can be defined as the process of identifying the most important features that can be used to train a machine learning model, which generalizes well for an unseen dataset (larger population). You need to clearly understand the concept of features
Feature engineering comprises of the following tasks:
- Identifying raw features which can be obtained from the dataset
- Identifying derived features which can be obtained using the raw data set
- Extracting features from the existing features
- Selecting the most important features from features obtained in the above stages
As a product manager, you play a key role in helping data scientists identify raw features and derived features. Rest is the work of the data scienitist.
What is your approach towards model governance/monitoring?
Model performance can be classified into three categories, namely, the green zone, the yellow zone, and the red zone. One needs to identify thresholds for putting the model performance in the green, yellow, and red zones. Based on which zone model performance is found, the model is scheduled for retraining.
- Green Zone: If model performance is above a particular threshold, say, 85-90%, the model can be said to be in the green zone. One may not need to do anything.
- Yellow Zone: If the model performance is between say 60-70% to green zone threshold, the model falls in the yellow zone and requires scrutiny.
- Red Zone: If the model performance is less than a particular threshold, say, 60%, the model gets scheduled to be retrained.
What is accuracy and how to best use it?
How do you use other metrics?
Accuracy = total correct predictions / total predictions
Measures the total misclassification done by the model.
Can be misleading if the dataset is unbalanced (majority comprised of one of the labels). Ok to use if the dataset is balanced.
We use precision when we want the prediction of 1 to be as correct as possible and we use recall when we want our model to spot as many real 1 as possible.
Why is testing ML projects challenging?
- Edge cases
- Live data can change
- Outcome itself is dependent not only on code but training data
- Model performance is something else to test for
- Model behaviour covered to ensure main known constraints and behaviours/patterns are met
How do you handle data science uncertainty when planning?
- Contract with stakeholder (examples from working with ops)
- Educate the stakeholder about unpredictability
- Encourage trying a few different approaches (if resources allow)
How do you help the team manage and prioritise BAU model improvements?
- Need to be metrics and objectives driven
- Logged and managed together
- If possible, bringing the stakeholder group into the process, aka workshops
What is precision and how to best use it?
Give an example.
Precision = true positive / total predicted positive (true and false positive)
Preicion is a good measure when the costs of getting a prediction wrong is much higher than cost of missing out on the right prediction, e.g. email spam detection. If precision is low, there are too many false positives and important email goes to the spam folder.