Machine Learning Flashcards
source: https://www.edureka.co/blog/interview-questions/machine-learning-interview-questions/
A/B Testing
A/B is Statistical hypothesis testing for randomized experiment with two variables A and B. It is used to compare two models that use different predictor variables in order to check which variable fits best for a given sample of data.
Consider a scenario where you’ve created two models (using different predictor variables) that can be used to recommend products for an e-commerce platform.
A/B Testing can be used to compare these two models to check which one best recommends products to a customer.
Bagging vs Boosting
Classification vs Regression
Classification:
- Predicting discrete class/label
- Binary and multi-class classification
Regression:
- Predicting continuous quantity
- multi-input regression is called multivariate regression
Cluster Sampling
It is a process of randomly selecting intact groups within a defined population, sharing similar characteristics.
Cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements.
Collinearity and Multicollinearity
Collinearity occurs when two predictor variables (e.g., x1 and x2) in a multiple regression have some correlation.
Multicollinearity occurs when more than two predictor variables (e.g., x1, x2, and x3) are inter-correlated.
Confusion Matrix
A confusion matrix or an error matrix is a table which is used for summarizing the performance of a classification algorithm.
Gini Impurity vs Entropy in a Decision Tree
- Gini Impurity and Entropy are the metrics used for deciding how to split a Decision Tree.
Gini measurement is the probability of a random sample being classified correctly if you randomly pick a label according to the distribution in the branch.
Entropy is a measurement to calculate the lack of information. You calculate the Information Gain (difference in entropies) by making a split. This measure helps to reduce the uncertainty about the output label.
How Decision Tree node is split
- Measures such as, Gini Index and Entropy can be used to decide which variable is best fitted for splitting the Decision Tree at the root node.
- We can calculate Gini as following:
- Calculate Gini for sub-nodes, using the formula – sum of square of probability for success and failure (p^2+q^2).
- Calculate Gini for split using weighted Gini score of each node of that split
- Entropy is the measure of impurity or randomness in the data
Entropy Vs Information Gain
Entropy is an indicator of how messy your data is. It decreases as you reach closer to the leaf node.
The Information Gain is based on the decrease in entropy after a dataset is split on an attribute. It keeps on increasing as you reach closer to the leaf node.
Eigenvectors and Eigenvalues
Eigenvectors: Eigenvectors are those vectors whose direction remains unchanged even when a linear transformation is performed on them.
Eigenvalues: Eigenvalue is the scalar that is used for the transformation of an Eigenvector.
Ensemble learning
Ensemble learning is a technique that is used to create multiple Machine Learning models, which are then combined to produce more accurate results. A general Machine Learning model is built by using the entire training data set.
However, in Ensemble Learning the training data set is split into multiple subsets, wherein each subset is used to build a separate model. After the models are trained, they are then combined to predict an outcome in such a way that the variance in the output is reduced.
True Positive
False Positive
False Negative
True Negative
- True Positive:
- False Positive:
- False Negative:
- True Negative:
Better False Positives vs False Negatives?
It depends on the question as well as on the domain for which we are trying to solve the problem.
- If you’re using Machine Learning in the domain of medical testing, then a false negative is very risky, since the report will not show any health problem when a person is actually unwell.
- Similarly, if Machine Learning is used in spam detection, then a false positive is very risky because the algorithm may classify an important email as spam.
Inductive vs Deductive learning
Inductive learning is the process of using observations to draw conclusions
Deductive learning is the process of using conclusions to form observations
KNN vs K-Means
KNN:
- Supervised Learning model/technique
- Classification or regression
- K is number of label to predict
K-Means:
- Unsupervised Learning/technique
- Clustering (or grouping)
- K is the number of clusters to identify/learn from the data
Python libraries for Data Analysis
- NumPy
- SciPy
- Pandas
- SciKit
- Matplotlib
- Seaborn
- Bokeh
Deep Learning vs Machine Learning
- Machine Learning is all about algorithms that parse data, learn from that data, and then apply what they’ve learned to make informed decisions.
- Deep Learning is a form of machine learning that is inspired by the structure of the human brain and is particularly effective in feature detection.
Types of Machine Learning
- Supervised Learning - uses labeled data
- Unsupervised Learning - uses unlabeled data
- Reinforcement Learning - actions oriented, uses rewards and penalties system