Machine Learning Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

What is overfitting?

A

Overfitting refers to a model that models the training data too well.

Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model. The problem is that these concepts do not apply to new data and negatively impact the models ability to generalize.

Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when learning a target function. As such, many nonparametric machine learning algorithms also include parameters or techniques to limit and constrain how much detail the model learns.

For example, decision trees are a nonparametric machine learning algorithm that is very flexible and is subject to overfitting training data. This problem can be addressed by pruning a tree after it has learned in order to remove some of the detail it has picked up.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is underfitting?

A

Underfitting refers to a model that can neither model the training data nor generalize to new data.

An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data.

Underfitting is often not discussed as it is easy to detect given a good performance metric. The remedy is to move on and try alternate machine learning algorithms. Nevertheless, it does provide a good contrast to the problem of overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How to detect overfitting?

A

K-fold cross-validation is one of the most popular techniques to assess accuracy of the model.

In k-folds cross-validation, data is split into k equally sized subsets, which are also called “folds.” One of the k-folds will act as the test set, also known as the holdout set or validation set, and the remaining folds will train the model. This process repeats until each of the fold has acted as a holdout fold. After each evaluation, a score is retained and when all iterations have completed, the scores are averaged to assess the performance of the overall model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to avoid overfitting?

A

Early stopping:
-pause training before the model starts learning the noise within the model. This approach risks halting the training process too soon, leading underfitting. Finding the “sweet spot” between underfitting and overfitting is the ultimate goal here.

Train with more data:
this can increase the accuracy of the model by providing more opportunities to parse out the dominant relationship among the input and output variables. That said, this is a more effective method when clean, relevant data is injected into the model. Otherwise, you could just continue to add more complexity to the model, causing it to overfit.

Data augmentation:
While it is better to inject clean, relevant data into your training data, sometimes noisy data is added to make a model more stable. However, this method should be done sparingly.

Feature selection:
When you build a model, you’ll have a number of parameters or features that are used to predict a given outcome, but many times, these features can be redundant to others. Feature selection is the process of identifying the most important ones within the training data and then eliminating the irrelevant or redundant ones. This is commonly mistaken for dimensionality reduction, but it is different. However, both methods help to simplify your model to establish the dominant trend in the data.

Regularization:
If overfitting occurs when a model is too complex, it makes sense for us to reduce the number of features. But what if we don’t know which inputs to eliminate during the feature selection process? If we don’t know which features to remove from our model, regularization methods can be particularly helpful. Regularization applies a “penalty” to the input parameters with the larger coefficients, which subsequently limits the amount of variance in the model. While there are a number of regularization methods, such as L1 regularization, Lasso regularization, and dropout, they all seek to identify and reduce the noise within the data.

Ensemble methods:
Ensemble learning methods are made up of a set of classifiers—e.g. decision trees—and their predictions are aggregated to identify the most popular result. The most well-known ensemble methods are bagging and boosting. In bagging, a random sample of data in a training set is selected with replacement—meaning that the individual data points can be chosen more than once. After several data samples are generated, these models are then trained independently, and depending on the type of task—i.e. regression or classification—the average or majority of those predictions yield a more accurate estimate. This is commonly used to reduce variance within a noisy dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Bias and Variance? What is the trade off between them.

A
  • Bias: An error due to Bias is the distance between the predictions of a model and the true values. In this type of error, the model pays little attention to training data and oversimplifies the model and doesn’t learn the patterns. The model learns the wrong relations by not taking in account all the features
  • Variance: Variability of model prediction for a given data point or a value that tells us the spread of our data. In this type of error, the model pays a lot of attention in training data, to the point to memorize it instead of learning from it. A model with a high error of variance is not flexible to generalize on the data which it hasn’t seen before.

Bias- Variance trade-off is about balancing and about finding a sweet spot between error due to bias and errors due to variance. (minimize the varience + Bias^2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Regularization?

A

regularization is the process which regularizes or shrinks the coefficients towards zero. In simple words, regularization discourages learning a more complex or flexible model, to prevent overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain the difference between Lasso and Ridge regularization.

A

Lasso (L1) - The penalty function is defined by the sim pf the absolute values of the coefficients.

Ridge (L2) - The penalty function is defined by the sum of the squares of the coefficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is ElasticNet?

A

ElasticNet is a hybrid of Lasso and Ridge, where both the absolute value penalization and squared penalization are included, being regulated with another coefficient l1_ratio:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is k-means algorithm?

A

Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group.

  1. Select k – number of clusters
  2. Select random k points from the data
  3. Measure the distance from the first data point to k points chosen.
  4. Assign the first point to the nearest cluster based on the minimum distance.
  5. Repeat #3, 4 all points.
  6. Calculate the mean of each cluster
  7. Repeat 3-7 for another iteration until clusters doesn’t change any more.
  8. Calculate the variance of the each cluster and add to find total variation.
  9. Repeat 1-8 based on another random k points and repeat as many times you want to find best cluster based on lowest total variation.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How to decide k in k-means algorithm?

A

Use elbow method
Goal is to minimize the distance between points in the cluster and maximize the distance between clusters.
Minimize within cluster sum of squares  WCSS (but zero means k = number of samples which is useless)
1. Try different k values and find the total variation
2. Plot total variance and k identify best based on elbow location.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the pros and cons of k-means?

A
Pros	
Simple to understand 	
Fast to cluster	
Widely available	
Easy to implement		
Always create a result (also a con)
Cons	(Remedy)
We need to pick k	(Elbow method)
Sensitive to initiation K-means++ (run initial algorithm to pick most appropriate seed points)
Sensitive to outliers (Remove outliers)
Produces spherical solutions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the pros and cons of k-means?

A

Pros Cons Remedy
Simple to understand We need to pick k Elbow method
Fast to cluster Sensitive to initiation K-means++ (run initial algorithm to pick most appropriate seed points)
Widely available Sensitive to outliers Remove outliers
Easy to implement Produces spherical solutions
Always create a result (also a con)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is k-nearest neighbors algorithm?

A

Supervised classification algorithm. (can be used as none linear regression)

  1. Get the training data already classified into group.
  2. Get a new data point, find the classification of nearest N points.
  3. Classify the new point based on the best vote from N points.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How to optimize k in k - nearest neighbors algorithm?

A

Select K by optimizing error on test data set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain pros and cons of KNN

A

Pros
Almost no assumptions
Simple and easy to implement (only need k and distance function)
Good value of k will makes it robust to noise
KNN learns a non-linear decision boundary
There is no training required
Can be used as classification and regression

Cons
Inefficient (need to calculate distance to all n points to classify)
Does not play well when we have higher dimensions.
Does not handle categorical features well.
Lower k will susceptible to outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Explain the differences between KNN and K-means.

A

K-means Unsupervised algorithm to do group similar data points together.

KNN Supervised algorithm to do classification give new classification based on k number of nearest neighbors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is linear regression?

A

Method of modeling depending variable based on independent variable by using a linear equation using least Squares method. Is there a significant linear relationship between independent variable and dependent variable?

    Y=C+mx C-intercept , m-slope
Residual = (Y-Yreal)
Correlation
Standard Error
TTS – total sum of squared Σ(y_real-(y_real ) ̅ )^2
RSS – Residual sum of squared Σ(y_real-y)^2
ESS – Explained sum of squared (y-(y_real ) ̅ )^2
TSS = RSS+ESS
R^2=1-RSS/TSS=ESS/TSS      anything about 0.3 give a good correlation
 F-Test 
Degrees of freedom = number observations - number of coefficients (independent variables) -1 DF=n-k-1
T-test
P-test
Confidence interval  if not zero is not included in this interval, it is good! We have a linear relationship.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the assumptions for linear regression?

A
  1. Linear relationship between independent and dependent
  2. Residual errors or residuals are normally distributed and indigent from each other
  3. There are no correlation between multiple independent variables
  4. Homoscedasticity – Variance around the regression line is the same for all values of the predictor variable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is time-series analysis?

A

• Use of past univariant values to predict future.
• Univariant - only one y value changing with time, Eg. stock price for last 30 days)
• Interval should be exactly the same.
• Components of TSA data
o Trend
o Seasonal
o White noise
o Residual
• Must be stationary
o Variance and covariance of the series are time invariant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How to measure accuracy in classification?

A

Accuracy = # of correct/Total # of predictions

Accuracy = (TP + TN)/(TP+TN+FP+FN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How to calculate Precision?

A

True positives from all positive results

Precision = TP/(TP+FP)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How to calculate Recall?

A

Recall is true positive from actual positives

TP / (TP + FN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is F1 score?

A

F1 score is a combination of Recall and precision

2RP/(R+P)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is stationary time series?

A

Time series are stationary if they do not have trend or seasonal effects. Summary statistics calculated on the time series are consistent over time, like the mean or the variance of the observations

When a time series is stationary, it can be easier to model. Statistical modeling methods assume or require the time series to be stationary to be effective.

Classical time series analysis and forecasting methods are concerned with making non-stationary time series data stationary by identifying and removing trends and removing seasonal effects

If you have clear trend and seasonality in your time series, then model these components, remove them from observations, then train models on the residuals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How to check time series is stationary?

A

3 methods:

  1. Look at Plots: You can review a time series plot of your data and visually check if there are any obvious trends or seasonality.
  2. Summary Statistics: You can review the summary statistics for your data for seasons or random partitions and check for obvious or significant differences.
  3. Statistical Tests: You can use statistical tests to check if the expectations of stationarity are met or have been violated.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Augmented Dickey-Fuller test

A

The Augmented Dickey-Fuller test is a type of statistical test called a unit root test. It uses an autoregressive model and optimizes an information criterion across multiple different lag values.

    • Null Hypothesis (H0): If failed to be rejected, it suggests the time series has a unit root, meaning it is non-stationary. It has some time dependent structure.
    • Alternate Hypothesis (H1): The null hypothesis is rejected; it suggests the time series does not have a unit root, meaning it is stationary. It does not have time-dependent structure.

p-value > 0.05: Fail to reject the null hypothesis (H0), the data has a unit root and is non-stationary.
p-value <= 0.05: Reject the null hypothesis (H0), the data does not have a unit root and is stationary.

For airline passengers
The test statistic is positive, meaning we are much less likely to reject the null hypothesis (it looks non-stationary).

from pandas import read_csv
from statsmodels.tsa.stattools import adfuller
series = read_csv('international-airline-passengers.csv', header=0, index_col=0, squeeze=True)
X = series.values
result = adfuller(X)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))
ADF Statistic: 0.815369
p-value: 0.991880
Critical Values:
	5%: -2.884
	1%: -3.482
	10%: -2.579
27
Q

What is Autoregression (AR)?

A

The autoregression (AR) method models the next step in the sequence as a linear function of the observations at prior time steps.

The notation for the model involves specifying the order of the model p as a parameter to the AR function, e.g. AR(p). For example, AR(1) is a first-order autoregression model.

The method is suitable for univariate time series without trend and seasonal components.

28
Q

What is Moving Average (MA)?

A

The moving average (MA) method models the next step in the sequence as a linear function of the residual errors from a mean process at prior time steps.

A moving average model is different from calculating the moving average of the time series.

The notation for the model involves specifying the order of the model q as a parameter to the MA function, e.g. MA(q). For example, MA(1) is a first-order moving average model.

The method is suitable for univariate time series without trend and seasonal components.

29
Q

What is Autoregressive Moving Average (ARMA)?

A

The Autoregressive Moving Average (ARMA) method models the next step in the sequence as a linear function of the observations and residual errors at prior time steps.

It combines both Autoregression (AR) and Moving Average (MA) models.

The notation for the model involves specifying the order for the AR(p) and MA(q) models as parameters to an ARMA function, e.g. ARMA(p, q). An ARIMA model can be used to develop AR or MA models.

The method is suitable for univariate time series without trend and seasonal components.

30
Q

What is Autoregressive Integrated Moving Average (ARIMA)?

A

The Autoregressive Integrated Moving Average (ARIMA) method models the next step in the sequence as a linear function of the differenced observations and residual errors at prior time steps.

It combines both Autoregression (AR) and Moving Average (MA) models as well as a differencing pre-processing step of the sequence to make the sequence stationary, called integration (I).

The notation for the model involves specifying the order for the AR(p), I(d), and MA(q) models as parameters to an ARIMA function, e.g. ARIMA(p, d, q). An ARIMA model can also be used to develop AR, MA, and ARMA models.

The method is suitable for univariate time series with trend and without seasonal components

31
Q

What is Seasonal Autoregressive Integrated Moving-Average (SARIMA)?

A

The Seasonal Autoregressive Integrated Moving Average (SARIMA) method models the next step in the sequence as a linear function of the differenced observations, errors, differenced seasonal observations, and seasonal errors at prior time steps.

It combines the ARIMA model with the ability to perform the same autoregression, differencing, and moving average modeling at the seasonal level.

The notation for the model involves specifying the order for the AR(p), I(d), and MA(q) models as parameters to an ARIMA function and AR(P), I(D), MA(Q) and m parameters at the seasonal level, e.g. SARIMA(p, d, q)(P, D, Q)m where “m” is the number of time steps in each season (the seasonal period). A SARIMA model can be used to develop AR, MA, ARMA and ARIMA models.

The method is suitable for univariate time series with trend and/or seasonal components.

32
Q

What are the types of ML learning models?

A

Supervised learning:

Unsupervised learning

Reinforcement learning

33
Q

Is high variance in data good or bad?

A

Higher variance directly means that the data spread is big and the feature has variety of data. Usually, high variance in a feature is seen as not so good quality.

34
Q

What is difference between regularization and normalization?

A

Normalization is a feature scaling method. adjusts the data; regularization adjusts the prediction function. Data will be adjusted to 0-1.

Normalization scale the different data columns to have combatable statistics such as range, max, min.

Regularisation imposes some control on this by rewarding simpler fitting functions over complex ones.

35
Q

What is the difference between normalization and standardization?

A

Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling.

x’ = (x- x_min) /(x_max-x_min)

Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.

x’ = (x - mu)/sigma

36
Q

List most popular distributions and along with scenario you use them.

A
  1. Uniform - constant probability - rolling a dice
  2. Binomial - probability of two outcome - tossing a coin
  3. Noramal - values of variables are distribute normally - height
  4. Poisson -
  5. Exponential - Amount time until specific event - decaying of battry life
37
Q

What is logistic regression?

A

Logistic regression is the appropriate regression analysis to conduct when the dependent variable is binary.

Binary Logistic Regression Major Assumptions

    • The dependent variable should be dichotomous in nature (e.g., presence vs. absent).
    • There should be no outliers in the data, which can be assessed by converting the continuous predictors to standardized scores, and removing values below -3.29 or greater than 3.29.
  • -There should be no high correlations (multicollinearity) among the predictors.

log(p/(1-p))=b0+b1x1+b2x2+..

38
Q

What is cross validation?

A

Cross-validation is a technique in which we train our model using the subset of the data-set and then evaluate using the complementary subset of the data-set.

The three steps involved in cross-validation are as follows :
Reserve some portion of sample data-set.
Using the rest data-set train the model.
Test the model using the reserve portion of the data-set.

39
Q

What are different cross validation techniques?

A

K-fold - Divide data into k groups (usually 10), leave one set out for validation

Stratified K-fold - Divide data into k groups (usually 10) making equal representation of target types, leave one set out for validation

Leave one out - Same as K-fold, just keep only one data point (row) out for validation

Bootstrapping - Resample data but replace some of the data by resampling.

Random Search CV - Randomized search on hyper parameters.

Grid Search CV - Define a search space as a grid of hyperparameter values and evaluate every position in the grid.

40
Q

What is K-fold cross validation? Explain pros and cons.=

A

In k-fold cross-validation, the original dataset is equally partitioned into k subparts or folds. Out of the k-folds or groups, for each iteration, one group is selected as validation data, and the remaining (k-1) groups are selected as training data.

Pros:
The model has low bias
Low time complexity
The entire dataset is utilized for both training and validation.

Cons:
Not suitable for an imbalanced dataset.

41
Q

Stratified k-fold cross-validation? Explain pros and cons.

A

the dataset is partitioned into k groups or folds such that the validation data has an equal number of instances of target class label.

Pros:
Works well for an imbalanced dataset.

Cons:
Now suitable for time series dataset.

42
Q

What is Leave-one-out cross-validation? Explain pros and cons

A

For a dataset having n rows, 1st row is selected for validation, and the rest (n-1) rows are used to train the model. For the next iteration, the 2nd row is selected for validation and rest to train the model. Similarly, the process is repeated until n steps or the desired number of operations.

Pros:
Simple, easy to understand, and implement.

Cons:
The model may lead to a low bias.
The computation time required is high.

43
Q

What is bootstrap cross validation?

A

Bootstrap sampling is a resampling technique that involves random sampling with replacement. The word resample in literal terms means ‘sample again’- implying that- a bootstrap sample is generated by sampling with replacement from ‘original’ sample.

44
Q

How to detect outliers?

A
  1. boxplot -
  2. Z score - Zscore < -3 or Zscore >3 considered as outliers
  3. Inter quantile Range (IQR) - data points that lie 1.5 times of IQR above Q3 and below Q1 are outliers.
45
Q

How to handle outliers?

A
  1. Drop the outlier records
  2. Cap the outliers by reducing range
  3. Assign a new value - impute using mean, mode, linear
  4. Transform (ex obtain log, percentage, etc)
46
Q

What is decision tree? What are the advantages and disadvantages

A

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

Advantages:
•	Simple to understand and to interpret. Trees can be visualized.
•	Requires little data preparation. No data normalization, dummy variables created or blank values to be removed. However can't handle missing values.
•	The cost of using is logarithmic in the number of data points
•	Able to handle both numerical and categorical data.
•	Able to handle multi-class problems.
•	Uses a white box model. Explainable
•	Possible to validate a model using statistical tests. test the reliability of the model.
•	Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.

Disadvantages
• Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfitting. Mechanisms such as pruning, setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.
• Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.
• Predictions of decision trees are neither smooth nor continuous, but piecewise constant approximations as Therefore, they are not good at extrapolation.
• The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.
• There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.
• Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

47
Q

In decision trees, how does the optimum split is determined?

A

Gini impurity index, entropy, or information gain is calculated to find the optimum split.

48
Q

What is Gini and Entropy?

A
Entropy
Entropy is the measurement of impurities or randomness in the data points.
Entropy= =–∑_j  p_j⋅log_2 p_j
Range 0 - 1
computationally bit expensive due to log

Information gain
Information gain determines the reduction of the uncertainty after splitting the dataset on a particular feature.

Information Gain = Entropy before splitting - Entropy after splitting
support smaller partions

Gini
computes the degree of probability of a specific variable.
Gini Index=1–∑_j  p^2_j
Range 0-1
Simple to calculate
Support larger partions
49
Q

What is pruning?

A

Removing a subtree that is redundant and not a useful split and replace it with a leaf node. This helps to reduce overfitting.

Two types:
Pre-pruning: also known as Early Stopping Rule, is the method where the subtree construction is halted at a particular node after evaluation of some measure such as Gini impurity or information gain.

Post-pruning:
post-pruning means to prune after the tree is built. You grow the tree entirely using your decision tree algorithm and then you prune the subtrees in the tree in a bottom-up fashion using Gini Impurity or Information Gain.

50
Q

What are pruning algorithms?

A

Pruning by information gain: Pre or post pruning method. Check whether information gain at a particular node is greater than minimum gain.

Reduced Error Pruning (REP): A node is pruned if the resulting pruned tree performs no worse than the original tree on the validation set.

Cost-complexity pruning
Tree Score based on Residual Sum of Squares (RSS) for the subtree, and a Tree Complexity Penalty that is a function of the number of leaves in the subtree. The Tree Complexity Penalty compensates for the difference in the number of leaves. Numerically, Tree Score is defined as follows:

Tree Score=RSS+aTTreeScore=RSS+aT

51
Q

What is Random Forest?

A

Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification and Regression problems. It builds decision trees on different samples and takes their majority vote for classification and average in case of regression.

Step 1: In Random forest n number of random records are taken from the data set having k number of records.

Step 2: Individual decision trees are constructed for each sample.

Step 3: Each decision tree will generate an output.

Step 4: Final output is considered based on Majority Voting or Averaging for Classification and regression respectively.

52
Q

What is Bagging and Boosing?

A
  1. Bagging– It creates a different training subset from sample training data with replacement & the final output is based on majority voting. For example, Random Forest.
  2. Boosting– It combines weak learners into strong learners by creating sequential models such that the final model has the highest accuracy. For example, ADA BOOST, XG BOOST
53
Q

What are some important features of Random forest?

A
  1. Diversity- Not all attributes/variables/features are considered while making an individual tree, each tree is different.
  2. Immune to the curse of dimensionality- Since each tree does not consider all the features, the feature space is reduced.
  3. Parallelization-Each tree is created independently out of different data and attributes. This means that we can make full use of the CPU to build random forests.
  4. Train-Test split- In a random forest we don’t have to segregate the data for train and test as there will always be 30% of the data which is not seen by the decision tree.
  5. Stability- Stability arises because the result is based on majority voting/ averaging.
54
Q

What are the differences between Decision Trees and Random Forests

A
  1. Decision trees normally suffer from the problem of overfitting if it’s allowed to grow without any control.
  2. A single decision tree is faster in computation.
  3. When a data set with features is taken as input by a decision tree it will formulate some set of rules to do prediction.
  4. Random forests are created from subsets of data and the final output is based on average or majority ranking and hence the problem of overfitting is taken care of.
  5. It is comparatively slower.
  6. Random forest randomly selects observations, builds a decision tree and the average result is taken. It doesn’t use any set of formulas
55
Q

What is recommender system?

A

Practically, recommender systems encompass a class of techniques and algorithms which are able to suggest “relevant” items to users. Ideally, the suggested items are as relevant to the user as possible, so that the user can engage with those items: YouTube videos, news articles, online products, and so on.

Items are ranked according to their relevancy, and the most relevant ones are shown to the user. The relevancy is something that the recommender system must determine and is mainly based on historical data.

Recommender systems are generally divided into two main categories:

Content Based
Collaborative Filtering
Model Based
Memory Based

56
Q

What is Collaborative Filtering Recommender System?

A

Methods that are solely based on the past interactions between users and the target items. Thus, the input to a collaborative filtering system will be all historical data of user interactions with target items. This data is typically stored in a matrix where the rows are the users, and the columns are the items.

The core idea behind such systems is that the historical data of the users should be enough to make a prediction. I.e we don’t need anything more than that historical data, no extra push from the user, no presently trending information, etc.

Based on the users’ historical data, the likes and dislikes of each item, the system tries to predict how the user would rate a new item which they haven’t rated yet. The predictions themselves are based the past ratings of other users, whose ratings and therefore supposed preferences, are similar to the active user.

57
Q

What is memory-based recommender?

A

Memory-based methods are the most simplistic as they use no model whatsoever. They assume that predictions can be made on pure “memory” of past data and usually just employ a simple distance-measurement approach, like nearest neighbor.

58
Q

What is model-based recommender?

A

Model-based approaches, on the other hand, always assume some kind of underlying model and basically try to make sure that whatever predictions come out will fit the model well.

As an example, let’s say we have a matrix of users-to-preferred lunch item where all of the users are Americans who love cheeseburgers (they are phenomenal). A memory-based method will only look at what the user has eaten over the past month, without considering that mini-fact of them being cheeseburger loving Americans. A model-based method, on the other hand, will ensure that the predictions always lean a bit more towards being a cheeseburger, since the underlying model assumption is that most people in the dataset should love cheeseburgers!

59
Q

What is Content-based recommender?

A

In contrast to collaborative filtering, content-based approaches will use additional information about the user and / or items to make predictions.

content-based system might consider the age, sex, occupation, and other personal user factors when making the predictions. It’s much easier to predict that the person wouldn’t like the video if we knew it was about skateboarding, but the user’s age is 87!

Thus, content-based methods are more similar to classical machine learning, in the sense that we will build features based on user and item data and use that to help us make predictions. Our system input is then the features of the user and the features of the item. Our system output is the prediction of whether or not the user would like or dislike the item.

60
Q

What is SVM?

A

Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.

61
Q

What are the advantages and disadvantages of SVM?

A

Advantages:
SVM works relatively well when there is a clear margin of separation between classes.
SVM is more effective in high dimensional spaces.
SVM is effective in cases where the number of dimensions is greater than the number of samples.
SVM is relatively memory efficient

Disadvantages:
SVM algorithm is not suitable for large data sets.
SVM does not perform very well when the data set has more noise i.e. target classes are overlapping.
In cases where the number of features for each data point exceeds the number of training data samples, the SVM will underperform.
As the support vector classifier works by putting data points, above and below the classifying hyperplane there is no probabilistic explanation for the classification.
Does not support categorical data (need to do one-hot-encoding)

62
Q

What are different encoding for categorical data?

A

Ordinal Encoding - each unique category value is assigned an integer value.
One-hot-encoding - For categorical variables where no ordinal relationship exists, the integer encoding may not be enough, at best, or misleading to the model at worst
Dummy Variable Encoding -Represent C categories with C-1 binary variables.
Hashing trick.

Nominal Variable (Categorical). Variable comprises a finite set of discrete values with no relationship between values.
Ordinal Variable. Variable comprises a finite set of discrete values with a ranked ordering between values.
63
Q

What is data wrangling?

A

Data wrangling—also called data cleaning, data remediation, or data munging—refers to a variety of processes designed to transform raw data into more readily used formats. The exact methods differ from project to project depending on the data you’re leveraging and the goal you’re trying to achieve.

Some examples of data wrangling include:

  • Merging multiple data sources into a single dataset for analysis
  • Identifying gaps in data (for example, empty cells in a spreadsheet) and either filling or deleting them
  • Deleting data that’s either unnecessary or irrelevant to the project you’re working on
  • Identifying extreme outliers in data and either explaining the discrepancies or removing them so that analysis can take place
64
Q

What are data wrangling steps?

A
  1. Discovery: During discovery, you may identify trends or patterns in the data, along with obvious issues, such as missing or incomplete values that need to be addressed. This is an important step, as it will inform every activity that comes afterward.
  2. Structuring: Raw data is typically unusable in its raw state because it’s either incomplete or misformatted for its intended application. Data structuring is the process of taking raw data and transforming it to be more readily leveraged. The form your data takes will depend on the analytical model you use to interpret it.
  3. Cleaning: Data cleaning is the process of removing inherent errors in data that might distort your analysis or render it less valuable. Cleaning can come in different forms, including deleting empty cells or rows, removing outliers, and standardizing inputs. The goal of data cleaning is to ensure there are no errors (or as few as possible) that could influence your final analysis
  4. Enriching: Once you understand your existing data and have transformed it into a more usable state, you must determine whether you have all of the data necessary for the project at hand. If not, you may choose to enrich or augment your data by incorporating values from other datasets. For this reason, it’s important to understand what other data is available for use.
  5. Validating: Data validation refers to the process of verifying that your data is both consistent and of a high enough quality. During validation, you may discover issues you need to resolve or conclude that your data is ready to be analyzed. Validation is typically achieved through various automated processes and requires programming.
  6. Publishing: Once your data has been validated, you can publish it. This involves making it available to others within your organization for analysis. The format you use to share the information—such as a written report or electronic file—will depend on your data and the organization’s goals.