Machine Learning Flashcards

Question

How to check time series is stationary?

Answer 1

3 methods: 1. Look at Plots: You can review a time series plot of your data and visually check if there are any obvious trends or seasonality. 2. Summary Statistics: You can review the summary statistics for your data for seasons or random partitions and check for obvious or significant differences. 3. Statistical Tests: You can use statistical tests to check if the expectations of stationarity are met or have been violated.

Answer 2

The Augmented Dickey-Fuller test is a type of statistical test called a unit root test. It uses an autoregressive model and optimizes an information criterion across multiple different lag values. - - Null Hypothesis (H0): If failed to be rejected, it suggests the time series has a unit root, meaning it is non-stationary. It has some time dependent structure. - - Alternate Hypothesis (H1): The null hypothesis is rejected; it suggests the time series does not have a unit root, meaning it is stationary. It does not have time-dependent structure. p-value > 0.05: Fail to reject the null hypothesis (H0), the data has a unit root and is non-stationary. p-value <= 0.05: Reject the null hypothesis (H0), the data does not have a unit root and is stationary. For airline passengers The test statistic is positive, meaning we are much less likely to reject the null hypothesis (it looks non-stationary). ``` from pandas import read_csv from statsmodels.tsa.stattools import adfuller series = read_csv('international-airline-passengers.csv', header=0, index_col=0, squeeze=True) X = series.values result = adfuller(X) print('ADF Statistic: %f' % result[0]) print('p-value: %f' % result[1]) print('Critical Values:') for key, value in result[4].items(): print('\t%s: %.3f' % (key, value)) ``` ``` ADF Statistic: 0.815369 p-value: 0.991880 Critical Values: 5%: -2.884 1%: -3.482 10%: -2.579 ```

Answer 3

The autoregression (AR) method models the next step in the sequence as a linear function of the observations at prior time steps. The notation for the model involves specifying the order of the model p as a parameter to the AR function, e.g. AR(p). For example, AR(1) is a first-order autoregression model. The method is suitable for univariate time series without trend and seasonal components.

Answer 4

The moving average (MA) method models the next step in the sequence as a linear function of the residual errors from a mean process at prior time steps. A moving average model is different from calculating the moving average of the time series. The notation for the model involves specifying the order of the model q as a parameter to the MA function, e.g. MA(q). For example, MA(1) is a first-order moving average model. The method is suitable for univariate time series without trend and seasonal components.

Answer 5

The Autoregressive Moving Average (ARMA) method models the next step in the sequence as a linear function of the observations and residual errors at prior time steps. It combines both Autoregression (AR) and Moving Average (MA) models. The notation for the model involves specifying the order for the AR(p) and MA(q) models as parameters to an ARMA function, e.g. ARMA(p, q). An ARIMA model can be used to develop AR or MA models. The method is suitable for univariate time series without trend and seasonal components.

Answer 6

The Autoregressive Integrated Moving Average (ARIMA) method models the next step in the sequence as a linear function of the differenced observations and residual errors at prior time steps. It combines both Autoregression (AR) and Moving Average (MA) models as well as a differencing pre-processing step of the sequence to make the sequence stationary, called integration (I). The notation for the model involves specifying the order for the AR(p), I(d), and MA(q) models as parameters to an ARIMA function, e.g. ARIMA(p, d, q). An ARIMA model can also be used to develop AR, MA, and ARMA models. The method is suitable for univariate time series with trend and without seasonal components

Answer 7

The Seasonal Autoregressive Integrated Moving Average (SARIMA) method models the next step in the sequence as a linear function of the differenced observations, errors, differenced seasonal observations, and seasonal errors at prior time steps. It combines the ARIMA model with the ability to perform the same autoregression, differencing, and moving average modeling at the seasonal level. The notation for the model involves specifying the order for the AR(p), I(d), and MA(q) models as parameters to an ARIMA function and AR(P), I(D), MA(Q) and m parameters at the seasonal level, e.g. SARIMA(p, d, q)(P, D, Q)m where “m” is the number of time steps in each season (the seasonal period). A SARIMA model can be used to develop AR, MA, ARMA and ARIMA models. The method is suitable for univariate time series with trend and/or seasonal components.

Answer 8

Supervised learning: Unsupervised learning Reinforcement learning

Answer 9

Higher variance directly means that the data spread is big and the feature has variety of data. Usually, high variance in a feature is seen as not so good quality.

Answer 10

Normalization is a feature scaling method. adjusts the data; regularization adjusts the prediction function. Data will be adjusted to 0-1. Normalization scale the different data columns to have combatable statistics such as range, max, min. Regularisation imposes some control on this by rewarding simpler fitting functions over complex ones.

Answer 11

Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling. x' = (x- x_min) /(x_max-x_min) Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation. x' = (x - mu)/sigma

Answer 12

1. Uniform - constant probability - rolling a dice 2. Binomial - probability of two outcome - tossing a coin 3. Noramal - values of variables are distribute normally - height 4. Poisson - 5. Exponential - Amount time until specific event - decaying of battry life

Answer 13

Logistic regression is the appropriate regression analysis to conduct when the dependent variable is binary. Binary Logistic Regression Major Assumptions - - The dependent variable should be dichotomous in nature (e.g., presence vs. absent). - - There should be no outliers in the data, which can be assessed by converting the continuous predictors to standardized scores, and removing values below -3.29 or greater than 3.29. - -There should be no high correlations (multicollinearity) among the predictors. log(p/(1-p))=b0+b1x1+b2x2+..

Answer 14

Cross-validation is a technique in which we train our model using the subset of the data-set and then evaluate using the complementary subset of the data-set. The three steps involved in cross-validation are as follows : Reserve some portion of sample data-set. Using the rest data-set train the model. Test the model using the reserve portion of the data-set.

Answer 15

K-fold - Divide data into k groups (usually 10), leave one set out for validation Stratified K-fold - Divide data into k groups (usually 10) making equal representation of target types, leave one set out for validation Leave one out - Same as K-fold, just keep only one data point (row) out for validation Bootstrapping - Resample data but replace some of the data by resampling. Random Search CV - Randomized search on hyper parameters. Grid Search CV - Define a search space as a grid of hyperparameter values and evaluate every position in the grid.

Answer 16

In k-fold cross-validation, the original dataset is equally partitioned into k subparts or folds. Out of the k-folds or groups, for each iteration, one group is selected as validation data, and the remaining (k-1) groups are selected as training data. Pros: The model has low bias Low time complexity The entire dataset is utilized for both training and validation. Cons: Not suitable for an imbalanced dataset.

Answer 17

the dataset is partitioned into k groups or folds such that the validation data has an equal number of instances of target class label. Pros: Works well for an imbalanced dataset. Cons: Now suitable for time series dataset.

Answer 18

For a dataset having n rows, 1st row is selected for validation, and the rest (n-1) rows are used to train the model. For the next iteration, the 2nd row is selected for validation and rest to train the model. Similarly, the process is repeated until n steps or the desired number of operations. Pros: Simple, easy to understand, and implement. Cons: The model may lead to a low bias. The computation time required is high.

Answer 19

Bootstrap sampling is a resampling technique that involves random sampling with replacement. The word resample in literal terms means ‘sample again’- implying that- a bootstrap sample is generated by sampling with replacement from ‘original’ sample.

Answer 20

1. boxplot - 2. Z score - Zscore < -3 or Zscore >3 considered as outliers 3. Inter quantile Range (IQR) - data points that lie 1.5 times of IQR above Q3 and below Q1 are outliers.

Answer 21

1. Drop the outlier records 2. Cap the outliers by reducing range 3. Assign a new value - impute using mean, mode, linear 4. Transform (ex obtain log, percentage, etc)

Answer 22

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. ``` Advantages: • Simple to understand and to interpret. Trees can be visualized. • Requires little data preparation. No data normalization, dummy variables created or blank values to be removed. However can't handle missing values. • The cost of using is logarithmic in the number of data points • Able to handle both numerical and categorical data. • Able to handle multi-class problems. • Uses a white box model. Explainable • Possible to validate a model using statistical tests. test the reliability of the model. • Performs well even if its assumptions are somewhat violated by the true model from which the data were generated. ``` Disadvantages • Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfitting. Mechanisms such as pruning, setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem. • Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble. • Predictions of decision trees are neither smooth nor continuous, but piecewise constant approximations as Therefore, they are not good at extrapolation. • The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement. • There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems. • Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

Answer 23

Gini impurity index, entropy, or information gain is calculated to find the optimum split.

Answer 24

``` Entropy Entropy is the measurement of impurities or randomness in the data points. Entropy= =–∑_j p_j⋅log_2 p_j Range 0 - 1 computationally bit expensive due to log ``` Information gain Information gain determines the reduction of the uncertainty after splitting the dataset on a particular feature. Information Gain = Entropy before splitting - Entropy after splitting support smaller partions ``` Gini computes the degree of probability of a specific variable. Gini Index=1–∑_j p^2_j Range 0-1 Simple to calculate Support larger partions ```

Answer 25

Removing a subtree that is redundant and not a useful split and replace it with a leaf node. This helps to reduce overfitting. Two types: Pre-pruning: also known as Early Stopping Rule, is the method where the subtree construction is halted at a particular node after evaluation of some measure such as Gini impurity or information gain. Post-pruning: post-pruning means to prune after the tree is built. You grow the tree entirely using your decision tree algorithm and then you prune the subtrees in the tree in a bottom-up fashion using Gini Impurity or Information Gain.

Answer 26

Pruning by information gain: Pre or post pruning method. Check whether information gain at a particular node is greater than minimum gain. Reduced Error Pruning (REP): A node is pruned if the resulting pruned tree performs no worse than the original tree on the validation set. Cost-complexity pruning Tree Score based on Residual Sum of Squares (RSS) for the subtree, and a Tree Complexity Penalty that is a function of the number of leaves in the subtree. The Tree Complexity Penalty compensates for the difference in the number of leaves. Numerically, Tree Score is defined as follows: Tree Score=RSS+aTTreeScore=RSS+aT

Answer 27

Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification and Regression problems. It builds decision trees on different samples and takes their majority vote for classification and average in case of regression. Step 1: In Random forest n number of random records are taken from the data set having k number of records. Step 2: Individual decision trees are constructed for each sample. Step 3: Each decision tree will generate an output. Step 4: Final output is considered based on Majority Voting or Averaging for Classification and regression respectively.

Answer 28

1. Bagging– It creates a different training subset from sample training data with replacement & the final output is based on majority voting. For example, Random Forest. 2. Boosting– It combines weak learners into strong learners by creating sequential models such that the final model has the highest accuracy. For example, ADA BOOST, XG BOOST

Answer 29

1. Diversity- Not all attributes/variables/features are considered while making an individual tree, each tree is different. 2. Immune to the curse of dimensionality- Since each tree does not consider all the features, the feature space is reduced. 3. Parallelization-Each tree is created independently out of different data and attributes. This means that we can make full use of the CPU to build random forests. 4. Train-Test split- In a random forest we don’t have to segregate the data for train and test as there will always be 30% of the data which is not seen by the decision tree. 5. Stability- Stability arises because the result is based on majority voting/ averaging.

Answer 30

1. Decision trees normally suffer from the problem of overfitting if it’s allowed to grow without any control. 2. A single decision tree is faster in computation. 3. When a data set with features is taken as input by a decision tree it will formulate some set of rules to do prediction. 1. Random forests are created from subsets of data and the final output is based on average or majority ranking and hence the problem of overfitting is taken care of. 2. It is comparatively slower. 3. Random forest randomly selects observations, builds a decision tree and the average result is taken. It doesn’t use any set of formulas

Answer 31

Practically, recommender systems encompass a class of techniques and algorithms which are able to suggest “relevant” items to users. Ideally, the suggested items are as relevant to the user as possible, so that the user can engage with those items: YouTube videos, news articles, online products, and so on. Items are ranked according to their relevancy, and the most relevant ones are shown to the user. The relevancy is something that the recommender system must determine and is mainly based on historical data. Recommender systems are generally divided into two main categories: Content Based Collaborative Filtering Model Based Memory Based

Answer 32

Methods that are solely based on the past interactions between users and the target items. Thus, the input to a collaborative filtering system will be all historical data of user interactions with target items. This data is typically stored in a matrix where the rows are the users, and the columns are the items. The core idea behind such systems is that the historical data of the users should be enough to make a prediction. I.e we don’t need anything more than that historical data, no extra push from the user, no presently trending information, etc. Based on the users' historical data, the likes and dislikes of each item, the system tries to predict how the user would rate a new item which they haven’t rated yet. The predictions themselves are based the past ratings of other users, whose ratings and therefore supposed preferences, are similar to the active user.

Answer 33

Memory-based methods are the most simplistic as they use no model whatsoever. They assume that predictions can be made on pure “memory” of past data and usually just employ a simple distance-measurement approach, like nearest neighbor.

Answer 34

Model-based approaches, on the other hand, always assume some kind of underlying model and basically try to make sure that whatever predictions come out will fit the model well. As an example, let’s say we have a matrix of users-to-preferred lunch item where all of the users are Americans who love cheeseburgers (they are phenomenal). A memory-based method will only look at what the user has eaten over the past month, without considering that mini-fact of them being cheeseburger loving Americans. A model-based method, on the other hand, will ensure that the predictions always lean a bit more towards being a cheeseburger, since the underlying model assumption is that most people in the dataset should love cheeseburgers!

Answer 35

In contrast to collaborative filtering, content-based approaches will use additional information about the user and / or items to make predictions. content-based system might consider the age, sex, occupation, and other personal user factors when making the predictions. It’s much easier to predict that the person wouldn’t like the video if we knew it was about skateboarding, but the user’s age is 87! Thus, content-based methods are more similar to classical machine learning, in the sense that we will build features based on user and item data and use that to help us make predictions. Our system input is then the features of the user and the features of the item. Our system output is the prediction of whether or not the user would like or dislike the item.

Answer 36

Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.

Answer 37

Advantages: SVM works relatively well when there is a clear margin of separation between classes. SVM is more effective in high dimensional spaces. SVM is effective in cases where the number of dimensions is greater than the number of samples. SVM is relatively memory efficient Disadvantages: SVM algorithm is not suitable for large data sets. SVM does not perform very well when the data set has more noise i.e. target classes are overlapping. In cases where the number of features for each data point exceeds the number of training data samples, the SVM will underperform. As the support vector classifier works by putting data points, above and below the classifying hyperplane there is no probabilistic explanation for the classification. Does not support categorical data (need to do one-hot-encoding)

Answer 38

Ordinal Encoding - each unique category value is assigned an integer value. One-hot-encoding - For categorical variables where no ordinal relationship exists, the integer encoding may not be enough, at best, or misleading to the model at worst Dummy Variable Encoding -Represent C categories with C-1 binary variables. Hashing trick. ``` Nominal Variable (Categorical). Variable comprises a finite set of discrete values with no relationship between values. Ordinal Variable. Variable comprises a finite set of discrete values with a ranked ordering between values. ```

Answer 39

Data wrangling—also called data cleaning, data remediation, or data munging—refers to a variety of processes designed to transform raw data into more readily used formats. The exact methods differ from project to project depending on the data you’re leveraging and the goal you’re trying to achieve. Some examples of data wrangling include: - Merging multiple data sources into a single dataset for analysis - Identifying gaps in data (for example, empty cells in a spreadsheet) and either filling or deleting them - Deleting data that’s either unnecessary or irrelevant to the project you’re working on - Identifying extreme outliers in data and either explaining the discrepancies or removing them so that analysis can take place

Answer 40

1. Discovery: During discovery, you may identify trends or patterns in the data, along with obvious issues, such as missing or incomplete values that need to be addressed. This is an important step, as it will inform every activity that comes afterward. 2. Structuring: Raw data is typically unusable in its raw state because it’s either incomplete or misformatted for its intended application. Data structuring is the process of taking raw data and transforming it to be more readily leveraged. The form your data takes will depend on the analytical model you use to interpret it. 3. Cleaning: Data cleaning is the process of removing inherent errors in data that might distort your analysis or render it less valuable. Cleaning can come in different forms, including deleting empty cells or rows, removing outliers, and standardizing inputs. The goal of data cleaning is to ensure there are no errors (or as few as possible) that could influence your final analysis 4. Enriching: Once you understand your existing data and have transformed it into a more usable state, you must determine whether you have all of the data necessary for the project at hand. If not, you may choose to enrich or augment your data by incorporating values from other datasets. For this reason, it’s important to understand what other data is available for use. 5. Validating: Data validation refers to the process of verifying that your data is both consistent and of a high enough quality. During validation, you may discover issues you need to resolve or conclude that your data is ready to be analyzed. Validation is typically achieved through various automated processes and requires programming. 6. Publishing: Once your data has been validated, you can publish it. This involves making it available to others within your organization for analysis. The format you use to share the information—such as a written report or electronic file—will depend on your data and the organization’s goals.

Machine Learning Flashcards

(64 cards)