Sect 3.2- Grubbs outlier, p-value from linear regression, decision trees, & multiple regression Flashcards

Question

Decision trees can also be applied to regression problems, using the DecisionTreeRegressor class. As in the classification setting, the fit method will take as argument arrays X and y, only that in this case y is expected to have floating point values instead of integer values:

Answer 1

from sklearn import tree >>> X = [[0, 0], [2, 2]] >>> y = [0.5, 2.5] >>> clf = tree.DecisionTreeRegressor() >>> clf = clf.fit(X, y) >>> clf.predict([[1, 1]]) array([0.5])

Answer 2

With regard to decision trees, this strategy can readily be used to support multi-output problems. This requires the following changes: Store n output values in leaves, instead of 1; Use splitting criteria that compute the average reduction across all n outputs. This module offers support for multi-output problems by implementing this strategy in both DecisionTreeClassifier and DecisionTreeRegressor. If a decision tree is fit on an output array Y of shape (n_samples, n_outputs) then the resulting estimator will: Output n_output values upon predict; Output a list of n_output arrays of class probabilities upon predict_proba. The use of multi-output trees for regression is demonstrated in Multi-output Decision Tree Regression. In this example, the input X is a single real value and the outputs Y are the sine and cosine of X.

Answer 3

Consider performing dimensionality reduction (PCA, ICA, or Feature selection) beforehand to give your tree a better chance of finding features that are discriminative.

Answer 4

Use min_samples_split or min_samples_leaf to ensure that multiple samples inform every decision in the tree, by controlling which splits will be considered. A very small number will usually mean the tree will overfit, whereas a large number will prevent the tree from learning the data. Try min_samples_leaf=5 as an initial value. If the sample size varies greatly, a float number can be used as percentage in these two parameters. While min_samples_split can create arbitrarily small leaves, min_samples_leaf guarantees that each leaf has a minimum size, avoiding low-variance, over-fit leaf nodes in regression problems. For classification with few classes, min_samples_leaf=1 is often the best choice. Note that min_samples_split considers samples directly and independent of sample_weight, if provided (e.g. a node with m weighted samples is still treated as having exactly m samples). Consider min_weight_fraction_leaf or min_impurity_decrease if accounting for sample weights is required at splits.

Answer 5

If the samples are weighted, it will be easier to optimize the tree structure using weight-based pre-pruning criterion such as min_weight_fraction_leaf, which ensure that leaf nodes contain at least a fraction of the overall sum of the sample weights.

Answer 6

If the input matrix X is very sparse, it is recommended to convert to sparse csc_matrix before calling fit and sparse csr_matrix before calling predict. Training time can be orders of magnitude faster for a sparse matrix input compared to a dense matrix when features have zero values in most of the samples.

Answer 7

C4.5 is the successor to ID3 and removed the restriction that features must be categorical by dynamically defining a discrete attribute (based on numerical variables) that partitions the continuous attribute value into a discrete set of intervals. C4.5 converts the trained trees (i.e. the output of the ID3 algorithm) into sets of if-then rules. The accuracy of each rule is then evaluated to determine the order in which they should be applied. Pruning is done by removing a rule’s precondition if the accuracy of the rule improves without it. C5.0 is Quinlan’s latest version release under a proprietary license. It uses less memory and builds smaller rulesets than C4.5 while being more accurate. CART (Classification and Regression Trees) is very similar to C4.5, but it differs in that it supports numerical target variables (regression) and does not compute rule sets. CART constructs binary trees using the feature and threshold that yield the largest information gain at each node. scikit-learn uses an optimized version of the CART algorithm; however, the scikit-learn implementation does not support categorical variables for now.

Answer 8

To reduce complexity and prevent overfitting, pruning is usually employed; this is a process, which removes branches that split on features with low importance. The model’s fit can then be evaluated through the process of cross-validation. Another way that decision trees can maintain their accuracy is by forming an ensemble via a random forest algorithm; this classifier predicts more accurate results, particularly when the individual trees are uncorrelated with each other.

Answer 9

Hunt’s algorithm, which was developed in the 1960s to model human learning in Psychology, forms the foundation of many popular decision tree algorithms, such as the following: - ID3: Ross Quinlan is credited within the development of ID3, which is shorthand for “Iterative Dichotomiser 3.” This algorithm leverages entropy and information gain as metrics to evaluate candidate splits. Some of Quinlan’s research on this algorithm from 1986 can be found here (PDF, 1.4 MB) (link resides outside of ibm.com). - C4.5: This algorithm is considered a later iteration of ID3, which was also developed by Quinlan. It can use information gain or gain ratios to evaluate split points within the decision trees. - CART: The term, CART, is an abbreviation for “classification and regression trees” and was introduced by Leo Breiman. This algorithm typically utilizes Gini impurity to identify the ideal attribute to split on. Gini impurity measures how often a randomly chosen attribute is misclassified. When evaluating using Gini impurity, a lower value is more ideal.

Answer 10

In this procedure all the features are considered and different split points are tried and tested using a cost function. The split with the best cost (or lowest cost) is selected. Consider the earlier example of tree learned from titanic dataset. In the first split or the root, all attributes/features are considered and the training data is divided into groups based on this split. We have 3 features, so will have 3 candidate splits. Now we will calculate how much accuracy each split will cost us, using a function. The split that costs least is chosen, which in our example is sex of the passenger. This algorithm is recursive in nature as the groups formed can be sub-divided using same strategy. Due to this procedure, this algorithm is also known as the greedy algorithm, as we have an excessive desire of lowering the cost. This makes the root node as best predictor/classifier.

Answer 11

Lets take a closer look at cost functions used for classification and regression. In both cases the cost functions try to find most homogeneous branches, or branches having groups with similar responses. This makes sense we can be more sure that a test data input will follow a certain path. Regression: sum(y - prediction) ^2 Lets say, we are predicting the price of houses. Now the decision tree will start splitting by considering each feature in training data. The mean of responses of the training data inputs of particular group is considered as prediction for that group. The above function is applied to all data points and cost is calculated for all candidate splits. Again the split with lowest cost is chosen. Classification:G = sum(pk* (1 - pk)) A Gini score gives an idea of how good a split is by how mixed the response classes are in the groups created by the split. Here, pk is proportion of same class inputs present in a particular group. A perfect class purity occurs when a group contains all inputs from the same class, in which case pk is either 1 or 0 and G = 0, where as a node having a 50–50 split of classes in a group has the worst purity, so for a binary classification it will have pk = 0.5 and G = 0.5.

Answer 12

You might ask when to stop growing a tree? As a problem usually has a large set of features, it results in large number of split, which in turn gives a huge tree. Such trees are complex and can lead to overfitting. So, we need to know when to stop? One way of doing this is to set a minimum number of training inputs to use on each leaf. For example we can use a minimum of 10 passengers to reach a decision(died or survived), and ignore any leaf that takes less than 10 passengers. Another way is to set maximum depth of your model. Maximum depth refers to the the length of the longest path from a root to a leaf.

Answer 13

The performance of a tree can be further increased by pruning. It involves removing the branches that make use of features having low importance. This way, we reduce the complexity of tree, and thus increasing its predictive power by reducing overfitting. Pruning can start at either root or the leaves. The simplest method of pruning starts at leaves and removes each node with most popular class in that leaf, this change is kept if it doesn't deteriorate accuracy. Its also called reduced error pruning. More sophisticated pruning methods can be used such as cost complexity pruning where a learning parameter (alpha) is used to weigh whether nodes can be removed based on the size of the sub-tree. This is also known as weakest link pruning.

Answer 14

Simple to understand, interpret, visualize. Decision trees implicitly perform variable screening or feature selection. Can handle both numerical and categorical data. Can also handle multi-output problems. Decision trees require relatively little effort from users for data preparation. Nonlinear relationships between parameters do not affect tree performance.

Answer 15

Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting. Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This is called variance, which needs to be lowered by methods like bagging and boosting. Greedy algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees, where the features and samples are randomly sampled with replacement. Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the data set prior to fitting with the decision tree. This is all the basic, to get you at par with decision tree learning. An improvement over decision tree learning is made using technique of boosting. A popular library for implementing these algorithms is Scikit-Learn. It has a wonderful api that can get your model up an running with just a few lines of code in python.

Answer 16

used to describe relationships between variables by fitting a line to the observed data. Regression allows you to estimate how a dependent variable changes as the independent variable(s) change.

Answer 17

How strong the relationship is between two or more independent variables and one dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth). The value of the dependent variable at a certain value of the independent variables (e.g. the expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).

Answer 18

the size of the error in our prediction doesn’t change significantly across the values of the independent variable.

Answer 19

the observations in the dataset were collected using statistically valid sampling methods, and there are no hidden relationships among variables. In multiple linear regression, it is possible that some of the independent variables are actually correlated with one another, so it is important to check these before developing the regression model. If two independent variables are too highly correlated (r2 > ~0.6), then only one of them should be used in the regression model.

Answer 20

The data follows a normal distribution.

Answer 21

the line of best fit through the data points is a straight line, rather than a curve or some sort of grouping factor.

Answer 22

y = the predicted value of the dependent variable B_0 = the y-intercept (value of y when all other parameters are set to 0) B_1X_1 = the regression coefficient (B_1) of the first independent variable (X_1) (a.k.a. the effect that increasing the value of the independent variable has on the predicted y value) … = do the same for however many independent variables you are testing B_nX_n = the regression coefficient of the last independent variable \epsilon = model error (a.k.a. how much variation there is in our estimate of y)

Answer 23

The regression coefficients that lead to the smallest overall model error. The t statistic of the overall model. The associated p value (how likely it is that the t statistic would have occurred by chance if the null hypothesis of no relationship between the independent and dependent variables was true). It then calculates the t statistic and p value for each regression coefficient in the model.

Answer 24

When reporting your results, include the estimated effect (i.e. the regression coefficient), the standard error of the estimate, and the p value. You should also interpret your numbers to make it clear to your readers what the regression coefficient means.

Answer 25

To predict future economic conditions, trends, or values To determine the relationship between two or more variables To understand how one variable changes when another changes There are many different kinds of regression analysis. For the purpose of this article, we will look at two: linear regression and multiple regression.

Answer 26

Also called simple regression, linear regression establishes the relationship between two variables. Linear regression is graphically depicted using a straight line with the slope defining how the change in one variable impacts a change in the other. The y-intercept of a linear regression relationship represents the value of one variable when the value of the other is 0. In linear regression, every dependent value has a single corresponding independent variable that drives its value. For example, in the linear regression formula of y = 3x + 7, there is only one possible outcome of 'y' if 'x' is defined as 2. If the relationship between two variables does not follow a straight line, nonlinear regression may be used instead. Linear and nonlinear regression are similar in that both track a particular response from a set of variables. As the relationship between the variables becomes more complex, nonlinear models have greater flexibility and capability of depicting the non-constant slope.

Answer 27

For complex connections between data, the relationship might be explained by more than one variable. In this case, an analyst uses multiple regression which attempts to explain a dependent variable using more than one independent variable. There are two main uses for multiple regression analysis. The first is to determine the dependent variable based on multiple independent variables. For example, you may be interested in determining what a crop yield will be based on temperature, rainfall, and other independent variables. The second is to determine how strong the relationship is between each variable. For example, you may be interested in knowing how a crop yield will change if rainfall increases or the temperature decreases. Multiple regression assumes there is not a strong relationship between each independent variable. It also assumes there is a correlation between each independent variable and the single dependent variable. Each of these relationships is weighted to ensure more impactful independent variables drive the dependent value by adding a unique regression coefficient to each independent variable.

Answer 28

Consider an analyst who wishes to establish a relationship between the daily change in a company's stock prices and the daily change in trading volume. Using linear regression, the analyst can attempt to determine the relationship between the two variables: Daily Change in Stock Price = (Coefficient)(Daily Change in Trading Volume) + (y-intercept) If the stock price increases $0.10 before any trades occur and increases $0.01 for every share sold, the linear regression outcome is: Daily Change in Stock Price = ($0.01)(Daily Change in Trading Volume) + $0.10 However, the analyst realizes there are several other factors to consider including the company's P/E ratio, dividends, and prevailing inflation rate. The analyst can perform multiple regression to determine which—and how strongly—each of these variables impacts the stock price: Daily Change in Stock Price = (Coefficient)(Daily Change in Trading Volume) + (Coefficient)(Company's P/E Ratio) + (Coefficient)(Dividend) + (Coefficient)(Inflation Rate)

Answer 29

specific calculation than simple linear regression. For straight-forward relationships, simple linear regression may easily capture the relationship between the two variables. For more complex relationships requiring more consideration, multiple linear regression is often better.

Answer 30

Multiple linear regression should be used when multiple independent variables determine the outcome of a single dependent variable. This is often the case when forecasting more complex relationships.

Answer 31

A multiple regression formula has multiple slopes (one for each variable) and one y-intercept. It is interpreted the same as a simple linear regression formula except there are multiple variables that all impact the slope of the relationship.

Answer 32

Simple linear regression enables statisticians to predict the value of one variable using the available information about another variable. Linear regression attempts to establish the relationship between the two variables along a straight line. Multiple regression is a type of regression where the dependent variable shows a linear relationship with two or more independent variables. It can also be non-linear, where the dependent and independent variables do not follow a straight line. Both linear and non-linear regression track a particular response using two or more variables graphically. However, non-linear regression is usually difficult to execute since it is created from assumptions derived from trial and error.

Answer 33

The first assumption of multiple linear regression is that there is a linear relationship between the dependent variable and each of the independent variables. The best way to check the linear relationships is to create scatterplots and then visually inspect the scatterplots for linearity. If the relationship displayed in the scatterplot is not linear, then the analyst will need to run a non-linear regression or transform the data using statistical software, such as SPSS.

Answer 34

The data should not show multicollinearity, which occurs when the independent variables (explanatory variables) are highly correlated. When independent variables show multicollinearity, there will be problems figuring out the specific variable that contributes to the variance in the dependent variable. The best method to test for the assumption is the Variance Inflation Factor method.

Answer 35

Multiple linear regression assumes that the amount of error in the residuals is similar at each point of the linear model. This scenario is known as homoscedasticity. When analyzing the data, the analyst should plot the standardized residuals against the predicted values to determine if the points are distributed fairly across all the values of independent variables. To test the assumption, the data can be plotted on a scatterplot or by using statistical software to produce a scatterplot that includes the entire model.

Answer 36

The model assumes that the observations should be independent of one another. Simply put, the model assumes that the values of residuals are independent. To test for this assumption, we use the Durbin Watson statistic. The test will show values from 0 to 4, where a value of 0 to 2 shows positive autocorrelation, and values from 2 to 4 show negative autocorrelation. The mid-point, i.e., a value of 2, shows that there is no autocorrelation.

Answer 37

Multivariate normality occurs when residuals are normally distributed. To test this assumption, look at how the values of residuals are distributed. It can also be tested using two main methods, i.e., a histogram with a superimposed normal curve or the Normal Probability Plot method.

Sect 3.2- Grubbs outlier, p-value from linear regression, decision trees, & multiple regression Flashcards

(61 cards)