Sect 3.2- Grubbs outlier, p-value from linear regression, decision trees, & multiple regression Flashcards

1
Q

Grubbs’ test is used to find a single outlier in a normally distributed data set. The test finds if a minimum value or a maximum value is an outlier.

Cautions:

A

The test is only used to find a single outlier in normally distributed data (excluding the potential outlier). If you think that your data set has more than one outlier, use the generalized extreme studentized deviate test or Tietjen-Moore test instead.

Using this test on non-normal distributions will give false results.

Run a test for normality (like the Shapiro-Wilk test) before running Grubbs’ test. If you find your data set isn’t normally distributed, try removing the potential outlier from the data set and running the normality test again. If your data still isn’t normal, don’t run this test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Running Grubbs’ Test

A

The test is a deceptively simple one to run. It checks for outliers by looking for the maximum of the absolute differences between the values and the mean. Basically, the steps are:

Find the G test statistic.

Find the G Critical Value.

Compare the test statistic to the G critical value.

Reject the point as an outlier if the test statistic is greater than the critical value.

The formulas used will be slightly different, depending on if you want to check for an outlier in either end of the data (a one tailed test) or in both ends at the same time (a two tailed test). For simplicity, I’d recommend running a one-tailed test to start, as it a/is an easier equation to work by hand and b/ it simplifies the decision to reject (or keep) a single minimum or maximum point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Find the G Test Statistic

A

Step 1: Order the data points from smallest to largest.

Step 2: Find the mean (x̄) and standard deviation of the data set.

Step 3: Calculate the G test statistic using one of the following equations:

The Grubbs’ test statistic for a two-tailed test is:
max{i = n,…,N | Yi - ȳ
G = —————————–
s
Where:
ȳ is the sample mean,
s = sample standard deviation.

A left-tailed test uses the test statistic:
ȳ - Y min
G = ————————–
s
Where Ymin is the minimum value.

For a right-tailed test, use:
Y max - Where Ymin is the minimum value.

For a right-tailed test, use:
Y max - ȳ
G = ————————–
s
Where Ymax is the maximum value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Find the G Critical Value.

A

Several tables exist for finding the critical value for Grubbs’ test. The one below is a partial table for several G critical values and alpha levels. You can find the full table here. When looking up tables for G critical values, make sure you’re using the right one (i.e. a one-tailed test or two).
g critical value table.
Manually, you can find the G critical value with a formula.
N - 1 t^2 a/(2N), N - 2
G > ——— sqrt/ ————————————–
sqrt/N N - 2 + t^2 a/(2N), N - 2
Where:
tα/(2N),N−2 is the upper critical value of a t-distribution with N-2 degrees of freedom.

For one-tailed test, replace α/(2N) with α/N.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Accept or Reject the Outlier

A

Compare your G test statistic to the G critical value:
Gtest < Gcritical: keep the point in the data set; it is not an outlier.
Gtest > Gcritical: reject the point as an outlier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

P-Value is a statistical test that determines the probability of extreme results of the statistical hypothesis test,taking the Null Hypothesis to be correct.

It is mostly used as an alternative to rejection points that provides the smallest level of significance at which the Null-Hypothesis would be rejected.

A

In R, when you apply the summary() function on the model stars or astrix appears beside the P-Value.The higher the number of stars Beside the P-Value the more significant the variable is.As shown in the image Below.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to interpret P-values for linear models?

A

The P-Value as you know provides probability of the hypothesis test,So in a regression model the P-Value for each independent variable tests the Null Hypothesis that there is “No Correlation” between the independent and the dependent variable,this also helps to determine the relationship observed in the sample also exists in the larger data.

So if the P-Value is less than the significance level (usually 0.05) then your model fits the data well.

The significance level is the probability of rejecting the null hypothesis when it is true.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

P >= 0.1

A

Absence of evidence against null hypothesis, data consistent with null hypothesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

0.05 <= P < 0.1

A

Low evidence against the null hypothesis in favour of the alternative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

0.01 <= P < 0.05

A

Strong evidence against null hypothesis in favour of the alternative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

P < 0.001

A

Very strong evidence against the null hypothesis in favour of the alternative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why should the P-value be less than 0.05?

A

A significance level of 0.05 indicates a 5% risk of concluding that a difference exists between the variables when there is no actual difference.In other words, If the P-Value for a variable is less than your significance level,then the sample data provide enough evidence to reject the null hypothesis for the entire population. Alternatively, a P-Value that is greater than 0.05 indicates a weak evidence and fail to reject the null hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Hypothesis testing is a statistical procedure to test if your results are valid.

In our example, we are testing if the true coefficient of Average_Pulse and the intercept is equal to zero.

Hypothesis test has two statements. The null hypothesis and the alternative hypothesis.

The null hypothesis can be shortly written as H0
The alternative hypothesis can be shortly written as HA

A

Mathematically written:

H0: Average_Pulse = 0
HA: Average_Pulse ≠ 0
H0: Intercept = 0
HA: Intercept ≠ 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

The null hypothesis can either be rejected or not.

If we reject the null hypothesis, we conclude that it exist a relationship between Average_Pulse and Calorie_Burnage. The P-value is used for this conclusion.

A common threshold of the P-value is 0.05.

Note: A P-value of 0.05 means that 5% of the times, we will falsely reject the null hypothesis. It means that we accept that 5% of the times, we might falsely have concluded a relationship.

If the P-value is lower than 0.05, we can reject the null hypothesis and conclude that it exist a relationship between the variables.

However, the P-value of Average_Pulse is 0.824. So, we cannot conclude a relationship between Average_Pulse and Calorie_Burnage.

A

It means that there is a 82.4% chance that the true coefficient of Average_Pulse is zero.

The intercept is used to adjust the regression function’s ability to predict more precisely. It is therefore uncommon to interpret the P-value of the intercept.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Regression analysis is a form of inferential statistics. The p values in regression help determine whether the relationships that you observe in your sample also exist in the larger population. The linear regression p value for each independent variable tests the null hypothesis that the variable has no correlation with the dependent variable.

A

If there is no correlation, there is no association between the changes in the independent variable and the shifts in the dependent variable. In other words, there is insufficient evidence to conclude that there is an effect at the population level.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

If the p-value for a variable is less than your significance level, your sample data provide enough evidence to reject the null hypothesis for the entire population. Your data favor the hypothesis that there is a non-zero correlation. Changes in the independent variable are associated with changes in the dependent variable at the population level. This variable is statistically significant and probably a worthwhile addition to your regression model.

A

On the other hand, when a p value in regression is greater than the significance level, it indicates there is insufficient evidence in your sample to conclude that a non-zero correlation exists.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does the coefficient mean? The sign of a linear regression coefficient tells you whether there is a positive or negative correlation between each independent variable and the dependent variable. A positive coefficient indicates that as the value of the independent variable increases, the mean of the dependent variable also tends to increase. A negative coefficient suggests that as the independent variable increases, the dependent variable tends to decrease.

A

The coefficient value signifies how much the mean of the dependent variable changes given a one-unit shift in the independent variable while holding other variables in the model constant. This property of holding the other variables constant is crucial because it allows you to assess the effect of each variable in isolation from the others.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

The linear regression coefficients in your statistical output are estimates of the actual population parameters. To obtain unbiased coefficient estimates that have the minimum variance, and to be able to trust the p-values, your model must satisfy the seven classical assumptions of OLS linear regression.

A

Statisticians consider linear regression coefficients to be an unstandardized effect size because they indicate the strength of the relationship between variables using values that retain the natural units of the dependent variable. Effect sizes help you understand how important the findings are in a practical sense. To learn more about unstandardized and standardized effect sizes, read my post about Effect Sizes in Statistics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Decision Trees (DTs)

A

non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Advantages of decision trees

A

Simple to understand and to interpret. Trees can be visualized.

Requires little data preparation. Other techniques often require data normalization, dummy variables need to be created and blank values to be removed. Note however that this module does not support missing values.

The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.

Able to handle both numerical and categorical data. However, the scikit-learn implementation does not support categorical variables for now. Other techniques are usually specialized in analyzing datasets that have only one type of variable. See algorithms for more information.

Able to handle multi-output problems.

Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results may be more difficult to interpret.

Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.

Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

disadvantages of Decision trees

A

Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting. Mechanisms such as pruning, setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.

Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.

Predictions of decision trees are neither smooth nor continuous, but piecewise constant approximations as seen in the above figure. Therefore, they are not good at extrapolation.

The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.

There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.

Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

DecisionTreeClassifier

A

a class capable of performing multi-class classification on a dataset.

As with other classifiers, DecisionTreeClassifier takes as input two arrays: an array X, sparse or dense, of shape (n_samples, n_features) holding the training samples, and an array Y of integer values, shape (n_samples,), holding the class labels for the training samples:

> > > from sklearn import tree
X = [[0, 0], [1, 1]]
Y = [0, 1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
After being fitted, the model can then be used to predict the class of samples:

> > > clf.predict([[2., 2.]])
array([1])
In case that there are multiple classes with the same and highest probability, the classifier will predict the class with the lowest index amongst those classes.

As an alternative to outputting a specific class, the probability of each class can be predicted, which is the fraction of training samples of the class in a leaf:

> > > clf.predict_proba([[2., 2.]])
array([[0., 1.]])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

DecisionTreeClassifier is capable of both binary (where the labels are [-1, 1]) classification and multiclass (where the labels are [0, …, K-1]) classification.

Using the Iris dataset, we can construct a tree as follows:

A

> > > from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
X, y = iris.data, iris.target
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, y)
Once trained, you can plot the tree with the plot_tree function:

> > > tree.plot_tree(clf)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

The export_graphviz exporter also supports a variety of aesthetic options, including coloring nodes by their class (or value for regression) and using explicit variable and class names if desired. Jupyter notebooks also render these plots inline automatically:

> > > dot_data = tree.export_graphviz(clf, out_file=None,
… feature_names=iris.feature_names,
… class_names=iris.target_names,
… filled=True, rounded=True,
… special_characters=True)
graph = graphviz.Source(dot_data)
graph

A

Alternatively, the tree can also be exported in textual format with the function export_text. This method doesn’t require the installation of external libraries and is more compact:

> > > from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_text
iris = load_iris()
decision_tree = DecisionTreeClassifier(random_state=0, max_depth=2)
decision_tree = decision_tree.fit(iris.data, iris.target)
r = export_text(decision_tree, feature_names=iris[‘feature_names’])
print(r)
|— petal width (cm) <= 0.80
| |— class: 0
|— petal width (cm) > 0.80
| |— petal width (cm) <= 1.75
| | |— class: 1
| |— petal width (cm) > 1.75
| | |— class: 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Decision trees can also be applied to regression problems, using the DecisionTreeRegressor class.

As in the classification setting, the fit method will take as argument arrays X and y, only that in this case y is expected to have floating point values instead of integer values:

A

from sklearn import tree
»> X = [[0, 0], [2, 2]]
»> y = [0.5, 2.5]
»> clf = tree.DecisionTreeRegressor()
»> clf = clf.fit(X, y)
»> clf.predict([[1, 1]])
array([0.5])

26
Q

A multi-output problem is a supervised learning problem with several outputs to predict, that is when Y is a 2d array of shape (n_samples, n_outputs).

When there is no correlation between the outputs, a very simple way to solve this kind of problem is to build n independent models, i.e. one for each output, and then to use those models to independently predict each one of the n outputs. However, because it is likely that the output values related to the same input are themselves correlated, an often better way is to build a single model capable of predicting simultaneously all n outputs. First, it requires lower training time since only a single estimator is built. Second, the generalization accuracy of the resulting estimator may often be increased.

A

With regard to decision trees, this strategy can readily be used to support multi-output problems. This requires the following changes:

Store n output values in leaves, instead of 1;

Use splitting criteria that compute the average reduction across all n outputs.

This module offers support for multi-output problems by implementing this strategy in both DecisionTreeClassifier and DecisionTreeRegressor. If a decision tree is fit on an output array Y of shape (n_samples, n_outputs) then the resulting estimator will:

Output n_output values upon predict;

Output a list of n_output arrays of class probabilities upon predict_proba.

The use of multi-output trees for regression is demonstrated in Multi-output Decision Tree Regression. In this example, the input X is a single real value and the outputs Y are the sine and cosine of X.

27
Q

Decision trees tend to overfit on data with a large number of features. Getting the right ratio of samples to number of features is important, since a tree with few samples in high dimensional space is very likely to overfit.

A

Consider performing dimensionality reduction (PCA, ICA, or Feature selection) beforehand to give your tree a better chance of finding features that are discriminative.

28
Q

Understanding the decision tree structure will help in gaining more insights about how the decision tree makes predictions, which is important for understanding the important features in the data.

Visualize your tree as you are training by using the export function. Use max_depth=3 as an initial tree depth to get a feel for how the tree is fitting to your data, and then increase the depth.

Remember that the number of samples required to populate the tree doubles for each additional level the tree grows to. Use max_depth to control the size of the tree to prevent overfitting.

A

Use min_samples_split or min_samples_leaf to ensure that multiple samples inform every decision in the tree, by controlling which splits will be considered. A very small number will usually mean the tree will overfit, whereas a large number will prevent the tree from learning the data. Try min_samples_leaf=5 as an initial value. If the sample size varies greatly, a float number can be used as percentage in these two parameters. While min_samples_split can create arbitrarily small leaves, min_samples_leaf guarantees that each leaf has a minimum size, avoiding low-variance, over-fit leaf nodes in regression problems. For classification with few classes, min_samples_leaf=1 is often the best choice.

Note that min_samples_split considers samples directly and independent of sample_weight, if provided (e.g. a node with m weighted samples is still treated as having exactly m samples). Consider min_weight_fraction_leaf or min_impurity_decrease if accounting for sample weights is required at splits.

29
Q

Balance your dataset before training to prevent the tree from being biased toward the classes that are dominant. Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value. Also note that weight-based pre-pruning criteria, such as min_weight_fraction_leaf, will then be less biased toward dominant classes than criteria that are not aware of the sample weights, like min_samples_leaf.

A

If the samples are weighted, it will be easier to optimize the tree structure using weight-based pre-pruning criterion such as min_weight_fraction_leaf, which ensure that leaf nodes contain at least a fraction of the overall sum of the sample weights.

30
Q

All decision trees use np.float32 arrays internally. If training data is not in this format, a copy of the dataset will be made.

A

If the input matrix X is very sparse, it is recommended to convert to sparse csc_matrix before calling fit and sparse csr_matrix before calling predict. Training time can be orders of magnitude faster for a sparse matrix input compared to a dense matrix when features have zero values in most of the samples.

31
Q

What are all the various decision tree algorithms and how do they differ from each other? Which one is implemented in scikit-learn?

ID3 (Iterative Dichotomiser 3) was developed in 1986 by Ross Quinlan. The algorithm creates a multiway tree, finding for each node (i.e. in a greedy manner) the categorical feature that will yield the largest information gain for categorical targets. Trees are grown to their maximum size and then a pruning step is usually applied to improve the ability of the tree to generalize to unseen data.

A

C4.5 is the successor to ID3 and removed the restriction that features must be categorical by dynamically defining a discrete attribute (based on numerical variables) that partitions the continuous attribute value into a discrete set of intervals. C4.5 converts the trained trees (i.e. the output of the ID3 algorithm) into sets of if-then rules. The accuracy of each rule is then evaluated to determine the order in which they should be applied. Pruning is done by removing a rule’s precondition if the accuracy of the rule improves without it.

C5.0 is Quinlan’s latest version release under a proprietary license. It uses less memory and builds smaller rulesets than C4.5 while being more accurate.

CART (Classification and Regression Trees) is very similar to C4.5, but it differs in that it supports numerical target variables (regression) and does not compute rule sets. CART constructs binary trees using the feature and threshold that yield the largest information gain at each node.

scikit-learn uses an optimized version of the CART algorithm; however, the scikit-learn implementation does not support categorical variables for now.

32
Q

Decision tree learning employs a divide and conquer strategy by conducting a greedy search to identify the optimal split points within a tree. This process of splitting is then repeated in a top-down, recursive manner until all, or the majority of records have been classified under specific class labels. Whether or not all data points are classified as homogenous sets is largely dependent on the complexity of the decision tree. Smaller trees are more easily able to attain pure leaf nodes—i.e. data points in a single class. However, as a tree grows in size, it becomes increasingly difficult to maintain this purity, and it usually results in too little data falling within a given subtree. When this occurs, it is known as data fragmentation, and it can often lead to overfitting. As a result, decision trees have preference for small trees, which is consistent with the principle of parsimony in Occam’s Razor; that is, “entities should not be multiplied beyond necessity.” Said differently, decision trees should add complexity only if necessary, as the simplest explanation is often the best.

A

To reduce complexity and prevent overfitting, pruning is usually employed; this is a process, which removes branches that split on features with low importance. The model’s fit can then be evaluated through the process of cross-validation. Another way that decision trees can maintain their accuracy is by forming an ensemble via a random forest algorithm; this classifier predicts more accurate results, particularly when the individual trees are uncorrelated with each other.

33
Q

Types of decision trees

A

Hunt’s algorithm, which was developed in the 1960s to model human learning in Psychology, forms the foundation of many popular decision tree algorithms, such as the following:

  • ID3: Ross Quinlan is credited within the development of ID3, which is shorthand for “Iterative Dichotomiser 3.” This algorithm leverages entropy and information gain as metrics to evaluate candidate splits. Some of Quinlan’s research on this algorithm from 1986 can be found here (PDF, 1.4 MB) (link resides outside of ibm.com).
  • C4.5: This algorithm is considered a later iteration of ID3, which was also developed by Quinlan. It can use information gain or gain ratios to evaluate split points within the decision trees.
  • CART: The term, CART, is an abbreviation for “classification and regression trees” and was introduced by Leo Breiman. This algorithm typically utilizes Gini impurity to identify the ideal attribute to split on. Gini impurity measures how often a randomly chosen attribute is misclassified. When evaluating using Gini impurity, a lower value is more ideal.
34
Q

Recursive binary splitting

A

In this procedure all the features are considered and different split points are tried and tested using a cost function. The split with the best cost (or lowest cost) is selected.

Consider the earlier example of tree learned from titanic dataset. In the first split or the root, all attributes/features are considered and the training data is divided into groups based on this split. We have 3 features, so will have 3 candidate splits. Now we will calculate how much accuracy each split will cost us, using a function. The split that costs least is chosen, which in our example is sex of the passenger. This algorithm is recursive in nature as the groups formed can be sub-divided using same strategy. Due to this procedure, this algorithm is also known as the greedy algorithm, as we have an excessive desire of lowering the cost. This makes the root node as best predictor/classifier.

35
Q

Cost of a split

A

Lets take a closer look at cost functions used for classification and regression. In both cases the cost functions try to find most homogeneous branches, or branches having groups with similar responses. This makes sense we can be more sure that a test data input will follow a certain path.
Regression: sum(y - prediction) ^2

Lets say, we are predicting the price of houses. Now the decision tree will start splitting by considering each feature in training data. The mean of responses of the training data inputs of particular group is considered as prediction for that group. The above function is applied to all data points and cost is calculated for all candidate splits. Again the split with lowest cost is chosen.
Classification:G = sum(pk* (1 - pk))

A Gini score gives an idea of how good a split is by how mixed the response classes are in the groups created by the split. Here, pk is proportion of same class inputs present in a particular group. A perfect class purity occurs when a group contains all inputs from the same class, in which case pk is either 1 or 0 and G = 0, where as a node having a 50–50 split of classes in a group has the worst purity, so for a binary classification it will have pk = 0.5 and G = 0.5.

36
Q

When to stop splitting?

A

You might ask when to stop growing a tree? As a problem usually has a large set of features, it results in large number of split, which in turn gives a huge tree. Such trees are complex and can lead to overfitting. So, we need to know when to stop? One way of doing this is to set a minimum number of training inputs to use on each leaf. For example we can use a minimum of 10 passengers to reach a decision(died or survived), and ignore any leaf that takes less than 10 passengers. Another way is to set maximum depth of your model. Maximum depth refers to the the length of the longest path from a root to a leaf.

37
Q

Pruning

A

The performance of a tree can be further increased by pruning. It involves removing the branches that make use of features having low importance. This way, we reduce the complexity of tree, and thus increasing its predictive power by reducing overfitting.

Pruning can start at either root or the leaves. The simplest method of pruning starts at leaves and removes each node with most popular class in that leaf, this change is kept if it doesn’t deteriorate accuracy. Its also called reduced error pruning. More sophisticated pruning methods can be used such as cost complexity pruning where a learning parameter (alpha) is used to weigh whether nodes can be removed based on the size of the sub-tree. This is also known as weakest link pruning.

38
Q

Advantages of CART

A

Simple to understand, interpret, visualize.
Decision trees implicitly perform variable screening or feature selection.
Can handle both numerical and categorical data. Can also handle multi-output problems.
Decision trees require relatively little effort from users for data preparation.
Nonlinear relationships between parameters do not affect tree performance.

39
Q

Disadvantages of CART

A

Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting.
Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This is called variance, which needs to be lowered by methods like bagging and boosting.
Greedy algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees, where the features and samples are randomly sampled with replacement.
Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the data set prior to fitting with the decision tree.
This is all the basic, to get you at par with decision tree learning. An improvement over decision tree learning is made using technique of boosting. A popular library for implementing these algorithms is Scikit-Learn. It has a wonderful api that can get your model up an running with just a few lines of code in python.

40
Q

regression models

A

used to describe relationships between variables by fitting a line to the observed data. Regression allows you to estimate how a dependent variable changes as the independent variable(s) change.

41
Q

Multiple linear regression is used to estimate the relationship between two or more independent variables and one dependent variable. You can use multiple linear regression when you want to know:

A

How strong the relationship is between two or more independent variables and one dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).

The value of the dependent variable at a certain value of the independent variables (e.g. the expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).

42
Q

Homogeneity of variance (homoscedasticity)

A

the size of the error in our prediction doesn’t change significantly across the values of the independent variable.

43
Q

Independence of observations

A

the observations in the dataset were collected using statistically valid sampling methods, and there are no hidden relationships among variables.

In multiple linear regression, it is possible that some of the independent variables are actually correlated with one another, so it is important to check these before developing the regression model. If two independent variables are too highly correlated (r2 > ~0.6), then only one of them should be used in the regression model.

44
Q

Normality

A

The data follows a normal distribution.

45
Q

Linearity

A

the line of best fit through the data points is a straight line, rather than a curve or some sort of grouping factor.

46
Q

How to perform a multiple linear regression
Multiple linear regression formula
The formula for a multiple linear regression is:

y = {\beta_0} + {\beta_1{X_1}} + … + {{\beta_n{X_n}} + {\epsilon}

A

y = the predicted value of the dependent variable
B_0 = the y-intercept (value of y when all other parameters are set to 0)
B_1X_1 = the regression coefficient (B_1) of the first independent variable (X_1) (a.k.a. the effect that increasing the value of the independent variable has on the predicted y value)
… = do the same for however many independent variables you are testing
B_nX_n = the regression coefficient of the last independent variable
\epsilon = model error (a.k.a. how much variation there is in our estimate of y)

47
Q

To find the best-fit line for each independent variable, multiple linear regression calculates three things:

A

The regression coefficients that lead to the smallest overall model error.
The t statistic of the overall model.
The associated p value (how likely it is that the t statistic would have occurred by chance if the null hypothesis of no relationship between the independent and dependent variables was true).

It then calculates the t statistic and p value for each regression coefficient in the model.

48
Q

Presenting the results

A

When reporting your results, include the estimated effect (i.e. the regression coefficient), the standard error of the estimate, and the p value. You should also interpret your numbers to make it clear to your readers what the regression coefficient means.

49
Q

Regression analysis is a common statistical method used in finance and investing. Linear regression is one of the most common techniques of regression analysis. Multiple regression is a broader class of regressions that encompasses linear and nonlinear regressions with multiple explanatory variables.

Regression as a tool helps pool data together to help people and companies make informed decisions. There are different variables at play in regression, including a dependent variable—the main variable that you’re trying to understand—and an independent variable—factors that may have an impact on the dependent variable.

In order to make regression analysis work, you must collect all the relevant data. It can be presented on a graph, with an x-axis and a y-axis.

There are several main reasons people use regression analysis:

A

To predict future economic conditions, trends, or values

To determine the relationship between two or more variables

To understand how one variable changes when another changes

There are many different kinds of regression analysis. For the purpose of this article, we will look at two: linear regression and multiple regression.

50
Q

Linear regression

A

Also called simple regression, linear regression establishes the relationship between two variables. Linear regression is graphically depicted using a straight line with the slope defining how the change in one variable impacts a change in the other. The y-intercept of a linear regression relationship represents the value of one variable when the value of the other is 0.

In linear regression, every dependent value has a single corresponding independent variable that drives its value. For example, in the linear regression formula of y = 3x + 7, there is only one possible outcome of ‘y’ if ‘x’ is defined as 2.

If the relationship between two variables does not follow a straight line, nonlinear regression may be used instead. Linear and nonlinear regression are similar in that both track a particular response from a set of variables. As the relationship between the variables becomes more complex, nonlinear models have greater flexibility and capability of depicting the non-constant slope.

51
Q

Multiple regression

A

For complex connections between data, the relationship might be explained by more than one variable. In this case, an analyst uses multiple regression which attempts to explain a dependent variable using more than one independent variable.

There are two main uses for multiple regression analysis. The first is to determine the dependent variable based on multiple independent variables. For example, you may be interested in determining what a crop yield will be based on temperature, rainfall, and other independent variables. The second is to determine how strong the relationship is between each variable. For example, you may be interested in knowing how a crop yield will change if rainfall increases or the temperature decreases.

Multiple regression assumes there is not a strong relationship between each independent variable. It also assumes there is a correlation between each independent variable and the single dependent variable. Each of these relationships is weighted to ensure more impactful independent variables drive the dependent value by adding a unique regression coefficient to each independent variable.

52
Q

Linear regression vs multiple regression

A

Consider an analyst who wishes to establish a relationship between the daily change in a company’s stock prices and the daily change in trading volume. Using linear regression, the analyst can attempt to determine the relationship between the two variables:

Daily Change in Stock Price = (Coefficient)(Daily Change in Trading Volume) + (y-intercept)

If the stock price increases $0.10 before any trades occur and increases $0.01 for every share sold, the linear regression outcome is:

Daily Change in Stock Price = ($0.01)(Daily Change in Trading Volume) + $0.10

However, the analyst realizes there are several other factors to consider including the company’s P/E ratio, dividends, and prevailing inflation rate. The analyst can perform multiple regression to determine which—and how strongly—each of these variables impacts the stock price:

Daily Change in Stock Price = (Coefficient)(Daily Change in Trading Volume) + (Coefficient)(Company’s P/E Ratio) + (Coefficient)(Dividend) + (Coefficient)(Inflation Rate)

53
Q

Is multiple linear regression better than simple linear regression?

A

specific calculation than simple linear regression. For straight-forward relationships, simple linear regression may easily capture the relationship between the two variables. For more complex relationships requiring more consideration, multiple linear regression is often better.

54
Q

When should you use multiple linear regression

A

Multiple linear regression should be used when multiple independent variables determine the outcome of a single dependent variable. This is often the case when forecasting more complex relationships.

55
Q

How do you interpret multiple regression?

A

A multiple regression formula has multiple slopes (one for each variable) and one y-intercept. It is interpreted the same as a simple linear regression formula except there are multiple variables that all impact the slope of the relationship.

56
Q

Understanding multiple linear regression

A

Simple linear regression enables statisticians to predict the value of one variable using the available information about another variable. Linear regression attempts to establish the relationship between the two variables along a straight line.

Multiple regression is a type of regression where the dependent variable shows a linear relationship with two or more independent variables. It can also be non-linear, where the dependent and independent variables do not follow a straight line.

Both linear and non-linear regression track a particular response using two or more variables graphically. However, non-linear regression is usually difficult to execute since it is created from assumptions derived from trial and error.

57
Q

A linear relationship between the dependent and independent variables

A

The first assumption of multiple linear regression is that there is a linear relationship between the dependent variable and each of the independent variables. The best way to check the linear relationships is to create scatterplots and then visually inspect the scatterplots for linearity. If the relationship displayed in the scatterplot is not linear, then the analyst will need to run a non-linear regression or transform the data using statistical software, such as SPSS.

58
Q

The independent variables are not highly correlated with each other

A

The data should not show multicollinearity, which occurs when the independent variables (explanatory variables) are highly correlated. When independent variables show multicollinearity, there will be problems figuring out the specific variable that contributes to the variance in the dependent variable. The best method to test for the assumption is the Variance Inflation Factor method.

59
Q

The variance of the residuals is constant

A

Multiple linear regression assumes that the amount of error in the residuals is similar at each point of the linear model. This scenario is known as homoscedasticity. When analyzing the data, the analyst should plot the standardized residuals against the predicted values to determine if the points are distributed fairly across all the values of independent variables. To test the assumption, the data can be plotted on a scatterplot or by using statistical software to produce a scatterplot that includes the entire model.

60
Q

Independence of observation

A

The model assumes that the observations should be independent of one another. Simply put, the model assumes that the values of residuals are independent. To test for this assumption, we use the Durbin Watson statistic.

The test will show values from 0 to 4, where a value of 0 to 2 shows positive autocorrelation, and values from 2 to 4 show negative autocorrelation. The mid-point, i.e., a value of 2, shows that there is no autocorrelation.

61
Q

Multivariate normality

A

Multivariate normality occurs when residuals are normally distributed. To test this assumption, look at how the values of residuals are distributed. It can also be tested using two main methods, i.e., a histogram with a superimposed normal curve or the Normal Probability Plot method.