Sect 3.2- Grubbs outlier, p-value from linear regression, decision trees, & multiple regression Flashcards
Grubbs’ test is used to find a single outlier in a normally distributed data set. The test finds if a minimum value or a maximum value is an outlier.
Cautions:
The test is only used to find a single outlier in normally distributed data (excluding the potential outlier). If you think that your data set has more than one outlier, use the generalized extreme studentized deviate test or Tietjen-Moore test instead.
Using this test on non-normal distributions will give false results.
Run a test for normality (like the Shapiro-Wilk test) before running Grubbs’ test. If you find your data set isn’t normally distributed, try removing the potential outlier from the data set and running the normality test again. If your data still isn’t normal, don’t run this test.
Running Grubbs’ Test
The test is a deceptively simple one to run. It checks for outliers by looking for the maximum of the absolute differences between the values and the mean. Basically, the steps are:
Find the G test statistic.
Find the G Critical Value.
Compare the test statistic to the G critical value.
Reject the point as an outlier if the test statistic is greater than the critical value.
The formulas used will be slightly different, depending on if you want to check for an outlier in either end of the data (a one tailed test) or in both ends at the same time (a two tailed test). For simplicity, I’d recommend running a one-tailed test to start, as it a/is an easier equation to work by hand and b/ it simplifies the decision to reject (or keep) a single minimum or maximum point.
Find the G Test Statistic
Step 1: Order the data points from smallest to largest.
Step 2: Find the mean (x̄) and standard deviation of the data set.
Step 3: Calculate the G test statistic using one of the following equations:
The Grubbs’ test statistic for a two-tailed test is:
max{i = n,…,N | Yi - ȳ
G = —————————–
s
Where:
ȳ is the sample mean,
s = sample standard deviation.
A left-tailed test uses the test statistic:
ȳ - Y min
G = ————————–
s
Where Ymin is the minimum value.
For a right-tailed test, use:
Y max - Where Ymin is the minimum value.
For a right-tailed test, use:
Y max - ȳ
G = ————————–
s
Where Ymax is the maximum value.
Find the G Critical Value.
Several tables exist for finding the critical value for Grubbs’ test. The one below is a partial table for several G critical values and alpha levels. You can find the full table here. When looking up tables for G critical values, make sure you’re using the right one (i.e. a one-tailed test or two).
g critical value table.
Manually, you can find the G critical value with a formula.
N - 1 t^2 a/(2N), N - 2
G > ——— sqrt/ ————————————–
sqrt/N N - 2 + t^2 a/(2N), N - 2
Where:
tα/(2N),N−2 is the upper critical value of a t-distribution with N-2 degrees of freedom.
For one-tailed test, replace α/(2N) with α/N.
Accept or Reject the Outlier
Compare your G test statistic to the G critical value:
Gtest < Gcritical: keep the point in the data set; it is not an outlier.
Gtest > Gcritical: reject the point as an outlier.
P-Value is a statistical test that determines the probability of extreme results of the statistical hypothesis test,taking the Null Hypothesis to be correct.
It is mostly used as an alternative to rejection points that provides the smallest level of significance at which the Null-Hypothesis would be rejected.
In R, when you apply the summary() function on the model stars or astrix appears beside the P-Value.The higher the number of stars Beside the P-Value the more significant the variable is.As shown in the image Below.
How to interpret P-values for linear models?
The P-Value as you know provides probability of the hypothesis test,So in a regression model the P-Value for each independent variable tests the Null Hypothesis that there is “No Correlation” between the independent and the dependent variable,this also helps to determine the relationship observed in the sample also exists in the larger data.
So if the P-Value is less than the significance level (usually 0.05) then your model fits the data well.
The significance level is the probability of rejecting the null hypothesis when it is true.
P >= 0.1
Absence of evidence against null hypothesis, data consistent with null hypothesis
0.05 <= P < 0.1
Low evidence against the null hypothesis in favour of the alternative
0.01 <= P < 0.05
Strong evidence against null hypothesis in favour of the alternative
P < 0.001
Very strong evidence against the null hypothesis in favour of the alternative
Why should the P-value be less than 0.05?
A significance level of 0.05 indicates a 5% risk of concluding that a difference exists between the variables when there is no actual difference.In other words, If the P-Value for a variable is less than your significance level,then the sample data provide enough evidence to reject the null hypothesis for the entire population. Alternatively, a P-Value that is greater than 0.05 indicates a weak evidence and fail to reject the null hypothesis.
Hypothesis testing is a statistical procedure to test if your results are valid.
In our example, we are testing if the true coefficient of Average_Pulse and the intercept is equal to zero.
Hypothesis test has two statements. The null hypothesis and the alternative hypothesis.
The null hypothesis can be shortly written as H0
The alternative hypothesis can be shortly written as HA
Mathematically written:
H0: Average_Pulse = 0
HA: Average_Pulse ≠ 0
H0: Intercept = 0
HA: Intercept ≠ 0
The null hypothesis can either be rejected or not.
If we reject the null hypothesis, we conclude that it exist a relationship between Average_Pulse and Calorie_Burnage. The P-value is used for this conclusion.
A common threshold of the P-value is 0.05.
Note: A P-value of 0.05 means that 5% of the times, we will falsely reject the null hypothesis. It means that we accept that 5% of the times, we might falsely have concluded a relationship.
If the P-value is lower than 0.05, we can reject the null hypothesis and conclude that it exist a relationship between the variables.
However, the P-value of Average_Pulse is 0.824. So, we cannot conclude a relationship between Average_Pulse and Calorie_Burnage.
It means that there is a 82.4% chance that the true coefficient of Average_Pulse is zero.
The intercept is used to adjust the regression function’s ability to predict more precisely. It is therefore uncommon to interpret the P-value of the intercept.
Regression analysis is a form of inferential statistics. The p values in regression help determine whether the relationships that you observe in your sample also exist in the larger population. The linear regression p value for each independent variable tests the null hypothesis that the variable has no correlation with the dependent variable.
If there is no correlation, there is no association between the changes in the independent variable and the shifts in the dependent variable. In other words, there is insufficient evidence to conclude that there is an effect at the population level.
If the p-value for a variable is less than your significance level, your sample data provide enough evidence to reject the null hypothesis for the entire population. Your data favor the hypothesis that there is a non-zero correlation. Changes in the independent variable are associated with changes in the dependent variable at the population level. This variable is statistically significant and probably a worthwhile addition to your regression model.
On the other hand, when a p value in regression is greater than the significance level, it indicates there is insufficient evidence in your sample to conclude that a non-zero correlation exists.
What does the coefficient mean? The sign of a linear regression coefficient tells you whether there is a positive or negative correlation between each independent variable and the dependent variable. A positive coefficient indicates that as the value of the independent variable increases, the mean of the dependent variable also tends to increase. A negative coefficient suggests that as the independent variable increases, the dependent variable tends to decrease.
The coefficient value signifies how much the mean of the dependent variable changes given a one-unit shift in the independent variable while holding other variables in the model constant. This property of holding the other variables constant is crucial because it allows you to assess the effect of each variable in isolation from the others.
The linear regression coefficients in your statistical output are estimates of the actual population parameters. To obtain unbiased coefficient estimates that have the minimum variance, and to be able to trust the p-values, your model must satisfy the seven classical assumptions of OLS linear regression.
Statisticians consider linear regression coefficients to be an unstandardized effect size because they indicate the strength of the relationship between variables using values that retain the natural units of the dependent variable. Effect sizes help you understand how important the findings are in a practical sense. To learn more about unstandardized and standardized effect sizes, read my post about Effect Sizes in Statistics.
Decision Trees (DTs)
non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation.
Advantages of decision trees
Simple to understand and to interpret. Trees can be visualized.
Requires little data preparation. Other techniques often require data normalization, dummy variables need to be created and blank values to be removed. Note however that this module does not support missing values.
The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.
Able to handle both numerical and categorical data. However, the scikit-learn implementation does not support categorical variables for now. Other techniques are usually specialized in analyzing datasets that have only one type of variable. See algorithms for more information.
Able to handle multi-output problems.
Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results may be more difficult to interpret.
Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.
Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.
disadvantages of Decision trees
Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting. Mechanisms such as pruning, setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.
Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.
Predictions of decision trees are neither smooth nor continuous, but piecewise constant approximations as seen in the above figure. Therefore, they are not good at extrapolation.
The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.
There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.
Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.
DecisionTreeClassifier
a class capable of performing multi-class classification on a dataset.
As with other classifiers, DecisionTreeClassifier takes as input two arrays: an array X, sparse or dense, of shape (n_samples, n_features) holding the training samples, and an array Y of integer values, shape (n_samples,), holding the class labels for the training samples:
> > > from sklearn import tree
X = [[0, 0], [1, 1]]
Y = [0, 1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
After being fitted, the model can then be used to predict the class of samples:
> > > clf.predict([[2., 2.]])
array([1])
In case that there are multiple classes with the same and highest probability, the classifier will predict the class with the lowest index amongst those classes.
As an alternative to outputting a specific class, the probability of each class can be predicted, which is the fraction of training samples of the class in a leaf:
> > > clf.predict_proba([[2., 2.]])
array([[0., 1.]])
DecisionTreeClassifier is capable of both binary (where the labels are [-1, 1]) classification and multiclass (where the labels are [0, …, K-1]) classification.
Using the Iris dataset, we can construct a tree as follows:
> > > from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
X, y = iris.data, iris.target
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, y)
Once trained, you can plot the tree with the plot_tree function:
> > > tree.plot_tree(clf)
The export_graphviz exporter also supports a variety of aesthetic options, including coloring nodes by their class (or value for regression) and using explicit variable and class names if desired. Jupyter notebooks also render these plots inline automatically:
> > > dot_data = tree.export_graphviz(clf, out_file=None,
… feature_names=iris.feature_names,
… class_names=iris.target_names,
… filled=True, rounded=True,
… special_characters=True)
graph = graphviz.Source(dot_data)
graph
Alternatively, the tree can also be exported in textual format with the function export_text. This method doesn’t require the installation of external libraries and is more compact:
> > > from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_text
iris = load_iris()
decision_tree = DecisionTreeClassifier(random_state=0, max_depth=2)
decision_tree = decision_tree.fit(iris.data, iris.target)
r = export_text(decision_tree, feature_names=iris[‘feature_names’])
print(r)
|— petal width (cm) <= 0.80
| |— class: 0
|— petal width (cm) > 0.80
| |— petal width (cm) <= 1.75
| | |— class: 1
| |— petal width (cm) > 1.75
| | |— class: 2