Module 7: Descriptive statistics Flashcards
Different types of data, numerical and categorical
It is always a good next step to check descriptive statistics after data cleaning and preparation tasks are complete. These measures provide summaries of features and will help in understanding the main characteristics of data.
Before that though, let’s discuss the different types of data – this matters as what algorithms can be use depends on the type of data:
1. Numerical or quantitative
a. Discrete: numerical values obtained by counting – for example it could be the number of petals on a flower
b. Continuous: data points obtained by measuring – the height of a student or the temperature of the water
2. Categorical or qualitative
a. Ordinal: values that can be ordered or ranked – natural order of hot, medium, cold
b. Nominal: categorical data that doesn’t have an order – a list of Canadian provinces
summary statistics: measure of location - mean, median, mode
Measure of Location:
- Mean: an average
- Median: the middle value in a sorted list of values
o Not affected by outliers within the dataset as the mean - Mode: most commonly selected value
import pandas as pd
import numpy as np
iris = pd.read_csv(‘iris.data’, sep=’,’,
header=None, # the data file does not contain a header
names=[‘sepal length’,’sepal width’,’petal length’,’petal width’,’class’] # names of columns
)
iris.head()
sepal length sepal width petal length petal width class 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa
Let’s group the data by the class of the flower and calculate the measures of location for each class and for each variable
grouping data by class
iris_grouped = iris.groupby(‘class’)
calculating mean (average) for each class:
iris_grouped.mean()
sepal length sepal width petal length petal width class Iris-setosa 5.006 3.418 1.464 0.244 Iris-versicolor 5.936 2.770 4.260 1.326 Iris-virginica 6.588 2.974 5.552 2.026
calculating median for each class:
iris_grouped.median()
sepal length sepal width petal length petal width class Iris-setosa 5.0 3.4 1.50 0.2 Iris-versicolor 5.9 2.8 4.35 1.3 Iris-virginica 6.5 3.0 5.55 2.0
Note: we can see that when we compare the mean and median values above they are close to eachother – this gives a good indication that the data is evenly distributed around the mean
Summary statistics: measures of spread - variance, standard deviation
- Variance: is a measure of the variability in the data – it measures how far values are spread out from the mean – roughly the average squared distance from the mean
- Standard deviation: the square root of the variance and can be used to describe how close the typical data point is to the mean – usually 70% of the data will be within one standard deviation of the mean and about 95% will be within two standard deviations (if the data is normally distributed)
Variance:
iris_grouped.var()
sepal length sepal width petal length petal width class Iris-setosa 0.124249 0.145180 0.030106 0.011494 Iris-versicolor 0.266433 0.098469 0.220816 0.039106 Iris-virginica 0.404343 0.104004 0.304588 0.075433
Standard deviation:
iris_grouped.std()
sepal length sepal width petal length petal width class Iris-setosa 0.352490 0.381024 0.173511 0.107210 Iris-versicolor 0.516171 0.313798 0.469911 0.197753 Iris-virginica 0.635880 0.322497 0.551895 0.274650
Data Distribution
A frequency distribution is usually represented as a list, table, or graph.
- In a table it will show the number of values that fall within certain data intervals
In this example, we are researching house prices in a certain part of the city. We collected data for the houses on the market, and placed it in the table below. To make the analysis easier, we grouped the prices into bins and county how many houses fall within each bin.
Price range, $ Number of houses
0-200,000 2
200,001 - 300,000 10
300,001 - 400,000 15
400,001 - 500,000 25
500,001 - 600,000 30
600,001 - 700,000 25
700,001 - 800,000 15
800,001 - 900,000 10
900,001 - 1,000,000 2
This is a frequency table, if we take the data and create a bar graph with it, we have a frequency distribution. Also called a histogram, they provide a view of data density and are very convenient for describing the shape of the data distribution. An example is below, first we create a dataframe.
houses = pd.DataFrame({
‘price, thousands’: [200,300,400,500,600,700,800,900,1000],
‘num of houses’:[2,10,15,25,30,25,15,10,2]
})
houses.head()
price, thousands num of houses 0 200 2 1 300 10 2 400 15 3 500 25 4 600 30
We can now plot the frequency distribution graph by using hist() function to create a histogram based on the raw data. This function creates bins and calculates frequencies, which we already did, so in this case we can draw a bar plot. For plotting graphs, we will use visualization library matplotlib
from matplotlib import pyplot as plt
%matplotlib inline
plt.style.use(‘seaborn-whitegrid’)
plt.rcParams[“figure.figsize”] = (7,7)
houses.plot(x = ‘price, thousands’, y = ‘num of houses’, kind=’bar’, color = ‘blue’, width=0.9)
plt.show()
The plot shows that the distribution seems symmetric and bell-shaped. This is an example of a normal distribution. Half of the data will fall to the left of the mean; half will fall to the right. Let’s explore the mean, median, and SD for this dataset:
houses[‘num of houses’].mean()
14.88888888888889
houses[‘num of houses’].median()
15.0
houses[‘num of houses’].std()
10.080233683358294
If the mean and median were different, the shape of the curve would shift to have a left or right skew.
If the SD was larger the curve would be wider, if it was smaller the curve would narrower
We discussed above that SD is a measure used to quantify the amount of variability in the data – a low SD means little variability and the data points are tightly clustered around the mean, a high SD shows that the values are spread out over a wider range.
Here is an example:
mu = 100
sigma = 30 # changed standard deviation from 15 to 30
x = mu + sigma * np.random.randn(10000)
num_bins = 50 # number of bins to create
n, bins, patches = plt.hist(x, num_bins, density=1, facecolor=’green’, alpha=0.8)
plt.xlabel(‘X’)
plt.axis([0, 200, 0, 0.03])
plt.ylabel(‘Frequencies’)
plt.title(“Sample of Normal Distribution, “ + “$\mu=100,\ \sigma=30$”)
plt.show()
You can see how wide the distribution is here, showing that there is a larger SD.
The graph below summarizes measures of the normal distribution – the histogram demonstrates the empirical rule:
This histogram shows that for the normal distribution, 99.7% of the data is within 3 standard deviations of the mean; 95% of the data is within two standard deviations from the mean, and the values less than one standard deviation away from the mean account for 68.27% of the dataset.
- Remember that how you interpret the SD in terms of a good or bad thing all depends on the context of the data:
o A SD of 2 minutes for 60 minutes to produce something means that 99.7% of all times you produced something it will be 60 mins + 6 minutes and 60 mins – 6 minutes. You likely don’t want a lot of SD when looking at business production
o A high SD shows a large spread, you might get something like this when looking at house prices in the GTA, this isn’t necessarily bad but expected
Measures of shape - skewness and kurtosis
There are two measures that describe the shape of a distribution:
1) Skewness is a measure of asymmetry of a distribution. When a distribution trailsoff to the right as a tail on the right-hand side of the graph, the shape is said to be right skewed or positive skewed – when a distribution trails off to the left, we say it is left or negative skewed
2) Kurtosis is the measure of the tailedness of the distribution – it is a good indicator of the outliers in the data. Normal distribution has a kertosis of 3, if it has less there are fewer outliers than normal and vice-versa if it has more.
iris_grouped.skew()
sepal length sepal width petal length petal width
class
Iris-setosa 0.120087 0.107053 0.071846 1.197243
Iris-versicolor 0.105378 -0.362845 -0.606508 -0.031180
Iris-virginica 0.118015 0.365949 0.549445 -0.129477
As you can see setosa is right-skewed and positively skewed as it is above 0, the opposite is true for versicolor, and virginica is also positively skewed except for the pedal width
What are correlations, positive, negative, correlation coefficients
Always takes a value between -1 and 1, describes the strength of the linear relationship between to variables.
- Two or more variables can vary with each other – this is called covariance
The covariance shows the tendency in the linear relationship between the variables:
- Positive correlation: they move together in both directions
- Negative correlation: they move in opposite directions
Since the magnitude of covariance is not easily interpretable, the correlation coefficient is used as a measure:
- If the linear relationship is strong and negative, the correlation coefficient R will be near -1
- If there is no apparent linear relationship, the correlation coefficient R will be near 0
- If the linear relationship is strong and positive, the correlation coefficient R will be near +1
iris[‘sepal length’].corr(iris[‘petal length’])
0.8717541573048718
Seems to be a strong positive correlation between sepal length and petal length.
If we want to calculate correlations within groups:
Use the corrwith() function which allows us to calculate pairwise correlation between the columns of two DataFrame objects
correlations = (iris[[‘sepal length’, ‘class’]]
.groupby(‘class’)
.corrwith(iris[‘petal length’])
.rename(columns={‘sepal length’ : ‘Corr Coef’}))
correlations
Corr Coef
class
Iris-setosa 0.263874
Iris-versicolor 0.754049
Iris-virginica 0.864225
Visualization in tandem with quantative analysis is important because it reveals whether the correlation is linear. In the example below, all correlation coefficients are the same but only one is linear (you would only know that because you displayed it)