Lesson 15 Statistics Flashcards
Import the packages for maths, stats and scipy
import math
import statistics
import numpy as np
import scipy.stats
import pandas as pd
Create a list inserting a nan between 2.5 and 4
x = [8.0, 1, 2.5, 4, 28.0]
x_with_nan = [8.0, 1, 2.5, math.nan, 4, 28.0]
What are the three different ways of getting a nan value?
- float(‘nan’)
- math.nan
- np.nan
Create np.ndarray and pd.Series objects that correspond to x and x_with_nan from the following lists:
x = [8.0, 1, 2.5, 4, 28.0]
x_with_nan = [8.0, 1, 2.5, math.nan, 4, 28.0]
y, y_with_nan = np.array(x), np.array(x_with_nan)
z, z_with_nan = pd.Series(x), pd.Series(x_with_nan)
Find the mean using in in built python function.
mean_ = statistics.mean(x)
mean_
What is another function to calculate the mean?
mean_ = statistics.fmean(x)
mean_
What value will the mean return if there are nan values present?
nan
How do you calculate the mean with numpy
mean_ = np.mean(y)
mean_
Write the code to calculate the mean but ignore any Nan values.
np.nanmean(y_with_nan)
x = [8.0, 1, 2.5, 4, 28.0]
w = [0.1, 0.2, 0.3, 0.25, 0.15]
y, z, w = np.array(x), pd.Series(x), np.array(w)
wmean = np.average(y, weights=w)
print(wmean)
Calculate the weighted mean of NumPy array or Pandas series
wmean = np.average(z, weights=w)
wmean = np.average(z, weights=w)
Calculate the harmonic mean using statistics library
hmean = statistics.harmonic_mean(x)
What happens if you input the following for a harmonic mean:
nan value
0
negative number
nan
0
error
Calculate the geometric mean
gmean = statistics.geometric_mean(x)
What is the main difference btween the behaviour of the mean and median?
The main difference between the behavior of the mean and median is related to dataset outliers or extremes. The mean is heavily affected by outliers, but the median only depends on outliers either slightly or not at all.
x is [1, 2.5, 4, 8.0, 28.0]
Find the median of the list x
median_ = statistics.median(x)
x is [1, 2.5, 4, 8.0, 28.0]. Slice the list so you remove the 28.0 and find the median.
median_ = statistics.median(x[:-1])
If the number of elements is even there are two middle values: find the lower median value from this list:
x is [1, 2.5, 4, 8.0, 28.0]
statistics.median_low(x[:-1])
If the number of elements is even there are two middle values: find the higher median value from this list:
x is [1, 2.5, 4, 8.0, 28.0]
statistics.median_high(x[:-1])
Calculate the mode returning a single value.
mode_ = statistics.mode(u)
Calculate the mode returning all modes
mode_ = statistics.multimode(u)
Calculate the mode using the following series (finish the code):
u, v, w = pd.Series(u), pd.Series(v), pd.Series(
u, v, w = pd.Series(u), pd.Series(v), pd.Series([2, 2, math.nan])
Calculate the variance
var_ = statistics.variance(x)
Calculate the variance using NumPy
var_ = np.var(y, ddof=1)
OR
var_ = y.var(ddof=1)
Calculate the variance to include nans
np.nanvar(y_with_nan, ddof=1)
Calculate variance with pandas (it will automatically include nans).
z_with_nan.var(ddof=1)
Calculate the standard deviation
std_ = statistics.stdev(x)
Use numpy to calculate standard deviation
np.std(y, ddof=1)
OR
y.std(ddof=1)
Use this list to show the sample 25th and 75th percentiles.
x = [-5.0, -1.1, 0.1, 2.0, 8.0, 12.8, 21.0, 25.8, 41.0]
x = [-5.0, -1.1, 0.1, 2.0, 8.0, 12.8, 21.0, 25.8, 41.0]
statistics.quantiles(x, n=4, method=’inclusive’)
x = [-5.0, -1.1, 0.1, 2.0, 8.0, 12.8, 21.0, 25.8, 41.0]
y = np.array(x)
In a given array x, find the 5th percentile
FInd the 95th percentile
find 5th percentile
np.percentile(y, 5)
find 95th percentile
np.percentile(y, 95)
Find the percentil in an array with nan values
np.nanpercentile(y_with_nan, [25, 50, 75])
Make a cov matrix to show the correlation coefficients from the following arrays:
np.array([14.2, 16.4,15.2, 22.6, 17.2])
np.array([215,325, 332, 445, 408])
cov_matrix = np.corrcoef(np.array([14.2, 16.4,15.2, 22.6, 17.2]), np.array([215,325, 332, 445, 408]))