Data Analytics Theory Flashcards
Which of the mean, mode and median are resistant to outliers?
The mean is very sensitive to the presence of outliers. The median and mode are very resistant to outliers.
True or false - the median is calculated differently depending on if there is an even or odd number in the sample?
True
What are the steps for determining the mean?
The sum of all sample values (Xi) divided by the number of samples.
What are the steps for determining the median?
Order the sample values in ascending order. For odd total n the median is found at (n+1)/2. For even total n, the median is the average of the value at n/2 and (n+2)/2.
What are the steps for determining the mode?
Creating a frequency table and observing the highest frequency. Then observe which value this is for.
What is the mode?
The observation that occurs most frequently in the dataset.
Why should we not look at measures of centrality in isolation?
Comparing the measures of centrality between datasets may indicate that they are similar when in reality they have different amounts of dispersion.
What are the measures of centrality?
Mean, mode, median
What do measures of variability describe?
Measures of variability describe how dispersed observations in the univariate dataset are. They describe whether observations are tightly clustered or spread out.
What are the measures of variability?
Variance and standard deviation. Range (though very sensitive to outliers). Five number summary provides basic information about variability.
What are synonyms of the mean?
Arithmetic mean or average
What is the formula for calculating the mean?
[See flashcard]
What is the mean?
The mean is considered to be the central (typical) measurement of a collection of observations.
What is the formula for calculating the standard deviation?
[See flashcard]
What is the formula for calculating the variance?
[See flashcard]
What is the variance?
The average squared distance of each observation from the mean. Measured in units squared.
What units is the variance measured in?
Units squared
What is the standard deviation?
The square root of the variance - it is useful to consider how close the observations are from the mean. Measured in the same units/same scale as the observations in the numerical variable.
What units is the standard deviation measured in?
Same units/scale as the observations in the numerical variable.
How much of the data is usually within one standard deviation from the mean?
68%
How much of the data is usually within two standard deviations from the mean?
95%
What are order statistics?
Statistics based on sorted (ranked) data
Define a quantile.
The value computed from a sorted collection of numerical measurements (in ascending order) that indicates an observation’s rank when compared to all other present observations. It can take a value between 0 and 1.
What does the 0.5th quantile mean?
This is the median value, below which half (50%) of the measurements lie.
What values can a quantile take?
Between 0 and 1.
What values can a percentile take?
Between 0 and 100.
What is the relationship between a quantile and percentile.
The percentile is the quantile expressed in “percent scale” of 0 to 100 ie Pth quantile = 100 x Pth percentile.
Define percentile.
The percentile is the quantile expressed in “percent scale” of 0 to 100 ie Pth quantile = 100 x Pth percentile. The Pth percentile is the cutoff point that indicates that at least P percent of the observation in the dataset take on this value or less.
What does the 80th percentile represent?
The 80th percentile is the cutoff point which indicates that 80% of observations in the dataset may be found at this point or below.
What are quartiles?
Quartiles are three cut off points that divide the dataset into four equal groups (Q1, Q2, Q3)
Define the first quartile
Q1 = 0.25th quantile = 25th percentile. This is the middle value between the smallest observation and the median. Ie it is the median of the lower half of the dataset.
Define the second quartile.
Q2 = 0.5th quantile = 50th percentile. This is the median of the dataset (the value which splits the dataset in half).
Define the third quartile.
Q3 = 0.75th quantile = 75% percentile. This is the middle value between the median and the highest observation in the dataset. Ie it is the median of the upper half of the dataset.
Define the range.
The range is the difference between the smallest and largest observations in a numerical variable. It is extremely sensitive to outliers and therefore not very useful as a general measure of dispersion in the data.
Why is the range not very useful as a general measure of dispersion in the data?
It is extremely sensitive to outliers - its calculation involves the use of extreme values.
What is the five number summary?
This provides basic information about variability in the dataset. It consists of the 0th percentile (minimum), 25th percentile (Q1), 50th percentile (Q2), 75th percentile (Q3) and 100th percentile (maximum). Ie it is the quartiles plus the maximum and minimum values.
What is the interquartile range?
The interquartile range (IQR) measures the width of the “middle 50 percent” of the data. It is the range of values between Q1 (0.25 quantile) and Q3 (0.75 quantile). It is very resistant to outliers as it doesn’t consider the extremes where outliers are present.
Why is the IQR resistant to outliers?
The IQR measures the range across the middle 50% of the data, and therefore unlike the range it doesn’t consider the extremes where the outliers are present.
What is the first step to carry out before determining order statistics?
Sort the data in ascending order.
What is covariance?
Covariance measures joint variability — the extent of variation between two random variables. It quantifies how two variables vary together.
What are the possible outcomes for covariance and what does each mean?
R = 0 - there is no linear relationship between numerical variables x and y.
R > 0 - there is a positive linear relationship between numerical variables x and y (as x increases, y increases and vice versa).
R < 0 - there is a negative linear relationship between numerical variables x and y (as x increases, y decreases and vice versa)
What does a positive linear relationship mean?
R > 0 - as x increases, y increases and vice versa
What does a negative linear relationship mean?
R < 0 - as x increases, y decreases and vice versa
Does correlation or covariance measure how strong a relationship is?
Correlation
Why does calculating the covariance not tell us how strong a relationship is?
Covariance can tell us if there is a relationship between two variables, but it cannot measure how strong the relationship is as there is no scale to compare the value of r to.
What type of variable can covariance and correlation be calculated for?
Numerical variables.
What is the problem with covariance?
We cannot quantify strength of the linear relationship between two variables. There are no upper or lower limits which covariance coefficient can take.
What does correlation measure?
The direction and strength of an association between two variables. It is used to interpret the covariance.
What coefficient do we use for correlation?
Pearson’s product-moment correlation coefficient (Pxy, Rho xy).
What are the interpretations of the absolute strength of the Pearson’s product-moment correlation coefficient?
There are guidelines available to interpret the value of rho.
|rho| = 0.0 – no linear relationship
0.0 < |rho| <= 0.19 – very weak L.R.
0.20 <= |rho| <= 0.39 – weak L.R.
0.40 <= |rho| <= 0.59 – moderate L.R.
0.60 <= |rho| <= 0.79 – strong L.R.
0.80 <= |rho| < 1.0 – very strong L.R.
|rho| = 1.0 – perfect L.R.
What are the basic interpretations of the Pearson’s product-moment correlation coefficient?
If rho = 1, there is a perfect positive linear relationship between variables x and y.
If 0 < rho < 1, there is a positive linear relationship between x and y. The closer to 1 the stronger it is.
If rho = -1, there is a perfect negative linear relationship between x and y.
If -1 < rho < 0, there is a negative linear relationship between x and y. The closer to -1 the stronger it is.
If rho = 0, there is no linear relationship between x and y.
What values can Pearson’s product-moment correlation coefficient take on?
Rho is between -1 and 1.
Why are we able to say how strong the relationship is using Pearson’s product-moment correlation coefficient?
It is scaled between - 1 and 1.
What is a frequency table?
A statistical technique used to get more insight into the properties of categorical variables.
What are the columns of a frequency table?
1 - category
2 - frequency column (F) - the number of occurrences of each categorical variable. Will total to n
3 - relative frequency (RF) - the proportion of occurrences of each categorical variable. (F/n). The sum of all relative frequencies when written as proportions must be equal to 1.
4 - percentages (P) - proportions multiplied by 100. The sum of this column must equal 100.
What does the relative frequency column of a frequency table sum to?
1
Why are frequency tables useful?
They help us to summarise large amounts of data and display this information clearly. We can see the most/least common variables and can calculate proportions.
What are contingency tables used for?
A contingency table summarises data for two categorical variables (table of counts by category). Each value in the table represents the number of times a particular combination of variable outcomes occurred.
What is the relationship between a frequency table and contingency table?
Both tables are used to summarise information on categorical variables. A frequency table is used to summarise information on a single categorical variables whereas contingency tables summarise the data for two categorical variable.
What kind of tool can be used to answer questions like “what proportion of spam emails contains text without numbers?”
Two categorical variables - contingency table
What are bar charts used to visualise?
Categorical variables. This can be represented as frequency or proportion.
How are categorical variables visualised?
Bar charts - this can be by frequency or proportion.
What are the different axis of a bar chart?
The x-axis represents the different symbols (categories) of a categorical variable. The y-axis represents the frequency or proportion of the occurrence of each category.
What is a mosaic plot?
A graphical representation of the information in a contingency table. It is similar to a bar plot.
How many variables can a mosaic plot represent?
A mosaic plot can be used to visualise one or two categorical variables from a contingency table.
How do mosaic plots represent the number of observations?
Mosaic plots use box areas to represent the number of observations that that box represents.
What is used to visualise contingency tables?
A mosaic plot
How does a two-variable mosaic plot represent the two variables?
One category (x) is used to create an initial one variable mosaic plot where the area represents the number of observations for that category. The second category (y) is represented by splitting each bar proportionally according to the fractions of y.
What types of variables are plotted on a scatterplot?
Numerical variables
What is a scatterplot?
A plot that provides a case-by-case view of data for two numerical variables.
What are scatterplots useful for?
Scatterplots are helpful in quickly spotting associations between two numerical variables.
What is a box plot?
A visualisation technique used for explaining important features of the distribution of the target numerical variable. It provides insight into centrality, spread, skewness and possible outliers.
What does a box plot show?
Centrality (mean), spread (quartiles), skewness and possible outliers.
Do the whiskers of a box plot represent the full range?
No, the whiskers may not capture the maximum and minimum values. The whiskers are determined differently dependent on the software package used. Eg 1.5 the IQR
What can box plots be useful for?
Identifying outliers.
If a box plot shows lot of outliers above the maximum whisker (high positive) what does this indicate about the skew of the data?
Right-skewed
If a box plot shows lot of outliers below the minimum whisker what does this indicate about the skew of the data?
Left-skewed
How can you identify suspected outliers on a box plot?
Suspected outliers are the observations beyond the maximum reach of the whiskers.
What is an outlier?
An outlier is an observation that appears extreme relative to the rest of the data
Why is it important to look for outliers?
- To identify a strong skew in the distribution
- To identify data collection or entry errors
- To get an insight into interesting properties of the data
What are side-by-side box plots used for?
Side-by-side box plots is a traditional tool for comparing numerical observations across categories. It is particularly useful for comparing centrality and spread of numerical observations between categories.
What visualisation technique can you use for exploring the distribution of numerical and categorical variables together?
Side-by-side box plots
What measures are side-by-side box plots particularly useful for?
Comparison of centrality and spread of numerical observations between categories.
How should you answer questions describing graphs?
- Describe what you see
- Relate this to the question (ie what does this mean in real life)
- Support with figures from the graph
What are histograms?
Histograms are plots that are used for describing the shape of the data distribution of the target numerical variable. They also provide a view of the data density of the target numerical variable (higher bars represent where data is more common).
What kind of data type is plotted in a histogram?
Numerical
What kind of visualisation describes the shape of the data distribution of a numerical variable?
Histogram
What does a higher bar in a histogram represent?
Where the data are relatively more common.
What kind of visualisation describes the data density of a numerical variable?
Histogram - where higher bars represent where the data are relatively more common.
What are the similarities of bar charts and histograms?
They use bars to represent frequencies / they both measure frequencies.
What are the differences of bar charts and histograms?
- Histograms re used for displaying distributions of numerical variables while bar charts are used for categorical variables.
- Both measure frequencies, but in histograms, observations first need to be “binned”
What is a “bin” in a histogram?
A defined interval (used to group individual numerical values). The number of observations that fall within each interval are counted and this frequency is used to determine the height of the bar for that interval.
Why does bin width matter when plotting histograms?
The chosen bin width can alter the story that the histogram is telling. Increasing the bin widths may decrease the number of modes available.
What are the steps of constructing a histogram?
1 - define the bins and bin sizes (software may determine this)
2 - once defined, count how many observations fall into each interval
3 - plot
How is the mode represented in a histogram?
The mode is represented by a prominent peak in the distribution.
What can histograms show?
Histograms can show how many and what the modes of a distribution are.
- Unimodal / bimodal / multimodal
Describe a right-skewed distribution.
When data trails off to the right ie observations are clustered on the left of the axis and there is a long tail to the right.
Describe a left-skewed distribution.
When data trails off to the left ie observations are clustered on the right of the axis and there is a long tail to the left.
If observations are clustered on the left of a histogram and there is a long tail to the right - what kind of skew is this?
Right-skewed
If observations are clustered on the right of a histogram and there is a long tail to the left - what kind of skew is this?
Left-skewed
How do you describe a dataset that shows roughly equal trailing off in both directions?
Symmetric
What is a symmetric distribution?
A dataset that shows roughly equal trailing off in both directions.
Why is it important to check if data is normally distributed?
A lot of statistical inference relies on data being normally distributed.
If the distribution of a dataset is symmetric, what measures should you use to describe the centre and spread?
Mean and standard deviation
What kind of distribution is best described by the mean and standard deviation?
Symmetric
If the distribution of a dataset is skewed, what measures should you use to describe the centre and spread?
Median and IQR - they are robust to outliers.
What is the relationship between the median, mean and mode of a symmetric distribution?
mean ~ median ~ mode
What is the relationship between the median, mean and mode of a right-skewed distribution?
mode < median < mean
What is the relationship between the median, mean and mode of a left-skewed distribution?
mean < median < mode
Why does mean ~ median ~ mode not hold for skewed data?
The mean is pulled in the direction of the tail, towards the extremes. The mode is pulled in the opposite direction (where the data is clustered)
If data is right-skewed, what kind of transformations can result in new samples which are less skewed?
y = sqrt(x)
y = ln(x)
y = -1/x
In increasing order of skewness severity
If data is left-skewed, what kind of transformations can result in new samples which are less skewed?
y = x^2
y = x^3
In increasing order of skewness severity
Why is the bin width choice important?
Depending on bin size, the story the graph tells can change. If the bin size is too wide, it may mislead you into thinking that the data is normally distributed.
What are the options to have on the y-axis of a histogram?
Absolute frequency or relative frequency (F/n)
What is the difference in shape of the relative histogram in relation to the absolute frequency histogram?
They have the same shape. The difference is the Y-axis and the fact that the areas of the bars of the relative frequency histogram add up to one.
How do you calculate the relative frequency?
The absolute frequency divided by the Toal number of observations
When do we use the relative frequency histogram over the absolute frequency histogram?
Use the relative frequency histogram when we want to investigate whether the proportion is less than or greater than a certain value. Ie we want to look at proportion rather than frequency.
How do you answer a question to determine how many people fall in a certain interval, if the bin widths are too big to answer this accurately?
Can’t determine an exact answer with these bin widths, we can only estimate. To answer accurately we need to have a narrower histogram (one with smaller bins)
If you keep changing the bin widths of a histogram to become smaller, what happens?
The histogram forms a more smooth curve, approaching the density curve.
What is a density curve?
A density curve is a smoothed version of the relative frequency histogram. It is used for the visualisation of continuous variables or very large populations. It also represents a probability density function. The area under the curve is equal to 1.
What kind of variables is visualised in a density curve or probability density function?
A continuous variable.
What does the area under a density curve represent?
The area corresponds to measuring probabilities. The total area is equal to 1. Similar to the bars in a relative frequency diagram.
From a probability density curve, what is the probability that x = a particular value from the continuous distribution?
The probability that x is equal to some value from the continuous distribution is ALWAYS equal to 0. This happens because a single point on the density curve diagram has a width of 0 and therefore we can’t obtain the area underneath the curve at a single point.
Which distribution is the most common?
The normal curve or normal distribution.
What are the properties of the normal distribution?
- It is unimodal and symmetric around its mean bell-shaped curve
- Mean, mode and median are equal
- It is determined by two parameters (mu and sigma), usually denoted as N(mu, sigma)
- The area under the normal curve is 1
What parameters determine the normal distribution?
Mu and sigma - N(mu, sigma)
What is the standard normal distribution?
A normal distribution where mu = 0 and sigma = 1, represented as N(0,1)
Which normal distribution is represented by N(mu = 0, sigma = 1)?
The standard normal distribution
We want our dataset to be normally distributed, but in practice our data comes from lots of different types of distributions, with lots of different influencing parameters. How do we account for this?
Transform our dataset onto the standard normal distribution. This enables us to refer to the standardised tables.
What parameters determine the shape of the normal distribution?
Mu (mean) - the centre of the curve, changing mu shifts the curve left / right
Sigma (standard deviation) - the width of the curve. Changing sigma stretches or constricts the curve
Which rule describes how many observations lie within different numbers of standard deviations from the mean in the normal distribution?
68-95-99.7 Rule
- 68% of observations lie within 1 SD away from the mean in the normal distribution
- 95% of observations lie within 2 SDs
- 99.7% of observations lie within 3 SDs
How many observations lie within 1/2/3 SDs away from the mean in the normal distribution?
68%, 95%, 99.7%
How do we analyse normally distributed data?
We should convert available observations into the standard deviation units and measure their distances from the mean.
To perform this type of conversion we use the standardisation technique called Z-score.