Topic 2 - Descriptive Statistics Flashcards

1
Q

How many general ways are there to describe data (descriptive statistics) numerically for 1 variable?

A

2 ways

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How can you describe data numerically?

A

1) Measures of Central Tendency- arithmetic mean, median, mode
2) Measures Variation or Dispersion- range, interquartile range, variance, standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do you denote the population arithmetic mean?

A

Population mean denoted by mu
See image in notes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do you denote the sample arithmetic mean?

A

x bar (x with horizontal line over the top)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do denote the population size?

A

Capital/uppercase N

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do denote the sample size?

A

Lowercase n

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you calculate the median?

A

Mid point of data ordered in ascending order

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a negative of the mean as a measure of central tendency?

A

It is very sensitive to and is affected by outliers/extreme values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a negative of the median as a measure of central tendency?

A

Ignores all values apart from the middle values- doesn’t take into account values which are fairly higher or smaller than the middle values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do calculate the median when there are an even number of ordered data items?

A

Take the mean of the middle of the middle 2 values e.g. the half way point of the 2 values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the advantages and disadvantages of the mode?

A

Advantages:
- not affected by extreme values
- can be used for either numerical (quantitative) or categorical (qualitative) data

Disadvantages:
- there may be no mode (if their are an equal number of all data items) or there may be several modes (if there is more than one data item which occurs most frequently)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What do measures of central tendency show?

A

Single value that attempts to describe a set of data by identifying the typical value within that set of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What do measures of variance or dispersion show?

A

How spread out the data is- how variable it is- the dispersion of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the disadvantages of the range as a measure of dispersion/variance?

A

1) It ignores the way in which data is distributed e.g. the range won’t take into account whether there is an even distribution of data among all data items or whether data is concentrated in the low, middle or high end- it is only concerned with the lowest and highest value in the data set

2) It is also sensitive to outliers/extreme values- one extreme value will have a massive impact on the range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are quartiles?

A

Quartiles split the ranked data into 4 segments with an equal number of values per segment

The point which marks the end of the 1st quartile is known as the lower quartile (Q1)
The point which marks the end of the 2nd quartile is known as the median (Q2)
The point which marks the end of the 3rd quartile is known as the upper quartile (Q3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do you calculate the interquartile range and what are its advantages?

A

Interquartile range (IQR) = 3rd quartile (Q3) – 1st quartile (Q1)

Advantages:
- can eliminate some outlier problems as only takes into account the middle 50% of data- here the range (interquartile range) is not likely to be affected by outliers/extreme values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Define population variance

A

The population variance is the exact average (exact because you are taking data directly from the population) of the squared deviations of values from the mean

The population variance is a PARAMETER of the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How do you calculate the population variance and the population standard deviation?

A

Population variance denoted at an ‘o’ (sigma) squared but with a line connected to the top which moves over slightly right of the p- see image in notes

Population variance (o^2) = the sum of : [value of data x - population mean mu (u)]^2 / population size N

NOTE that population variance itself is denoted as sigma squared so the formula above gives you the population variance itself
Sigma on its own (so if you were to square root the population variance) would give you the standard deviation of the population
… the standard deviation of the population is denoted as sigma in its own (o looking thing)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Define sample variance

A

The sample variance is the average (approximately- because you are calculating from the sample which is supposed to represent the population and … isn’t calculated from the population directly) of the squared deviations of values from the mean

The sample variance is a STATISTIC of the sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How do you calculate the sample variance and the sample standard deviation?

A

Sample variance (s^2) = the sum of : [value of data x - sample/arithmetic mean x bar]^2 / [sample size n – 1]

Note- sample variance denoted as small/lowercase ‘s^2’- REMEMBER sample variance itself is ‘s’ squared and the root of it so ‘s’ in its own would be equal to the standard deviation of the sample
… standard deviation of the sample is denoted as ‘s’ on its own

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Why do you divide by (n - 1) in the sample variance and not by n?

A

Because the observed values fall, on average, closer to the sample mean than to the population mean, the standard deviation which is calculated using deviations from the sample mean underestimates the desired standard deviation of the population

Using “n-1” instead of “n” as the divisor corrects for that by making the result a little bit bigger

Note that the correction has a larger proportional effect when “n” is small than when it is large, which is what we want because when “n” is larger the sample mean is likely to be a good estimator of the population mean

Note- we divide here by [n - 1] instead of just N (seen in population variance) because we need to ensure that the sample variance is an unbiased estimator (average of the sample variances for all possible samples should equal the population variance) of the population variance

SEE IF MAKES SENSE AFTER LECTURE 2 LIVE AND ADD HERE ACCORDINGLY

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the units of the population or sample standard deviation?

A

The units are the same as the original data e.g. if in litres then the standard deviation for the sample and the population is also in litres

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What does the standard deviation measure?

A

Measures the variation/scatter of data around the mean

24
Q

What is the significance of a large or small standard deviation?

A

A small standard deviation means that the data is generally quite close to the mean- this type of data can be a shown by a narrow steep variation peak

A large standard deviation means that the data is generally quite far from the mean- this type of data can be a shown by a wide flat variation peak

See image in notes

25
Q

What are the advantages of variance and standard deviation as measures of dispersion?

A

Each value in the data set is used in the calculation

Values far from the mean are given extra weight because the deviations from the mean are squared … the further data is from the mean the larger the impact it will have on the standard deviation (as these values are squared) which is good as it takes into account all values including extreme values far from the mean

26
Q

What is a backwards u-shaped data distribution called in this module and what is this the same as?

A

They call it a bell shaped distribution
SAME AS normal distribution- JUST ANOTHER NAME FOR IT

27
Q

In a bell shaped distribution, how much of the values in the population or sample does the distribution contain when looking at 1 standard deviation on either side of the mean (either mu or x bar depending on whether you’re looking at population or sample distribution)?

A

Contains about 68% of the values in the population or the sample
See image in notes

28
Q

In a bell shaped distribution, how much of the values in the population or sample does the distribution contain when looking at 2 standard deviations on either side of the mean (either mu or x bar depending on whether you’re looking at population or sample distribution)?

A

Contains about 95% of the values in the population or the sample
See image in notes

29
Q

In a bell shaped distribution, how much of the values in the population or sample does the distribution contain when looking at 3 standard deviations on either side of the mean (either mu or x bar depending on whether you’re looking at population or sample distribution)?

A

Contains about 99.7% of the values in the population or the sample
See image in notes

30
Q

What is a z-Score and what is the significance of it?

A

A z-score shows the position of an observation relative to the mean of the distribution- it indicates the number of standard deviations a value is from the mean

A z-score greater than zero indicates that the value is greater than the mean

A z-score less than zero indicates that the value is less than the mean

A z-score of zero indicates that the value is equal to the mean

CAN BE applied to equation for z-Score given in the next few flashcards

31
Q

How do you calculate the z-Score for the population?

A

z-Score (z) = [observed value x - population mean (mu)] / population standard deviation (sigma)

32
Q

How do you calculate the z-Score for a sample?

A

z-Score (z) = [observed value x - sample mean (x bar)] / sample standard deviation (s)

33
Q

How is the z-Score denoted?

A

By a lowercase ‘z’

34
Q

Can the standard deviation be negative?

A

No standard deviation is always positive

35
Q

What can we say about the z-Score formula?

A

If the observed x value is greater than the mean, then the numerator will be positive and … the z-Score value will also be positive

If the observed x value is less than the mean, then the numerator will be negative and … the z-Score value will also be negative

This is because the denominator (standard deviation) and the mean is always positive

36
Q

What is significant about a z-Score value of more than 1?

A

REMEMBER the z-Score value shows how many standard deviations a value is from the mean

A standard deviation of 1.4 for example is significant as recall that 1 standard deviation on either side of the mean contains about 68% of all values in a bell-shaped distribution … 1.4 suggests that the observed value is above the middle 68% and … you can conclude that the value is very high and well above average

37
Q

How and when would you calculate the weighted mean?

A

Weighted mean (denoted uppercase x bar) = multiplying each x value by the relative weight and add all (w1x1 + w2x2 …) then divide by the sum of the weights (w1 + w2 + …)

Used when data already grouped into n classes with w values in the corresponding class

38
Q

When measuring the relationship between 2 variables, how many descriptive statistics could we use?

A

2

39
Q

When measuring the relationship between 2 variables, what descriptive statistics could we use?

A

1) Covariance
2) Correlation Coefficient

40
Q

What is covariance and what does it signify?

A

A measure of the direction of a linear relationship between 2 variables
- a positive covariance means that there is a positive relationship between two variables- both variables move in the same direction
- a negative covariance means that there is a negative relationship between two variables- both variables move in opposite directions

41
Q

What is correlation coefficient?

A

A measure of the direction AND THE strength of a linear relationship between 2 variables
- derived from covariance

42
Q

So what is the difference between covariance and correlation coefficient and which is more informative?

A

Correlation coefficient is more informative as it tells us not only about the direction, but also the strength of a linear relationship between 2 variables

43
Q

How do you calculate population covariance?

A

Cov (x,y) = sigma with small x y in bottom right corner = (an observed value x - the population mean for x denoted mu with small x in bottom right corner) * (the corresponding value of y - the population mean for y denoted mu with small y in bottom right corner) / N

44
Q

How do you calculate sample covariance?

A

Cov (x,y) = s with small x y in bottom right corner = (an observed value x - the sample mean for x denoted x bar) * (the corresponding value of y - the sample mean for y denoted y bar) / (n-1)

45
Q

What is the significance of both sample and population covariance and how can it be applied?

A

If both values for x and the corresponding y are below the respective means then the 2 brackets in the numerator will be negative which will generate a positive when multiplied which will suggest a positive relationship between the variables

If one of the values is less than its mean but the other is greater this will generate a negative in the numerator and … the covariance will be also negative suggesting a negative relationship between the 2 variables

If both values for x and the corresponding y are above the respective means then the 2 brackets in the numerator will be positive which will generate a positive when multiplied which will suggest a positive relationship between the variables

If there is a greater number of values which a produce a negative covariance then this will mean an overall negative relationship between the variable but if there is a greater number of values which produce a positive covariance then this will mean an overall positive relationship

46
Q

So what is the link between covariance and direction?

A

MAKE SURE you use key word DIRECTION

A covariance greater than 0 suggests that x and y move in the SAME direction

A covariance less than 0 suggests x and y move in OPPOSITE directions
a covariance equal to 0 suggests x and y are independent (have no relationship with each other)

47
Q

What is one thing to remember about covariance?

A

No causal effect is implied- there might be a relationship between 2 variables but this does not mean that an increases/decrease in one variable causes an increase/decrease in the other variable

48
Q

How do calculate the population correlation coefficient?

A

Population correlation coefficient (denoted by lowercase p) = population covariance (sigma with small xy in bottom right corner) / population standard deviation of x denoted sigma x with x in bottom right corner * population standard deviation of y denoted sigma y with y in bottom right corner

49
Q

How do you calculate the sample correlation coefficient?

A

Sample correlation coefficient (denoted by lowercase r) = sample covariance (s with small xy in bottom right corner) / sample standard deviation of x denoted by sx with x being small in the bottom right corner * sample standard deviation of y denoted sy with y being small in bottom right corner

50
Q

What are the features of the descriptive statistic correlation coefficient?

A

1) It is unit free
2) It ranges between -1 and 1
3) The closer it is to -1, the stronger the negative linear relationship
4) The closer it is to 1, the stronger the positive linear relationship
5) The closer to 0, the weaker any linear relationship is
6) A correlation coefficient of 0 means no linear relationship exists between the 2 variables
7) A correlation coefficient of +1 indicates a perfect positive linear relationship
8) A correlation coefficient of -1 indicates a perfect negative linear relationship

51
Q

So how does the correlation coefficient tell us both about the direction AND strength of the relationship between 2 variables?

A

Firstly, if the correlation coefficient is negative then this shows that x and y tend to move in the opposite direction (negative relationship) and if positive then this shows that x and y tend to move in the same direction (positive relationship)

The closeness to -1 and 1 tells us about the strength of the correlation/relationship

52
Q

What must you remember about correlation coefficient?

A

It is not the slope of the line
A correlation coefficient of 1 means that all points line on a straight line and there is no variation between the data points
SEE NOTES IMAGES FOR THIS AND FOR ALL PREVIOUS FLASHCARDS

53
Q

What is important to remember about positive and negative skew and the mean, median and mode?

A

A

54
Q

If you have 2 samples and the one sample size is a lot larger than the other, will the mean (average) differ?

A

A

55
Q

If you have 2 samples and the one sample size is a lot larger than the other, will the variance differ?

A

A

56
Q

If you have 2 samples and the one sample size is a lot larger than the other, will the range differ?

A

A