Exploring Data - Topic 3: Numerical Summaries Flashcards

1
Q

What are the major features that can be summed numerically?

A

Minimum
Maximum
Centre (mean, median)
Spread (standard deviation, range, IQR)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are numerical summaries? Advantages and disadvantages?

A

Numerical summaries are a collection of measures that try to describe as much as possible about the data set in as few numbers as possible.

An advantage is that it allows for easy communication and comparisons

A disadvantage is that it loses a lot of info when doing so

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the mean?

A

It is the average of ALL the data. It is the point where the data is balanced

I.e. the higher readings and lower readings all cancel each other out

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do you calculate mean?

A

Mean = Sum of the data / size of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the median?

A

It is the middle datapoint when the data is ordered from smallest to largest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do you calculate median?

A

For odd sized data: It is the middle number (i.e. (n+1)/2)

For even sized data: It is the average of the 2 middle points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What situations would a median be preferred over a mean?

A

Within datasets with high variation and outliers. This is because the mean will be easily influenced by these other variables, and cause a very different mean.

Thus, medians are used when there are lots of outliers as it provides a better measure of the centre than the mean (as that one is influenced by outliers a lot)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does robustness/robust mean? Give an example

A

It is when something is a good summary for skewed data as it isn’t affected by outliers

I.e. the median is said to be robust and is a good summary for skewed data as it isn’t affected by outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a left skewed dataset?

A

Also known as a negatively skewed dataset, the tail is headed towards the negative side.

Here, we expect the mean to be smaller than the median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a right skewed dataset?

A

Also known as a positively skewed dataset, the tail is headed towards the positive side

Here, we expect the median to be smaller than the mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Which meaasure of central tendenc is the best for describing a centre which is skewed?

A

Median

Mean for a symmetrical distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Can there be cases when neither mean or median are helpful in describing centre?

A

Yes. This typically occurs with bimodal data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are limitations of the mean?

A

Doesn’t account for outliers which could influence the mean

Potential for misinterpretation

Market could have changed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a limitation of the median?

A

Not accounting for all data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the wrong way to measure spread?

A

We could calculate the gap between each data value and the mean, and then average these gaps. However, the average will always be zero because from the definition, the mean gap must be zero, as the mean is the balancing point of the gaps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Thus, what is the new, better way to measure spread?

A

A better alternative to calculating the mean of the data minus mean is through standard deviation. This involves the RMS (Root, mean, square)

The RMS measures the avg set of numbers, regardless of sign

17
Q

What are the steps to RMS?

A

Square the numbers (Each data point - mean) , mean the result, then root it

18
Q

What does standard deviation population measure and what is the equation to measure it?

A

Measures the s.d. of the population

Root ((Sum of (x - average) ^2) / N )

19
Q

What does standard deviation sample measure, and what is the equation to measure it?

A

Measures the s.d. of the sample

Root ((Sum of (x - average)^2 / (N-1))

20
Q

How do you know the differnece of when to use sample and population s.d.?

A

Population SD is when you have every set of data available, sample SD is when you don’t have every set of data available (with reference to the research question)

It is very much dependent on what the research is trying to achieve

For example;
assume ‘data’ =Newtown property prices during April - June 2017

if just looking at ‘Newtown Property Prices during April-June 2017’, then the ‘data’ is the whole population

If we want to look at all property prices/ prices in general, then the ‘data’ is considered a sample

21
Q

What % of data lies within what s.d. from the mean

A

68% of data –> 1 s.d. from the mean

95% of data –> 2 s.d. from the mean

99.7% of data –> 3 s.d. from the mean

22
Q

What are standard units?

A

They are also known as z scores, and determines how many standard deviations a score is above or below the mean

To compare2 data points, we can compare with reference to standard units

23
Q

How is a standard unit (z score calculated)

A

(Data point - mean) / SD

24
Q

What is the Interquartile range (IQR)?

A

It is the range of the middle 50% of the data. More formally, IQR = Q3 - Q1, where Q1 is the 25th percentile and Q3 is the 75th percentile

25
Q

What are quartiles?

A

Quartiles divide the data into quarters

26
Q

What is the mathematical reasoning for what an outlier is?

A

The lower and upper thresholds is both a distance of 1.5IQR from the Q3 or Q1

Lower threshold = Q1 - 1.5IQR

Upper threshold = Q3 + 1.5IQR

27
Q

What are the pairs of measures of central tendency and variability that we have to do?

A

Mean and standard deviation

Median and IQR

28
Q

What is coefficient of variation? (CV)

A

It is used to compare data dispersion between distinct series of data

29
Q

WHat is the formula for coefficient of variation?

A

SD / Mean

30
Q

What are examples of what CV is used for?

A

Analytical chemistry

Engineering + physics

Economics for determining volatility of securities