Topic 3: Numerical Summaries Flashcards

Question 1

Q

What are the main features of numerical summaries?

Answer

A

Max & min
Centre (mean, median)
Spread (standard deviation, range, interquartile range)

Question 2

Q

Why do we need to pair centre features with spread?

Answer

A

If there is only centre points included, it can lead to misleading intepretation and instant assumptions regarding the dataset.

Question 3

Q

Why do we need numerical summaries?

Answer

A

Numerical summeries reduce all of the data to 1 point. Even though this leads to a loss of lots of information, it makes communication and comparison much easier.

Question 4

Q

What is mean and how is it calculated?

Answer

A

Mean is the balancing point of the data set and takes into account of the whole data. High and lower readings than the mean cancel each other out.

Mean = sum/size

Question 5

Q

What is median and how is it calculated?

Answer

A

Median is the middle point of the dataset, which takes into account of only 1 or 2 central points.

If the dataset has odd number of readings, the median is unique.
If the dataset has even number of readings, the median is anywhere between the 2 middle points (usually take the average).

Question 6

Q

When to use mean vs median?

Answer

A

Mean is used for fairly symmetric data.
Median is used for skewed and large data with outliers.
If the data graph is bimodal, neither one is suitable.

Question 7

Q

What is standard deviation and how is it calculated?

Answer

A

Standard deviation is used to measure how spread the data is compared to the mean.

RMS of gaps from the mean = sqrt[mean of (gaps from the mean)^2]

Question 8

Q

How much percentage of the dataset is presented if taking account 1SD, 2SD, and 3SD?

Answer

A

1SD: 68%
2SD: 95%
3SD: 99.7%

Question 9

Q

What is the IQR and how is it calcualted?
What does it represent in boxplot?

Answer

A

IQR is the interquartile range or the range of the middle 50%.

IQR = Q3-Q1 = 75% percentile - 25% percentile

IQR is the length of the box in boxplot.

Question 10

Q

How is the mean compared to the median in different dataset?

Answer

A

In symmetric data, mean is quite near median.

In left skewed data, smaller data points drag the mean down
–> mean < median

In right skewed data, higher data points drag the mean up
–> mean > median

Question 11

Q

What is the difference between using (mean,SD) and (median,IQR)?

Answer

A

(Median, IQR) is more robust as they are barely affected by outliers and suitable for skewed data compared to (mean,SD)

Question 12

Q

What is standard units and how is it calculated?

Answer

A

Standard units measure how many SD is one data point above or below the mean.

Standard units = (data point - mean)/SD

Question 13

Q

What is coefficient of variation and how is it calculated?

Answer

A

Coefficient of variation is a relative measure of deviation (or combining those two values into 1 summary).

CoV = SD/mean

Question 14

Q

What features are included in a boxplot?

Answer

A

Q2: meadian
Q1, Q3: 25%, 75%
Lower threashold: LT = Q1 - 1.5*IQR
Upper threashold: UT = Q3 + 1.5*IQR
Data points lying outside the threashold are outliers.

Question 15

Q

Describe the differences between quantile and quartile.

Answer

A

A set of q quantiles divides data into (q-1) equal size sets (in terms of the percentage of data)

Quartile divides data into quarters.

Question 16

Q

What are some steps in data wrangling?

Answer

Study These Flashcards

A

Sourcing: the reliability, integrity, and original source of the data

Scraping: extracting data from any source (web scraping: from websites)

Cleaning and tidying: produce neat datasets

Question 17

Q

What can be classified as neat datasets?

Answer

Study These Flashcards

A

Each variable is a column.
Each subject/observation is a row.
Each type of observational unit forms a table.

Topic 3: Numerical Summaries Flashcards

(17 cards)