Statistics - Summarising and presenting data Flashcards

1
Q

What is the purpose of statistics?

A
  • To summarise and present the information contained in a data set
  • To handle and quantify variation and uncertainty in the data, to help to infer what they tell us about the underlying theory of interest.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 5 main summary measures of any numerical data?

A

Mean, Median, Mode, range, and inter-quartile range (IQR)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do you calculate Mean?

A

Add all the values together and divide by how many values there are

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do you calculate Median?

A

The median is the middle value. Arrange all of the values in size order and locate the middle value.

If there are 2 middle values calculate the number between the middle values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you calculate inter-quartile range (IQR)?

A

Inter-quartile range (IQR) is the difference between the 75th and 25th percentiles of the data.

There are 4 rank -ordered even parts that give quartiles (Q1, Q2, and Q3):
- Q1 / lower quartile / 25%
- Q2 / the median / 50%
- Q3 / upper quartile / 75%

IQR = Q3 - Q1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do you calculate range?

A

Range = largest value - smallest value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you calculate Mode?

A

Mode is the number or value which is repeated most often among all of the values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is standard deviation?

A

Standard deviation is the square root of the variance

Standard deviations (Std. Dev.) = √ (variance)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do you calculate variance?

A

You can calculate the variance of a dataset by calculating the distances of values from the mean (e.g. the largest and smallest values in the dataset), and adding the results together, followed by dividing the number from the number of distances calculated.

In the case that there are negative values in the dataset in calculating distances from the mean, square them to make them positive before calculating distances.

Variance = Added distances / how many distances there are.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

STATA can be used to run statistical tests when given a dataset, followed by variables and commands imputed. TRUE or FALSE?

A

TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Can STATA statistical software calculate mean, standard deviation, range, mode, median, and variance?

A

Yes it can, but you should still know how to calculate them all yourself.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

When the variable ‘Age’ is selected in STATA, what is the command that should be used to calculate summary measures (Obs/Mean/Std. Dev./Min/Max)?

A

summarise Age

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What command should be used in STATA to obtain more information following on summary measures (to find quartiles, median etc. rather than just mean/Std. Dev. etc.)?

A

summarise, Age, detail

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

If data presents in a graph as either positively or negatively skewed (not normally distributed), is finding the mean and standard deviation an appropriate measure?

A

No, median and inter-quartile range are more appropriate measures for data which is NOT normally distributed.

This is because skewed data shows the mean as either larger than the median (positively skewed/to the left) or smaller than the median (negatively skewed/to the right).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

If data presents as normally distributed (distribution tail extended equally over both left and right sides) in a graph, is finding the mean and standard deviation an appropriate measure?

A

Yes, finding the mean and standard deviation is an appropriate measure for normally distributed data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

In positively skewed data (to the left of a graph), is the mean larger or smaller than the median?

A

The mean is larger than the median in positively skewed data.

Positively skewed data: mean > median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

In negatively skewed data (to the right of a graph) is the mean larger or smaller than the median?

A

In negatively skewed data the mean is smaller than the median.

Negatively skewed data: mean < median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Name three main things which presenting data in graphs allows us to easily derive from the data.

A

Graphical representation of data enables us to get a feel for:
1. Typical (central) values and range of values
2. Shape and spread of the distribution of values
3. Interesting patterns and relationships in the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Name two ways in which problems can be revealed in concern with data quality by using graphical displays (graphs) to present data.

A

Graphical displays can reveal problems concerning the quality of the data, including:
1. Identifying outlying / erroneous observations
2. Digit preference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Name three types of graph used in statistical analysis.

A
  1. Bar charts
  2. Histogrms
  3. Line graphs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Name two types of tables used in statistical analysis.

A
  1. Frequency tables
  2. Cross tabulations (contingency tables)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the risk of having too few classes within your data set when using a histogram to present data?

A

If there are too few classes in the data set when using a histogram, it could be difficult to see any interesting patterns when the data is presented.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the risk associated with having too many classes within your data set when using a histogram to present data?

A

If there are too many classes when presenting data in a histogram, there may be only one observation per class as opposed to a group of observations. The number of observations per class should be no less than 2.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

The optimal number of classes in a data set that is presented in a histogram ensures that interesting patterns are not unintentionally masked, unlike in the case that there are either too many or too few classes. TRUE or FALSE.

A

TRUE

25
Q

Is continuous data a type of quantitative data or categorical data?

A

Continuous data is a type of quantitative data.

26
Q

Give an example of continuous data.

A

Any of the following:
- Blood pressure
- Age
- Concentration of a pollutant

27
Q

Is discrete data a type of quantitative data or categorical data?

A

Discrete data is a type of quantitative data.

28
Q

Give an example of discrete data.

A

Any of the following:
- Number of children (parity)
- Number of cigarettes per day
- Counts of death in small areas

29
Q

Is ordinal data a type of quantitative data or categorical data?

A

Ordinal data (ordered categories) is a type of categorical data.

30
Q

Give an example of ordinal data (ordered categories).

A

Any of the following:
- Grade of breast cancer
- Disease severity (mild/moderate/severe)
- Social class (I, II, III, IV, V)

31
Q

Is nominal data a type of quantitative data or categorical data?

A

Nominal data (unordered categories) is a type of categorical data.

32
Q

Give an example of nominal data (unordered categories).

A

Any of the following:
- Sex (male/female)
- Exposed/unexposed
- Ethnicity (white/asian/black/other)

33
Q

What are factors?

A

‘Factors’ is the name often given to categorical covariate data.

34
Q

What is dichotomous or binary data?

A

Categorical data which takes on only two distinct values is also known as dichotomous or binary data.

35
Q

Categorical data can often be coded using numerical values. TRUE or FALSE?

A

TRUE

36
Q

Name a disadvantage that can present when using statistical packages to analyse coded categorical data.

A

It is important to declare that the data is categorical before running tests, as statistical packages will often treat numeric data (including coded categorical data) as quantitative unless explicitly declared as categorical.

37
Q

Name one limiting factor of continuous observation.

A

One limiting factor of continuous observation is the accuracy of the measurement instrument.

38
Q

It is possible to transform continuous data into categorical data in the case that the amount of detail provided by continuous data is not necessary. TRUE or FALSE?

A

TRUE

E.g. >2.5kg = 0 , and <2.5kg = 1
In a study of the effect of maternal smoking on birthweight, birthweight can be re-coded as shown above.

39
Q

How can transforming data to a different scale sometimes be helpful?

A

It is sometimes helpful to transform data to a different scale to aid interpretation and/or statistical analysis.

40
Q

Name a reason for transforming data.

A

Any of the following:
- To get improved approximation to normality
- To reduce skewness
- To linearise the relationship between two variables
- To make multiplicative relationships additive

41
Q

Name a common transformation.

A

Any of the following:
- Natural logarithm (y = loge(x)  x = ey or exp(y), where e = 2.718…)
- Power transformations (y = x , y = x2 , y = x3 , etc.)

42
Q

Which common transformation is the following example?

(y = x , y = x2 , y = x3 , etc.)

A

Power transformation (y = x , y = x2 , y = x3 , etc.)

43
Q

Which common transformation is the following example?

(y = loge(x)  x = ey or exp(y), where e = 2.718…)

A

Natural logarithm (y = loge(x)  x = ey or exp(y), where e = 2.718…)

44
Q

Name important things to check when displaying data in a spreadsheet - to ensure your data is ready for analysis

A
  • Coding - Check twice that your coding is correct (including identifying typos where you may have put in incorrect information or not typed a number correctly)
  • Check that relevant research data matches your findings
  • Compare your data with that of similar study cohorts, is it consistent?
  • Identify and develop methods on how you handle missing values
45
Q

It is necessary to be able to distinguish between different types of data, such as continuous, discrete or categorical. TRUE or FALSE?

A

TRUE

46
Q

The most appropriate way to present data is dependent on the type of data. TRUE or FALSE?

A

TRUE

47
Q

What type of data are frequency tables most appropriate for?

A

Frequency tables are appropriate for all types of data.

48
Q

What are two main tips for creating a good frequency table?

A
  1. For quantitative data, it is important to think carefully about appropriate choice of classes/intervals to group data before display
  2. Keep information in tables to the minimum necessary to convey the message you want to present (significant figures, number of variables/categories)
49
Q

What type of graph is most appropriate for displaying categorical data?

A

Bar charts are appropriate for displaying categorical data.

50
Q

What graphs are most appropriate for displaying quantitative data?

A

Histograms and box plots are appropriate for displaying quantitative data.

51
Q

Which of the following statements is true for a positively skewed data?:
a) Mean = Median
b) Mean = Mode
c) Median < Mean
d) Median > Mean

A

c) Median< Mean

52
Q

An appropriate summary measure for any skewed data is:
a) Mean and interquartile range
b) Mean and variance
c) Mean and mode
d) Mode and standard deviation

A

a) Mean and interquartile range

53
Q

Daily death counts due to Covid-19 virus is a/an:
a) Continuous variable
b) Discrete variable
c) Ordered categorical variable
d) Unordered categorical variable

A

b) Discrete variable

54
Q

Disease severity (mild/moderate/severe) is a/an:
a) Continuous variable
b) Discrete variable
c) Unordered categorical variable (nominal variable)
d) Ordered categorical variable (ordinal variable)

A

d) Ordered categorical variable

55
Q

Which of the following is true for a negatively skewed data?
a) Median < Mean
b) Mean = Median
c) Median > Mean
d) Mean = Mode

A

a) Median > Mean in negatively skewed data

56
Q

Which of the following statements is true for a positively skewed data:
a) Median < Mean
b) Median > Mean
c) Mean = Mode
d) Mean = Median

A

Median < Mean in positively skewed data

57
Q

Which of the following statements is true for a normal distribution with the tail extended equally over both sides?:
a) Median and standard deviations are appropriate measure.
b) Median and interquartile range are appropriate measure.
c) Mean and standard deviations are appropriate measure.
d) Mean and interquartile range are appropriate measure.

A

c) Mean and standard deviations are appropriate measure.

58
Q

Which of these subtractions gives the value of the interquartile range for a continuous variable?
a) 75th value - 25th value
b) Median - mean
c) Largest value - smallest value
d) Mean - median

A

a) 75th value - 25th value