New Flashcards

1
Q

What methods should you use to summarise ordinal categorical data

A

Median

Interquartile range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the purpose of a pie chart

A

To show frequencies/proportions/percentages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is purposive sampling

A

Sampling when the researcher uses their expertise to choose a sample that is most useful for the purpose of the research

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How many and what kind of variable would you use with a means plot

A

One scale (aka continuous) variable or two categorical variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is root cause analysis

A

It is a method used to solve problems by first identifying the root cause of the problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What methods should you use to summarise continuous normally distributed data

A

Mean

Standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In kurtosis what numbers should the score be between to show the data is not too skewed

A

+1 and -1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do the standard error and the margin of error relate

A

As the standard error increases, the margin of error also increases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What overall method of test would you use when working with a skewed continuous dependent variable

A

Non-parametric test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What specific test would you use when comparing three or more measurements on the same subject when the data is not normally distributed

A

Friedman test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is stratified sampling

A

The population is divided into subpopulations (strata) with key differences eg gender, age

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the purpose of a means plot

A

Looks at the combined effect of two categorical variables on the mean of one scale variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What methods are used to determine outliers

A

Standard deviation/ z score

Interquartile range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Generally, when can ordinal data be analysed with parametric tests

A

When there are 7 or more categories and the data is approximately normally distributed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why is mean imputation considered bad

A

it completely removes the accountability for feature correlation. This also means that the data will have low variance and increased bias, adding to the dip in the accuracy of the model, alongside narrower confidence intervals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What specific test would you use when comparing the averages of three or more independent groups when the data is normally distributed

A

One way ANOVA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the meaning of covariance

A

Covariance is the measure of indication when two items vary together in a cycle. The systematic relation is determined between a pair of random variables to see if the change in one will affect the other variable in the pair or not.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is observational data

A

Observational data correlates to the data that is obtained from observational studies, where variables are observed to see if there is any correlation between them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How do you find degrees of freedom

A

How many independent variables you have minus one

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What overall method of test would you use when working with a normally distributed continuous dependent variable

A

Parametric test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is selection bias

A

Selection bias is a phenomenon that involves the selection of individual or grouped data in a way that is not considered to be random

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are ordinal variables

A

Categorical variables with an obvious order

Eg most - least likely

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are continuous scale variables

A

Variables that can take any variable

Eg height

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the purpose of a scatter graph

A

Shows the relationship between two variables and helps detect outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the purpose of a histogram

A

To show the distribution of results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What specific test would you use when comparing three or more measurements on the same subject when the data is normally distributed

A

Repeated measures ANOVA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the relationship between the confidence level and the significance level in statistics?

A

The significance level is the probability of obtaining a result that is extremely different from the condition where the null hypothesis is true.
The confidence level is used as a range of similar values in a population.

Both significance and confidence level are related by the following formula:

Significance level = 1 − Confidence level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

How many and what kind of variable would you use with a scatter graph

A

Two scale (aka continuous) variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

When would it be better to use the median than the mean to study data

A

When there are a lot of outliers that can positively or negatively skew data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is a survivorship bias

A

The survivorship bias is the flaw of the sample selection that occurs when a dataset only considers the ‘surviving’ or existing observations and fails to consider those observations that have already ceased to exist.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What methods should you use to summarise nominal categorical data

A

Mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What are the two types of scale variables

A

Continuous

Discrete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What are 5 ways of handling missing data

A

Winsorizing the data
Prediction of missing values
Deletion or rows with missing data mean/median imputation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What are the two main types of categorical variables

A

Ordinal

Nominal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is the central limit theorem

A

The central limit theorem states that the normal distribution is arrived at when the sample size varies without having an effect on the shape of the population distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What are right skewed distributions

A

A right-skewed distribution is one where the right tail is longer than the left one. But, here the mean > median > mode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What specific test would be used for assessing the relationship between two categorical variables when the data is not normally distributed

A

Chi-Squared test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What kind of summarising statistics would you get from a pie chart

A

Class percentages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

How many and what kind of variable would you use with a stacked bar chart

A

Two categorical variables

40
Q

What methods should you use to summarise skewed data or data with influential outliers

A

Median

Interquartile range

41
Q

What is experimental data

A

Experimental data is derived from experimental studies, where certain variables are held constant to see if any discrepancy is raised in the working.

42
Q

How many and what kind of variable would you use with a line chart

A

A scale by time variable

43
Q

What kind of summarising statistics would you get from a boxplot

A

Median

Interquartile range

44
Q

What kind of summarising statistics would you get from a histogram

A

Mean and standard deviation

45
Q

What are three types of symmetric distribution

A

Uniform distribution
Nominal distribution
Binomial distribution

46
Q

What is snowball sampling

A

When participants are hard to access, participants are recruited through other participants

47
Q

What specific test would you use when comparing the averages of two independent groups when the data is normally distributed

A

An independent t test

48
Q

What kind of summarising statistics would you get from a stacked bar chart

A

Percentages within groups

49
Q

What is a long tailed distribution

A

A type of distribution where the tail drops off gradually toward the end of the curve

50
Q

What are left skewed distributions

A

A left-skewed distribution is one where the left tail is longer than that of the right tail. Here, it is important to note that the mean < median < mode.

51
Q

How many and what kind of variable would you use with a boxplot

A

One scale (aka continuous) variable or one categorical variable

52
Q

What is the sampling frame

A

The actual list of individuals that the sample will be drawn from
Ideally it should be the entire target population

53
Q

What is an undercoverage bias

A

The undercoverage bias is a bias that occurs when some members of the population are inadequately represented in the sample.

54
Q

What are 4 types of non-probability (non-random) sampling

A

Convenience sampling
Quota sampling
Judgement sampling
Snowball sampling

55
Q

What is cluster sampling

A

The population is divided into subgroups randomly and entire subgroups are selected

56
Q

What is Bessel’s correction

A

Bessel’s correction is a factor that is used to estimate a populations’ standard deviation from its sample. It causes the standard deviation to be less biased, thereby providing more accurate results.

57
Q

What is meant by mean imputation for missing data

A

Mean imputation is a rarely used practice where null values in a dataset are replaced directly with the corresponding mean of the data.

58
Q

What is the purpose of a boxplot

A

To compare the spread of values

59
Q

What are five potential causes of bias in sampling

A

Pre-arranged sample rules are deviated from

People in hard to reach groups are omitted

Selected individuals are replaced with others, for example if they are hard to contact

Low response rates (eg from specific groups)

An out of data list is used in the sampling frame (eg if it excludes people who have moved to a new area)

60
Q

What is simple random sampling

A

A sampling method when every member of the population has an equal chance of being selected

61
Q

What is the relationship between mean and median in a normal distribution

A

In a normal distribution, the mean is equal to the median. To know if the distribution of a dataset is normal, we can just check the dataset’s mean and median

62
Q

What specific test would you use when working with a skewed categorical dependent variable

A

Chi-squared test

63
Q

What are the two overall variable types

A

Categorical

Continuous

64
Q

How many and what kind of variable would you use with a histogram

A

One scale (aka continuous) variable

65
Q

What specific test would you use when comparing the averages of two independent groups when the data is NOT normally distributed

A

Mann-Whitney test

66
Q

What is volunteer sampling

A

Based on ease of access but people volunteer for the sample

67
Q

What are 5 types of selection bias

A
Observer selection 
Attrition 
Protopathic bias
Time intervals 
Sampling bias
68
Q

What specific test would you use when comparing average difference between paired (matched) samples e.g. weight before and after a diet for data that is normally distributed

A

Paired t test

69
Q

In a scatter diagram, what is the line that is drawn above or below the regression line called?

A

The line that is drawn above or below the regression line in a scatter diagram is called the residual or also the prediction error.

70
Q

What kind of summarising statistics would you get from a means plot

A

Mean

71
Q

What kind of summarising statistics would you get from a scatter graph

A

Correlation coefficient

72
Q

What kind of summarising statistics would you get from a line chart

A

Means by time point

73
Q

What are discrete scale variables

A

Finite numerical variables (integers)

Eg number of children

74
Q

What is the purpose of a stacked bar chart

A

To compare proportions within groups

75
Q

What does a p value actually show

A

The likelihood that a result occurred due to chance

Generally want it to be under .05

76
Q

What are three times when outliers would be kept in the data

A

Results are critical
Outliers add meaning to the data
The data is highly skewed

77
Q

What is convenience sampling

A

Using a sample of the most accessible participants

78
Q

What is symmetric distribution

A

Symmetric distribution means that the data on the left side of the median is the same as the one present on the right side of the median

79
Q

What does a chi squared test show

A

Goodness of fit - the probability that any differences between expected and observed numbers are due to chance

80
Q

What specific test would you use when comparing the averages of three or more independent groups when the data is not normally distributed

A

Kurskal-Wallis test

81
Q

What is exploratory data analysis

A

The process of performing investigations on data to understand the data better

Initial investigations are done to determine patterns, spot abnormalities, test hypotheses, and also to check if the assumptions are right.

82
Q

What is systematic sampling

A

A sampling method when every member of the population is given a number and are selected at specific intervals

83
Q

What are nominal variables

A

Categorical variables with no clear order

Eg gender, hair colour

84
Q

What overall method of test would you use when working with a normally distributed categorical dependent variable

A

Non-parametric test

85
Q

What test would be used to compare the relationship between two continuous variables when the data is normally distributed

A

Pearson’s r correlation coefficient

86
Q

What is an outlier

A

Outliers are data points that vary in a large way when compared to other observations in the dataset. Depending on the learning process, an outlier can worsen the accuracy of a model and decrease its efficiency sharply.

87
Q

What test would be used to compare the relationship between two continuous variables when the data is not normally distributed

A

Spearman’s rank correlation coefficient

88
Q

What are the two overall types of sampling

A

Random (probability)

Non random (non-probability)

89
Q

What is the purpose of a line chart

A

Displays changes over time

Comparison of groups

90
Q

What is the difference between descriptive and inferential statistics

A

Descriptive statistics: Descriptive statistics is used to summarize from a sample set of data like the standard deviation or the mean.

Inferential statistics: Inferential statistics is used to draw conclusions from the test data that are subjected to random variations.

91
Q

How is the statistical significance of an insight (idea) assessed

A

Hypothesis testing - the null and alternative hypothesis are stated and the p value is found

92
Q

What are 4 types of probability (random sampling)

A

Simple random sampling
Systematic sampling
Stratified sampling
Clustered sampling

93
Q

What specific test would you use when comparing average difference between paired (matched) samples e.g. weight before and after a diet for data that is not normally distributed

A

Wilcoxon signed rank test

94
Q

What is kurtosis

A

Kurtosis is used to describe the extreme values present in one tail of distribution versus the other. It is actually the measure of outliers present in the distribution. A high value of kurtosis represents large amounts of outliers being present in data.

95
Q

How many and what kind of variable would you use with a pie chart

A

One categorical variable