New Flashcards

Question 1

Q

What methods should you use to summarise ordinal categorical data

Answer

A

Median

Interquartile range

Question 2

Q

What is the purpose of a pie chart

Answer

A

To show frequencies/proportions/percentages

Question 3

Q

What is purposive sampling

Answer

A

Sampling when the researcher uses their expertise to choose a sample that is most useful for the purpose of the research

Question 4

Q

How many and what kind of variable would you use with a means plot

Answer

A

One scale (aka continuous) variable or two categorical variables

Question 5

Q

What is root cause analysis

Answer

A

It is a method used to solve problems by first identifying the root cause of the problem.

Question 6

Q

What methods should you use to summarise continuous normally distributed data

Answer

A

Mean

Standard deviation

Question 7

Q

In kurtosis what numbers should the score be between to show the data is not too skewed

Answer

A

+1 and -1

Question 8

Q

How do the standard error and the margin of error relate

Answer

A

As the standard error increases, the margin of error also increases.

Question 9

Q

What overall method of test would you use when working with a skewed continuous dependent variable

Answer

A

Non-parametric test

Question 10

Q

What specific test would you use when comparing three or more measurements on the same subject when the data is not normally distributed

Answer

A

Friedman test

Question 11

Q

What is stratified sampling

Answer

A

The population is divided into subpopulations (strata) with key differences eg gender, age

Question 12

Q

What is the purpose of a means plot

Answer

A

Looks at the combined effect of two categorical variables on the mean of one scale variable

Question 13

Q

What methods are used to determine outliers

Answer

A

Standard deviation/ z score

Interquartile range

Question 14

Q

Generally, when can ordinal data be analysed with parametric tests

Answer

A

When there are 7 or more categories and the data is approximately normally distributed

Question 15

Q

Why is mean imputation considered bad

Answer

A

it completely removes the accountability for feature correlation. This also means that the data will have low variance and increased bias, adding to the dip in the accuracy of the model, alongside narrower confidence intervals.

Question 16

Q

What specific test would you use when comparing the averages of three or more independent groups when the data is normally distributed

Answer

A

One way ANOVA

Question 17

Q

What is the meaning of covariance

Answer

A

Covariance is the measure of indication when two items vary together in a cycle. The systematic relation is determined between a pair of random variables to see if the change in one will affect the other variable in the pair or not.

Question 18

Q

What is observational data

Answer

A

Observational data correlates to the data that is obtained from observational studies, where variables are observed to see if there is any correlation between them

Question 19

Q

How do you find degrees of freedom

Answer

A

How many independent variables you have minus one

Question 20

Q

What overall method of test would you use when working with a normally distributed continuous dependent variable

Answer

A

Parametric test

Question 21

Q

What is selection bias

Answer

A

Selection bias is a phenomenon that involves the selection of individual or grouped data in a way that is not considered to be random

Question 22

Q

What are ordinal variables

Answer

A

Categorical variables with an obvious order

Eg most - least likely

Question 23

Q

What are continuous scale variables

Answer

A

Variables that can take any variable

Eg height

Question 24

Q

What is the purpose of a scatter graph

Answer

A

Shows the relationship between two variables and helps detect outliers

Question 25

Q

What is the purpose of a histogram

Answer

A

To show the distribution of results

Question 26

Q

What specific test would you use when comparing three or more measurements on the same subject when the data is normally distributed

Answer

A

Repeated measures ANOVA

Question 27

Q

What is the relationship between the confidence level and the significance level in statistics?

Answer

A

The significance level is the probability of obtaining a result that is extremely different from the condition where the null hypothesis is true.
The confidence level is used as a range of similar values in a population.

Both significance and confidence level are related by the following formula:

Significance level = 1 − Confidence level

Question 28

Q

How many and what kind of variable would you use with a scatter graph

Answer

A

Two scale (aka continuous) variables

Question 29

Q

When would it be better to use the median than the mean to study data

Answer

A

When there are a lot of outliers that can positively or negatively skew data

Question 30

Q

What is a survivorship bias

Answer

A

The survivorship bias is the flaw of the sample selection that occurs when a dataset only considers the ‘surviving’ or existing observations and fails to consider those observations that have already ceased to exist.

Question 31

Q

What methods should you use to summarise nominal categorical data

Question 32

Q

What are the two types of scale variables

Answer

A

Continuous

Discrete

Question 33

Q

What are 5 ways of handling missing data

Answer

A

Winsorizing the data
Prediction of missing values
Deletion or rows with missing data mean/median imputation

Question 34

Q

What are the two main types of categorical variables

Answer

A

Ordinal

Nominal

Question 35

Q

What is the central limit theorem

Answer

A

The central limit theorem states that the normal distribution is arrived at when the sample size varies without having an effect on the shape of the population distribution

Question 36

Q

What are right skewed distributions

Answer

A

A right-skewed distribution is one where the right tail is longer than the left one. But, here the mean > median > mode.

Question 37

Q

What specific test would be used for assessing the relationship between two categorical variables when the data is not normally distributed

Answer

A

Chi-Squared test

Question 38

Q

What kind of summarising statistics would you get from a pie chart

Answer

A

Class percentages

Question 39

Q

How many and what kind of variable would you use with a stacked bar chart

Answer

A

Two categorical variables

Question 40

Q

What methods should you use to summarise skewed data or data with influential outliers

Answer

A

Median

Interquartile range

Question 41

Q

What is experimental data

Answer

A

Experimental data is derived from experimental studies, where certain variables are held constant to see if any discrepancy is raised in the working.

Question 42

Q

How many and what kind of variable would you use with a line chart

Answer

A

A scale by time variable

Question 43

Q

What kind of summarising statistics would you get from a boxplot

Answer

A

Median

Interquartile range

Question 44

Q

What kind of summarising statistics would you get from a histogram

Answer

A

Mean and standard deviation

Question 45

Q

What are three types of symmetric distribution

Answer

A

Uniform distribution
Nominal distribution
Binomial distribution

Question 46

Q

What is snowball sampling

Answer

A

When participants are hard to access, participants are recruited through other participants

Question 47

Q

What specific test would you use when comparing the averages of two independent groups when the data is normally distributed

Answer

A

An independent t test

Question 48

Q

What kind of summarising statistics would you get from a stacked bar chart

Answer

A

Percentages within groups

Question 49

Q

What is a long tailed distribution

Answer

A

A type of distribution where the tail drops off gradually toward the end of the curve

Question 50

Q

What are left skewed distributions

Answer

A

A left-skewed distribution is one where the left tail is longer than that of the right tail. Here, it is important to note that the mean < median < mode.

Question 51

Q

How many and what kind of variable would you use with a boxplot

Answer

A

One scale (aka continuous) variable or one categorical variable

Question 52

Q

What is the sampling frame

Answer

A

The actual list of individuals that the sample will be drawn from
Ideally it should be the entire target population

Question 53

Q

What is an undercoverage bias

Answer

A

The undercoverage bias is a bias that occurs when some members of the population are inadequately represented in the sample.

Question 54

Q

What are 4 types of non-probability (non-random) sampling

Answer

A

Convenience sampling
Quota sampling
Judgement sampling
Snowball sampling

Question 55

Q

What is cluster sampling

Answer

A

The population is divided into subgroups randomly and entire subgroups are selected

Question 56

Q

What is Bessel’s correction

Answer

A

Bessel’s correction is a factor that is used to estimate a populations’ standard deviation from its sample. It causes the standard deviation to be less biased, thereby providing more accurate results.

Question 57

Q

What is meant by mean imputation for missing data

Answer

A

Mean imputation is a rarely used practice where null values in a dataset are replaced directly with the corresponding mean of the data.

Question 58

Q

What is the purpose of a boxplot

Answer

A

To compare the spread of values

Question 59

Q

What are five potential causes of bias in sampling

Answer

A

Pre-arranged sample rules are deviated from

People in hard to reach groups are omitted

Selected individuals are replaced with others, for example if they are hard to contact

Low response rates (eg from specific groups)

An out of data list is used in the sampling frame (eg if it excludes people who have moved to a new area)

Question 60

Q

What is simple random sampling

Answer

A

A sampling method when every member of the population has an equal chance of being selected

Question 61

Q

What is the relationship between mean and median in a normal distribution

Answer

A

In a normal distribution, the mean is equal to the median. To know if the distribution of a dataset is normal, we can just check the dataset’s mean and median

Question 62

Q

What specific test would you use when working with a skewed categorical dependent variable

Answer

A

Chi-squared test

Question 63

Q

What are the two overall variable types

Answer

A

Categorical

Continuous

Question 64

Q

How many and what kind of variable would you use with a histogram

Answer

A

One scale (aka continuous) variable

Answer 64

A

Mann-Whitney test

Answer 65

A

Based on ease of access but people volunteer for the sample

Answer 66

A

Observer selection 
Attrition 
Protopathic bias
Time intervals 
Sampling bias

Answer 67

A

Paired t test

Answer 68

A

The line that is drawn above or below the regression line in a scatter diagram is called the residual or also the prediction error.

Answer 69

A

Correlation coefficient

Answer 70

A

Means by time point

Answer 71

A

Finite numerical variables (integers)

Eg number of children

Answer 72

A

To compare proportions within groups

Answer 73

A

The likelihood that a result occurred due to chance

Generally want it to be under .05

Answer 74

A

Results are critical
Outliers add meaning to the data
The data is highly skewed

Answer 75

A

Using a sample of the most accessible participants

Answer 76

A

Symmetric distribution means that the data on the left side of the median is the same as the one present on the right side of the median

Answer 77

A

Goodness of fit - the probability that any differences between expected and observed numbers are due to chance

Answer 78

A

Kurskal-Wallis test

Answer 79

A

The process of performing investigations on data to understand the data better

Initial investigations are done to determine patterns, spot abnormalities, test hypotheses, and also to check if the assumptions are right.

Answer 80

A

A sampling method when every member of the population is given a number and are selected at specific intervals

Answer 81

A

Categorical variables with no clear order

Eg gender, hair colour

Answer 82

A

Non-parametric test

Answer 83

A

Pearson’s r correlation coefficient

Answer 84

A

Outliers are data points that vary in a large way when compared to other observations in the dataset. Depending on the learning process, an outlier can worsen the accuracy of a model and decrease its efficiency sharply.

Answer 85

A

Spearman’s rank correlation coefficient

Answer 86

A

Random (probability)

Non random (non-probability)

Answer 87

A

Displays changes over time

Comparison of groups

Answer 88

A

Descriptive statistics: Descriptive statistics is used to summarize from a sample set of data like the standard deviation or the mean.

Inferential statistics: Inferential statistics is used to draw conclusions from the test data that are subjected to random variations.

Answer 89

A

Hypothesis testing - the null and alternative hypothesis are stated and the p value is found

Answer 90

A

Simple random sampling
Systematic sampling
Stratified sampling
Clustered sampling

Answer 91

A

Wilcoxon signed rank test

Answer 92

A

Kurtosis is used to describe the extreme values present in one tail of distribution versus the other. It is actually the measure of outliers present in the distribution. A high value of kurtosis represents large amounts of outliers being present in data.

Answer 93

A

One categorical variable