New Flashcards
What methods should you use to summarise ordinal categorical data
Median
Interquartile range
What is the purpose of a pie chart
To show frequencies/proportions/percentages
What is purposive sampling
Sampling when the researcher uses their expertise to choose a sample that is most useful for the purpose of the research
How many and what kind of variable would you use with a means plot
One scale (aka continuous) variable or two categorical variables
What is root cause analysis
It is a method used to solve problems by first identifying the root cause of the problem.
What methods should you use to summarise continuous normally distributed data
Mean
Standard deviation
In kurtosis what numbers should the score be between to show the data is not too skewed
+1 and -1
How do the standard error and the margin of error relate
As the standard error increases, the margin of error also increases.
What overall method of test would you use when working with a skewed continuous dependent variable
Non-parametric test
What specific test would you use when comparing three or more measurements on the same subject when the data is not normally distributed
Friedman test
What is stratified sampling
The population is divided into subpopulations (strata) with key differences eg gender, age
What is the purpose of a means plot
Looks at the combined effect of two categorical variables on the mean of one scale variable
What methods are used to determine outliers
Standard deviation/ z score
Interquartile range
Generally, when can ordinal data be analysed with parametric tests
When there are 7 or more categories and the data is approximately normally distributed
Why is mean imputation considered bad
it completely removes the accountability for feature correlation. This also means that the data will have low variance and increased bias, adding to the dip in the accuracy of the model, alongside narrower confidence intervals.
What specific test would you use when comparing the averages of three or more independent groups when the data is normally distributed
One way ANOVA
What is the meaning of covariance
Covariance is the measure of indication when two items vary together in a cycle. The systematic relation is determined between a pair of random variables to see if the change in one will affect the other variable in the pair or not.
What is observational data
Observational data correlates to the data that is obtained from observational studies, where variables are observed to see if there is any correlation between them
How do you find degrees of freedom
How many independent variables you have minus one
What overall method of test would you use when working with a normally distributed continuous dependent variable
Parametric test
What is selection bias
Selection bias is a phenomenon that involves the selection of individual or grouped data in a way that is not considered to be random
What are ordinal variables
Categorical variables with an obvious order
Eg most - least likely
What are continuous scale variables
Variables that can take any variable
Eg height
What is the purpose of a scatter graph
Shows the relationship between two variables and helps detect outliers
What is the purpose of a histogram
To show the distribution of results
What specific test would you use when comparing three or more measurements on the same subject when the data is normally distributed
Repeated measures ANOVA
What is the relationship between the confidence level and the significance level in statistics?
The significance level is the probability of obtaining a result that is extremely different from the condition where the null hypothesis is true.
The confidence level is used as a range of similar values in a population.
Both significance and confidence level are related by the following formula:
Significance level = 1 − Confidence level
How many and what kind of variable would you use with a scatter graph
Two scale (aka continuous) variables
When would it be better to use the median than the mean to study data
When there are a lot of outliers that can positively or negatively skew data
What is a survivorship bias
The survivorship bias is the flaw of the sample selection that occurs when a dataset only considers the ‘surviving’ or existing observations and fails to consider those observations that have already ceased to exist.
What methods should you use to summarise nominal categorical data
Mode
What are the two types of scale variables
Continuous
Discrete
What are 5 ways of handling missing data
Winsorizing the data
Prediction of missing values
Deletion or rows with missing data mean/median imputation
What are the two main types of categorical variables
Ordinal
Nominal
What is the central limit theorem
The central limit theorem states that the normal distribution is arrived at when the sample size varies without having an effect on the shape of the population distribution
What are right skewed distributions
A right-skewed distribution is one where the right tail is longer than the left one. But, here the mean > median > mode.
What specific test would be used for assessing the relationship between two categorical variables when the data is not normally distributed
Chi-Squared test
What kind of summarising statistics would you get from a pie chart
Class percentages
How many and what kind of variable would you use with a stacked bar chart
Two categorical variables
What methods should you use to summarise skewed data or data with influential outliers
Median
Interquartile range
What is experimental data
Experimental data is derived from experimental studies, where certain variables are held constant to see if any discrepancy is raised in the working.
How many and what kind of variable would you use with a line chart
A scale by time variable
What kind of summarising statistics would you get from a boxplot
Median
Interquartile range
What kind of summarising statistics would you get from a histogram
Mean and standard deviation
What are three types of symmetric distribution
Uniform distribution
Nominal distribution
Binomial distribution
What is snowball sampling
When participants are hard to access, participants are recruited through other participants
What specific test would you use when comparing the averages of two independent groups when the data is normally distributed
An independent t test
What kind of summarising statistics would you get from a stacked bar chart
Percentages within groups
What is a long tailed distribution
A type of distribution where the tail drops off gradually toward the end of the curve
What are left skewed distributions
A left-skewed distribution is one where the left tail is longer than that of the right tail. Here, it is important to note that the mean < median < mode.
How many and what kind of variable would you use with a boxplot
One scale (aka continuous) variable or one categorical variable
What is the sampling frame
The actual list of individuals that the sample will be drawn from
Ideally it should be the entire target population
What is an undercoverage bias
The undercoverage bias is a bias that occurs when some members of the population are inadequately represented in the sample.
What are 4 types of non-probability (non-random) sampling
Convenience sampling
Quota sampling
Judgement sampling
Snowball sampling
What is cluster sampling
The population is divided into subgroups randomly and entire subgroups are selected
What is Bessel’s correction
Bessel’s correction is a factor that is used to estimate a populations’ standard deviation from its sample. It causes the standard deviation to be less biased, thereby providing more accurate results.
What is meant by mean imputation for missing data
Mean imputation is a rarely used practice where null values in a dataset are replaced directly with the corresponding mean of the data.
What is the purpose of a boxplot
To compare the spread of values
What are five potential causes of bias in sampling
Pre-arranged sample rules are deviated from
People in hard to reach groups are omitted
Selected individuals are replaced with others, for example if they are hard to contact
Low response rates (eg from specific groups)
An out of data list is used in the sampling frame (eg if it excludes people who have moved to a new area)
What is simple random sampling
A sampling method when every member of the population has an equal chance of being selected
What is the relationship between mean and median in a normal distribution
In a normal distribution, the mean is equal to the median. To know if the distribution of a dataset is normal, we can just check the dataset’s mean and median
What specific test would you use when working with a skewed categorical dependent variable
Chi-squared test
What are the two overall variable types
Categorical
Continuous
How many and what kind of variable would you use with a histogram
One scale (aka continuous) variable
What specific test would you use when comparing the averages of two independent groups when the data is NOT normally distributed
Mann-Whitney test
What is volunteer sampling
Based on ease of access but people volunteer for the sample
What are 5 types of selection bias
Observer selection Attrition Protopathic bias Time intervals Sampling bias
What specific test would you use when comparing average difference between paired (matched) samples e.g. weight before and after a diet for data that is normally distributed
Paired t test
In a scatter diagram, what is the line that is drawn above or below the regression line called?
The line that is drawn above or below the regression line in a scatter diagram is called the residual or also the prediction error.
What kind of summarising statistics would you get from a means plot
Mean
What kind of summarising statistics would you get from a scatter graph
Correlation coefficient
What kind of summarising statistics would you get from a line chart
Means by time point
What are discrete scale variables
Finite numerical variables (integers)
Eg number of children
What is the purpose of a stacked bar chart
To compare proportions within groups
What does a p value actually show
The likelihood that a result occurred due to chance
Generally want it to be under .05
What are three times when outliers would be kept in the data
Results are critical
Outliers add meaning to the data
The data is highly skewed
What is convenience sampling
Using a sample of the most accessible participants
What is symmetric distribution
Symmetric distribution means that the data on the left side of the median is the same as the one present on the right side of the median
What does a chi squared test show
Goodness of fit - the probability that any differences between expected and observed numbers are due to chance
What specific test would you use when comparing the averages of three or more independent groups when the data is not normally distributed
Kurskal-Wallis test
What is exploratory data analysis
The process of performing investigations on data to understand the data better
Initial investigations are done to determine patterns, spot abnormalities, test hypotheses, and also to check if the assumptions are right.
What is systematic sampling
A sampling method when every member of the population is given a number and are selected at specific intervals
What are nominal variables
Categorical variables with no clear order
Eg gender, hair colour
What overall method of test would you use when working with a normally distributed categorical dependent variable
Non-parametric test
What test would be used to compare the relationship between two continuous variables when the data is normally distributed
Pearson’s r correlation coefficient
What is an outlier
Outliers are data points that vary in a large way when compared to other observations in the dataset. Depending on the learning process, an outlier can worsen the accuracy of a model and decrease its efficiency sharply.
What test would be used to compare the relationship between two continuous variables when the data is not normally distributed
Spearman’s rank correlation coefficient
What are the two overall types of sampling
Random (probability)
Non random (non-probability)
What is the purpose of a line chart
Displays changes over time
Comparison of groups
What is the difference between descriptive and inferential statistics
Descriptive statistics: Descriptive statistics is used to summarize from a sample set of data like the standard deviation or the mean.
Inferential statistics: Inferential statistics is used to draw conclusions from the test data that are subjected to random variations.
How is the statistical significance of an insight (idea) assessed
Hypothesis testing - the null and alternative hypothesis are stated and the p value is found
What are 4 types of probability (random sampling)
Simple random sampling
Systematic sampling
Stratified sampling
Clustered sampling
What specific test would you use when comparing average difference between paired (matched) samples e.g. weight before and after a diet for data that is not normally distributed
Wilcoxon signed rank test
What is kurtosis
Kurtosis is used to describe the extreme values present in one tail of distribution versus the other. It is actually the measure of outliers present in the distribution. A high value of kurtosis represents large amounts of outliers being present in data.
How many and what kind of variable would you use with a pie chart
One categorical variable