Statistics (Elizabeth Wilhelm) Flashcards
Qualitative & Quantitative variables
Qualitative variables are often known as the categorical variables because they have a variable that have different groups e.g gender. You can’t explain it, you just choose one category.
For qualitative, there are nominal and ordinal scales. Nominal scale means that there are categories and we cannot rank them e.g. if we are looking for diabetic vs non-diabetic. For ordinal scale, you can do a ranking e.g. stage of breast cancer (I, II, III and IV). The higher the stage, the worse for the patient, this is common knowledge that we are aware of. However, we cannot say that the difference between stage I and II and equal to the differences between stage III and IV.
The variables consist of numbers e.g. age. The variables can be discrete or continuous. Discrete variables only consist of integers e.g. how many children does one have. You cannot answer and say that you have 1.5 children. The continuous variables consist of decimals so you can measure e.g. weight and be more precise.
For quantitative, there are interval and ratio scales. For the interval scale there is no true zero. You don’t have any start point, hence you cannot say that something is twice as much as another value. For e.g. temperature, which does not have a true zero value. The 0 degrees is not a start point as one can have minus degrees, etc. Not many quantitative variables are corresponding to an interval scale. For the ratio scale, you have absolute and true zeros e.g. age where the “zero” age is the starting point.
Normally distributed or not…?
It is only quantitative variables that can be normally distributed or not. The definition for normal distribution is that the variables will be symmetrical about the mean, which is located in the middle. If we have a normally distributed variable, approximately 64% will be in the range of -1/1 standard deviation (SD).
If results are not symmetric, then it is not a normal distribution curve.
Research hypothesis
It is behind why you are choosing this population, including certain individuals and why you have chosen a certain study design. The research hypothesis helps guide in this process. It can be expressed in terms of a prior expectation of the results of the study, you think you will have this result and you know by the old knowledge that you could expect that. You have a prediction of the possible outcomes of the study, also could presume some differences in some of the groups. You want to compare for a specific variable or presume a relationship between variables. Sometimes you don’t know anything about what you will get from the study as it is the first time ever the study is performed. If this is the case, you make an educated guess based on theory and clinical experience.
Hypothesis in statistical testing
We use them after we have gathered our data. We break down our research questions into more specific questions. For example, the research question could be differences between male and female regarding health. However, when conducting a statistical test, we break this question down even further into “Is there a difference between male and female in terms of how much they exercise/eat?”, thus making a concrete specific question. The questions can be about differences/similarities/correlations or sometimes about an explanatory model to explain the relationship between many variables. Each statistical analysis corresponds to one specific question, you don’t answer two or three questions at the same time with ONE statistical analysis; separate ones must be conducted for each question.
The null hypothesis is phrased as “There is NO difference between….”.
The alternate hypothesis is phrased as “There is A difference between…”.
P-value
The significance level is transformed into a p-value, this is the result of a statistical analysis. You will find the p-values in the tables of your article. It is related to different distribution curves which correspond to a chosen analysis.
The p-value is the probability of making the wrong decision when the null hypothesis is true. The lower your p-value is, the stronger the evidence to reject the null hypothesis.
Type I and II error
Type I (alpha) error is the risk of rejecting null hypothesis even though null hypothesis is TRUE, you say that there is a significance when there actually isn’t. It can only occur when we have received a significant result. You have to keep in mind that based on the overall population, the null hypothesis is true. However, in our sample study, we have come to the conclusion that it is not. Alpha in this case is the same as the p-value.
Type II (beta) error is the risk of retaining the null hypothesis even though null hypothesis is FALSE. You are saying that there is a significance when there isn’t. The results could be significant in the population, however in the sample study, the conclusion is that the result is not significant.
Power
It is the probability of a hypothesis test of finding an effect if there is an effect to be found. It can also be defined as the probability of NOT making a Type II error. It can be calculated by doing 1 - beta. The power is between 0% and 100%. It is said that a higher power is better, in terms of representing significance. However, if you have a very high power, it will cost you of more individuals in the study. Regardless, a GOOD POWER IS 80% and higher.
We can only know how high the beta is when we have done the power calculation. You should think about the power because you gather in your data because you need that to calculate how many individuals you need in your study. Not all studies calculate power and those who do, mention it in the statistical method of the study. If this is included, it also means that they have included all the individuals needed in the study. If they have not mentioned it in the statistical method, no assumptions can be made. Power can be calculated after gathering the data but its not the same as calculating before prior to gathering the data.
Parametric vs. non-parametric analysis
Parametric: for those who have a quantitative variable, almost all of these need a normal distribution; data is sampled randomly and independently; no outliers and equal variance; e.g. T-test, paired T-test, ANOVA, Pearson correlation, linear regression
Non - parametric: qualitative variable, but can use it for quantitative variables that are not normally distributed); does not need to fulfil the qualities of a parametric analysis; e.g. Mann-Whitney, Kruskall Wallis, Wilcoxon ranked sign test, Spearman, logistic regression
Descriptive analysis
Descriptive statistics has the purpose of describing the data, we want to know the distribution of every variable separately, how many are represented in each categories, etc. We can split the variables into specific groups e.g. gender, presenting results for each gender. It doesn’t matter if we divide our variable into groups, we just want to describe the data regardless. The frequently used parameters for this statistics are amount - you describe this using percentage and there are always categories here in the variables. There is also median (IQR) and mean (SD).
If you have your own data, then you have to use a test to determine if the quantitative variable is normally distributed or not.
Kolmogorov-Smirnov: Use it if you have 50 or more observations in the variable
Shapiro-Wilk analysis: Use it if you have less than 50 observations
If it is NOT normally distributed, you present the quantitative variable using the median (IQR) but if it is normally distributed, then you present the quantitative variable using the mean (standard deviation).
If mean/2 is lower than the SD, that indicates that the variable is NOT normally distributed. If it is higher than SD, it is normally distributed.
Bivariate analysis
A bivariate analysis focuses on analysis of two variables.
Non parametric:
- Mann-Whitney compares two groups
- Chi-squared is a cross table
- Wilcoxon signed rank: two occasions or matched pair
Parametric:
- T-test comparing two groups
- Paired T-test: two occasions or matched pair
Multivariate analysis
A multivariate analysis focuses on analysis of three or
more variables.
Non parametric:
- Kruskal Wallis compares three or more groups.
Parametric:
- ANOVA comparing three or more groups.
Generalised linear models
These focus on 2 or more variables and are an extension of the general linear model. The dependent variable follows different distributions and is not quantitative. There is a non-linear relationship between the dependent and independent variables. The dependent variable must be transformed and is related to the linear model through a link function. Then transformed dependent variable is linearly dependent on the independent series.
Examples: logistic (binomial, multinomial, ordinal) and poisson regression
In baby terms, it’s explaining one variable with two or more variables.