statistical analysis Flashcards
Independent Samples t-test
Definition and Assumptions
Definition:
Determine if there is a significant difference between the means of two independent groups.
Assumptions:
Data from each group are independent.
Data are approximately normally distributed.
The variances of the two groups are approximately equal.
Analysis with t-test
Analysis and Interpretation
compare the p-value with the chosen significance level (usually 0.05).
If the p-value is less than the significance level (e.g., p < 0.05), it suggests that there is a significant difference.
If the p-value is greater than the significance level (e.g., p ≥ 0.05), there is insufficient evidence to conclude a significant difference.
Repeated sample t-test
Definition: tests for a significant difference between the means of related groups, where each subject is measured at two or more time points or conditions.
math skills over time: beginning (time 1) and end (time 2)
Mean
Average of a set of numbers, calculated by summing all the values and dividing by the total count.
Calculation: Mean = sum of all values / total count
Median
The middle value of a dataset when arranged in ascending order
Representative of Central Value (especially when the data is skewed or contains outliers)
Mode
The value that appears most frequently in a dataset.
Histogram
A graph showing how often different values occur in a dataset. It’s like splitting data into groups and counting how many values fall into each group.
Helps us see if data is skewed, has outliers, or follows a specific pattern, like a bell curve for normal distribution.
Helps us understand how data is spread out
Histograms and Normality
A normal distribution looks like a symmetric, bell-shaped curve. It means most data points are in the middle, tapering off towards the ends.
This shape indicates that data is evenly spread around the average, making it easier to predict outcomes.
Normality
data is symmetrically distributed around the mean, with the majority of values clustered near the center and fewer values spread out towards the tails.
assessed visually using histograms, Q-Q plots, or box plots
t-tests, ANOVA, and regression, rely on the assumption of normality
more representable and valid
Sample Size and Noramlity
As the sample size increases, the variability of sampling distribution decreases. Also, as the sample size increases the shape of the sampling distribution becomes more similar to a normal distribution regardless of the shape of the population.
Need atleast 30 participants
Q-Q Plot Graph
Q-Q plot compares our data to a perfect “normal” dataset. It plots how our data points stack up against the ideal.
Normality and Q-Q Plot Graph
Picture two sets of dots on a graph. If they make a straight line, our data is “normal”. If they curve or stray, it’s not.
Box Plot
displays the five-number summary of a set of data. The five-number summary is the minimum, first quartile, median, third quartile, and maximum
insights into the variability and central tendency of the data, as well as the presence of outliers across groups or conditions.
Scatterplot
the relationship between two continuous variables, with each data point representing an observation.
allows for the identification of patterns, trends, or correlations between variables
Bar Chart
representation of categorical data, where the height or length of each bar represents the frequency or proportion of observations in each category.
facilitates comparisons between categories and visualizes differences in frequencies or proportions across groups or conditions.
Plot graph
individual data points are connected by straight lines, typically used to show trends or changes over time.
longitudinal data or changes in variables across different time points or conditions
Degrees of Freedom
tell us how much data can vary without messing up our calculations (how free it is).
df = n1 + n2 - 2 (the number of groups minus 1)
crucial in determining the appropriate critical values for hypothesis testing and estimating the variability of sample statistics
Sampling distribution of mean differences
the distribution of the differences in means between two samples that are randomly drawn from the same population.
understand how much variability we might expect in the differences between sample m
Sampling distribution of mean differences and t-test
compare the means of two independent groups to determine if there is a significant difference between them.
By comparing the observed difference in sample meansto the distribution of mean differences from the sampling distribution, we can assess whether the observed difference is statistically significant (if its larger than the variability expected by chance then its not significant)
P-value
A p-value, or probability value, is a number describing how likely it is that your data would have occurred by random chance
quantifies the evidence against a null hypothesis
P-value interpretation
The smaller the p-value, the less likely the results occurred by random chance, and the stronger the evidence that you should reject the null hypothesis.
alpha: a set probability threshold (often 0.05)
A p-value less than or equal to your significance level (typically ≤ 0.05) is statistically significant.
It’s a piece of evidence, not a definitive proof.
Effect size measure
Effect size measures quantify how big the difference or relationship is between variables in a study. They show how meaningful the findings are beyond just whether they’re statistically significant.
cohen’s d, pearson’s r, phi coefficent, partial eta-squared
Effect Size
Effect size is like a ruler for measuring how strong a relationship is between things or how big a difference is between groups in a study. I
a way to talk about how important the findings are, so researchers can compare results from different studies easily.
Cohen’s D
effect size measure used to quantify the standardized difference between two group means in a study.
mean (1) - mean (2) / total SD
Helps compare the magnitude of differences between groups in various studies, regardless of sample size.
TWO GROUPS (indepednet t-test)
Pearson’s r
Pearson’s r is used to quantify the strength and direction of the linear relationship between two continuous variables.
Range: -1 (perfect negative correlation) to +1 (perfect positive correlation)
Magnitude: Closer to 1 indicates a stronger correlation.
Helps understand how closely related two variables are and whether changes in one variable predict changes in the other.
Displaying Cohen’s D
bar graph: you would have two bars representing the means of each group, with error bars indicating the variability or standard error of the means.
box plot: you would have two boxes representing the distribution of scores in each group, with whiskers indicating the range of the data and possibly outliers.
compare means between two groups, which is what Cohen’s d represents.
Violin Plot
combines a box plot with a kernel density plot to show the distribution of data.
Provides a visual representation of data distribution, including information about central tendency, spread, and multimodality.
Suitable for comparing distributions across different groups or visualizing the distribution of a single variable.
Advantages of Violin Plot
Individual Data Points: shows the distribution and density of data points, providing insights into clusters or gaps.
Outliers:
Easily noticeable due to the combination of box plot and kernel density plot.
Statistics: Displays key summary statistics such as mean, median, and quartiles, enhancing interpretability.
Disadvantages of Violin Plot
Complexity: visually complex, especially when comparing multiple groups or variables.
Limited for Nominal Data: Less effective for nominal or categorical data compared to continuous data.
Advantages of Box Plots
Efficiently summarizes the distribution of data, including median, quartiles, and outliers.
Useful for identifying skewness, variability, and outliers in the data.
Facilitates easy comparison of distributions between groups or categories.
Disadvantages of Box Plots
May not provide detailed information about the shape of the distribution or individual data points.
Less effective for displaying the density or frequency of data points compared to histograms.
Advantages of Histogram
Provides a visual representation of the frequency distribution of data.
Allows for easy identification of patterns, central tendency, and spread.
Suitable for displaying both continuous and discrete data.
Diadvantages of Histogtram
Choice of bin width can influence the appearance and interpretation of the histogram.
May not accurately represent the underlying distribution if the number of bins is not chosen appropriately.
Not as effective for comparing distributions between groups or categories as box plots.
Advantages of Scatter Plot
Visualizes the relationship between two continuous variables.
Allows for the identification of patterns, trends, and correlations in the data.
Useful for detecting outliers and assessing the strength and direction of relationships.
Disadvantages of Scatter Plot
Limited to visualizing relationships between two variables and may not capture more complex patterns.
Requires a large sample size to accurately represent the underlying population distribution.
May be less effective for categorical or ordinal data compared to continuous data.
Relative Frequency
The proportion or percentage of data values that fall into each category or interval.
Relative frequency = frequency of a category or interval / total number of data points.
Provides insights into the distribution of data by showing the proportion of observations in each category or interval.
Cumulative Frequencies
The running total of frequencies as you move through the categories or intervals from the lowest to the highest.
Cumulative frequency of a category or interval = sum of frequencies up to that category or interval.
Helps visualize the accumulation of data values and identify patterns or trends in the distribution.
Continuous Data
Data that can take any value within a certain range and can be measured with precision.
Height, weight, temperature, time.
Can take an infinite number of values within a range.
Often measured using instruments with fine precision.
Can be subdivided into smaller units (e.g., fractions or decimals).
Discrete Data
Data that can only take specific values and cannot be subdivided further.
Number of siblings, number of cars in a parking lot, number of goals scored in a football match.
Can only take distinct, separate values.
Often represented by integers.
Cannot be measured with infinite precision.
Continuous vs Discrete
Continuous data can take any value within a range, while discrete data can only take specific, distinct values.
Continuous data is measured with precision, often using instruments, while discrete data is counted or observed.
Continuous data can be subdivided into smaller units, such as fractions or decimals, while discrete data cannot be further divided.
Bar Chart vs. Histogram for Categorical Groups
Bar Chart: Use for distinct categories to compare frequencies.
Histogram: Use for continuous data to display distribution.
Misleading Bar Chart
Adjusting the scale of the y-axis to exaggerate differences between categories.
Makes differences between categories appear larger or smaller than they actually are, leading to inaccurate interpretations or conclusions.
e.g. A bar chart with a truncated y-axis that starts at a value greater than zero, making small differences between categories appear larger than they are.
Sample Standard Deviation Denominator (n - 1)
It’s a way to calculate standard deviation that helps us get a better estimate of how spread out our data is.
By using n - 1 instead of just n, we give our data a bit more room to vary, which makes our estimate more accurate.
Statistical Inference
It’s like making educated guesses about a whole group based on what we see in a smaller group.
This helps us test ideas, draw conclusions about larger groups, and figure out if what we’re seeing is really important or just random.
Probability on Standard Normal Curve
There’s a 50% chance (or 0.5) that something falls below the middle point.
Because the normal curve is perfectly symmetrical, half of the data falls below the middle point, which is why it’s 50%.
Percentage between -1 and +1 SD
Around 68% of the data falls within one standard deviation of the average.
This is a rule we use that says most of our data (about 68%) is within a certain distance from the average, which helps us understand how spread out our data is.
Chi-Square Test
Statistical test used to determine if there is a significant association between categorical variables.
Assess whether observed frequencies differ significantly from expected frequencies.
Pearson chi-square, likelihood ratio chi-square, Fisher’s exact test.
Variables Suitable for Chi-Square Goodness-of-Fit Test
Used when comparing observed frequencies to expected frequencies within one categorical variable.
Testing whether observed frequencies of blood types in a population match the expected frequencies based on a genetic model.
Chi-Square Goodness-of-Fit Test Calculation
χ² = Σ((O-E)² / E)
where O is observed frequency, E is expected frequency, Σ represents the summation over all cells of the contingency table.
Calculate expected frequencies, find the difference between observed and expected frequencies, square the differences, divide by expected frequencies, sum up these values to get the chi-square statistic.
Compare the calculated chi-square value to a critical value from the chi-square distribution to determine significance.
Follow-up after Significant Chi-square Test
To explore the nature of the significant association found in the chi-square test.
Post-Hoc Analysis: Conduct additional analyses to determine which categories are driving the significant result.
If the chi-square test indicates a significant association between two categorical variables, follow-up analyses such as residual analysis or pairwise comparisons can be performed to identify specific categories contributing to the association.