Final Exam- Pearson's Correlation Flashcards
Why Screen Data?
- Avoiding erroneous conclusions by checking accuracy of data
- Use SPSS (PASW) frequency procedure - Avoiding missing data (from entry, participants, equipment, etc.)
- Avoiding extreme values (outliers). So extreme that they distort results.
- Meeting assumptions of particular tests
Stem and Leaf Display
Like a grouped frequency distribution without loss of information
- Stem: the intervals on the left
- Leaf: digits on the right side indicating frequency and number
Why does data go missing?
- Measurement Equipment Fails
- Participants do not complete all trials or all items
- Errors occur during data entry
Missing Data
If missing data are not randomly distributed, there can be systematic problems
What do you do with missing data?
- Analyze difference between groups (those with missing and those without)
- Delete cases and /or items
- Estimate missing values using
- Prior knowledge
- Calculating means using available data
- Use regression analyses to predict values
How do we find missing data?
- Analyze -> Descriptive Statistics -> Frequencies and…
2. Analyze -> Descriptive Statistics -> Explore
Replacing Missing Data
- Transform -> Replace Missing Values
2. Have the option to replace with series mean, mean (and median) of nearby points, and other imputations
Causes For Outliers
- Data-Entry Errors were made by the researcher
- The participant is not a member of the population for which the sample is intended
- The participant is simply different from the reminder of the sample
Why are outliers problematic?
- Can have disproportionate influence on results (many tests take squared deviations from mean)
- Statistical Tests are sensitive to outliers
- Can create Type I and Type II errors
How do we identify outliers in SPSS?
- Explore Menu (under Descriptive Statistics) can give you frequencies, highest and lowest scores, boxplots, and stem and leaf plots.
What should you do with outliers?
- Conduct analyses with and without
2. Some outliers are of interest (e.g., they can call attention to a poorly worded question)
Are data normal?
Examine both univariate (individual variables) and multivariate (combination of variables) normality
Ways to assess normality
- Skewness: Degree of symmetry of a distribution around the mean
- Kurtosis: Degree of peakedness of distribution
- When normal, value for both are equals to zero
- Kolmogorov-Smirnov statistic
Kolmogorov-Smirnov statistic
Tests the null hypothesis that the population is normally distributed
-Significance of this test indicates non-normal data
Normal distribution
A symmetrical, bell-shaped distribution having half the scores above the mean and half the scores below the mean
- Most of the scores are clustered near the middle of the continuum of observed scores
- Resembles bell shaped curve
Variability
The extent to which scores spread out around the mean
Range
A measure of variability that is computed by subtracting the smallest score from the largest score
Variance
A single number that represents the total amount of variation in a distribution
Standard Deviation
The standard deviation is the square root of the variance. It has important relations to the normal curve.
- Most commonly used measure of dispersion
- Approximately how far on the average a score is from the mean
Skewed Distribution
Most of the scores are clustered on one end of the continuum
- Positively skewed: scores cluster at the lower end of the continuum (higher than zero statistic)
- Negatively skewed: scores cluster at the higher end of the continuum (lower than zero statistic)
Kurtosis
Measure of the degree of peakedness of a distribution
Leptokurtosis
Distribution is too peaked with thin tall (higher than zero statistic)
Platykurtosis
Distribution is too flat with many cases in the tail(s) (lower than zero statistic)
Multimodal shapes
Scores tend to congregate around more than one point
Bimodal shapes
scores are clustered in two places
Trimodal shapes
Scores are clustered in three places
Mode
Most frequently occurring score
Median
Midpoint
- Identifying the value that splits the distribution into two halves, each half having the same number of values.
- Best measure of central tendency when the distribution includes extreme scores because it is less influenced by the extreme scores than is the mean
Mean
Average
-Most commonly reported measure of central tendency and is determined by dividing the sum of the scores by the number of scores contributing to that sum
Range
Difference between the highest and lowest scores
Interquartile range
Spread between the middle 50% of the scores
- Upper quartile: top 25%
- Lower quartile: bottom 25%
Box-and-whisker plot
Summarizes the degree of variability with a picture
- “Box”: middle 50% of scores
- “Whiskers”: extend to highest score, 1.5 times the height of the rectangle, or to the 5th and 95th percentile
- Line in the middle corresponds with median
- Helps identify outliers
Outliers
Scores that lie far away from the data set
-Can lead to understanding or overestimating relationship
Why do outliers occur?
- Subotage
- Misunderstandings
- Extreme thinking
- Data Entry
- Participant is not part of population from which sample is intended
- Participant is different from rest of sample
How can we address violations of normality assumption?
Data transformations
Data transformations
- Application of mathematical procedures to make the data appear more normal
- Several different types of transformation exist. Appropriate one depends on shape of data.
Linearity
Assumption that there is a straight-line relationship between two variables
-Important because most statistical tests only capture linear relationships
How do we assess linearity?
Residuals
Bivariate Scatterplots
Residuals
Examing the differences between the predicted value and the plotted (actual) values
-Also known as prediction errors
Bivariate Scatterplots
Subjective method of assessing linearity
Homogeneity of Variance
- Variance between groups is similar
- Assessed using Levene’s test
- If significant at 0.05 level, homogeneity of variance can not be assumed
- Done using General Linear Model menu
Homoscedasticity (with two continuous variables)
Assumption that the variability in scores for one continuous variable is roughly the same at all values of another continuous variable
Heteroscedasticity
- This violation of the assumption of homoscedasticity can be assessed through the examination of the bivariate scatterplots
- This violation will not prove fatal to an analysis
Line Graph
A graph that is frequently used to depict the results of an experiment. The vertical or y axis is known as the ordinate and the horizontal or x axis is known as the abscissa.
Correlational study
Measurement and determination of the relation between two variables
- Used when data on two variables are available, but variables only able to be measured, not manipulated.
- Cannot determine cause-and-effect
- Correlation Coefficient
- -Strength: Number
- -Direction: Sign
Pearson Product-Moment Correlation Coefficient (r)
- This type of correlation coefficient is calculated when both the X variable and the Y variable are interval or ratio scale measurements and the data appear to be linear
- Other correlation coefficients can be calculated when one or both of the variable are not interval or ratio scale measurements or when the data do not fall on a straight line.
- Involves two ratio or interval variables
Correlation matrix
Used when we have multiple correlations. Summarizes all correlations
Different types of correlation
- Pearson’s Product-Moment Correlation (Pearson’s r)
- Spearman’s Rho
- Coefficient of Determination
Spearman’s Rho
Calculated for ordinal data
Coefficient of Determination
- Result when r is squared
- Indicates proportion of variability in one variable that is associated with another variable
- Times result by 100 to get percentage of explained variability (or shared variance)
Stregnths of R (effect size)
- 10 (or -0.10): small or weak
- 30 (or -0.30): medium or moderate
- 50 (or -0.50): large or strong
Covariance
An association establishes that A B
Temporal precedence (directionality problem)
Do we know which one came first in time??
Did A -> B
or Did B -> A
If we cannot tell which came first, we cannot infer causation.
Internal validity (third-variance problem)
Is there a C variable that is associated with both A and B, independently?
-If there is plausible third variable, we cannot infer causation.
Problems with correlation
- Cause and effect?
- Directionality
- Third Variable Problem
Pie Chart
Graphical representation of the percentage allocated to each alternative as
Bar Graph
A graph in which the frequency for each category of a qualitative variable is represented as a vertical column. The columns of a bar graph do not touch.
Histogram
A graph in which the frequency for each category of a quantitative variable is represented as a vertical column that touches the adjacent column.
Frequency Polygon
A graph that is constructed by placing a dot in the center of each bar of a histogram and then connecting the dots.
Data Analysis for an Experiment Comparing Means
- Getting to know the data
- Summarizing the data
- Using Confidence Intervals to Confirm what the Data Reveal
Measures of central tendency
Mean, median, mode
-indicate the score that the data tend to center around
Measure of dispersion (variability)
Indicate the breadth, or variability, of the distribution
- Range
- Standard deviation
Standard error of the mean
the standard deviation of this theoretical sampling distribution of the mean
-Our ability to estimate the population mean on the basis of a sample depends on the size of the sample and on the variability in the population from which the sample was drawn, as estimated by the sample standard deviation
Estimated standard error of the mean
Typically, we do not know the standard deviation of the population, so we estimate it using the sample standard deviation (s)