Week 6: Descriptive Data Analysis Flashcards
What is descriptive statistics?
Descriptive statistics = set of techniques for summarizing and displaying data.
LETS ASSUME…
Quantitative data
With scores on one or more variables - for several study participants.
It is important to DESCRIBE each variable individually
DESCRIBE the distribution of a Variable
The Distribution of a Variable
EVERY VARIABLE - has a distribution
= scores are distributed ACROSS the levels of that variable
EXAMPLE
Sample of 100 university students :)
the distribution of the variable “number of siblings” might be such that…
- 10 of them have no siblings
- 30 have one sibling
- 40 have two siblings, and so on.
In the same sample, the distribution of the variable “sex” might be such that…
44 have a score of “male”
56 have a score of “female.”
Describe a frequency table and why it is important?
Frequency Tables
One way to DISPLAY the distribution of a variable is in a frequency table.
EXAMPLE
Frequency table showing a hypothetical distribution of scores on the Rosenberg Self-Esteem Scale for a sample of 40 college students.
The first column lists the values of the variable—the possible scores on the Rosenberg scale
The second column lists the frequency of each score.
This table shows that there were three students who had self-esteem scores of 24, five who had self-esteem scores of 23, and so on.
From a frequency table like this, one can quickly see several important aspects of a distribution
- including the RANGE of scores (from 15 to 24)
- the most and least common scores (22 and 17, respectively)
- EXTREME scores that stand out from the rest.
Explain 2 important points about frquency tables…
IMPORTANT - FREQUENCY TABLES
- the levels listed in the first column usually go from the highest at the top to the lowest at the bottom
- and they usually do not extend beyond the highest and lowest scores in the data.
EXAMPLE
ALTHOUGH scores on the Rosenberg scale can VARY from a high of 30 to a low of 0
THE TABLE ONLY includes levels from 24 to 15 because that range includes all the scores in this particular data set.
With MANY different scores across a WIDE range of values, it is often better to create a grouped frequency table,
—– in which the first column lists ranges of values
—– the second column lists the frequency of scores in each range.
Attached table - grouped frequency table showing a hypothetical distribution of simple reaction times for a sample of 20 participants.
GROUPED FREQUENCY TABLE - The ranges must all be of equal width, and there are usually between…
5 - 15
FREQUENCY TABLES Can also be used for categorical variables
Levels = category labels.
The order = somewhat arbitrary
Listed from the most frequent at the top to the least frequent at the bottom.
Describe and explain - HISTOGRAM
HISTOGRAM
= graphical display of a distribution.
It presents the SAME information as — a frequency table
BUT
is quicker and easier to grasp.
When the variable is quantitative = no gap between the bars.
When the variable is categorical, however, there is usually a SMALL gap between them.
SHAPE = Unimodal
Describe and explain - Distribution shapes
Distribution Shapes
When the distribution of a quantitative variable is displayed in a histogram — it has a shape.
The shape of the distribution of self-esteem scores in is typical.
There is a peak somewhere near the middle of the distribution and “tails” that taper in either direction from the peak.
IMAGE BELOW…
SHAPE = bimodal
meaning they have two distinct peaks
Distributions can also have more than two distinct peaks, = relatively rare in psychological research.
Symmetrical or skewed?
Describe the shape of positive, negative skew and symmetrical…
Symmetrical or skewed?
Describe outliers
An OUTLIER is an extreme score that is much higher or lower than the rest of the scores in the distribution.
Sometimes outliers represent truly extreme scores on the variable of interest.
EXAMPLE
On the Beck Depression Inventory
= a single clinically depressed person might be an outlier in a sample of otherwise happy and high-functioning peers.
However, outliers can also represent errors or misunderstandings on the part of the researcher or participant, equipment malfunctions, or similar problems.
Describe and explain - Central Tendency
Central Tendency
The central tendency of a distribution is its middle
the point around which the scores in the distribution tend to cluster.
(Another term for central tendency is average.)
Three most common measures of central tendency:
the mean, the median, and the mode.
The mean of a distribution (symbolized M)
= is the sum of the scores divided /
by the number of scores
M=ΣX/N
Σ (the Greek letter sigma) = is the summation sign
X = means to sum across the values of the variable
N = represents the number of scores.
The mean…
— provides a good indication of the central tendency of a distribution
— is easily understood by most people.
— has statistical properties FOR inferential statistics.
Describe median
MEDIAN = is the middle score
Find the median…
8 4 12 14 3 2 3
Rearrange low to high
2 3 3 4 8 12 14
Median = 4
EXAMPLE
If the distribution were…
2 3 3 4 8 12 14 15
Median = 6
Describe mode and what we can learn from different sidtribution shapes
MODE = is the most frequent score in a distribution.
The mode is the ONLY measure of central tendency that can also be used for categorical variables
EXAMPLE -
Data set - 2 3 3 4 8 12 14
Mode = 3
When distribution = unimodal and symmetrical
the mean, median, and mode will be very close to each other at the peak of the distribution.
BIMODAL AND ASSYMETRICAL distribution
the mean, median, and mode can be quite different.
In a bimodal distribution
the mean and median will tend to be between the peaks,
WHILE the mode will be at the tallest peak.
In a skewed distribution
the mean will DIFFER from the median in the direction of the skew (i.e., the direction of the longer tail).
For highly skewed distributions, the mean can be pulled so far in the direction of the skew that it is no longer a good measure of the central tendency of that distribution.
HIGHLY SKEWED = Researchers prefer MEDIAN
EXAMPLE (for understanding…)
Imagine, for example, a set of four simple reaction times of 200, 250, 280, and 250 milliseconds (ms). The mean is 245 ms. But the addition of one more score of 5,000 ms—perhaps because the participant was not paying attention—would raise the mean to 1,445 ms. Not only is this measure of central tendency greater than 80% of the scores in the distribution, but it also does not seem to represent the behavior of anyone in the distribution very well. This is why researchers often prefer the median for highly skewed distributions (such as distributions of reaction times).
Desribe and explain - Variability
Measures of Variability
The variability of a distribution is the EXTENT to which the scores vary AROUND their central tendency.
REF IMAGE -
both of which have the SAME central tendency.
The mean, median, and mode of each distribution are 10.
Notice, however, that the two distributions differ in terms of their variability.
The top = LOW variability, with all the scores relatively close to the center.
The bottom one has relatively HIGH variability, with the scores are spread across a much greater range.
Describe range
RANGE = difference between the highest and lowest scores in the distribution.
Although the range is easy to compute and understand, it can be misleading when there are outliers.
Explain the standard deviation
By far the most common measure of variability is the standard deviation.
The standard deviation of a distribution is the average distance between the scores and the mean.
Computing the standard deviation involves a slight complication.
- Specifically, it involves finding the difference between each score and the mean
- Squaring each difference
- finding the mean of these squared differences
- finding the square root of that mean.
Look at table - mentally go through the process of calculating the standard deviation
The computations for the standard deviation are illustrated for a small set of data in Table 12.3.
—- although the differences can be negative, the squared differences are always positive—meaning that the standard deviation is always positive.
—- Although the variance is itself a measure of variability, it generally plays a larger role in inferential statistics than in descriptive statistics.
Why use N − 1 instead of N
N or N − 1
If you have already taken a statistics course, you may have learned to divide the sum of the squared differences by N − 1 rather than by N when you compute the variance and standard deviation. Why is this?
By definition, the standard deviation is the square root of the mean of the squared differences. This implies dividing the sum of squared differences by N, as in the formula just presented. Computing the standard deviation this way is appropriate when your goal is simply to describe the variability in a sample. And learning it this way emphasizes that the variance is in fact the mean of the squared differences—and the standard deviation is the square root of this mean.
However, most calculators and software packages divide the sum of squared differences by N − 1. This is because the standard deviation of a sample tends to be a bit lower than the standard deviation of the population the sample was selected from.
Dividing the sum of squares by N − 1 corrects for this tendency and results in a better estimate of the population standard deviation.
Because researchers generally think of their data as representing a sample selected from a larger population—and because they are generally interested in drawing conclusions about the population—it makes sense to routinely apply this correction.
Describe and explain - percentile rank
In many situations, it is useful to have a way to describe the location of an individual score within its distribution.
Percentile rank of a score = the percentage of scores in the distribution that are LOWER than that score.
EXAMPLE
Any score in the distribution, we can find its percentile rank by counting the number of scores in the distribution that are lower than that score and converting that number to a percentage of the total number of scores.
Notice, for example, that five of the students represented by the data in Table 12.1 had self-esteem scores of 23.
In this distribution, 32 of the 40 scores (80%) are lower than 23.
Thus each of these students has a percentile rank of 80. (It can also be said that they scored “at the 80th percentile.”)
Percentile ranks are often used to report the results of standardized tests of ability or achievement.
Another approach is the z score. The z score for a particular individual is the difference between that individual’s score and the mean of the distribution, divided by the standard deviation of the distribution:
z = (X−M)/SD
A z score indicates how far above or below the mean a raw score is, but it expresses this in terms of the standard deviation. For example, in a distribution of intelligence quotient (IQ) scores with a mean of 100 and a standard deviation of 15, an IQ score of 110 would have a z score of (110 − 100) / 15 = +0.67. In other words, a score of 110 is 0.67 standard deviations (approximately two thirds of a standard deviation) above the mean. Similarly, a raw score of 85 would have a z score of (85 − 100) / 15 = −1.00. In other words, a score of 85 is one standard deviation below the mean.
There are several reasons that z scores are important. Again, they provide a way of describing where an individual’s score is located within a distribution and are sometimes used to report the results of standardized tests. They also provide one way of defining outliers. For example, outliers are sometimes defined as scores that have z scores less than −3.00 or greater than +3.00. In other words, they are defined as scores that are more than three standard deviations from the mean. Finally, z scores play an important role in understanding and computing other statistics, as we will see shortly.
Online Descriptive Statistics
Although many researchers use commercially available software such as SPSS and Excel to analyze their data, there are several free online analysis tools that can also be extremely useful. Many allow you to enter or upload your data and then make one click to conduct several descriptive statistical analyses. Among them are the following.
Rice Virtual Lab in Statistics
http://onlinestatbook.com/stat_analysis/index.html
VassarStats
http://faculty.vassar.edu/lowry/VassarStats.html
Bright Stat
http://www.brightstat.com
For a more complete list, see http://statpages.org/index.html
Differences between groups or conditions are usually described in terms of the mean and standard deviation of each group or condition. For example, Thomas Ollendick and his colleagues conducted a study in which they evaluated two one-session treatments for simple phobias in children (Ollendick et al., 2009)[1]. They randomly assigned children with an intense fear (e.g., to dogs) to one of three conditions. In the exposure condition, the children actually confronted the object of their fear under the guidance of a trained therapist. In the education condition, they learned about phobias and some strategies for coping with them. In the wait-list control condition, they were waiting to receive a treatment after the study was over. The severity of each child’s phobia was then rated on a 1-to-8 scale by a clinician who did not know which treatment the child had received. (This was one of several dependent variables.) The mean fear rating in the education condition was 4.83 with a standard deviation of 1.52, while the mean fear rating in the exposure condition was 3.47 with a standard deviation of 1.77. The mean fear rating in the control condition was 5.56 with a standard deviation of 1.21. In other words, both treatments worked, but the exposure treatment worked better than the education treatment. As we have seen, differences between group or condition means can be presented in a bar graph like that in Figure 12.5, where the heights of the bars represent the group or condition means. We will look more closely at creating American Psychological Association (APA)-style bar graphs shortly.