1. gain an initial sense of the data 2. detecting data entry errors or data coding errors 3. to identify outliers 4. to evaluate research methodology 5. to determine whether data meet statistical criteria and assumptions

1. Gain an initial sense of the data

- histogram: helps us summarize how many people actually have that score - along the x axis we have score and along y axis we have frequency or count - dejections are variables - make sense of the graphs - we use pictures to make sense of the data - takeaway message: we need to know what the actual scale is otherwise we can end up with completely different stories and helpful to compare things side by side to see what the actual values are

2. Detecting data entry errors or data coding errors

- reverse coding: 5 becomes 1, 4 becomes 2, 2 becomes 4, 1 becomes 5 (the meaning is flipped) - we examine data to see if we made any mistake in coding with these type of scales - people respond strongly to negatively balanced items

3. To identify outliers

- rare, extreme scores that are outside the range of most other scores in the data set - in histograms you can identify outliers (where there are gaps from other ones, there will be an outlier)

4. To evaluate research methodology

- very similar scores may indicate problems with the measure used - if the scale is not sensitive enough we cannot tease out the variability, the scale may be too broad - similar scores equals a lack of variability

Module 2, Examining Data Flashcards by Rose Dhanoa

Why Examine Data?

gain an initial sense of the data
detecting data entry errors or data coding errors
to identify outliers
to evaluate research methodology
to determine whether data meet statistical criteria and assumptions

How well did you know this?

Not at all

Perfectly

Gain an initial sense of the data

histogram: helps us summarize how many people actually have that score - along the x axis we have score and along y axis we have frequency or count
dejections are variables
make sense of the graphs
we use pictures to make sense of the data
takeaway message: we need to know what the actual scale is otherwise we can end up with completely different stories and helpful to compare things side by side to see what the actual values are

How well did you know this?

Not at all

Perfectly

Detecting data entry errors or data coding errors

reverse coding: 5 becomes 1, 4 becomes 2, 2 becomes 4, 1 becomes 5 (the meaning is flipped) - we examine data to see if we made any mistake in coding with these type of scales
people respond strongly to negatively balanced items

How well did you know this?

Not at all

Perfectly

To identify outliers

rare, extreme scores that are outside the range of most other scores in the data set
in histograms you can identify outliers (where there are gaps from other ones, there will be an outlier)

How well did you know this?

Not at all

Perfectly

To evaluate research methodology

very similar scores may indicate problems with the measure used
if the scale is not sensitive enough we cannot tease out the variability, the scale may be too broad
similar scores equals a lack of variability

How well did you know this?

Not at all

Perfectly

Examining Data using Tables - Frequency Distribution Tables

summarizes the number and percentage of participants for the different values of the variable
- another way to show what was in a histogram
- frequency informs us on total number of people and how many represent each category or ranking
- percent: include the total number of people in the study (divide by total number of people for percentage) - represent all that were part of the study, even those who did not report
- valid percent: total number of people who reported on that variable - changes values you are dividing by (you would essentially not include “missing system” in your calculations)
- cumulative percent does not make a lot of sense if it is nominal or categorical - makes more sense if there is some sort of ranking (ordinal)

How well did you know this?

Not at all

Perfectly

Creating Frequency Distribution Tabes

identify all possible values for the variable
determine the frequency of participants who report each value
calculate the percentage for each value

How well did you know this?

Not at all

Perfectly

Percentage Formula

% - frequency / total number of scores x 100
n = total number (look at frequency column for total number of scores)

How well did you know this?

Not at all

Perfectly

Looking for Data Problems

frequency tables can identify “problem data”
◦ incorrect entry: e.g. BMI - 333
◦ restricted range (not much
variability)
◦ highly skewed data
◦ missing data (want to figure
out why there is missing data)
** in the absence of not knowing why you have problematic data, the best option is to remove it because you do not have the context (context helps make a decision)
◦ was the data not put in
correctly or outlier (we do not
know in this case)

How well did you know this?

Not at all

Perfectly

Cumulative Percent

cumulative percent: take valid percent and add the next one (this matters now that it is ordinal)
- if we say we want the percentage of those who smoked 5 or less we can take cumulative percent and find that (this would include 3 categories in this case
- beginning with the first valid percent and adding on from there

How well did you know this?

Not at all

Perfectly

Group Frequency Distribution Table

a table that groups interval or ratio values of a variable into a smaller number of intervals (more manageable to look at visually)
- frequencies and percentages are calculated within the intervals
- can often change this into a histogram - we do a group frequency distribution table as it often helps us create a grouped histogram where the bars themselves will be representative of the range of values

How well did you know this?

Not at all

Perfectly

Group frequency distribution table: Real Lower Limit & Real Upper Limit

real lower limit: smallest value of a variable that would be grouped in a particular interval

real upper limit: largest value of a variable that would be grouped into a particular interval

ex. 10-12: 9.5 (RLL) & 12.4 (RUL)

How well did you know this?

Not at all

Perfectly

What is the RLL & RUL for interval 16-18?

RLL: 15.5
RUL: 18.4

How well did you know this?

Not at all

Perfectly

Creating Grouped Frequency Distribution Tables (rules)

variables are grouped approximately 10 intervals (8-12)
the numbers of interval should accurately represent the data
intervals should be of equal size
intervals should not overlap

How well did you know this?

Not at all

Perfectly

Bar Charts

nominal and ordinal data
- use bars to represent the frequency or percentage of values (is very similar to a histogram)
- do not care about the difference between bar chart and histogram
- looks a lot like histogram other than the fact that the bars are not touching
- often times they represent averages and bars will be means as opposed to frequency

How well did you know this?

Not at all

Perfectly

Pie Charts

nominal or ordinal (only when you have categories)
- represent the percentage of the sample corresponding to the value
- gives you an idea of proportions of a sample

Histograms

interval and ratio data
- use bars to represent the frequency of values
- bars touch - indicates an interval variable
- score along x axis

Frequency Polygons

interval and ratio data
- are line graphs that use data points to represent frequencies
- histograms gives more of a smooth shape of the actual distribution as compared to this
- you can smooth it out by creating a line overtop

Inappropriate Conclusions from Figures

same data, but different y-axes!
the scaling of the y axis makes a big impact because it can be misleading causing you to draw wrong conclusion

Modality

values with the highest frequency
with modality you need to see peaks and valleys
unimodal - one value that occurs with the highest frequency (histogram - one highest bar)
bimodal - two values that occur with the highest frequency
multimodal would include three or more peaks

Symmetry

symmetric distributions have frequencies that change in a similar manner moving away from the mode
skewness (there are degrees of) - there needs to be nuance (slightly positively skewed for example)
symmetry: refers to how values of a variable change in relation to the most common or most frequent occurring values

Symmetry: Asymmetry

asymmetric distributions have outliers that skew the shape of the distribution is
frequencies change in a different manner moving away in both directions from the most frequently occurring value
means that the the most highest values are located at one end rather than in the middle
it is oftentimes not even outliers that may create a skew in the distribution, however it certainly impacts the skew of the distribution
the portions of the distribution where the value with the lowest frequency and at ends of distributions are called tails of distributions (they are long and will have outliers which are skewing the shape)
based on the location of the long tail we can determine if the data is positively or negatively skewed

Positively Skewed

data is said to be positively skewed when the long tail is on the right side of the distribution, with the high frequency values clustered on the left

Negatively Skewed

data is said to be negatively skewed when the long tail is on the left side of the distribution, with the high frequency values clustered on the right

Quantifying Skewness

skewness statistic positive statistic = positive skew negative statistic = negative skew 0 = perfectly normal distribution the further the skewness statistic is from 0 the more skewed the distribution

Variability

- the amount of differences in the distribution of a variable (flatter distributions have more variability in their data) - are the scores different from or similar to one another? - kurtosis statistic helps us with variability - normal, peaked or flat - mesokurtic, leptokurtic or platykurtic

Mesokurtic

neither peaked nor flat ~ kurtosis statistic would be zero (medium, middle) - has more variability than peaked but less than flat

Leptokurtic

more peaked relative to a normal distribution (two kangaroos back to back are pretty peaked and kangaroos leap) - very little variability (for example all athletes achieved similar score on beep test)

Platykurtic

flatter distribution relative to a normal distribution (platypus - pretty flat) - the frequency of data is spread across values of a the variable

Quantifying Kurtosis

Kurtosis Statistic positive statistic = indicates leptokurtic distribution negative statistic = indicates platykurtic distribution 0 = perfectly normal distribution the further the kurtosis statistic is from 0 the more likely the distribution is to be not normal - degree of is really important - may not be as leptokurtic or platykurtic as we think

The Normal Curve

- unimodal: one value that occurs with the highest frequency - symmetrical: right side and left side fall away in similar manner - neither peaked nor flat (mesokurtic)