Module 2, Examining Data Flashcards
Why Examine Data?
- gain an initial sense of the data
- detecting data entry errors or data coding errors
- to identify outliers
- to evaluate research methodology
- to determine whether data meet statistical criteria and assumptions
- Gain an initial sense of the data
- histogram: helps us summarize how many people actually have that score - along the x axis we have score and along y axis we have frequency or count
- dejections are variables
- make sense of the graphs
- we use pictures to make sense of the data
- takeaway message: we need to know what the actual scale is otherwise we can end up with completely different stories and helpful to compare things side by side to see what the actual values are
- Detecting data entry errors or data coding errors
- reverse coding: 5 becomes 1, 4 becomes 2, 2 becomes 4, 1 becomes 5 (the meaning is flipped) - we examine data to see if we made any mistake in coding with these type of scales
- people respond strongly to negatively balanced items
- To identify outliers
- rare, extreme scores that are outside the range of most other scores in the data set
- in histograms you can identify outliers (where there are gaps from other ones, there will be an outlier)
- To evaluate research methodology
- very similar scores may indicate problems with the measure used
- if the scale is not sensitive enough we cannot tease out the variability, the scale may be too broad
- similar scores equals a lack of variability
Examining Data using Tables - Frequency Distribution Tables
summarizes the number and percentage of participants for the different values of the variable
- another way to show what was in a histogram
- frequency informs us on total number of people and how many represent each category or ranking
- percent: include the total number of people in the study (divide by total number of people for percentage) - represent all that were part of the study, even those who did not report
- valid percent: total number of people who reported on that variable - changes values you are dividing by (you would essentially not include “missing system” in your calculations)
- cumulative percent does not make a lot of sense if it is nominal or categorical - makes more sense if there is some sort of ranking (ordinal)
Creating Frequency Distribution Tabes
- identify all possible values for the variable
- determine the frequency of participants who report each value
- calculate the percentage for each value
Percentage Formula
% - frequency / total number of scores x 100
n = total number (look at frequency column for total number of scores)
Looking for Data Problems
- frequency tables can identify “problem data”
◦ incorrect entry: e.g. BMI - 333
◦ restricted range (not much
variability)
◦ highly skewed data
◦ missing data (want to figure
out why there is missing data)
** in the absence of not knowing why you have problematic data, the best option is to remove it because you do not have the context (context helps make a decision)
◦ was the data not put in
correctly or outlier (we do not
know in this case)
Cumulative Percent
cumulative percent: take valid percent and add the next one (this matters now that it is ordinal)
- if we say we want the percentage of those who smoked 5 or less we can take cumulative percent and find that (this would include 3 categories in this case
- beginning with the first valid percent and adding on from there
Group Frequency Distribution Table
a table that groups interval or ratio values of a variable into a smaller number of intervals (more manageable to look at visually)
- frequencies and percentages are calculated within the intervals
- can often change this into a histogram - we do a group frequency distribution table as it often helps us create a grouped histogram where the bars themselves will be representative of the range of values
Group frequency distribution table: Real Lower Limit & Real Upper Limit
real lower limit: smallest value of a variable that would be grouped in a particular interval
real upper limit: largest value of a variable that would be grouped into a particular interval
ex. 10-12: 9.5 (RLL) & 12.4 (RUL)
What is the RLL & RUL for interval 16-18?
RLL: 15.5
RUL: 18.4
Creating Grouped Frequency Distribution Tables (rules)
- variables are grouped approximately 10 intervals (8-12)
- the numbers of interval should accurately represent the data
- intervals should be of equal size
- intervals should not overlap
Bar Charts
nominal and ordinal data
- use bars to represent the frequency or percentage of values (is very similar to a histogram)
- do not care about the difference between bar chart and histogram
- looks a lot like histogram other than the fact that the bars are not touching
- often times they represent averages and bars will be means as opposed to frequency