Chapter 2: exploratory data analysis Flashcards
contrast diagrams and tables
- both tables and diagrams are easy ways to present a lot of information
- tables are better at containing a lot of detail whereas diagrams are better at showing a bigger picture
what must we be wary of in diagrams
- show the data clearly and fairly
- use simplicity in design
- keep the decoding simple
how can we differentiate the type of variable
- ordered?
- scaled? (levels)
- rounding error?
- meaningful zero?, think temperatures or height
A. categorical
B. ordinal
C. discrete numerical
d. continuous numerical
describe categorical data
data which can fall into one of small number of categories which are unordered
eg. sex, smoking status
describe ordinal data
data which falls into one of a small number of categories but the categories are ordered in some way
note: can be treated as categorical, but loses info
describe discrete numerical data
- can only be integers
- count data
note: age is continuous
how should data handling occur
- check for common sense
- sig figs, general rule is three
- transformations, eg. logs to logitx
what are the three main description techniques
- frequency distributions
- cumulative frequency distributions and quantiles
- moment stats (mean, variance…)
which descriptive techniques can be used for categorical data
frequency distribution
which descriptive techniques can be used for ordinal data
frequency distribution and cumulative frequency distribution and quantiles
which descriptive techniques can be used for numerical data
- frequency
- cumulative frequency distributions and quantiles
- moment stats
what is the relationship between sample quantiles and order statistics
the median and quartiles are all examples of sample quantiles that are calculated by order statistics
what must be noted between using R and calculating sample quantiles by hand
R values are slightly different
describe dot plots including what it is and when not to use it
chart showing each point as a dot, with identical points separated vertically, thus, giving a quick idea of distribution
when not to use?
- continuous data
- too much data at which point you should use a bar graph
describe bar charts and bargraphs
- vertical bars represent the observed frequency of certain values or catagories
- suited for categorical and ordinal and discrete numerical data
- bars should be of equal width and separated from each other so as not to imply continuity
y axis on bar graph:
- observed frequency
- relative freqeucny
- % frequency
note: the lines on the graph are directly above the x axis value
why should we use relative frequency?
- sum up all values on y axis which is one, so then we can tell propositions
describe pie charts
- often used to depict categorical data, however, it relies on the ability to assess area/angle and volume and humans are bad at this
- better alternative is bar chart
describe histograms
- barograph for continuous data
- if all intervals are of the same width, the heights of the bars can be frequencies of red. frequencies
- if the intervals are not of the same width, we want the area to be proportional to the relative frequencies
height: rel. frequency/interval width
why do we group into bins?
- each observation of a continuous variable is aways distinct, we group the observations into bins which represent intervals of values
what is cumulative frequency?
relative frequency of observations less or equal to the number x. this is a simple analogue of the cumulative distribution function ie. CDF
graph:
- empirical cdf occurs in a step function
- however, with lots of data the steps will get closer and closer to LOOK like a continuous function but install discrete
sample quantiles can be found easily from cumulative frequency
what is a sample quantile? what is its relationship to cumulative frequency?
.
describe box plots
graphical representation of five number summary
we have it to give a simple idea of the location, spread, skewness/symmetry
it does not tell us the mean, SD, sample size and frequency like a histogram does
POSITIVE SKEW
- the tail is longer on the right side
NEGATIVE SKEW
- the tail is longer on the left side
SYMMETRICAl
- the tails are the same length
how can outliers in box plots be dealt with
- limit the tails to range specified on DB
- IQR=Q3-Q1
- data outside inner fences are indicated by individual points ie. by a hashtag with the value, over an asterisk (no division sign) (we have outliers included because they may sometimes give an explanation eg. person with rare disease)
when and why do we use transformations?
when there is a positive skew in data
- we use a log transformation ie. log(x) where x is the initial data
how is skewness represented in a graphical perspective
BEWARE! possible point of confusion,
it still follows that positive has a longer right tail, but the peak is higher on the left
negative skew: peak is on the right, but the tail is longer on the left
describe bivariate data
occurs when two variables occur for a particular individual
- the variable may be any time
eg. height and weight, time and alcohol content, gender and blood type, treatment type and level
in the case of bivariable and it is two categorical variables.
what to do?
- combine the variables into one variable
- use bar chart for super variable
see notes for visual
in the case of bivariable and it is one numerical and one categorical variables.
what to do?
- compare the striation of the numerical variable for each level of the categorical data
- use parallel box plots or dot plots
in the case of bivariable and it is two numerical variables.
what to do?
- use a scatter plot or scatter diagram
- each coordinate specifies a point in the Cartesian plane corresponding to an individual
- a line plot may be drawn in the SPECIAL CASE that the x variable has a natural ordering. this is most oftenly time
describe the types of relationships scatter plots allow us to see. what can we do to tell the strength of this relationship
negative relationship: large x and small y, vice versa
positive relationship: large x and large y, vice versa
we measure the strength of the relationship with the correlation coefficient, r. this is a value between -1 less than or equal to r, which is less than or equal 1.
- the larger the magnitude, the stronger the relationship
- the sign reflects the relationship
- if r is plus/minus 1, it means the two variables are directly linearly related whereby y=a+bx
- scaling of these graphs must be all the same for consistent comparison