Chapter 2: exploratory data analysis Flashcards
contrast diagrams and tables
- both tables and diagrams are easy ways to present a lot of information
- tables are better at containing a lot of detail whereas diagrams are better at showing a bigger picture
what must we be wary of in diagrams
- show the data clearly and fairly
- use simplicity in design
- keep the decoding simple
how can we differentiate the type of variable
- ordered?
- scaled? (levels)
- rounding error?
- meaningful zero?, think temperatures or height
A. categorical
B. ordinal
C. discrete numerical
d. continuous numerical
describe categorical data
data which can fall into one of small number of categories which are unordered
eg. sex, smoking status
describe ordinal data
data which falls into one of a small number of categories but the categories are ordered in some way
note: can be treated as categorical, but loses info
describe discrete numerical data
- can only be integers
- count data
note: age is continuous
how should data handling occur
- check for common sense
- sig figs, general rule is three
- transformations, eg. logs to logitx
what are the three main description techniques
- frequency distributions
- cumulative frequency distributions and quantiles
- moment stats (mean, variance…)
which descriptive techniques can be used for categorical data
frequency distribution
which descriptive techniques can be used for ordinal data
frequency distribution and cumulative frequency distribution and quantiles
which descriptive techniques can be used for numerical data
- frequency
- cumulative frequency distributions and quantiles
- moment stats
what is the relationship between sample quantiles and order statistics
the median and quartiles are all examples of sample quantiles that are calculated by order statistics
what must be noted between using R and calculating sample quantiles by hand
R values are slightly different
describe dot plots including what it is and when not to use it
chart showing each point as a dot, with identical points separated vertically, thus, giving a quick idea of distribution
when not to use?
- continuous data
- too much data at which point you should use a bar graph
describe bar charts and bargraphs
- vertical bars represent the observed frequency of certain values or catagories
- suited for categorical and ordinal and discrete numerical data
- bars should be of equal width and separated from each other so as not to imply continuity
y axis on bar graph:
- observed frequency
- relative freqeucny
- % frequency
note: the lines on the graph are directly above the x axis value
why should we use relative frequency?
- sum up all values on y axis which is one, so then we can tell propositions