Chapter 2: exploratory data analysis Flashcards

1
Q

contrast diagrams and tables

A
  • both tables and diagrams are easy ways to present a lot of information
  • tables are better at containing a lot of detail whereas diagrams are better at showing a bigger picture
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what must we be wary of in diagrams

A
  • show the data clearly and fairly
  • use simplicity in design
  • keep the decoding simple
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

how can we differentiate the type of variable

A
  1. ordered?
  2. scaled? (levels)
  3. rounding error?
  4. meaningful zero?, think temperatures or height

A. categorical
B. ordinal
C. discrete numerical
d. continuous numerical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

describe categorical data

A

data which can fall into one of small number of categories which are unordered

eg. sex, smoking status

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

describe ordinal data

A

data which falls into one of a small number of categories but the categories are ordered in some way

note: can be treated as categorical, but loses info

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

describe discrete numerical data

A
  • can only be integers
  • count data

note: age is continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

how should data handling occur

A
  1. check for common sense
  2. sig figs, general rule is three
  3. transformations, eg. logs to logitx
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what are the three main description techniques

A
  1. frequency distributions
  2. cumulative frequency distributions and quantiles
  3. moment stats (mean, variance…)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

which descriptive techniques can be used for categorical data

A

frequency distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

which descriptive techniques can be used for ordinal data

A

frequency distribution and cumulative frequency distribution and quantiles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

which descriptive techniques can be used for numerical data

A
  1. frequency
  2. cumulative frequency distributions and quantiles
  3. moment stats
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is the relationship between sample quantiles and order statistics

A

the median and quartiles are all examples of sample quantiles that are calculated by order statistics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what must be noted between using R and calculating sample quantiles by hand

A

R values are slightly different

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

describe dot plots including what it is and when not to use it

A

chart showing each point as a dot, with identical points separated vertically, thus, giving a quick idea of distribution

when not to use?

  • continuous data
  • too much data at which point you should use a bar graph
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

describe bar charts and bargraphs

A
  • vertical bars represent the observed frequency of certain values or catagories
  • suited for categorical and ordinal and discrete numerical data
  • bars should be of equal width and separated from each other so as not to imply continuity

y axis on bar graph:

  • observed frequency
  • relative freqeucny
  • % frequency

note: the lines on the graph are directly above the x axis value

why should we use relative frequency?
- sum up all values on y axis which is one, so then we can tell propositions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

describe pie charts

A
  • often used to depict categorical data, however, it relies on the ability to assess area/angle and volume and humans are bad at this
  • better alternative is bar chart
17
Q

describe histograms

A
  • barograph for continuous data
  • if all intervals are of the same width, the heights of the bars can be frequencies of red. frequencies
  • if the intervals are not of the same width, we want the area to be proportional to the relative frequencies

height: rel. frequency/interval width

why do we group into bins?
- each observation of a continuous variable is aways distinct, we group the observations into bins which represent intervals of values

18
Q

what is cumulative frequency?

A

relative frequency of observations less or equal to the number x. this is a simple analogue of the cumulative distribution function ie. CDF

graph:

  • empirical cdf occurs in a step function
  • however, with lots of data the steps will get closer and closer to LOOK like a continuous function but install discrete

sample quantiles can be found easily from cumulative frequency

19
Q

what is a sample quantile? what is its relationship to cumulative frequency?

A

.

20
Q

describe box plots

A

graphical representation of five number summary

we have it to give a simple idea of the location, spread, skewness/symmetry

it does not tell us the mean, SD, sample size and frequency like a histogram does

POSITIVE SKEW
- the tail is longer on the right side

NEGATIVE SKEW
- the tail is longer on the left side

SYMMETRICAl
- the tails are the same length

21
Q

how can outliers in box plots be dealt with

A
  • limit the tails to range specified on DB
  • IQR=Q3-Q1
  • data outside inner fences are indicated by individual points ie. by a hashtag with the value, over an asterisk (no division sign) (we have outliers included because they may sometimes give an explanation eg. person with rare disease)
22
Q

when and why do we use transformations?

A

when there is a positive skew in data

- we use a log transformation ie. log(x) where x is the initial data

23
Q

how is skewness represented in a graphical perspective

A

BEWARE! possible point of confusion,

it still follows that positive has a longer right tail, but the peak is higher on the left

negative skew: peak is on the right, but the tail is longer on the left

24
Q

describe bivariate data

A

occurs when two variables occur for a particular individual
- the variable may be any time

eg. height and weight, time and alcohol content, gender and blood type, treatment type and level

25
Q

in the case of bivariable and it is two categorical variables.
what to do?

A
  • combine the variables into one variable
  • use bar chart for super variable
    see notes for visual
26
Q

in the case of bivariable and it is one numerical and one categorical variables.
what to do?

A
  • compare the striation of the numerical variable for each level of the categorical data
  • use parallel box plots or dot plots
27
Q

in the case of bivariable and it is two numerical variables.
what to do?

A
  • use a scatter plot or scatter diagram
  • each coordinate specifies a point in the Cartesian plane corresponding to an individual
  • a line plot may be drawn in the SPECIAL CASE that the x variable has a natural ordering. this is most oftenly time
28
Q

describe the types of relationships scatter plots allow us to see. what can we do to tell the strength of this relationship

A

negative relationship: large x and small y, vice versa
positive relationship: large x and large y, vice versa

we measure the strength of the relationship with the correlation coefficient, r. this is a value between -1 less than or equal to r, which is less than or equal 1.

  • the larger the magnitude, the stronger the relationship
  • the sign reflects the relationship
  • if r is plus/minus 1, it means the two variables are directly linearly related whereby y=a+bx
  • scaling of these graphs must be all the same for consistent comparison