Chapter 2: exploratory data analysis Flashcards by Keegan C

contrast diagrams and tables

both tables and diagrams are easy ways to present a lot of information
tables are better at containing a lot of detail whereas diagrams are better at showing a bigger picture

How well did you know this?

Not at all

Perfectly

what must we be wary of in diagrams

show the data clearly and fairly
use simplicity in design
keep the decoding simple

How well did you know this?

Not at all

Perfectly

how can we differentiate the type of variable

ordered?
scaled? (levels)
rounding error?
meaningful zero?, think temperatures or height

A. categorical
B. ordinal
C. discrete numerical
d. continuous numerical

How well did you know this?

Not at all

Perfectly

describe categorical data

data which can fall into one of small number of categories which are unordered

eg. sex, smoking status

How well did you know this?

Not at all

Perfectly

describe ordinal data

data which falls into one of a small number of categories but the categories are ordered in some way

note: can be treated as categorical, but loses info

How well did you know this?

Not at all

Perfectly

describe discrete numerical data

can only be integers
count data

note: age is continuous

How well did you know this?

Not at all

Perfectly

how should data handling occur

check for common sense
sig figs, general rule is three
transformations, eg. logs to logitx

How well did you know this?

Not at all

Perfectly

what are the three main description techniques

frequency distributions
cumulative frequency distributions and quantiles
moment stats (mean, variance…)

How well did you know this?

Not at all

Perfectly

which descriptive techniques can be used for categorical data

frequency distribution

How well did you know this?

Not at all

Perfectly

which descriptive techniques can be used for ordinal data

frequency distribution and cumulative frequency distribution and quantiles

How well did you know this?

Not at all

Perfectly

which descriptive techniques can be used for numerical data

frequency
cumulative frequency distributions and quantiles
moment stats

How well did you know this?

Not at all

Perfectly

what is the relationship between sample quantiles and order statistics

the median and quartiles are all examples of sample quantiles that are calculated by order statistics

How well did you know this?

Not at all

Perfectly

what must be noted between using R and calculating sample quantiles by hand

R values are slightly different

How well did you know this?

Not at all

Perfectly

describe dot plots including what it is and when not to use it

chart showing each point as a dot, with identical points separated vertically, thus, giving a quick idea of distribution

when not to use?

continuous data
too much data at which point you should use a bar graph

How well did you know this?

Not at all

Perfectly

describe bar charts and bargraphs

vertical bars represent the observed frequency of certain values or catagories
suited for categorical and ordinal and discrete numerical data
bars should be of equal width and separated from each other so as not to imply continuity

y axis on bar graph:

observed frequency
relative freqeucny
% frequency

note: the lines on the graph are directly above the x axis value

why should we use relative frequency?
- sum up all values on y axis which is one, so then we can tell propositions

How well did you know this?

Not at all

Perfectly

describe pie charts

Study These Flashcards

often used to depict categorical data, however, it relies on the ability to assess area/angle and volume and humans are bad at this
better alternative is bar chart

describe histograms

Study These Flashcards

barograph for continuous data
if all intervals are of the same width, the heights of the bars can be frequencies of red. frequencies
if the intervals are not of the same width, we want the area to be proportional to the relative frequencies

height: rel. frequency/interval width

why do we group into bins?
- each observation of a continuous variable is aways distinct, we group the observations into bins which represent intervals of values

what is cumulative frequency?

Study These Flashcards

relative frequency of observations less or equal to the number x. this is a simple analogue of the cumulative distribution function ie. CDF

graph:

empirical cdf occurs in a step function
however, with lots of data the steps will get closer and closer to LOOK like a continuous function but install discrete

sample quantiles can be found easily from cumulative frequency

what is a sample quantile? what is its relationship to cumulative frequency?

Study These Flashcards

describe box plots

Study These Flashcards

graphical representation of five number summary

we have it to give a simple idea of the location, spread, skewness/symmetry

it does not tell us the mean, SD, sample size and frequency like a histogram does

POSITIVE SKEW
- the tail is longer on the right side

NEGATIVE SKEW
- the tail is longer on the left side

SYMMETRICAl
- the tails are the same length

how can outliers in box plots be dealt with

Study These Flashcards

limit the tails to range specified on DB
IQR=Q3-Q1
data outside inner fences are indicated by individual points ie. by a hashtag with the value, over an asterisk (no division sign) (we have outliers included because they may sometimes give an explanation eg. person with rare disease)

when and why do we use transformations?

Study These Flashcards

when there is a positive skew in data

- we use a log transformation ie. log(x) where x is the initial data

how is skewness represented in a graphical perspective

Study These Flashcards

BEWARE! possible point of confusion,

it still follows that positive has a longer right tail, but the peak is higher on the left

negative skew: peak is on the right, but the tail is longer on the left

describe bivariate data

Study These Flashcards

occurs when two variables occur for a particular individual
- the variable may be any time

eg. height and weight, time and alcohol content, gender and blood type, treatment type and level

in the case of bivariable and it is two categorical variables. what to do?

- combine the variables into one variable - use bar chart for super variable see notes for visual

in the case of bivariable and it is one numerical and one categorical variables. what to do?

- compare the striation of the numerical variable for each level of the categorical data - use parallel box plots or dot plots

in the case of bivariable and it is two numerical variables. what to do?

- use a scatter plot or scatter diagram - each coordinate specifies a point in the Cartesian plane corresponding to an individual - a line plot may be drawn in the SPECIAL CASE that the x variable has a natural ordering. this is most oftenly time

describe the types of relationships scatter plots allow us to see. what can we do to tell the strength of this relationship

negative relationship: large x and small y, vice versa positive relationship: large x and large y, vice versa we measure the strength of the relationship with the correlation coefficient, r. this is a value between -1 less than or equal to r, which is less than or equal 1. - the larger the magnitude, the stronger the relationship - the sign reflects the relationship - if r is plus/minus 1, it means the two variables are directly linearly related whereby y=a+bx - scaling of these graphs must be all the same for consistent comparison

Chapter 2: exploratory data analysis Flashcards

(28 cards)