intro to data Flashcards
data matrix
table of data
columns : variables
rows : individuals
what are variables and individuals
(information given in data)
variables = characteristics
individuals = observational unit
quantitative variablee
numerical or measurement variable
ex: age, distance
two types: discrete & continuous
quantitative variable
discrete
can only take numerical values with jumps,
1, 2, 3, 4
# of plants in a garden # of dogs in a house
quantitative variable
continuous
can take on any value in an interval
temperature throughout the day
decimals
categorical variable
qualitative variable
place an individual or item into one of several groups or categories called levels
examples of levels
blood types (A,B, AB, O)
gender
two types of categorical variables
nominal and ordinal
categorical variable
nominal
no natural ordering for the categories
Ex: dog breed, brand of soda
categorical variable
ordinal
have a logical order for categories
ex: size of soda, grade level
what graphs are used to graph categorical data
bar graph and pie chart
box plot qualities
shows the median using dark horizontal line
can’t see the number of modes
what graphs are used to graph quantitative data
dotplots and histograms
dotplots qualities
represents each observation in a data set using a single dot along the x-axis
do well displaying values of a variable in a smaller data set
NOT good at displaying data with too many different values —> lose sense of overall distribution
histograms qualities
give a good sense of the shape of the distribution
shows the modes
symmetry is visible
distribution
what values does the variable take and how often
modes
of peaks
univocal, bimodal, multimodal
symmetry
symmetric
skewed to right (tail on right) lower values
skewed to left (tail on left) higher values
outliers
observations that lie outside the overall pattern of distribution
^^ must consider reason they exist
population
the entire group we are interested in learning about
sample
subset of individuals that is often a small fraction of the overall population
parameter
the numerical summary for a characteristic of the population (as a whole)
keyword : “All”
statistic
the numerical summary for a characteristic of a sample
keyword: sample
what two goes together
sample and parameter
statistic and population
sample is a statistic
population is a parameter
sample mean
the sum of the observations divided by the # of observations
symbol: x bar
center when distribution is roughly symmetric
sample median
the middle value when data are arranged from smallest to largest
center when distribution is skewed
mean and median sensitivity to extremes
mean = sensitive to extremes
median = resistant to extremes
what does the mean and median value relative to each other, tell us about the distribution
mean= median
mean>median
mean<median
mean = median : approximately symmetric
mean>median : skewed to right
mean<median : skewed to left
range
maximum- minimum
very sensitive to extremes
interquartile range
goes hand in hand with median
used to measure variability when median is the center
IQR = Q3-Q1
describes the variability of the middle 50% of data
NOT sensitive to extreme values
percentiles
first percentile Q1 the 25th percentile
third percentile 75th percentile Q3
standard deviation
used as measure of variability when sample mean is the measure of center
tells us how much an observation departs from the mean :observation - mean
sample variance
s^2 = sum of squared deviations / n-1
n = number of observations
sample standard deviation equation
s = sqrt (all observations - means)^2 / n-1
how do u know if there is more variability when you look at standard deviation value (s)
when s is larger
what two descriptions go together and when
median
mean
iqr
standard deviation
median and iqr = skewed
mean and sd = symmetric
how to describe distribution of a graph
SOCS
shape - (# of modes/ symmetric? skewed?)
Outliers (do they exist?)
center (mean or median)
spread/variability (IQR or SD)
suspected outliers in boxplots
points that lie at or below Q1-1.5xIQR or at or above Q3+1.5xIQR