Kapitel 1-3 Flashcards
elements
the entities on which data is collected’
variable
is the different values of interest which gives us different outcomes
nominal scale
A scale of measurement, when the data of a variable consits of label or names to identify an attribute of the element. Can also be numeric when the number stands for a label or name.
ordinal scale
A scale of measurement, the same as nominal scale + the order or rank of the data is meaningful. Is non-numeric but can be numeric when the number stands for a label/name
interval scale
A scale of measurement, the same as ordinal scale + the interval between values are expressed in terms of a fixed unit of measure. Always numeric. The differences between two numeric values has meaning.
ratio scale
A scale of measurement, the same as interval scale + the ratio of two values are meaningful. Exampel: disrance, height, weight and time. The scale requires that a zero value means zero point for the value.
quantatitive data
require numeric variables that indicate how much or how many
cross-sectional data
data collected at the same or approximately the same time
time series data
data collected over several time periods
descriptive statistics
Summarize of data through tabular, graphs or numerical
population
the set of all elements of interest in a particular study
sample
subset of population
frequency distribution
is a tabular data summary showing the frequency of items in each of several non-overlapping classes
relative frequency distribution
tabular summary shoing relative frequency for each class. relative frequency of a class = frequency of the class / n
percentage frequency distribution
summarizes the percentages frequency for each class
bar chart
is a graphical representation of a frequency, relative frequency or percetages frequency with two axises (data and frequency)
pie charts
a circel representing data for frequency, relative frequency or percentages frequency. Divided into sectors corresponding to its relative frequency.
histogram
is a chart showing quantatitive data in a frequency distribution
cumulative frequency distributions
Show the frequency of classes but divides the classes in “less than or equal to the upper class limit” of each class.
dot plot
graphical summarizes data, equal many dots as the frequency
stem-and-leaf display
shows both the rank order and shape of data set.
cross tabulations
a tabular summary of data for two variables
simpsons paradox
the problem which can occur when two variabels aggregates and gives reversed relation between the variables in comparsion when they are not aggregated
clustered bar charts
shows joint distribution of two categorical variabels
stacked bar charts
shows joint distribution of two cateogorical variabels
scatter diagram
graphical representation of the relationships between two quantatitive variabels. Trend line provides a approximation of the relationship
percentile
provides information about how the data are spread over the interval from the smallest to largest value.
“the pth percentile is a value such that at least p percent of the observations are less than or equal to this value and at least (100-p) percent. of the observations are greater than or qual to this value”
quartile
Division points:
First quartile - 25th percentile
Second quartile - 50 th percentile (also the median)
Third quartile - 75th percentile
range
Range = largest value - smallest value
Highly affected by extrem high and low values.
interquartile range
IQR = third quartile - first quartile
variance
is a measure of variability that uses all data values and is based on the difference between each data value and the mean
standard deviation
is defined as the positive square root of the variance
coefficient of variation
describes how large the standrad deviation is relative to the mean –> standard deviation/mean x 100
Distributional shape
histograms with relative frequency distributions shows the skewness, the distibutional shape
negative, positive, zero skeweness
negative skeweness - skewed to the left
positive skeweness - skewed to the right
zero skeweness - symmetrical
z-score
represents the number of standard deviations Xi from the sample mean, are standardized values
Chebyshev´s theorem
enabels us to make a statements about the proportion of data values that lie within a specified numer of standrad deviations of the mean. “At least (1-1/z^2) x 100 of the data values must be within z standard deviations of the mean, where z is any value greater than 1”
Cheabysthev´s theroem - percent
- at least 75% of the data values be within z=2 standard deviations of the mean
- at least 89% of the data values be within z=3 standard deviations of the mean
- at least 94% of the data values be within z=4 standard deviations of the mean
empirical rule
When the approximation of a distributions is bell-shaped can the empirical rule be used to determine the approximate percentage of data values that lie within a specified number of standard deviations of the mean.
Empirical rule - percent
For data with bell-shaped distribution:
- approximately 68% of the data values lie within 1 standrad deviation of the mean
- approximately 95% of the data values lie within 2 standrad deviation of the mean
- allmost all data lies within 3 standard deviations of the mean
outliers
Extreme values in a data set.
Z-scores (standrdized values) can be used to identify outliers. In bell-shaped distributions, the empirical rule says that any data with z-scores less than -3 or more than 3 can be said to be outliers
Five-number summary
Used to summarize data:
1. Smallest value
2. First quartile
3. Median
4. Third quartile
5. Largest value
Box plot
Graphical version of the five-number summary.
1. A box is drawn with the box ends located at first and third quartile
2. A line is drawn across the box at the location of the median
3. By using the IQR, limits are located. The limits are 1.5(IQR) below first quartile and 1.5(IQR) above thrid quartile. Data outside these limit are considered outliers.
4. Whiskers (dashed lines) are drawn from the ends of the box to the smallest and largest value.
5. The outliers is shown with *
Covariance
a descpritive measure of the linear association between two variables