Describing data Flashcards
Micro data
collected on individuals
Macro data
Collected on groups of units
Population
The set of all statistical units of the interested object. Denoted by N
Sample
Subset drawn from the population. Denoted by n
Non-probability sampling
Units are drawn from the population according to the judgement of the researcher
Probability sampling
Units are drawn from the population randomly. It ensures that the sample is representative of the population, by not favoring any part of N
Inferential process
Drawing conclusions that concern the entire population from the information drawn from n.
Collection of techniques that make use of sample statistics to learn on N parameters
Parameter
Numerical summary of a characteristic at N level
Statistic
Numerical summary of a characteristic at n level
Categorical values
Non numerical values, can be either nominal or ordinal
Nominal categorical values
Non-numerical that cannot be ranked
Ordinal categorical values
Non-numerical that can be ranked
Numerical values
number values, can either be discrete or continuous
Discrete Numerical value
takes on a finite number of values of infinite but COUNTABLE
Numerical continuous values
Can take any value between two numbers (ex.: height and weight)
How is a frequency distribution table composed (its columns)
RIGHT: observed distinct values (classes/groups)
MIDDLE: absolute frequency or absolute values of the observations
LEFT: relative frequency
How can you represent a freq. table (not with intervals)?
Pie or bar chart
How can you represent a freq. table (with intervals)?
Histogram
How can you read a histogram?
HORIZONTAL AXIS: Intervals –> on each interval there is a bar having area equal to its relative frequency
VERTICAL AXIS: interval density
How can you calculate an interval density?
relative frequency/ interval length
The higher the number of intervals the ……….. is the degree of detail of the description
higher
Mode
The level or value of a variable that is observed with the highest frequency = the most observed value
What is the unique measure for nominal variables?
Mode
Median
The central value of the distribution. It divides the sample in half
How to calculate the median for odd and even numbers?
ODD: (n+1)/2
EVEN: any of the two middle observations, or the arithmetic avg of them
How can we calculate the median from a frequency table?
It can either be the value in which the cumulative percentage is 50% or the first value that weights more than 50%
Mean
Arithmetic average of all variable values. ONLY for numerical values
Deviation
The difference between each observed value and the mean. Positive if higher than the mean and negative if lower
How to calculate the mean from freq. distribution tables?
(Valuefrequency) + (value2freqeuency) +…… / n
Do outliers affect the mean? Why?
Yes, because it is measured using ALL the values from the observation
Do outliers affect the median? Why?
No, because it is measured only by using the frequencies
What are the two measures of location and for what types of variables can they be used?
Quartiles and percentiles.
For ordinal categorical and numerical values
What are the quartiles and what does each represent?
Q1 = approximately a quarter of the observations are smaller (25th %) Q2= Median Q3= approximately three quarters of the observations are smaller (75th %)
What does the pth percentile represents?
It is the value such that approximately p% of the cases fall below
What are the characteristics of a boxplot?
Lower edge = Q1
Upper edge = Q3
line = median
Whiskers = values above/below 1.5x IQ range. Any value above/below are outliers
How can we compute the IQ range?
IQ range = Q3-Q1
What are the 5-number summary?
the most extreme values in the data set (the maximum and minimum values), the lower and upper quartiles, and the median
Bell-shaped distribution characteristics
Median-Q1=Q3-median
Q1-Min = max - Q3
Median - Q1 < Q1-min
Mean = median
NOT Bell-shaped distribution characteristics
Median-Q1=Q3-median
Q1-Min = max - Q3
Median - Q1 > Q1-min
Mean = median
Skewed-right distribution characteristics
Median-Q1>Q3-median
Q1-Min > max - Q3
high on low values and low on high values
Skewed-left distribution characteristics
Median-Q1
What are the four measures of variability?
Range, IQ range, variance and standard variation
What do the measures of variability show us?
How the frequencies are distributed across the values and if the units are spread uniformly across the variable
Range
Max - Min value
Why is the range affected by outliers?
Because it only uses extreme values
What does the IQ measures?
The spread of the 50% central part of the distribution
Variance
The average of the squared deviations
What does the variance measures?
The dispersion or variability of a variable as the spread around its mean.
It is ALWAYS positive
If all values are equal then it’s = 0
Standard deviation
The square root of the variance
Coefficient of variation
CV = sd/mean
NEEDED when comparing two variables, if they do not have the same mean
What does a variable concentration analyzes?
It assesses how far from the extremes the actual distribution is
What does a very concentrated variable means?
that the distribution is very different from the perfectly equal distribution
What does a low concentrated variable means?
that the distribution is very close from the perfectly equal distribution
For what type of variable is a concentration analysis carried out?
Only for POSITIVE numerical values that have the the property of TRANSFERABILITY
Are bio metrical values, such as weight, transferable values?
No
Are financial values transferable values?
Yes
What is the name of the concentration curve?
The Lorenz curve
What are the two variables needed for concentration and how to compute them?
Fi and Qi
Fi = i/n
Qi = value/n
If Fi and Qi, for each i, are the same, what type of concentration do we have?
Minimun concentration
If Qi=0 for each i except Qn, which is 1, what type of concentration do we have?
Maximum concentration
In the Lorenz curve, the closer the curve is to the horizontal axis the ……….. (greater/smaller) the concentration is
Greater
In the Lorenz curve, the closer the curve is to the vertical axis the ……….. (greater/smaller) the concentration is
Smaller
What are the characteristics of the concentration/Lorenz curve?
Continuous, convex, it crosses the dots with the coordinates (0,0) and (1,1)
Gini index
Denoted by R = concentration area/possible maximun area
What is the maximun area formula?
(n-1)/2n
When is the gini coefficient 1?
Max concentration
When is the gini coefficient 0?
Min concentration
What is a bivariate association? And how can it be done for numerical values?
The study of association between two values. It can be done with a cross tab/ contingency table
What type of plot can be used to show conditional frequencies?
Stacked bar plot or Side bar plot
When do we have a positive linear association between numerical values?
When high/low values of one variable tend to occur with high/low values of the other variable
When do we have a negative linear association between numerical values?
When high/low values of one variable tend to occur with low/high values of the other variable
With what measures can we assess linear association?
Covariance and Pearson index
How is the pearson index calculated?
r= Cov (X,Y) /SxYx
What does the Pearson index measures?
The direction and strength of the correlation between two numerical variables. It takes on values between -1 and 1
What does the following pearson correlations indicate?
+1,-1,0
+1 - strong POSITIVE correlation
-1 - strong NEGATIVE correlation
0 - no linear association
If the median is closer to Q3 what is the distribution shape?
Left-skewed
If the median is closer to Q1 what is the distribution shape?
Right skewed
If the median is on the center what is the distribution shape?
Bell-shaped