Descriptive stats Flashcards
What’s the difference between descriptive and inferential statistics?
Descriptive –> describe sample data based on sample statistics
Inferential –> use sample statistics to learn on population parameters
What is micro data?
Data collected on individuals
What is macro data?
Data collected on a group of units
What is a population?
The set of all statistical units object of interest
What is the sample?
A subset drawn from the population
What is non probability and probability sampling?
Non –> units are drawn from the population according to the judgement of the researcher
Probability –> units are drawn at random from the population, and every unit has the same probability to be drawn
What is the inferential process?
It consists in drawing conclusions that concern the entire population from the information provided by a sample
What are the two broad variable categories?
Numerical and categorical
What are the subsets of numerical variables?
Discrete and continuous
What are the subsets of categorical variables?
Ordinal and nominal
What are the columns of a frequency distribution table?
Classes/groups; absolute frequencies and relative frequencies
How is a histogram composed?
Horizontal axis –> intervals
Bars –> have an area equal to its relative frequency
Vertical axis –> interval density = relative frequency/interval width
How can we calculate an interval density?
Relative frequency/interval width
How does the number of intervals relate to the accuracy?
The higher the # of intervals, the higher the detail of the description.
What are the three measures of central tendency?
Mode, median and mean
What is the mode?
The level/value of a variable that is observed with the highest frequency
What is the unique measure of central tendency for nominal variables?
Mode
What is the median?
It is the central value. If odd–> (n+1)/2, if even it’s the median of the two central values
What is the mean?
The arithmetic average of the values. (x1+x2+….+xn)/n
What is the deviation?
It’s the difference of an observed value and the mean
What are the properties of deviation?
- It’s positive when the value is higher than the mean and negative when not
- The sum of all deviations is equal to 0
Do strange values have an impact on the median?
No, because it’s based on frequencies
Do strange values have an impact on the mean?
Yes, because it’s computed using all values
What are the measures of location? And to which type of data can they be computed for?
- Quartiles and percentiles;
- Ordinal categorical and numerical
What are quartiles?
They divide the observation in four
Q1- 25% of values are smaller than it
Q2- it’s the median
Q3- 75% of the observations are smaller than it.
What is the percentile?
The value that pth observations fall below it
What are the 5 number summary? And how can it be represented?
Minimum, Q1,Q2,Q3 and maximum
By means of a boxplot
How is the boxplot composed?
Height –> it’s the IQR (Q3-Q1)
Upper edge –> Q3
Lower edge –> Q1
Whiskers –> connect the outliers (1.5xIQR)
What are the properties of a symmetrical/bell-shaped distribution?
Q1-min = Q3-max; Median-q1=Q3-median;
median-Q1
What are the properties of a not bell-shaped distribution?
Q1-min = Q3-max; Median-q1=Q3-median;
median-Q1>Q1-min; Q3-median>Max-median
What are the properties of a right-skewed distribution?
It’s high on low values and low on high values
Median - Q1 > Q3-median
Mean > median
What are the properties of a left-skewed distribution?
It’s high on high values and low on low values
The mean is not affected by low frequency values
Median - Q1 < Q3-median
What are the 4 measures of variability?
Range, IQR, variance and standard deviation
What does thee IQR measure?
The spread of the central 50% of the observations
What is the variance?
It’s the average of the squared deviations. It measures the dispersion of a variable around its mean. It’s always positive
What is the coefficient of variation?
CV=s/mean, it expresses the standard deviation as percentage of the mean and allow for a comparison of the behavior of two variables when they have a different mean
What does it mean to analyze the concentration of a variable?
It means to assess how far from the extremes the actual distribution is
What does it mean a variable is very concentrated?
It’s very far from being perfectly concentrated
What does it mean a variable has a low concentration?
It’s very close from being perfectly concentrated
Can a concentration analysis be carried out for variables with negative values?
No
What is the property needed for a variable so we can carry a concentration analysis?
It needs to be transferable
How is qi distributed in a case of maximum concentration?
Q0-Qn-1=0 and Qn=1
What is the coordinates of the maximum concentration?
{(n-1)/2n,0}
How is qi distributed in a case of minimum concentration?
qi=fi for every i, that is, the concentration is always the same
What are the properties of a concentration curve?
- Continuos
- Convex
- crosses (0,0) and (1,1)
What is the gini index?
R= concentration area/maximum possible area (n-1)/2n
When are the pietra and gini index equal to zero?
When fi=qi for all i, that is, when it has a minimum concentration
If for two variables their high observations tend to occur with high values of the other too, what kind of linear association is there?
Positive
If for two variables their high observations tend to occur with low values of the other too, what kind of linear association is there?
Negative
What is the formula for the pearson’s correlation in dex?
r = cov(X,Y)/ sx sy