09 Quantitative Methods - Descriptive Statistics Flashcards
Questions on Wagschal, Uwe (1999): Statistik für Politikwissenschaftler. Oldenbourg, München u.a. (pages to read: chapters 7 and selected parts of chapter 10).
(1) Regarding the measures of centrality:
(1.1) What is the purpose of measures of centrality?
The most commonly used measures of statistics are the means, also known as location parameters. A location parameter refers to the point on a characteristic axis where the characteristic values of a population are located on average. This location can be measured in different ways, which is why there are a whole range of differently defined location parameters. Mathematically speaking, a mean is a value between the smallest and largest value of a set. The measures of means are also referred to as models for describing central tendency.
(1.2) Make sure that you understand the following measures of centrality: mode (Modalwert), median and arithmetic mean.
Mode (xmod): it is the value that appears with the highest frequency in a given dataset. For example, if the dataset is {1, 2, 2, 3, 4, 4, 4, 5}, then the mode is 4, since it appears three times and no other value appears more frequently.
Median (Z): the median is a measure of central tendency that refers to the value separating the higher half from the lower half of a dataset. Specifically, it is the middle value in a sorted list of numbers. To find the median of a dataset, you first need to sort the numbers in ascending or descending order, and then identify the middle value(s) based on the number of items in the dataset. If the dataset has an odd number of items, the median is the middle number. If the dataset has an even number of items, the median is the average of the two middle numbers.
Arithmetic mean (x̄): it is a measure of central tendency that refers to the average value of a set of numbers. It is calculated by adding up all the numbers in the set and then dividing the sum by the total number of items in the set.
(1.3) What are the “Lageregeln der Mittelwerte”/ “laws of location of means”?
The distribution is
1. symmetric if x̄ = Z = xmod
2. skewed to the right if x̄ < Z < xmod
3. skewed to the left if x̄ > Z > xmod
(2) Regarding the measures of dispersion:
(2.1) What is the purpose of measures of dispersion?
The purpose of measures of dispersion is to provide information about the spread or variability of a dataset. While measures of central tendency such as the mean or median provide information about the typical or central value of a dataset, measures of dispersion provide information about how much the individual values in the dataset deviate from that central value.
Some commonly used measures of dispersion include the range, variance, standard deviation, and interquartile range. These measures can help to identify the presence of outliers, assess the precision of statistical estimates, and make comparisons between different datasets.
In general, a high dispersion indicates that the values in the dataset are spread out over a larger range, while a low dispersion indicates that the values are more tightly clustered around the central value. By taking into account both the central tendency and the dispersion of a dataset, researchers can gain a more complete understanding of the distribution of the data and draw more accurate conclusions about the population from which the data was drawn.
2.2) Make sure that you understand the following measures of dispersion: range, interquartile range
Range: The range is the simplest measure of dispersion and is calculated by subtracting the minimum value from the maximum value in the dataset. It represents the total spread of the dataset.
Interquartile Range (IQR): The interquartile range is a measure of the spread of the middle 50% of a dataset. It is calculated by subtracting the 25th percentile (first quartile) from the 75th percentile (third quartile).
2.2) Make sure that you understand the following measures of dispersion: variance, standard deviation, coefficient of variation.
Variance: The variance is a measure of how much the individual values in a dataset deviate from the mean. It is calculated by taking the average of the squared differences between each value and the mean. The variance can be biased if the sample size is small, in which case it is better to use the corrected sample variance.
Standard Deviation: The standard deviation is the square root of the variance and provides a measure of the spread of a dataset in the same units as the data. It is the most commonly used measure of dispersion, as it is easy to interpret and widely applicable.
Coefficient of Variation: The coefficient of variation is a measure of relative dispersion, calculated as the ratio of the standard deviation to the mean. It is often used to compare the variability of datasets with different means or units of measurement.
(3.1) What is the purpose of measures of shape?
The purpose of measures of shape is to provide information about the distribution or shape of a dataset. Measures of shape help to describe the pattern or form of the distribution, which can be important in understanding the underlying nature of the data and drawing appropriate conclusions.
Some commonly used measures of shape include skewness, kurtosis, and moments. Skewness measures the degree to which a distribution is asymmetrical, with positive skewness indicating a longer tail on the right side of the distribution and negative skewness indicating a longer tail on the left side. Kurtosis measures the degree to which a distribution is peaked or flat, with high kurtosis indicating a tall peak and heavy tails, and low kurtosis indicating a flatter distribution with lighter tails. Moments are statistical measures that describe the shape of a distribution, including skewness, kurtosis, and other characteristics.
Measures of shape are particularly useful in comparing different datasets, identifying outliers, and testing assumptions about the underlying population distribution. By taking into account the shape, central tendency, and dispersion of a dataset, researchers can gain a more complete understanding of the distribution and make more accurate inferences about the population from which the data was drawn.
(3.2) What is skewness (Schiefe)?
Skewness is a measure of the degree of asymmetry of a distribution. It quantifies how much a distribution deviates from a perfectly symmetric, bell-shaped curve.
A distribution can be either positively skewed or negatively skewed. In a positively skewed distribution, the tail is longer on the right-hand side of the curve and the majority of the data is concentrated on the left-hand side of the curve. This is also called a right-skewed distribution. In a negatively skewed distribution, the tail is longer on the left-hand side of the curve and the majority of the data is concentrated on the right-hand side of the curve. This is also called a left-skewed distribution.
Skewness is typically measured using the skewness coefficient, which is a standardized measure of the degree of asymmetry. The skewness coefficient is calculated by dividing the difference between the mean and the mode by the standard deviation. A positive skewness coefficient indicates a positively skewed distribution, while a negative skewness coefficient indicates a negatively skewed distribution. A skewness coefficient of zero indicates a perfectly symmetric distribution.
Skewness is an important measure of the shape of a distribution and can be used to identify potential outliers or anomalies in the data. It is also used in various statistical tests and models, including regression analysis and hypothesis testing.
(3.3) What is kurtosis (Wölbung)?
Kurtosis is a measure of the degree of peakedness or flatness of a probability distribution. It is a statistical measure that describes the shape of a distribution, particularly the tails of the distribution. A distribution with high kurtosis has a sharper peak and heavier tails, while a distribution with low kurtosis has a flatter peak and lighter tails.
Questions on chapter 7: This chapter focuses on the measures of correlation.
(1) What is the purpose of the measures of correlation?
In economics and social sciences, it is an important goal to measure the relationships between variables (characteristics, for metric-scaled variables). The main interest is whether there is a relationship, how strong it is, and in which direction it goes. The aim is to identify patterns in the joint occurrence of characteristics: depending on the scale level, contingency measures (nominal scale), association measures (ordinal scale), and correlation measures (metric scale) are used. The scale level of the considered characteristics is of great importance for data analysis. Measures for interval-scaled variables should not be used for nominal-scaled characteristics. However, measures for lower scale levels can be calculated for higher scale levels - of course, only if this makes sense. Important measures, such as the Pearson correlation coefficient, should therefore not be used for lower scale levels, although this sometimes happens in practice.
(3) Go to section 10.14. Make sure that you understand what Pearson’s r is, what it tells us and how it should be interpreted.
is a measure of the strength and direction of the linear relationship between two continuous variables. It ranges from -1 to +1, where -1 indicates a perfect negative linear relationship, +1 indicates a perfect positive linear relationship, and 0 indicates no linear relationship between the variables.
The strength of the correlation can be interpreted by the magnitude of the value of r. Values close to -1 or +1 indicate a strong correlation, while values closer to 0 indicate a weak correlation. The exact interpretation of what constitutes a strong or weak correlation may vary depending on the field of study and the specific research question.
It’s important to note that correlation does not imply causation. Just because two variables are correlated does not mean that one causes the other. Other factors or variables may be involved in the relationship.
(4.1) What is the relation between cause and effect?
Correlation does not equal causation. Strong correlations between two variables may indicate a causal relationship between two characteristics X and Y, but this is not necessarily the case. Correlation refers only to a purely numerical relationship between variables, while causation implies a clear cause-and-effect relationship. Such relationships can be one-sided or mutual, reversible (i.e. reversible: “if X then Y, and if Y then X”) or irreversible (i.e. not reversible: “if X then Y, but if Y then not X”), as well as synchronous (i.e. the influence occurs at the same time) or diachronic (time-delayed). The presence of a high correlation is not a necessary condition for a potentially existing causal relationship.
(4.1) What three different types of cause-and-effect relationships are there?
1) Random (stochastic) relationships
1) Random (stochastic) relationships: If one wants to explain a social phenomenon, such as voting for a particular political party (=dependent variable), then in a stochastic process, this voting decision is independent of any other factors such as income or religion. Only chance plays a role in such a case. This would imply that the voting decision for a party is not predictable. For social scientists, purely stochastic relationships are the worst of all possible “worlds,” as there is ultimately nothing left to explain. The predictability of an event Y has a probability of 0 in such a case, making it impossible.
(4.1) What three different types of cause-and-effect relationships are there?
2) Complete predictability (determinism) of a dependent variable
2) Complete predictability (determinism) of a dependent variable: A deterministic relationship is a relation that always holds in the specified way. In equation form, a deterministic statement is: Y = f(X), which means that Y is a function of X, and if X is true, then Y always holds. In this case, the variable Y depends only on the variable X. A deterministic statement would be: If the research participant belongs to the Catholic Church, they always vote for the CDU/CSU. From a probability theory perspective, this means a probability of 1 for the effect (vote for the CDU/CSU) when the cause (Catholic) is present. Such rigid statements leave no room for the factor of chance or other influencing factors that overlap with the factor of religion. Complex social science relationships can only be inadequately captured with deterministic statements.
(4.1) What three different types of cause-and-effect relationships are there?
3) probabilistic cause-effect relationships
3) probabilistic cause-effect relationships: In a probabilistic statement, one can assign a certain probability of occurrence to the event: “If X, then Y with p-% probability”. For example, “A Catholic votes for the CDU/CSU with a probability of 70 percent.” This is the predominant view of causality in the social sciences (Suppes 1970). The formulation of probability-based causal relationships allows room for other confounding factors.