Stats Flashcards

1
Q

Nominal data

A

are classified into mutually exclusive groups or categories and lack intrinsic order. A zoning classification, social security number, and sex are examples of nominal data. The label of the categories does not matter and should not imply any order. So, even if one category might be labeled as 1 and the other as 2, those labels can be switched.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Ordinal data

A

are ordered categories implying a ranking of the observations. Even though ordinal data may be given numerical values, such as 1, 2, 3, 4, the values themselves are meaningless, only the rank counts. So, even though one might be tempted to infer that 4 is twice 2, this is not correct. Examples of ordinal data are letter grades, suitability for development, and response scales on a survey (e.g., 1 through 5).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Interval data

A

is data that has an ordered relationship where the difference between the scales has a meaningful interpretation. The typical example of interval data is temperature, where the difference between 40 and 30 degrees is the same as between 30 and 20 degrees, but 20 degrees is not twice as cold as 40 degrees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Ratio data

A

is the gold standard of measurement, where both absolute and relative differences have a meaning. The classic example of ratio data is a distance measure, where the difference between 40 and 30 miles is the same as the difference between 30 and 20 miles, and in addition, 40 miles is twice as far as 20 miles.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Continuous variables

A

can take an infinite number of values, both positive and negative, and with as fine a degree of precision as desired. Most measurements in the physical sciences yield continuous variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Discrete variables

A

can only take on a finite number of distinct values. An example is the count of the number of events, such as the number of accidents per month. Such counts cannot be negative, and only take on integer values, such as 1, 28, or 211.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

binary or dichotomous variables

A

only take on two values, typically coded as 0 and 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Descriptive Statistics

A

describe the characteristics of the distribution of values in a population or in a sample. For example, a descriptive statistic such as the mean could be applied to the age distribution in the population of AICP exam takers, providing a summary measure of central tendency (e.g., “on average, AICP test takers in 2018 are 30 years old”).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Inferential Statistics

A

use probability theory to determine characteristics of a population based on observations made on a sample from that population. We infer things about the population based on what is observed in the sample. For example, we could take a sample of 25 test takers and use their average age to say something about the mean age of all the test takers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Distribution

A

is the overall shape of all observed data. It can be listed as an ordered table, or graphically represented by a histogram or density plot. A histogram groups observations in bins represented as a bar chart. A density plot is a smooth curve.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

range

A

the difference between the largest and the smallest value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Normal or Gaussian distribution

A

also referred to as the bell curve. This distribution is symmetric and has the additional property that the spread around the mean can be related to the proportion of observations. More specifically, 95% of the observations that follow a normal distribution are within two standard deviations from the mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Symmetric distribution

A

is one where an equal number of observations are below and above the mean (e.g., this is the case for the normal distribution).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

An asymmetric distribution

A

where there are either more observations below the mean or more above the mean is also called skewed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Skewed to the right

A

when the bulk of the values are above the mean. This tends to happen when the distribution is dominated by a few very large values (outliers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Skewed to the left

A

where small values (such as zero) pull the distribution to the left

17
Q

Central tendency

A

is a typical or representative value for the distribution of observed values. There are several ways to measure central tendency, including mean, median, and mode. The central tendency can be applied to the population as a whole, or to a sample from the population. In a descriptive sense, it can be applied to any collection of data.

18
Q

Mean

A

is the average of a distribution. It is computed by adding up the values and dividing by the number of observations.

19
Q

Weighted mean

A

is when there is a greater importance placed on specific entries or when representative values are used for groups of observations. For example, when computing a measure for the mean income among a number of counties, the value for each county could be multiplied by the number of people of the county, yielding a population-weighted mean. The mean is appropriate for interval and ratio scaled data, but not for ordinal or nominal data.

20
Q

Median

A

is the middle value of a ranked distribution. The median is the only suitable measure of central tendency for ordinal data, but it can also be applied to interval and ratio scale data after they are converted to ranked values.

21
Q

Mode

A

is the most frequent number in a distribution. There can be more than one mode for distribution. For example, the modes of [1, 2, 3, 3, 5, 6, 7, 7] are 3 and 7. The mode is the only measure of central tendency that can be used for nominal data, but it can also be applied to interval and ratio scale data.

22
Q

Standard deviation

A

Square root of the variance. The standard deviation is in the same units as the original variable and is therefore often preferred.

23
Q

Variance

A

the average squared deviation from the mean. A larger variance means a greater spread around the mean (flatter distribution), a smaller variance a narrower spread (a spikier distribution).

24
Q

Coefficient of Variation

A

Which measures the relative dispersion from the mean by taking the standard deviation and dividing by the mean.

25
Q

Z-score

A

This is a standardization of the original variable by subtracting the mean and dividing by the standard deviation. As a result, the mean of the z-score is 0 and the variance (and standard deviation) is 1. The z-score in effect transforms the original measure into standard deviation units. For example, a z-score of more than 2 would mean the observation is more than two standard deviations away from the mean, or, it is an outlier in the sense just defined.

26
Q

Inter-quartile range or IQR

A

This is the difference in value between the 75 percentile and the 25 percentile, i.e., the 1/4 cut-off value and 3/4 cut-off value in a set of ranked values. For example, if we have 20 observations ranked in increasing order, we take the fifth and fifteenth observation and compute the difference between those values. This is the inter-quartile range. The IQR forms the basis for an alternative concept of outliers. Two fences are computed as the first quartile less 1.5 times the IQR and the third quartile plus 1.5 times the IQR. Observations that are outside these fences are termed outliers. This is visualized in a box plot (also called box and whiskers plot).

27
Q

Statistical inference

A

is the process of drawing conclusions about the characteristics of a distribution from a sample of data. For example, we estimate the mean from a sample of data and make a statement about the value of the population mean.

28
Q

Hypothesis test

A

A statement about a particular characteristic of a population (or several populations). We distinguish between the null hypothesis (H0), i.e., the point of departure or reference, and the alternative hypothesis (H1), or the research hypothesis one wants to find support for by rejecting the null hypothesis.

One starts by setting up a condition that is used as a reference but is not that useful in and of itself. Typically, this consists of setting a characteristic of the distribution (such as the mean) equal to a given value (often zero). A hypothesis test then consists of finding evidence in the data that rejects this statement in the direction of the alternative (typically, an inequality). The statistical evidence only provides support to reject the null hypothesis, never to accept the alternative hypothesis (the latter is just used as a means to help in rejecting the null). An alternative hypothesis can be two-sided (differences in both directions are considered), or one-sided (only differences in one direction are considered, i.e., only larger than or smaller than, but not both).

29
Q

Sampling error

A

provides the connection between the sample and the population. Because a sample does not contain all the information in the population, any statistic computed from the sample will not be identical to the population statistic, but show variation. That random variation is the sampling error or sampling distribution.

30
Q

Systematic error

A

occurs because our model (or assumptions) are wrong. It is unrelated to the sample as such.

31
Q

Confidence interval

A

this constitutes a range around the sample statistic that contains the population statistic with a given level of confidence, typically 95% or 99%. So, instead of rejecting the null hypothesis with a given probability, we establish a range around the sample statistic, such as a sample average, that contains the population mean with a given probability. The range of the confidence interval depends critically on the sampling error. If the sampling error is large, this means there isn’t much information in the sample relative to the population, so our statements about the latter will by necessity be vague (large confidence interval). On the other hand, with a smaller sampling error, we can make more precise statements. The sampling error is related to the sample size, with a larger sample resulting in a smaller error (as the sample grows larger, it approximates the actual population more closely).

32
Q

T-test

A

typically used to compare the means of two populations based on their sample averages. This is a so-called two-sample t-test (a one sample t-test compares the sample average to a hypothesized value for the mean). So, the null hypothesis is that the two population means are equal. However, since we do not observe the actual means, but only the sample averages, we can only make a probabilistic statement about the equality. Each of the sample averages has its only sampling distribution. By comparing the two sampling distributions, we can make statements about the null hypothesis.

33
Q

ANOVA or analysis of variance

A

a more complex form of testing the equality of means between groups. The typical application is in a so-called treatment effects analysis where the outcome of a variable is compared between a treatment group and a control group (in medical experiments, this would be the placebo group). For example, we would compare the average speed of cars on a street before (control) and after a street calming infrastructure was put in place (treatment). It is thus similar to the case considered in a t-test, but it allows more complex categorization of the groups. Typically, one classifies the sample into several groups according to categorical variables and compares the mean outcome on a continuous outcome variable.

34
Q

Chi Square test

A

a measure of fit. It is a test that assesses the difference between a sample distribution and a hypothesized distribution. A Chi Square test is often used to test the null hypothesis of independence in a contingency table, i.e. when the observations are grouped according to two categorical variables. The observed proportions are compared to the proportions we would expect if the two classifications were independent.

35
Q

Chi Square distribution

A

a skewed distribution that is obtained by taking the square of a standard normal variable (so, it only takes positive values). Under the null, the Chi Square test follows a Chi Square distribution.

36
Q

correlation coefficient

A

Measures the strength of a linear relationship between two variables. Note that, very importantly, this does not imply anything about causation, i.e., whether one variable influences the other. Also, the correlation coefficient only pertains to a linear relationship and can be misleading when the relationship is nonlinear. The correlation is computed by standardizing each of the variables and its value is between -1 and +1.

37
Q

Linear regression

A

The linear relation between two or more variables. This hypothesizes a linear relationship between a dependent variable (on the left-hand side of the equal sign) and one or more explanatory variables (on the right-hand side of the equal sign).