ST102.1 - Data visualisation and descriptive statistics Flashcards

Question

How do you use summation notation to find sums of sets of observations other than 1 to n?

Answer 1

(n/2) Σ Xi = X2 + X3 + · · · + Xn/2 (i=2)

Answer 2

The sample mean (‘arithmetic mean’, ‘mean’ or ‘average’) is the most common measure of central tendency. - The sample mean of a variable X is denoted X¯ (where the bar is above the X). - It is the ‘sum of the observations (Σ Xi)’ divided by the ‘number of observations (n)’ (sample size) expressed as: (n) X¯ = Σ Xi / n (i=1) EXAMPLE. The mean X¯ of the numbers 1, 4 and 7 is: (1 + 4 + 7)/3 = 4

Answer 3

Function f is called a linear operator if it has the two properties: - f(x+y)=f(x)+f(y) for all x and y; - f(cx)=cf(x) for all x and all constants c. EXAMPLE. (n) Σ (Xi − X¯) = 0. (i=1) is the same a: (n) (n) Σ (Xi) - Σ (X¯) = 0. (i=1) (i=1) AND nX¯ - nX¯ = 0 (as above).

Answer 4

Used to calculate the Sample Median - Let X(1), X(2), . . . , X(n) denote the sample values of X when ordered from the smallest to the largest, known as the order statistics, such that: - X(1) is the smallest observed value (the minimum) of X (this is not to be confused with X1 which is the first value of X as X(1) /= X1). - X(n) is the largest observed value (the maximum) of X..

Answer 5

The (sample) median, q50, of a variable X is the value that is ‘in the middle' of the ordered sample. 1. If n is odd, then q50 = X((n+1)/2). - EXAMPLE. If n = 3, q50 = X(2). - EXAMPLE. If n = 155, so q50 = X(78) i.e the 78th value of X. 2. If n is even, then q50 = (X(n/2) + X(n/2+1))/2. - EXAMPLE. If n = 4, q50 = (X(2) + X(3))/2 so between n2 and n3.

Answer 6

In general, the mean is affected much more than the median by outliers, i.e. unusually small or large observations. - Therefore, you should identify outliers early on and investigate them – perhaps there has been a data entry error, which can simply be corrected. - If deemed genuine outliers, a decision has to be made about whether or not to remove them.

Answer 7

Due to its sensitivity to outliers, the mean, more than the median, is pulled toward the long tail of the sample distribution. - When summarising variables with skewed distributions, it is useful to report both the mean and the median. 1. For a positively skewed distribution, the mean is larger than the median. 2. For a negatively skewed distribution, the mean is smaller than the median. 3. For an exactly symmetric distribution, the mean and median are equal.

Answer 8

The (sample) mode of a variable is the value that has the highest frequency (i.e. appears most often) in the data.

Answer 9

Measures of dispersion is a statistical term that describes the size of the distribution of values expected for a particular variable and can be measured by several different statistics, such as range, variance, and standard deviation.

Answer 10

The sample variance of a variable X, denoted S^2 (or S^2x), is defined as: S^2 = 1/(n − 1) Σ (Xi − X¯)^2

Answer 11

The sample standard deviation of X, denoted S (or SX ), is the positive square root of the sample variance. - The standard deviation is more understandable than the variance, because the standard deviation is expressed in the same units as X (rather than the variance, which is expressed in squared units). _________________ S = V 1/(n − 1) Σ (Xi − X¯)^2

Answer 12

A useful rule-of-thumb for interpretation is that for many symmetric distributions, such as the ‘normal’ distribution: - About 2/3 of the observations are between X¯ − S and X¯ + S, that is, within one (sample) standard deviation about the (sample) mean. - About 95% of the observations are between X¯ − 2 × S and X¯ + 2 × S, that is, within two (sample) standard deviations about the (sample) mean.

Answer 13

Standard deviations (and variances) are never negative, and they are zero only if all the Xi observations are the same (that is, there is no variation in the data).

Answer 14

1. The first quartile, q25 or Q1, is the value which divides the sample into the smallest 25% of observations and the largest 75%. 2. The median, q50, is basically the value which divides the sample into the smallest 50% of observations and the largest 50%. 3. The third quartile, q75 or Q3, gives the 75%–25% split. 4. If we consider other percentage splits, we get other (sample) quantiles, (percentiles), qc . - The extremes in this spirit are the minimum, X(1) (the ‘0% quantile’, so to speak), and the maximum, X(n) (the ‘100% quantile’). - These are no longer ‘in the middle’ of the sample, but they are more general measures of location of the sample distribution.

Answer 15

1. Range: X(n) − X(1) = maximum − minimum - The range is, clearly, extremely sensitive to outliers, since it depends on nothing but the extremes of the distribution, i.e. the minimum and maximum observations. 2. Interquartile range (IQR): IQR = q75 − q25 = Q3 − Q1. - The IQR focuses on the middle 50% of the distribution, so it is completely insensitive to outliers.

Answer 16

A boxplot (in full, a box-and-whiskers plot) summarises some key features of a sample distribution using quantiles. The plot is comprised of the following: - The line inside the box, which is the median. - The box, whose edges are the first and third quartiles (Q1 and Q3). - Hence the box captures the middle 50% of the data. Therefore, the length of the box is the interquartile range. - The bottom whisker extends either to the minimum or up to a length of 1.5 times the interquartile range below the first quartile, whichever is closer to the first quartile. - The top whisker extends either to the maximum or up to a length of 1.5 times the interquartile range above the third quartile, whichever is closer to the third quartile. - Points beyond 1.5 times the interquartile range below the first quartile or above the third quartile are regarded as outliers, and plotted as individual points. - Boxplots are useful for comparisons of how the distribution of a continuous variable varies across different groups, i.e. across different levels of a discrete variable.

Answer 17

A scatterplot shows the values of two continuous variables against each other, plotted as points in a two-dimensional coordinate system.

Answer 18

A common special case of a scatterplot is a line plot (time series plot), where the variable on the x-axis is time. The points are connected in time order by lines, to show how the variable on the y-axis changes over time.

Answer 19

A (two-way) contingency table (or cross-tabulation) shows the frequencies in the sample of each possible combination of the values of two discrete variables. Such tables often show the percentages within each row or column of the table.

ST102.1 - Data visualisation and descriptive statistics Flashcards

(43 cards)