Chapter 2 Flashcards

Question

What values does the normalised Gini index take?

Answer 1

[See flashcard]

Answer 2

[See flashcard]

Answer 3

Perfect homogeneity

Answer 4

Maximum heterogeneity

Answer 5

The relative index of heterogeneity

Answer 6

Rescale by the maximum value (log K) [See flashcard]

Answer 7

The Gini Coefficient R (a summary index of concentration)

Answer 8

They help understand the concentration of the characteristic among the N quantities

Answer 9

Minimum concentration - equal income for all (everyone has equal salary) x1 = x2 = .... = xn = x Maximum concentration - someone gets all the income x1 = x2 = ... xn-1 = 0 and xn = N*x_bar The degree of concentration can lie between these two extremes

Answer 10

[See flashcard]

Answer 11

There are N non-negative quantitates measuring a transferable characteristic (eg a fixed amount of income among N individuals) placed in an increasing (non-decreasing) number.

Answer 12

The cumulative proportion of considered units, up to unit i

Answer 13

The cumulative proportion of characteristic that belongs to the first I units

Answer 14

They sum up to N-1, don't include the final value

Answer 15

Minimum concentration

Answer 16

Maximum concentration

Answer 17

The sheer complexity of the information

Answer 18

- Principal components - Exploratory factor analysis

Answer 19

A data-reduction technique that transforms a larger number of correlated variables into a much smaller set of uncorrelated variables called principal components.

Answer 20

A collection of methods designed to uncover the latent structure in a given set of variables. It looks for a smaller set of underlying or latent constructs that can explain the relationships among the observed variables. eg a dataset of 24 variables has intercorrelations that can be explained by 4 underlying factors.

Answer 21

Uncorrelated composite variables, used to reduce dimensionality. They aim to ratio as much information from the original set of variables as possible. They are linear combinations of the observed variables. The weights used to form the linear composites are chosen to maximise the variance each PC accounts for, while keeping the components uncorrelated.

Answer 22

Factors are assumed to underlie or "cause" the observed variables in exploratory factor analysis, rather than being linear combinations of them. Errors represent the variance in the observed variables unexplained by the factors. The factors and errors aren't directly observable but are inferred from the correlations among the variables. Curved arrows between factors indicate that they are correlated.

Answer 23

A statistical technique that linearly transforms an original set of p correlated variables into a new set of k uncorrelated variables called principal components. These are a substantially smaller set of variables that represent most of the information in the original set - they maximise the variance accounted for in the original p variables.

Answer 24

Decreasing order of importance so that the 1st PC accounts for as much as possible of the variation in the original data.

Answer 25

To see if the first few components account for most of the variation in the original data. If they do, then it is argued that the effective dimensionality of the problem is less than p (the original number of correlated variables).

Answer 26

Reduce the dimensionality of the original data set A smaller set of uncorrelated variables is much easier to understand and use in further analysis than a larger set of correlated variables.

Answer 27

Simplifies the complexity of the data. Makes it easier to visualise.

Answer 28

Principal components are the underlying structure in the data. They are the directions where there is the most variance, the directions where the data is most spread out.

Answer 29

Using eigenvectors and eigenvalues - we can destruct the set of data points into eigenvectors and eigenvalues.

Answer 30

Orthogonal to each other. The eigenvectors have to be able to span the whole [x-y] area. In order to do this most effectively, the directions need to be orthogonal. The eigenvectors then provide a much more useful axis to frame the date in.

Answer 31

Eigenvector - direction Eigenvalue - number, telling us how much variance there is in the data in that direction. Telling us how spread out the data is on the line.

Answer 32

The eigenvectors with the highest eigenvalues

Answer 33

The same number of dimensions that the data set has. The eigenvectors put the data into a new set of dimensions, so these new dimensions have to be equal to the original amount of dimensions.

Answer 34

No, we are just looking at it from a different angle. We are shifting from one set of axes to another. We rearrange the axes to be along the eigenvectors. These new axes are much more intuitive to the shape of the data.

Answer 35

They are more intuitive to the shape oof the data. However, the original axis were well defined (we explicitly measure these things), and the new axes are not. There is often a good reason why the new axes represent the data better, but the maths won't tell us why. Data scientists have to work out the meanings of the new axes.

Answer 36

Dimensionality reduction.

Answer 37

Reduces the data down into its basic components, stripping away any unnecessary parts.

Answer 38

The correlation matrix, R This is the variance - covariance matrix of standardised variables. We have to standardised the matrix of data X (with n rows and p columns) to give matrix Z so that each column has variance 1 and mean 0.

Answer 39

It is a vector described by a linear combination of the variables. In matrix terms: Y1 = Z * a1 Z - the original standardised matrix A1 - the vector of coefficients (weights)

Answer 40

Chosen to maximise the variance of the variable Y1. Y1 is maximised when the weights are chosen to be the eigenvectors corresponding to the largest eigenvalues of the correlation matrix.

Answer 41

By the vector of weights a1 such that the variance of Y1 is maximised, under the constraint a1 ' a1 = 1 (a1 to itself = 1)

Answer 42

Y2 = Za2 where the vector of coefficients is chosen in such a way that the variance of Y2 is maximised, under the constraints a2 ' a2 = 1 and a2 ' a1 = 0 (they are perpendicular) It can be shown that a2 is the eigenvector (normalised and orthogonal to a1) corresponding to the second largest eigenvalues of R.

Answer 43

It is the linear combination Yv = Zav In which the vector of coefficients av is the eigenvectors of R corresponding to the vth largest eigenvalues. This eigenvectors is normalised and orthogonal to all the previous eigenvectors.

Answer 44

The vth eigenvalues Var(Yv) = lambda v

Answer 45

Cov(Yi, Yj) = 0 Should be 0 since they are perpendicular

Answer 46

A diagonal matrix - where the lambdas 1-K appear along the diagonal and all other values are 0.

Answer 47

(1/p) * sum(lamda i) eg 10 original variables to 3 PCs Proportion of variability maintained by 3PCs is L1 + L2 + L3 / 10 This equation expresses a cumulative measure of the quota of variability "reproduced" by the first k components, with respect to the overall variability present in the original data matrix. Therefore it can be a measure of importance of the chosen k PCs, in terms of "quantity of information" maintained by passing p variables to k components.

Answer 48

Consider the general covariance between a PC and the original standardised variables Z. This helps us interpret what the new PCs mean - if PC1 is highly correlated with age, we know it tells us a lot about age.

Answer 49

Corr(Yj, Zi) = sqrt(lambda-j) * a-ij The linear correlation between PC Yj and the original variable Xi Corr(Yj, Xi) = Corr(Yj, Zi) = sqrt(lambda-j) * a-ij

Answer 50

The algebraic sign and value of the coefficient a-ij. This determines the sign and strength of the correlation between the jth PC and the ith original variable.

Answer 51

Portion of variability = sum (lambda-j * a-ij^2) The portion of variability is the sum of the square of the appropriate correlation terms. This describes the quota of variability of each explanatory variable that is maintained in passing from the original variables to the principal components.

Answer 52

We can interpret each PC by referring it mainly to the variables with which it is strongly correlated.

Answer 53

That a few principal components would explain most of the variation in the original variables.

Answer 54

The total number of original values - the information of the original values has been redistributed.

Answer 55

Take the eigenvalues for that PC and divide it by the total number of original variables.

Answer 56

The variance accounted for by each principal component

Answer 57

Y1 = Z * av where av is a vector of coefficients, the eigenvector of R corresponding to the Vth largest eigenvalue

Answer 58

The correlations of the original variables with the principal components

Answer 59

Corr(Xi, Yj) = sqrt(lambda) * aij ie multiply the elements of the eigenvectors by the square root of the corresponding eigenvalue.

Answer 60

Indicate that the corresponding original variable was important in defining that particular principal component the element relates to.

Answer 61

They will all be positive

Answer 62

They will be of about the same magnitude. That is if we were to create a principal component score using the elements of the eigenvector, it would essentially be an equally weighted average of each variable.

Answer 63

The largest principal component [in these circumstances]

Answer 64

It has to be orthogonal to the first (and all other PCs). In order for this to be fulfilled, the sum of the cross product of the elements of 1 PC eigenvector to another must be zero. So, since the first PC has all positive values, the second must be a mix of positive and negative. When we have an overall size factor, the succeeding principal components with alternating positive and negative signs are usually interpreted as contrasts.

Answer 65

They explain most of the variation in the set of variables. Frequently, the smaller components are more difficult to interpret as to what they represent.

Answer 66

The row sum of squares indicates how much variance for that variable is accounted for by the retained PCs. Eg if we retain 2 PCs, add together the two corresponding values int he row to see how much variation is maintained (cumulative)

Answer 67

We want to choose enough to adequately represent the data. There are many different criteria suggested to do this - Include any component with an eigenvalue (lambda-i) >= 1 - Cattell's scree criterion - Retain enough PCs to account for at least X% of the overall variability - Retain enough PCs to account for at least X% of the variation in each variable

Answer 68

The overall variance is equal to p, so the average variance should be equal to at least 1.

Answer 69

Look at the scree plot of the ordered eigenvalues. The components which make up the steepest part of the curve are included, whilst those on the flatter part are discarded. Sometimes the elbow itself is included and sometimes it is not.

Answer 70

The eventual use of the PCs If they were going to be used as independent variables in a regression analysis, then we might want to retain components such that all the variables are adequately represented by the PCs.

Answer 71

Individual rows

Answer 72

Check for outlying observations, searching for clusters and in general understanding the structure of the data.

Answer 73

- Participants forget to answer one or more questions - Refuse to answer sensitive questions - Grow fatigued and fail to complete a long questionnaire - Study participants miss appointments or drop out - Recording equipment fails - Data miscoded - Data may be lost for reasons you may never be able to ascertain

Answer 74

That you are working with complete data

Answer 75

- Identify the missing data - Examine the causes of the missing data - Delete the cases containing missing data, or replace (?impute) the missing values with reasonable alternative data values

Answer 76

- Missing completely random (MCAR) - Missing at random (MAR) - Missing not at random (MNAR) It depends on how the missing data process is related to the underlying hypothetical complete data.

Answer 77

If the missingness of a variable Y is unrelated to either the value of Y or that of other measured variables. ie the observed data points are a simple random sample of the data had the data been complete. Missing cases are no different than non-missing cases. These can be thought of as randomly missing. The only really penalty in failing to account for missing data is loss of power.

Answer 78

- There are random missing questions throughout a survey

Answer 79

When the missingness of a variable Y is unrelated to the value of Y itself after conditioning on other observed values. ie missing data depends on known values and thus it is fully described by variables in the dataset. Accounting for the values which "cause" the missing data will produce unbiased results in an analysis.

Answer 80

- Education survey where GCSE question was left blank - it didn't make sense for global people

Answer 81

When the missingness of a variable Y still depends on the value of Y even given the observed variables. When data is missing in an unmeasured fashion, this is also termed "non-ignorable". Since the missing data depends on events or items which the researcher has not measured this is a damaging situation. Can't infer from the dataset why

Answer 82

- Can verify whether data are MCAR or not - Impossible to test the MAR mechanism (with exception)

Answer 83

- An accessible mechanism is one where the cause of missingness can be accounted for - MCAR and most MAR - An inaccessible mechanism is one where the missing data mechanism cannot be measured - Nonignorable mechanisms and MAR mechanisms Often the missing data mechanism is made up of both accessible and inaccessible factors.

Answer 84

Identify the amount, distribution and pattern of missing data - What % of data is missing? - Are the missing data concentrated in a few variables or widely distributed? - Do the missing values appear to be random? - this may be seen when plotted - Does the data suggest a possible mechanism that's producing the missing values?

Answer 85

Delete the variables and continue the analysis normally

Answer 86

Limit the analysis to cases with complete data and still get reliable and valid results

Answer 87

Apply multiple imputation methods and arrive at valid conclusions

Answer 88

Turn to specialised methods or collect new data

Answer 89

- Depression study where questions are omitted describing depressed mood by depressed people and older people - Cancer questionnaire where < 10 was omitted from dataset

Answer 90

- A rational approach for recovering data - A traditional approach involving deleting missing data or simple data imputation - A modern approach that is based on simulations to perform the missing data imputation

Answer 91

Using mathematical or logical relationships among variables to attempt to fill in or recover missing values. It may be exact or approximate - Eg variables which are part of an equation, can use one to calculate another - Eg date of birth and age etc. - Eg using numerical answers to answer binary questions

Answer 92

List-wise and pair-wise deletion methods The advantage of these methods is that they are convenient and are standard options.

Answer 93

A case is dropped from an analysis because it has a missing value in at last on of the specified variables.

Answer 94

Approach which uses cases which contain some missing data. eg looking at the subset of the dataset that contains eg two columns that we want to find correlation between

Answer 95

A correlation matrix (pair-wise correlations of all variables) that is computed as the first step

Answer 96

MCAR data If this assumption does not hold it can produce distorted parameter estimates It is not recommended unless the portion of missing data is very small.

Answer 97

- It allows you to use more of your data - Each computed statistic may be based on a different subset of cases which can be problematic - things may become irrelevant

Answer 98

You substitute a value for each missing value. Standard statistical procedures for complete data analysis can then be used with the filled-in data set. - Eg input the variable mean of complete cases - Eg input the mean conditional on observed values of other variables Simple imputation does not reflect the uncertainty about the predictions of the unknown missing values, and the resulting estimated variances of the parameter estimates will be biased toward zero.

Answer 99

Simulations

Answer 100

An approach to dealing with missing values based on repeated simulations. Instead of filling in a single value for each missing value, replace each missing value with a set of plausible values that represent the uncertainty about the right value to input. The multiple imputed data sets are then analysed using standard procedures for complete data and combined,

Answer 101

Three 1 - the missing data are filled m times to generate m complete data sets 2 - the m complete data sets are analysed by using standard procedures 3 - the results from the m complete data sets are combined for the inference

Answer 102

Types of missing data pattern / types of missingness

Answer 103

- Parametric regression method - assumes multivariate normality - Nonparametric method which uses propensity scores

Answer 104

A Markov chain Monte Carlo (MCMC) method that assumes multivariate normality. This creates multiple imputations by using simulations from a Bayesian prediction distribution for normal data. Another way to handle a data set with arbitrary missing data pattern is to use the MCMC approach to impute enough values to make the missing data pattern monotone.

Answer 105

When a variable Y is missing for the individual I implies that all subsequent variables are missing for the individual i. You have greater flexibility in your choice of strategies. You can implement a regression model without involving iterations as in MCMC

Chapter 2 Flashcards

(133 cards)