Chapter 2 Flashcards
What does explanatory data analysis do and how?
Takes the available information and analyse it to summarise the whole data set.
It uses descriptive statistical techniques.
How does EDA compare to data mining?
EDA uses descriptive statistical techniques.
Whereas data mining uses descriptive and inferential methods.
What is the purpose of EDA?
To describe the structure and the relationships present in the data, for eventual use in a statistical model
What is univariate exploratory data analysis?
Analysis of individuals.
It is an important step in preliminary data analysis.
What does univariate data analysis usually consist of?
Graphical displays and a series of summary indexes.
What kind of graphical univariate analysis is carried out for qualitative nominal data?
Bar charts and pie charts
What kind of graphical univariate analysis is carried out for ordinal qualitative and discrete quantitative variables?
Frequency diagrams (ie bar charts where order on the horizontal axis is important)
What are the main unidimensional / univariate statistical indexes?
- Measures of location
- Measures of variability
- Measures of heterogeneity
- Measures of centrality
What are the measures of location?
Mean, mode and median
What is the formula for the mean?
[See flashcard]
How is univariate data classified?
In terms of a frequency distribution
What are the measures of variability?
The range, interquartile range and variance.
What is the range?
The difference between the maximum and minimum observations
What is the interquartile range?
The difference between the third and first quartile
What is the formula for the sample variance?
[See flashcard]
What types of data can measures of location and measures of variability be used for?
Only for continuous/quantitative data.
They cannot be used for qualitative data.
How can the dispersion of qualitative data be calculated?
Measures of heterogeneity - the spread of data
What is null heterogeneity?
When all observations have X equal to the same level (ie they all belong to the same category, so there is no spread).
Pi = 1 for a certain I
Pi = 0 for all other k-1 levels
What is maximum heterogeneity?
When all observations are uniformly distributed among the k levels.
Pi = 1/k for all i
How can you asses the dispersion of qualitative data?
- Gini index of heterogeneity
- Entropy
What is the formula for the Gini index of heterogeneity?
[See flashcard]
What does G = 0 represent?
Perfect homogeneity ie everything belongs to the same category
What G = 1 - 1/K or (k-1)/k represent?
Maximum heterogeneity - all categories represented equally
How do we normalise the Gini index of heterogeneity?
By dividing by the maximum value
What values does the normalised Gini index take?
0 to 1
What is the formula for the normalised Gini index?
[See flashcard]
What is the formula for Entropy?
[See flashcard]
What does E = 0 represent?
Perfect homogeneity
What does E = log K represent?
Maximum heterogeneity
What is the normalised index called?
The relative index of heterogeneity
How do you obtain the normalised index E
Rescale by the maximum value (log K)
[See flashcard]
What are the measures of concentration?
The Gini Coefficient R (a summary index of concentration)
What are measures of concentration?
They help understand the concentration of the characteristic among the N quantities
What are the two extreme situations which can occur for measures of concentration?
Minimum concentration - equal income for all (everyone has equal salary) x1 = x2 = …. = xn = x
Maximum concentration - someone gets all the income x1 = x2 = … xn-1 = 0 and xn = N*x_bar
The degree of concentration can lie between these two extremes
What is the equation for the Gini Coefficient R and the relevant steps associated with it?
[See flashcard]
What are the conditions for the variables investigated for the Gini Coefficient R?
There are N non-negative quantitates measuring a transferable characteristic (eg a fixed amount of income among N individuals) placed in an increasing (non-decreasing) number.
What is Fi?
The cumulative proportion of considered units, up to unit i
What is Qi?
The cumulative proportion of characteristic that belongs to the first I units
What do you need to remember about the sums associated with the Gini concentration index R?
They sum up to N-1, don’t include the final value
What does R = 0 represent?
Minimum concentration
What does R = 1 represent?
Maximum concentration
What is a challenge of multivariate data?
The sheer complexity of the information
What are the methodologies for exploring and simplifying complex multivariate data?
- Principal components
- Exploratory factor analysis
What is principal components analysis (PCA)?
A data-reduction technique that transforms a larger number of correlated variables into a much smaller set of uncorrelated variables called principal components.
What is exploratory factor analysis (EFA)?
A collection of methods designed to uncover the latent structure in a given set of variables.
It looks for a smaller set of underlying or latent constructs that can explain the relationships among the observed variables.
eg a dataset of 24 variables has intercorrelations that can be explained by 4 underlying factors.
What are principal components?
Uncorrelated composite variables, used to reduce dimensionality.
They aim to ratio as much information from the original set of variables as possible.
They are linear combinations of the observed variables. The weights used to form the linear composites are chosen to maximise the variance each PC accounts for, while keeping the components uncorrelated.
What are factors?
Factors are assumed to underlie or “cause” the observed variables in exploratory factor analysis, rather than being linear combinations of them.
Errors represent the variance in the observed variables unexplained by the factors.
The factors and errors aren’t directly observable but are inferred from the correlations among the variables.
Curved arrows between factors indicate that they are correlated.
Describe the process of principal component analysis
A statistical technique that linearly transforms an original set of p correlated variables into a new set of k uncorrelated variables called principal components. These are a substantially smaller set of variables that represent most of the information in the original set - they maximise the variance accounted for in the original p variables.
What order are the principal components derived in?
Decreasing order of importance so that the 1st PC accounts for as much as possible of the variation in the original data.
What is the objective of PCA?
To see if the first few components account for most of the variation in the original data. If they do, then it is argued that the effective dimensionality of the problem is less than p (the original number of correlated variables).
What are the goals of PCA?
Reduce the dimensionality of the original data set
A smaller set of uncorrelated variables is much easier to understand and use in further analysis than a larger set of correlated variables.
What does reducing the dimensionality of the problem do?
Simplifies the complexity of the data.
Makes it easier to visualise.
What do PCs reveal about the structure of the data?
Principal components are the underlying structure in the data.
They are the directions where there is the most variance, the directions where the data is most spread out.