Chapter 2 Flashcards
What does explanatory data analysis do and how?
Takes the available information and analyse it to summarise the whole data set.
It uses descriptive statistical techniques.
How does EDA compare to data mining?
EDA uses descriptive statistical techniques.
Whereas data mining uses descriptive and inferential methods.
What is the purpose of EDA?
To describe the structure and the relationships present in the data, for eventual use in a statistical model
What is univariate exploratory data analysis?
Analysis of individuals.
It is an important step in preliminary data analysis.
What does univariate data analysis usually consist of?
Graphical displays and a series of summary indexes.
What kind of graphical univariate analysis is carried out for qualitative nominal data?
Bar charts and pie charts
What kind of graphical univariate analysis is carried out for ordinal qualitative and discrete quantitative variables?
Frequency diagrams (ie bar charts where order on the horizontal axis is important)
What are the main unidimensional / univariate statistical indexes?
- Measures of location
- Measures of variability
- Measures of heterogeneity
- Measures of centrality
What are the measures of location?
Mean, mode and median
What is the formula for the mean?
[See flashcard]
How is univariate data classified?
In terms of a frequency distribution
What are the measures of variability?
The range, interquartile range and variance.
What is the range?
The difference between the maximum and minimum observations
What is the interquartile range?
The difference between the third and first quartile
What is the formula for the sample variance?
[See flashcard]
What types of data can measures of location and measures of variability be used for?
Only for continuous/quantitative data.
They cannot be used for qualitative data.
How can the dispersion of qualitative data be calculated?
Measures of heterogeneity - the spread of data
What is null heterogeneity?
When all observations have X equal to the same level (ie they all belong to the same category, so there is no spread).
Pi = 1 for a certain I
Pi = 0 for all other k-1 levels
What is maximum heterogeneity?
When all observations are uniformly distributed among the k levels.
Pi = 1/k for all i
How can you asses the dispersion of qualitative data?
- Gini index of heterogeneity
- Entropy
What is the formula for the Gini index of heterogeneity?
[See flashcard]
What does G = 0 represent?
Perfect homogeneity ie everything belongs to the same category
What G = 1 - 1/K or (k-1)/k represent?
Maximum heterogeneity - all categories represented equally
How do we normalise the Gini index of heterogeneity?
By dividing by the maximum value