Data exploration and classification Flashcards
lecture 14
what is data exploration?
the process of examining data prior to formal structured data analysis
What is data classification in Acr GIS?
the data classification tool is a tool which can be used to explore spatial data and is based on descriptive stats.
What does data exploration include in GIS?
- In GIS it involves both spatial and attribute data (how & where?)
- Media used in GIS includes maps (spatial), graphs, and tables.
What is the crime rate like in Gauteng and where is the highest crime found in this province?
projected on a map with stats
What does data visualisation invlove?
- Rendering – what to show in a graphic plot & what type of plot to
make - Manipulation – how to operate on individual plots and how to
organise multiple plots
What are the fundamental tasks for data exploration?
- Finding patterns
- Posing queries, i.e. exploring data characteristics and data subsets
- Making comparisons, i.e. between variables or data subsets
Q – Which portion of my field produces the highest / lowest yield
Q2 – Why do certain portions of my land produce higher yields?
Q – Which areas of Tanzania are most suitable for growing Pinotage?
Q – How does wildfire susceptibility vary across a nature reserve ?
Q – What is the groundwater recharge potential of the Winelands municipality
Q – How does deforestation rates vary across the Peruvian Amazon?
spatial data exploration statistics?
can be:
Descriptive
Inferential
What are descriptive statistics?
Statistics that provide a statistical summary of a dataset (summary statistic)
1. Measures of central tendency - Describes data by identifying central position.
2. Measures of dispersion .
3. Skewness
4. Kurtosis
What are inferential statistics?
generalizing from a sample to a population with a calculated degree of certainty.
drawing conclusions.
What are measures of central tendency?
Median, mode, mean
What are measures of dispersion?
Look at the statistical spread or
distribution of a dataset.
Include:
1. Standard deviation / Standaard afwyking
2. Variance/ Variansie
3. Standardised score (z score)
Observe the spread of or trends in
data - can be used to identify outliers.
What is the standard deviation?
Shows how much variation or “dispersion” exists from the average.
What is the variance?
Measure of how far a set of numbers is spread out.
What is the standard score (z score)
The standardized or z score informs how many standard deviations a
reading is above or below the mean.
What is classification?
the process of reducing a large number of individual quantitative values to a smaller number of ordered categories, each of which comprises a portion of the original data value range.
what are the different types of classification?
Each classification type divides the data value range in different
ways and are used for the classification of interval and ratio
data (mostly):
1. Natural breaks
2. Equal interval classes
3. User defined
4. Quantiles
5. Mean and Standard
Deviation
6. Geometric Interval
What is the fundamental principle of classification?
- Each of the original (un-classed) data values must fall into only one of the classes
- None of the original data values falls into more than one class
- Always mutually exclusive & exhaustive (if they cannot both be true).
Deciding the number of classes:
Rules of thumb:
* Monochrome color schemes: No more than 5 to 7 classes.
* Multi-hue map: No more than 9
Need to consider:
* Communication goal?
* Complexity of Spatial Pattern
* Available Symbol Types
What is quantitative precision?
Communication goal:
* Use larger number of class intervals.
* Each class will represent a relatively small range of the original data values and will therefore represent those values more
precisely.
Trade offs:
* Too much information
* Indistinct symbols
What is immediate graphic impact?
Communication goal:
* Use smaller number of class intervals.
* Each class will be graphically clear, but will be imprecise quantitatively.
Trade offs:
* Potential for oversimplification
* One class may include wildly
* varying data values
What is Jenks natural breaks?
- The Natural Jenks is the default classification method in ArcGIS
- Minimum variation in value within classes.
- Maximum variation in value between classes.
- The method seeks to reduce the variance within classes and maximize the variance between classes.
What are the advantages of natural breaks?
- Maximizes the similarity of values within each class
- Increases the precision of the map given the number of
classes
what are the disadvantages of natural breaks?
- Class breaks often look random
- Need to explain the method
- Method will be difficult to grasp for those lacking a background in statistical methods.
What is equal interval classification?
- Each class represents an equal portion of original data range.
- Also called equal size or equal width classification
Calculation:
1. Determine range of original values {Range = Max – Min}
2. Decide Number of classes, {N}
3. Calculate class width:
{CW = Range / N}
What are the advantages of equal interval?
- Easy to understand, intuitive appeal
- Each class represents an equal range or amount of the original data range
- Good for rectangular data distributions.
What are the disadvantages of equal intervals?
- Does not often occur in geographic phenomena
- Not good for skewed data distributions.
What is defined interval classification?
map author specifies an
interval by which to equally divide a range of values, i.e. class
size
* Intervals may need to be altered to fit the range of the data
* Different from Equal Interval where user specify the number classes
* ArcMap automatically determines the number of classes based on the interval.
Calculation:
1. Set interval size
2. Determine range of original data values:
{Range = Maximum – Minimum}
3. Calculate number of classes:
{N = Range / CW}
What are the advantages of the defined interval?
- Easy to understand, intuitive appeal
- Each class represents a specified amount
- Good for rectangular data distributions
- Example: Good for data with “assumed” breaks
What are the disadvantages of the defined intervals?
- Not good for skewed data distributions
➢ Many classes will be empty and not mapped.
What is quantiles classification?
- Places an equal number of cases in each class
- Sets class break points wherever they need to be in order to accomplish this
What are the advantages of quantiles?
- Each class has equal representation on the map
- Intuitive appeal: map readers like to be able to identify the “top 20%” or the “bottom 20%”
- Example: Very useful for ordinal data.
What are the disadvantages of quantiles?
- Very irregular break points unless data have rectangular distribution.
- Breaks can sometimes lead to an over-weighting of the outlier in that class division.
What is Mean and Standard Deviation classification?
Places break points at the Mean and at various Standard Deviation intervals above and below the mean
Mean:
Measure of central
tendency
Standard Deviation:
Measure of variability
What are the advantages of the mean and SD?
- Shows how much the feature’s attribute value varies from the mean
- Useful to emphasize which observations are above the mean and which observations are below the mean
- Example: Income and education levels
What are the disadvantages of mean and SD?
- Many map readers are not familiar with the concept of the standard deviation
- Not good for skewed data.
What is the geometric interval classification?
- Used for visualizing continuous data that is not distributed normally
- The width of each succeeding class interval is larger than the previous interval by a constant amount.
Calculating the constant amount, CW:
* Decide on number of classes, N.
* Calculate the range: R = Max - Min
* R = CW + 2CW + . . . + NCW
What are the advantages of geometric intervals?
- Uneven, but regular class
breaks - Used for data that contains excessive duplicate values,
e.g., 35% of the features
have the same value - Tends to even out class frequencies for skewed distributions while making class widths relatively small in areas where there is high frequency.
What are the disadvantages of geometric intervals?
- Uncommon
- Unequal width classes