Data Quality & Uncertainty Flashcards
Importance of Data Quality
- Automatic tendency to regard outputs as a form of truth
- How reliable are the results/output
- Liability issues
What are the 4 components of data quality?
1) Accuracy
2) Precision
3) Error
4) Uncertainty
Liability
If not done correctly, could cause problems later
- Ex. wrong datum caused arrests to be thrown out of court because the boundary people crossed was not placed in the correct spot
Data quality: Accuracy
- How close does the data match the true values or descriptions
- True for both spatial and attribute
How do we account for data quality?
- usually best when personally collected
- scale, who, what, why
- What was the data intended for and can it be used for another purpose
- Spatially looks like it should an where it should be
- On target but maybe not clustered
Data quality: Precision
- Scale
- How Exact the data are measured (map sheet vs. lat/long vs. UTM meters
- Higher level of precision from map sheet to meters
- Worst case: Precise data that is inaccurate
Data quality: Error
- How far the data are actually from their true values
- Always present to some extent but does not fatally undermine GIS use
- Use statistics to determine if data can or cannot be used based on type/size of error (size–> distance from true?)
What are the 3 types of Error?
- Gross
- Systematic
- Random
Gross Error
Incredibly inaccurate
- easy to identify
Systematic Error
Exact same on every piece of data
- X, Y accidentally set as Y, X
- Can be fixed/accounted for
Random Error
Not easy to find
- Could be one data point or attribute incorrectly entered (10.21 entered as 102.1)
Data quality: Uncertainty
Doubt due to incompete knowledge (someone else collected) (this is why metadata is essential)
- Many issues in GIS have uncertainty underpinning them
- Prevalent in processes/transformations
- Model behaviour (know how the model works, not just what it does, link to describe process and math involved i.e. white papers)
Sources of Uncertainty
- Measurement
- GIS representation
- Reporting Numbers (lines for roads but what is the width)
Data Collection
Quality control at the 1st step
Data Input
Resolution when digitizing
- Boundaries (edge of forest by ownership vs edge of trees and how many trees dictate a forest?)
Stages for Accounting for Data Quality
Real world - Inherent Uncertainty
Conception - Uncertainty in Conception
Measurement - Uncertainty in measurements
Analysis - Uncertainty in Analysis
- Acceptable values fall within/under a curve to deal with variability (ex. double breast height of a tree measured by different foresters can be ok if it falls within acceptabe values)
Data quality: Positional Accuracy RMSE
- Square root of the average of the squared discrepancies in position (d) of well-defined points (n) determined from the map and compared to higher accuracy surveyed location of each point
- Calculates image from true difference and is scale dependent
Fuzzy Sets
Defined by degree of membership
- Venn diagrams, set theory, and/or –> SQL
- probability (%) that something belongs to that category
- S-Curve (Venn diagram that acknowledges uncertainty)
- Can have partial membership in a set with yes, no, and maybe
Uncertainty
Degree to which the measured value is estimated to vary from the true value
- Arise from a variety of sources including limitation on precision or accuracy of measuring system
- Often used to describe degree of accuracy of measurement
Why would you choose a point, line, or polygon for the data?
Based on the purpose for the data
S- Curve
Venn diagram with the uncertainty acknowledged
Advantages of fuzzy sets
- Acknowledging uncertainty upfront
- Membership can be adjusted if more info becomes available
What is the drawback?
People! (numbers from feelings)
- Probability values can reflect the way individuals state how they feel
- i.e. one person can state 90% surety while another with the same confidence can state 99% surety
- But which is it?
G.C.S
Geographic Corrdinate System
- ex. Lat/Long
MAUP
Modifiable Areal Unit Problem
- Classic source of error
- Especially when aggregating spatial units
- Data can be the same but aggregated differently
- Choices of areal units tent to be dominated by what is available rather than what is best (Ex. Crime in within an administrative boundary an be 5% but in a concentrated area of the boundary it can be up to 20% of properties broken into)
Choropleth & MAUP
Choropleth will show that an entire polygon is 18% but really only a portion of it holds the majority
What is a possible fix to account for MAUP?
Metadata!
- can’t fix but can account for issue
- Embrace Uncertainty!
- There will always be issues when you did not personally collect the data so justify with metadata, field checks, and GIS checks
Some basic principle for dealing with uncertainty in GIS
- Use many sources to prevent building of error from same datasets
- Metadata
- Checks (Field & GIS)
- Appropriateness of reporting results ( Be careful how you phrase results with causes, may or could
More thoughts on dealing with uncertainty
- Embrace uncertainty!
- Try multiple data sources to prevent error boiling
- Be honest in communicating what you know about the accuracy
- Report what you believe to be true, not what the GIS appears to be saying
- GIS has been slow to deal with or treat errors