Data Analysis Concepts Revision Flashcards
State the five stages of creating structured data
1) Document scanning
2) Text recognition
3) Character encoding
4) Parsing
5) Migration to database
Describe ‘serial’ file organisation
Data that is organised based on the order of their creation. As a result, data is unorganised, or at best is in chronological order.
Describe ‘sequential’ file organisation
Data is sorted by a key field, often a primary key.
State and briefly explain four ways to achieve good quality data
1) People and skills - with appropriate knowledge and competency in their role
2) Governance and Leadership - policies and procedures in place to support the process
3) Systems and processes - systems to support validation and verification
4) Data security - Ensuring data collected is secure and only used by authorised individuals
What does CUVCATR stand for in relation to Data Quality
C - Completness U - Uniqueness V - Validity C - Consistency A - Accuracy T - Timeliness R - Relevance
What is the difference between Personal and Identifying information with relation to GDPR
Personal - any record relating to yourself (i.e doctor records)
Identifying - information that can be used to identify from others in a dataset
Define ‘data lineage’
Includes the origin of the data, what happens to it and where it moves over time
Define ‘Interpolation’
The creation of new estimated data points based on pre-existing data points
State and explain the three forms of interpolation
Linear - The simplest form that makes fewest assumptions about the data
Polynomial - Captures non-linear patterns
Nearest Neighbour - Does not generate new values, replicates the nearest existing values
State the difference between the Null and Alternative hypothesis’
Null hypothesis states that whatever relationship you are studying is not due to a real effect but observed only because of random sampling
Alternative hypothesis states that the effect/relationship you are measuring/observing is due to a real phenomena.
Define ‘Data Architecture’
Data architecture is collective term describing the systems, policies, rules and standards that aim to standardise the way data is collected, handled, stored and transmitted
State four advantages of using a data architecture
1) Operations on data are done in the same/similar ways
2) Upgrades/maintenance to software or hardware are simplified
3) Accessing and performing operations on data is made easier
4) Encourages people to think of the wider context in which their application/systems live
Define ‘Domain Knowledge’
Knowledge of a specific industry and business
Define ‘Descriptive Analysis’
Analysis that shows ‘what has happened?’. Often involves summary stats i.e mean, count, sum.
Define ‘Predictive Analysis’
Helps project trends and patterns into the future.