Data Analysis Concepts Revision Flashcards
State the five stages of creating structured data
1) Document scanning
2) Text recognition
3) Character encoding
4) Parsing
5) Migration to database
Describe ‘serial’ file organisation
Data that is organised based on the order of their creation. As a result, data is unorganised, or at best is in chronological order.
Describe ‘sequential’ file organisation
Data is sorted by a key field, often a primary key.
State and briefly explain four ways to achieve good quality data
1) People and skills - with appropriate knowledge and competency in their role
2) Governance and Leadership - policies and procedures in place to support the process
3) Systems and processes - systems to support validation and verification
4) Data security - Ensuring data collected is secure and only used by authorised individuals
What does CUVCATR stand for in relation to Data Quality
C - Completness U - Uniqueness V - Validity C - Consistency A - Accuracy T - Timeliness R - Relevance
What is the difference between Personal and Identifying information with relation to GDPR
Personal - any record relating to yourself (i.e doctor records)
Identifying - information that can be used to identify from others in a dataset
Define ‘data lineage’
Includes the origin of the data, what happens to it and where it moves over time
Define ‘Interpolation’
The creation of new estimated data points based on pre-existing data points
State and explain the three forms of interpolation
Linear - The simplest form that makes fewest assumptions about the data
Polynomial - Captures non-linear patterns
Nearest Neighbour - Does not generate new values, replicates the nearest existing values
State the difference between the Null and Alternative hypothesis’
Null hypothesis states that whatever relationship you are studying is not due to a real effect but observed only because of random sampling
Alternative hypothesis states that the effect/relationship you are measuring/observing is due to a real phenomena.
Define ‘Data Architecture’
Data architecture is collective term describing the systems, policies, rules and standards that aim to standardise the way data is collected, handled, stored and transmitted
State four advantages of using a data architecture
1) Operations on data are done in the same/similar ways
2) Upgrades/maintenance to software or hardware are simplified
3) Accessing and performing operations on data is made easier
4) Encourages people to think of the wider context in which their application/systems live
Define ‘Domain Knowledge’
Knowledge of a specific industry and business
Define ‘Descriptive Analysis’
Analysis that shows ‘what has happened?’. Often involves summary stats i.e mean, count, sum.
Define ‘Predictive Analysis’
Helps project trends and patterns into the future.
Define ‘Decision Analysis’
Helps understand the ‘map’ of possibility from a decision. Uses known outcomes, risks and machine learning to estimate a likely outcome.
Define ‘Prescriptive Analysis’
Helps understand the best course of action of possibility from a decision. Similar to ‘decision’ analysis but are scored by a set of criteria.
Which of the following would you use to gain consensus;
Focus Groups Interviews Shadowing Surveys Studying Documents
Focus Groups
State three issues that can occur from requirements elicitation
1) Problems with scope - boundaries of systems are ill-defined
2) Problems with understanding - Stakeholders aren’t sure of what is needed
3) problems of volatility - Requirements change over time
State and briefly describe the four types of database maintenance activity
1) Log File Maintenance - Updating and deleting old logs, and checking access for unusual activity
2) Data Compaction - Removing old/redundant data and replacing with metadata
3) Integrity Check - Checking for vulnerable pieces of data with insufficient backups.
4) Data Warehousing - Bringing structured data together into the database
State five things that make a hypothesis ‘good’
1) MUST be testable and be written in non-ambiguous language
2) MUST at least partly answer the problem statement
3) MUST be used to make at least one clear prediction
4) MUST be based on reliable and relevant information
5) MUST contain dependent variable (what you measure) and an independent variable (something that is controlled or changed)
What are the advantages of using both structured and unstructured data
- Faster decision making
- More relevant decisions
- Fewer wasted man hours
- More opportunities spotted
- Rich data sources available
What does IPR stand for when referring to data protection rules
Intellectual Property Rights
What does ‘CIA’ mean when referring to a model designed to guide information security policies
Confidentiality, Integrity, Availability
How to overcome problems with requirements eliciation
Use visualisations to help people understand
Consistent language
Guidelines
Consistent use of templates
Document dependencies between requirements
Analysis of change - route cause analysis
Order the following in the order that they would occur as part of an ETL;
Session Mapping Source Target Workflow
1) Source
2) Target
3) Mapping
4) Source
5) Workflow
Define CAP theorem and explain the three elements
The theory that a distributed system can delivery only two of three desired characteristics;
Consistency - All clients see the data at the same time
Availability - Any client making a request for data gets a response
Partition Tolerance - The cluster must continue to work despite any number of communication breakdowns.