Data Analysis Concepts Revision Flashcards

1
Q

State the five stages of creating structured data

A

1) Document scanning
2) Text recognition
3) Character encoding
4) Parsing
5) Migration to database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe ‘serial’ file organisation

A

Data that is organised based on the order of their creation. As a result, data is unorganised, or at best is in chronological order.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe ‘sequential’ file organisation

A

Data is sorted by a key field, often a primary key.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

State and briefly explain four ways to achieve good quality data

A

1) People and skills - with appropriate knowledge and competency in their role
2) Governance and Leadership - policies and procedures in place to support the process
3) Systems and processes - systems to support validation and verification
4) Data security - Ensuring data collected is secure and only used by authorised individuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does CUVCATR stand for in relation to Data Quality

A
C - Completness
U - Uniqueness
V - Validity
C - Consistency
A - Accuracy
T - Timeliness
R - Relevance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the difference between Personal and Identifying information with relation to GDPR

A

Personal - any record relating to yourself (i.e doctor records)
Identifying - information that can be used to identify from others in a dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define ‘data lineage’

A

Includes the origin of the data, what happens to it and where it moves over time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define ‘Interpolation’

A

The creation of new estimated data points based on pre-existing data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

State and explain the three forms of interpolation

A

Linear - The simplest form that makes fewest assumptions about the data
Polynomial - Captures non-linear patterns
Nearest Neighbour - Does not generate new values, replicates the nearest existing values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

State the difference between the Null and Alternative hypothesis’

A

Null hypothesis states that whatever relationship you are studying is not due to a real effect but observed only because of random sampling

Alternative hypothesis states that the effect/relationship you are measuring/observing is due to a real phenomena.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Define ‘Data Architecture’

A

Data architecture is collective term describing the systems, policies, rules and standards that aim to standardise the way data is collected, handled, stored and transmitted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

State four advantages of using a data architecture

A

1) Operations on data are done in the same/similar ways
2) Upgrades/maintenance to software or hardware are simplified
3) Accessing and performing operations on data is made easier
4) Encourages people to think of the wider context in which their application/systems live

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Define ‘Domain Knowledge’

A

Knowledge of a specific industry and business

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Define ‘Descriptive Analysis’

A

Analysis that shows ‘what has happened?’. Often involves summary stats i.e mean, count, sum.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Define ‘Predictive Analysis’

A

Helps project trends and patterns into the future.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Define ‘Decision Analysis’

A

Helps understand the ‘map’ of possibility from a decision. Uses known outcomes, risks and machine learning to estimate a likely outcome.

17
Q

Define ‘Prescriptive Analysis’

A

Helps understand the best course of action of possibility from a decision. Similar to ‘decision’ analysis but are scored by a set of criteria.

18
Q

Which of the following would you use to gain consensus;

Focus Groups
Interviews
Shadowing
Surveys
Studying Documents
A

Focus Groups

19
Q

State three issues that can occur from requirements elicitation

A

1) Problems with scope - boundaries of systems are ill-defined
2) Problems with understanding - Stakeholders aren’t sure of what is needed
3) problems of volatility - Requirements change over time

20
Q

State and briefly describe the four types of database maintenance activity

A

1) Log File Maintenance - Updating and deleting old logs, and checking access for unusual activity
2) Data Compaction - Removing old/redundant data and replacing with metadata
3) Integrity Check - Checking for vulnerable pieces of data with insufficient backups.
4) Data Warehousing - Bringing structured data together into the database

21
Q

State five things that make a hypothesis ‘good’

A

1) MUST be testable and be written in non-ambiguous language
2) MUST at least partly answer the problem statement
3) MUST be used to make at least one clear prediction
4) MUST be based on reliable and relevant information
5) MUST contain dependent variable (what you measure) and an independent variable (something that is controlled or changed)

22
Q

What are the advantages of using both structured and unstructured data

A
  • Faster decision making
  • More relevant decisions
  • Fewer wasted man hours
  • More opportunities spotted
  • Rich data sources available
23
Q

What does IPR stand for when referring to data protection rules

A

Intellectual Property Rights

24
Q

What does ‘CIA’ mean when referring to a model designed to guide information security policies

A

Confidentiality, Integrity, Availability

25
Q

How to overcome problems with requirements eliciation

A

Use visualisations to help people understand
Consistent language
Guidelines
Consistent use of templates
Document dependencies between requirements
Analysis of change - route cause analysis

26
Q

Order the following in the order that they would occur as part of an ETL;

Session
Mapping
Source
Target
Workflow
A

1) Source
2) Target
3) Mapping
4) Source
5) Workflow

27
Q

Define CAP theorem and explain the three elements

A

The theory that a distributed system can delivery only two of three desired characteristics;

Consistency - All clients see the data at the same time
Availability - Any client making a request for data gets a response
Partition Tolerance - The cluster must continue to work despite any number of communication breakdowns.