Data Analysis Concepts Revision Flashcards by Samuel Bull

State the five stages of creating structured data

1) Document scanning
2) Text recognition
3) Character encoding
4) Parsing
5) Migration to database

How well did you know this?

Not at all

Perfectly

Describe ‘serial’ file organisation

Data that is organised based on the order of their creation. As a result, data is unorganised, or at best is in chronological order.

How well did you know this?

Not at all

Perfectly

Describe ‘sequential’ file organisation

Data is sorted by a key field, often a primary key.

How well did you know this?

Not at all

Perfectly

State and briefly explain four ways to achieve good quality data

1) People and skills - with appropriate knowledge and competency in their role
2) Governance and Leadership - policies and procedures in place to support the process
3) Systems and processes - systems to support validation and verification
4) Data security - Ensuring data collected is secure and only used by authorised individuals

How well did you know this?

Not at all

Perfectly

What does CUVCATR stand for in relation to Data Quality

C - Completness
U - Uniqueness
V - Validity
C - Consistency
A - Accuracy
T - Timeliness
R - Relevance

How well did you know this?

Not at all

Perfectly

What is the difference between Personal and Identifying information with relation to GDPR

Personal - any record relating to yourself (i.e doctor records)
Identifying - information that can be used to identify from others in a dataset

How well did you know this?

Not at all

Perfectly

Define ‘data lineage’

Includes the origin of the data, what happens to it and where it moves over time

How well did you know this?

Not at all

Perfectly

Define ‘Interpolation’

The creation of new estimated data points based on pre-existing data points

How well did you know this?

Not at all

Perfectly

State and explain the three forms of interpolation

Linear - The simplest form that makes fewest assumptions about the data
Polynomial - Captures non-linear patterns
Nearest Neighbour - Does not generate new values, replicates the nearest existing values

How well did you know this?

Not at all

Perfectly

State the difference between the Null and Alternative hypothesis’

Null hypothesis states that whatever relationship you are studying is not due to a real effect but observed only because of random sampling

Alternative hypothesis states that the effect/relationship you are measuring/observing is due to a real phenomena.

How well did you know this?

Not at all

Perfectly

Define ‘Data Architecture’

Data architecture is collective term describing the systems, policies, rules and standards that aim to standardise the way data is collected, handled, stored and transmitted

How well did you know this?

Not at all

Perfectly

State four advantages of using a data architecture

1) Operations on data are done in the same/similar ways
2) Upgrades/maintenance to software or hardware are simplified
3) Accessing and performing operations on data is made easier
4) Encourages people to think of the wider context in which their application/systems live

How well did you know this?

Not at all

Perfectly

Define ‘Domain Knowledge’

Knowledge of a specific industry and business

How well did you know this?

Not at all

Perfectly

Define ‘Descriptive Analysis’

Analysis that shows ‘what has happened?’. Often involves summary stats i.e mean, count, sum.

How well did you know this?

Not at all

Perfectly

Define ‘Predictive Analysis’

Helps project trends and patterns into the future.

How well did you know this?

Not at all

Perfectly

Define ‘Decision Analysis’

Study These Flashcards

Helps understand the ‘map’ of possibility from a decision. Uses known outcomes, risks and machine learning to estimate a likely outcome.

Define ‘Prescriptive Analysis’

Study These Flashcards

Helps understand the best course of action of possibility from a decision. Similar to ‘decision’ analysis but are scored by a set of criteria.

Which of the following would you use to gain consensus;

Focus Groups
Interviews
Shadowing
Surveys
Studying Documents

Study These Flashcards

Focus Groups

State three issues that can occur from requirements elicitation

Study These Flashcards

1) Problems with scope - boundaries of systems are ill-defined
2) Problems with understanding - Stakeholders aren’t sure of what is needed
3) problems of volatility - Requirements change over time

State and briefly describe the four types of database maintenance activity

Study These Flashcards

1) Log File Maintenance - Updating and deleting old logs, and checking access for unusual activity
2) Data Compaction - Removing old/redundant data and replacing with metadata
3) Integrity Check - Checking for vulnerable pieces of data with insufficient backups.
4) Data Warehousing - Bringing structured data together into the database

State five things that make a hypothesis ‘good’

Study These Flashcards

1) MUST be testable and be written in non-ambiguous language
2) MUST at least partly answer the problem statement
3) MUST be used to make at least one clear prediction
4) MUST be based on reliable and relevant information
5) MUST contain dependent variable (what you measure) and an independent variable (something that is controlled or changed)

What are the advantages of using both structured and unstructured data

Study These Flashcards

Faster decision making
More relevant decisions
Fewer wasted man hours
More opportunities spotted
Rich data sources available

What does IPR stand for when referring to data protection rules

Study These Flashcards

Intellectual Property Rights

What does ‘CIA’ mean when referring to a model designed to guide information security policies

Study These Flashcards

Confidentiality, Integrity, Availability

How to overcome problems with requirements eliciation

Use visualisations to help people understand Consistent language Guidelines Consistent use of templates Document dependencies between requirements Analysis of change - route cause analysis

Order the following in the order that they would occur as part of an ETL; ``` Session Mapping Source Target Workflow ```

1) Source 2) Target 3) Mapping 4) Source 5) Workflow

Define CAP theorem and explain the three elements

The theory that a distributed system can delivery only two of three desired characteristics; Consistency - All clients see the data at the same time Availability - Any client making a request for data gets a response Partition Tolerance - The cluster must continue to work despite any number of communication breakdowns.

Data Analysis Concepts Revision Flashcards

(27 cards)