Chapter 1 Flashcards

Question 1

Q

Three keys forms of data analysis

Answer

A

descriptive
inferential
predictive

Question 2

Q

Descriptive Analysis

Answer

A

Descriptive analysis simplifies raw data, making it easier to interpret, especially when dealing with large volumes. It involves summarizing or presenting data in a way that highlights patterns or trends.

However, it does not allow for specific conclusions—its purpose is purely to describe the data.

An example of this is a graph.

Question 3

Q

Measures of Central Tendency

Answer

A

These describe the ‘average’ value of a data set. The most common measures are:

Mean – the arithmetic average
Median – the middle value when data is ordered
Mode – the most frequently occurring value

Question 4

Q

Measures of dispersion

Answer

A

These describe the ‘spread’ of values in a data set. Key measures include:

Standard deviation – indicates how much values deviate from the mean
Interquartile range – the range of the middle 50% of data

Question 5

Q

Measurements of shape of a distribution

Answer

A

Skewness – measures the symmetry of a data set. A skewed distribution has more values concentrated on one side.
Kurtosis – measures the likelihood of extreme values appearing in the tails of the distribution.

Question 6

Q

Empirical Distribution

Answer

A

Empirical means “based on observation.”

An empirical distribution reflects actual collected data, not an assumed theoretical model

Question 7

Q

Inferential Analysis

Answer

A

Used when collecting data from an entire population is impractical.

A sample is analyzed to represent the wider population.

Involves estimating parameters and testing hypotheses based on the sample data.

Question 8

Q

Importance of a representative sample

Answer

A

A large, randomly selected sample can accurately reflect the population.

The accuracy of inferential analysis depends on making reasonable assumptions about the population.

Sampling bias can occur if the selection method excludes certain groups (e.g., surveying only urban residents may not reflect national opinions).

Question 9

Q

Predictive Analysis

Answer

A

Extends inferential analysis to make predictions about future events based on past data

Uses a training set with known attributes (features) to discover potentially predictive relationships

These relationships are tested with a separate test set to assess their strength

Question 10

Q

Linear Regression

Answer

A

A common example of predictive analysis

Assumes a linear relationship between a dependent variable and an independent variable
The training set is used to determine the slope and intercept of the line
Example: The relationship between a car’s speed (independent/explanatory variable) and braking distance (dependent variable)

Question 11

Q

Independent vs Dependent Variable

Answer

A

Independent variable: The explanatory variable, which is presumed to cause or explain changes in the dependent variable

Dependent variable: The variable being predicted or explained, dependent on changes in the independent variable

Question 12

Q

The Data Analysis Process

Answer

A

Define Objectives – Develop a well-defined set of objectives which need to be met by the data analysis, such as summarizing insurance claims by age and gender or predicting an election outcome.
Identify the data items required
Collect Data – Gather data from appropriate sources; internal (e.g., company records) or external (e.g., government statistics) ones.
Process and Format Data – E.g. Input data into spreadsheets, databases, or models for analysis.
Clean Data – Address unusual, missing, or inconsistent values to ensure accuracy.
Exploratory Data Analysis – Conduct preliminary analysis, which may include:
- Descriptive analysis – Summarizing central tendency and spread.
- Inferential analysis – Estimating summary parametersof the wider population of data and testing hypotheses.
- Predictive analysis – Making forecasts about future events or other data sets.
Model the Data
Communicate Results – Clearly explain the data used, analyses performed, assumptions made, conclusions, and limitations.
Monitor, Update and Repeat – Data analysis is often repeated over time. For example, insurance companies may reassess claims trends periodically, and polling firms may conduct multiple surveys before an election to track changes in public opinion.

Question 13

Q

Sources of bias or inaccuracy in data

Answer

A

whether the process was manual or automated;
limitations on the precision of the data recorded;
whether there was any validation at source;
if data wasn’t collected automatically, how was it converted to an electronic form.

Question 14

Q

Simple random sampling

Answer

A

Each employee has an equal chance of being selected

Question 15

Q

Stratified sampling

Answer

A

Split into groups based on specific criteria

Question 16

Q

Pre-processed data

Answer

A

Data may have undergone pre-processing, such as grouping by geographical area or age band. In the past, this was done to reduce storage and computation needs. While modern computing power has lessened these concerns, data may still be grouped to anonymise it or protect sensitive information.

Question 17

Q

Types of data based on cololection process

Answer

A

Cross-sectional data involves recording values for each case in the sample at a single moment in time.
Example: Recording the amount spent in a supermarket by each member of a loyalty card scheme this week.
Longitudinal data involves recording values at intervals over time.
Example: Recording the amount spent in a supermarket by a particular loyalty card member each week for a year.
Censored data occurs when the value of a variable is only partially known.
Example: In a survival study, if a subject withdraws or survives beyond the study period, only a lower bound for survival time is known.
Truncated data occurs when measurements on some variables are not recorded and are completely unknown.
Example: If we only recorded internet disruptions lasting 5 minutes or longer, shorter disruptions would be missing from the data set.

Question 18

Q

Big Data

Answer

A

The term big data is not well defined but generally refers to data with characteristics that make traditional analysis methods impractical. This often involves automatically collected data where characteristics must be inferred rather than pre-defined.

Characteristics of Big Data:
* Size – Big data includes a vast number of individual cases, often with many variables. A high proportion of these may have empty (null) values, leading to sparse data.
* Speed – Data may arrive in real time at a very fast rate, such as sensors taking thousands of measurements per second.
* Variety – Big data often comes from multiple sources with different structures or may be largely unstructured.
* Reliability – The accuracy and reliability of individual data elements can be difficult to confirm and may vary over time (e.g., a sensor going offline).

Examples of Big Data:
* Information held by large online retailers on items viewed, purchased, and recommended for each customer.
* Atmospheric pressure measurements from sensors monitored by a national meteorological organization.
* Data collected by an insurance company from personal activity trackers used by policyholders to monitor exercise, food intake, and sleep.

The 4 characteristics should be considered for any data source anyway

Question 19

Q

Data Security, Privacy and Regulation

Answer

A

In any investigation, it is crucial to consider data security, privacy, and compliance with relevant regulations.

Combining data from different anonymized sources can sometimes lead to individual cases becoming identifiable.

The availability of data on the internet does not necessarily mean it can be freely used. Laws regarding data usage vary across jurisdictions, making this a complex area.

Question 20

Q

Reproducibility vs. Replication

Answer

A

Reproducibility means that when the results of a statistical analysis are reported, enough information is provided for an independent third party to repeat the analysis and obtain the same results.

Replication refers to repeating an experiment and obtaining the same or consistent results. However, replication can be difficult, expensive, or even impossible in cases where:

The study is very large.
The study relies on data collected at great expense or over many years.
The study examines a unique occurrence (e.g., healthcare standards after a specific event).

Because replication may not always be feasible, reproducibility is often used as an alternative standard. Instead of validating results by fully replicating the study (including collecting new data), validation is achieved by an independent third party reproducing the same results using the original data set.
* Where there is randomness being used (for example random forests or neural networks) or where simulation is used, replication will require the random seed to be set.

Question 21

Q

Elements required for reproducibility

Answer

A

Reproducibility typically requires the original data and computer code to be available (or fully specified) so others can repeat the analysis and verify results.

In all but the simplest cases, full documentation is necessary, including:
* A description of each data variable.
* An audit trail detailing decisions made during data cleaning and processing.
* Fully documented code.

While not strictly required for reproducibility, version control helps align drafts of code, documentation, and reports, making changes reversible if needed.
A commonly used tool for version control is git.
Ensuring reproducibility may also involve documenting the software environment, computing architecture, operating system, software toolchain, external dependencies, and version numbers.

Question 22

Q

Literate Statistical Programming

Answer

A

allows code to include explanations in plain language alongside snippets of code.
* In R, this can be done using R Markdown, which enables the creation of documents containing code, explanations, and outputs.

Question 23

Q

Manual Processes in reproducability

Answer

A

Doing things ‘by hand’ can create problems in reproducing the work. Examples include:
* Manually editing spreadsheets (instead of reading raw data into a programming environment and making changes there).
* Editing tables and figures (rather than ensuring the programming environment creates them exactly as needed).
* Downloading data manually from a website (instead of doing it programmatically).
* Pointing and clicking (choosing operations from an on-screen menu, which is usually not recorded electronically).

The key point is that automated analysis is much easier to reproduce than manual interventions, which may be forgotten or difficult to document clearly.

Question 24

Q

The Value of Reproducibility

Answer

A

Reproducibility is necessary for a complete technical work review, ensuring the analysis was done correctly and conclusions are justified by the data and analysis.
It may be required by external regulators and auditors.
Reproducible research is easier to extend for investigating the effects of changes to the analysis or incorporating new data.
Comparing the results of an investigation with a similar past investigation is easier and more reliable if the earlier study was reproducible. Differences between the two can be analyzed with confidence.
The discipline of reproducible research, focusing on good documentation and data storage, leads to fewer errors that need to be fixed enhancing efficiency.

Question 25

Q

Issues that reproducibility does not address

Answer

A

Reproducibility does not guarantee the analysis is correct. For example, if an incorrect distribution is assumed, the results may still be reproducible but wrong.
While reproducibility provides transparency, it allows incorrect analysis to be challenged, but this may come too late if activities involved in reproducibility are only done at the end of the analysis. By this time, resources may have been moved to other projects.