Chapter 1 Flashcards
Data Analysis
Three keys forms of data analysis
- descriptive
- inferential
- predictive
Descriptive Analysis
Descriptive analysis simplifies raw data, making it easier to interpret, especially when dealing with large volumes. It involves summarizing or presenting data in a way that highlights patterns or trends.
However, it does not allow for specific conclusions—its purpose is purely to describe the data.
An example of this is a graph.
Measures of Central Tendency
These describe the ‘average’ value of a data set. The most common measures are:
Mean – the arithmetic average
Median – the middle value when data is ordered
Mode – the most frequently occurring value
Measures of dispersion
These describe the ‘spread’ of values in a data set. Key measures include:
- Standard deviation – indicates how much values deviate from the mean
- Interquartile range – the range of the middle 50% of data
Measurements of shape of a distribution
- Skewness – measures the symmetry of a data set. A skewed distribution has more values concentrated on one side.
- Kurtosis – measures the likelihood of extreme values appearing in the tails of the distribution.
Empirical Distribution
Empirical means “based on observation.”
An empirical distribution reflects actual collected data, not an assumed theoretical model
Inferential Analysis
Used when collecting data from an entire population is impractical.
A sample is analyzed to represent the wider population.
Involves estimating parameters and testing hypotheses based on the sample data.
Importance of a representative sample
A large, randomly selected sample can accurately reflect the population.
The accuracy of inferential analysis depends on making reasonable assumptions about the population.
Sampling bias can occur if the selection method excludes certain groups (e.g., surveying only urban residents may not reflect national opinions).
Predictive Analysis
Extends inferential analysis to make predictions about future events based on past data
Uses a training set with known attributes (features) to discover potentially predictive relationships
These relationships are tested with a separate test set to assess their strength
Linear Regression
A common example of predictive analysis
Assumes a linear relationship between a dependent variable and an independent variable
The training set is used to determine the slope and intercept of the line
Example: The relationship between a car’s speed (independent/explanatory variable) and braking distance (dependent variable)
Independent vs Dependent Variable
Independent variable: The explanatory variable, which is presumed to cause or explain changes in the dependent variable
Dependent variable: The variable being predicted or explained, dependent on changes in the independent variable
The Data Analysis Process
- Define Objectives – Develop a well-defined set of objectives which need to be met by the data analysis, such as summarizing insurance claims by age and gender or predicting an election outcome.
- Identify the data items required
- Collect Data – Gather data from appropriate sources; internal (e.g., company records) or external (e.g., government statistics) ones.
- Process and Format Data – E.g. Input data into spreadsheets, databases, or models for analysis.
- Clean Data – Address unusual, missing, or inconsistent values to ensure accuracy.
-
Exploratory Data Analysis – Conduct preliminary analysis, which may include:
- Descriptive analysis – Summarizing central tendency and spread.
- Inferential analysis – Estimating summary parametersof the wider population of data and testing hypotheses.
- Predictive analysis – Making forecasts about future events or other data sets.
- Model the Data
- Communicate Results – Clearly explain the data used, analyses performed, assumptions made, conclusions, and limitations.
- Monitor, Update and Repeat – Data analysis is often repeated over time. For example, insurance companies may reassess claims trends periodically, and polling firms may conduct multiple surveys before an election to track changes in public opinion.
Sources of bias or inaccuracy in data
- whether the process was manual or automated;
- limitations on the precision of the data recorded;
- whether there was any validation at source;
- if data wasn’t collected automatically, how was it converted to an electronic form.
Simple random sampling
Each employee has an equal chance of being selected
Stratified sampling
Split into groups based on specific criteria
Pre-processed data
Data may have undergone pre-processing, such as grouping by geographical area or age band. In the past, this was done to reduce storage and computation needs. While modern computing power has lessened these concerns, data may still be grouped to anonymise it or protect sensitive information.
Types of data based on cololection process
-
Cross-sectional data involves recording values for each case in the sample at a single moment in time.
Example: Recording the amount spent in a supermarket by each member of a loyalty card scheme this week. -
Longitudinal data involves recording values at intervals over time.
Example: Recording the amount spent in a supermarket by a particular loyalty card member each week for a year. -
Censored data occurs when the value of a variable is only partially known.
Example: In a survival study, if a subject withdraws or survives beyond the study period, only a lower bound for survival time is known. -
Truncated data occurs when measurements on some variables are not recorded and are completely unknown.
Example: If we only recorded internet disruptions lasting 5 minutes or longer, shorter disruptions would be missing from the data set.
Big Data
- The term big data is not well defined but generally refers to data with characteristics that make traditional analysis methods impractical. This often involves automatically collected data where characteristics must be inferred rather than pre-defined.
Characteristics of Big Data:
* Size – Big data includes a vast number of individual cases, often with many variables. A high proportion of these may have empty (null) values, leading to sparse data.
* Speed – Data may arrive in real time at a very fast rate, such as sensors taking thousands of measurements per second.
* Variety – Big data often comes from multiple sources with different structures or may be largely unstructured.
* Reliability – The accuracy and reliability of individual data elements can be difficult to confirm and may vary over time (e.g., a sensor going offline).
Examples of Big Data:
* Information held by large online retailers on items viewed, purchased, and recommended for each customer.
* Atmospheric pressure measurements from sensors monitored by a national meteorological organization.
* Data collected by an insurance company from personal activity trackers used by policyholders to monitor exercise, food intake, and sleep.
The 4 characteristics should be considered for any data source anyway
Data Security, Privacy and Regulation
In any investigation, it is crucial to consider data security, privacy, and compliance with relevant regulations.
Combining data from different anonymized sources can sometimes lead to individual cases becoming identifiable.
The availability of data on the internet does not necessarily mean it can be freely used. Laws regarding data usage vary across jurisdictions, making this a complex area.
Reproducibility vs. Replication
Reproducibility means that when the results of a statistical analysis are reported, enough information is provided for an independent third party to repeat the analysis and obtain the same results.
Replication refers to repeating an experiment and obtaining the same or consistent results. However, replication can be difficult, expensive, or even impossible in cases where:
- The study is very large.
- The study relies on data collected at great expense or over many years.
- The study examines a unique occurrence (e.g., healthcare standards after a specific event).
Because replication may not always be feasible, reproducibility is often used as an alternative standard. Instead of validating results by fully replicating the study (including collecting new data), validation is achieved by an independent third party reproducing the same results using the original data set.
* Where there is randomness being used (for example random forests or neural networks) or where simulation is used, replication will require the random seed to be set.
Elements required for reproducibility
Reproducibility typically requires the original data and computer code to be available (or fully specified) so others can repeat the analysis and verify results.
In all but the simplest cases, full documentation is necessary, including:
* A description of each data variable.
* An audit trail detailing decisions made during data cleaning and processing.
* Fully documented code.
While not strictly required for reproducibility, version control helps align drafts of code, documentation, and reports, making changes reversible if needed.
A commonly used tool for version control is git.
Ensuring reproducibility may also involve documenting the software environment, computing architecture, operating system, software toolchain, external dependencies, and version numbers.
Literate Statistical Programming
allows code to include explanations in plain language alongside snippets of code.
* In R, this can be done using R Markdown, which enables the creation of documents containing code, explanations, and outputs.
Manual Processes in reproducability
Doing things ‘by hand’ can create problems in reproducing the work. Examples include:
* Manually editing spreadsheets (instead of reading raw data into a programming environment and making changes there).
* Editing tables and figures (rather than ensuring the programming environment creates them exactly as needed).
* Downloading data manually from a website (instead of doing it programmatically).
* Pointing and clicking (choosing operations from an on-screen menu, which is usually not recorded electronically).
The key point is that automated analysis is much easier to reproduce than manual interventions, which may be forgotten or difficult to document clearly.
The Value of Reproducibility
- Reproducibility is necessary for a complete technical work review, ensuring the analysis was done correctly and conclusions are justified by the data and analysis.
- It may be required by external regulators and auditors.
- Reproducible research is easier to extend for investigating the effects of changes to the analysis or incorporating new data.
- Comparing the results of an investigation with a similar past investigation is easier and more reliable if the earlier study was reproducible. Differences between the two can be analyzed with confidence.
- The discipline of reproducible research, focusing on good documentation and data storage, leads to fewer errors that need to be fixed enhancing efficiency.
Issues that reproducibility does not address
- Reproducibility does not guarantee the analysis is correct. For example, if an incorrect distribution is assumed, the results may still be reproducible but wrong.
- While reproducibility provides transparency, it allows incorrect analysis to be challenged, but this may come too late if activities involved in reproducibility are only done at the end of the analysis. By this time, resources may have been moved to other projects.