Data analysis Flashcards
Key forms of data analysis
Descriptive
Inferential
Predicative
Descriptive analysis
Presents data in a simpler format that is more easily understood and by the user
Describes the data actually presented
Key measures/parameters used in a descriptive analysis
Measure of central tendency
Measure of the dispersion
(Also the shape of the (empirical) distribution)
Measurements of central tendency
Mean
Median
Mode
Measurements of the dispersion
Standard deviation
Ranges such as the interquartile range
Inferential analysis
Gather data in respect of a sample which is used to represent the wider population
Measures/Paramaters of inferential analysis
Measure of central tendency
Measure of the dispersion
(Testing Hypothesis)
Predictive analysis
Extends the principles behind inferencial analysis in order for the user to analyse past data and make predictions about future events
How is predictive analysis used to make projections
It uses an existing set of data with known attributes/featues (training set) in order to discover potentially predictive relationships.
Those relationships are tested using a different set of data (test set) to assess the strength of those relationships
Typical example of a predictive analysis
Regression analysis
Linear regression
The relationship between a scalar dependant variable and an explanatory or independent variable is assumed to be linear and the training set is used to determine the slope and intercept of the line
Eg a car’s speed and braking distance
Data Analysis Process
Develop a well-defined set of objectives
Identify the data items required for the analysis
Collection of the data from appropriate sources
Processing and formatting data for analysis
Cleaning data
Exploratory data analysis (despriptive/ inferential/ predictive)
Modelling the data
Communicating the results
Monitoring the process, update the data and repeat if necessary (actuarial control cycle)
The modelling team throughout the data analysis process
Ensure that any relevant professional guidance has been complied with
Ensure any relevant legal requirements are complied with
Possible issues with the data collection process that the analyst should be aware of
Whether the process was manual or automated
Limitations on the precision of the data collected
Whether there was any validation at source
If data was not collected automatically, how was it it converted to an electronic form
Why is randomisation used?
Reduce the effect of bias
Reduce the effect of confounding variables (a variable that influences both the dependent variable and independent variable causing a false association)
Random sampling schemes
Simple random sampling
Stratisfied sampling
Another Sampling method
Simple random sampling
Each item in the sample space has an equal chance of being selected
Stratisfied sampling
The sample space would first be divided into groups defined by specific criteria, before items are randomly selected from each group
Why would stratisfied sampling be used instead of random sampling
It aims to overcome the issued with random sampling as random sampling does not fully reflect the characteristics of the population
A common example of pre-processing
Grouping
Why was grouping used in the past
To reduce the ammount of storage space required
To make the number if calculations managable
Why is data currently grouped
To anonymise the data
To remove the possibility of extracting sensitive (or commercially sensitive) details
Other aspects of data which are determined by the collection process which affect the way it is analysed
Cross-sectional data
Longitudinal data
Censored data
Truncated data
Cross-sectional data
Involves recording values of the variables of interest for each case in the sample at a single moment in time
Eg the amount spent by each of the members of a loyalty card scheme this week
Longitudinal data
Involves recording values of intervals over time
Eg the amount spent by a particular member of a loyalty card scheme each week for a year
Censored data
Occurs when the value of a particular variable is only partially known
Eg when a subject in a survival study survives beyond the end of the study - only a lower bound for the survival period is known
Truncated data
Occurs when measurements on some data are not recorded and thus completely unknown
Eg when collecting data on the periods of time users spend on the internet, but only periods lasting longer than 10 minuted are recorded, so periods of time lasting shorting than 10 minutes are not recorded and the data is truncated
Big data
Not well defined, but is used to describe data with characteristics that make it impossible to apply traditional methods for analysis
Typically, this means automatically collected data with characteristics that have to be inferred (deduced) from the data itself rather than known is advance from the design of an experiment
Properties that can lead to data being classified as big data
Volume/Size
Velocity/Speed
Variety
Veracity/Reliability
Volume/Size (as an property that can lead to data being classified as big data)
Big data include a very large number of individual cases, but each may include very many variables, a high proportion of which may be empty (or null) values - leading to sparse data
Velocity/Speed (as an property that can lead to data being classified as big data)
The data to be analysed might be arriving in real time at a very fast rate
Eg from an array of sensors collecting data thousands of times a second
Variety (as an property that can lead to data being classified as big data)
Big data is often composed of elements from many different sources which could have very different structures - or is largely unstructured
Veracity/Reliability (as an property that can lead to data being classified as big data)
Given the volume, velocity and variety of data the reliability of individual data elements might be difficult to ascertain and could vary over time
Sparse data
Data that include null (or empty) values
What may happen when combining different data from anonymised sources
The individual cases may become unidentifiable
Reproducibility
Refers to the idea that when the results of a statistical analysis are reported, sufficient information is provided so that an independent third party can repeat the analysis and arrive at the same results
Replication
Refers to someone repeating an experiment (from scratch) and arriving at the same (or at least consistent) results
When can replication be hard, expensive or impossible
If the study is big
If the study relies on data collected at great expense or over many years
If the study is of an unique occurrence
Elements required for reproducibility
The original data
The computer code
Full documentation
The random seed to be set (where there is randomness in the statistical or machine learning techniques being used)
Literal statistical programming
The program includes an explanation of the code in plain language
Why is reproducibility valuable
Reproducibility is necessary for a complete technical work review to ensure the analysis has been correctly carried out and the conclusions are justified by the data and analysis
Reproducibility may be required by external regulators and auditors
Reproducible research is more easily extended to investigate the effect of changes in the analysis or to incorporate new data
It is often desirable to compare the results of an investigation with an similar one carried out in the past;
if an earlier investigation was reported reproducibly an analysis of the differences between between the two can be carried out with confidence
Reproducible research can lead to fewer errors that need correcting in the original work, and hence, greater efficiency
What issues does reproducibility not address?
Reproducibility does not mean the analysis in correct
If activities involved in reproducibility is only carried out at the end of an analysis, this may be too late for the resulting challenges to be dealt with