Data analysis Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Key forms of data analysis

A

Descriptive
Inferential
Predicative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Descriptive analysis

A

Presents data in a simpler format that is more easily understood and by the user

Describes the data actually presented

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Key measures/parameters used in a descriptive analysis

A

Measure of central tendency
Measure of the dispersion

(Also the shape of the (empirical) distribution)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Measurements of central tendency

A

Mean
Median
Mode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Measurements of the dispersion

A

Standard deviation

Ranges such as the interquartile range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Inferential analysis

A

Gather data in respect of a sample which is used to represent the wider population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Measures/Paramaters of inferential analysis

A

Measure of central tendency
Measure of the dispersion

(Testing Hypothesis)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Predictive analysis

A

Extends the principles behind inferencial analysis in order for the user to analyse past data and make predictions about future events

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How is predictive analysis used to make projections

A

It uses an existing set of data with known attributes/featues (training set) in order to discover potentially predictive relationships.

Those relationships are tested using a different set of data (test set) to assess the strength of those relationships

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Typical example of a predictive analysis

A

Regression analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Linear regression

A

The relationship between a scalar dependant variable and an explanatory or independent variable is assumed to be linear and the training set is used to determine the slope and intercept of the line

Eg a car’s speed and braking distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Data Analysis Process

A

Develop a well-defined set of objectives

Identify the data items required for the analysis

Collection of the data from appropriate sources

Processing and formatting data for analysis

Cleaning data

Exploratory data analysis (despriptive/ inferential/ predictive)

Modelling the data

Communicating the results

Monitoring the process, update the data and repeat if necessary (actuarial control cycle)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The modelling team throughout the data analysis process

A

Ensure that any relevant professional guidance has been complied with

Ensure any relevant legal requirements are complied with

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Possible issues with the data collection process that the analyst should be aware of

A

Whether the process was manual or automated

Limitations on the precision of the data collected

Whether there was any validation at source

If data was not collected automatically, how was it it converted to an electronic form

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why is randomisation used?

A

Reduce the effect of bias

Reduce the effect of confounding variables (a variable that influences both the dependent variable and independent variable causing a false association)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Random sampling schemes

A

Simple random sampling

Stratisfied sampling

Another Sampling method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Simple random sampling

A

Each item in the sample space has an equal chance of being selected

18
Q

Stratisfied sampling

A

The sample space would first be divided into groups defined by specific criteria, before items are randomly selected from each group

19
Q

Why would stratisfied sampling be used instead of random sampling

A

It aims to overcome the issued with random sampling as random sampling does not fully reflect the characteristics of the population

20
Q

A common example of pre-processing

A

Grouping

21
Q

Why was grouping used in the past

A

To reduce the ammount of storage space required

To make the number if calculations managable

22
Q

Why is data currently grouped

A

To anonymise the data

To remove the possibility of extracting sensitive (or commercially sensitive) details

23
Q

Other aspects of data which are determined by the collection process which affect the way it is analysed

A

Cross-sectional data
Longitudinal data
Censored data
Truncated data

24
Q

Cross-sectional data

A

Involves recording values of the variables of interest for each case in the sample at a single moment in time

Eg the amount spent by each of the members of a loyalty card scheme this week

25
Q

Longitudinal data

A

Involves recording values of intervals over time

Eg the amount spent by a particular member of a loyalty card scheme each week for a year

26
Q

Censored data

A

Occurs when the value of a particular variable is only partially known

Eg when a subject in a survival study survives beyond the end of the study - only a lower bound for the survival period is known

27
Q

Truncated data

A

Occurs when measurements on some data are not recorded and thus completely unknown

Eg when collecting data on the periods of time users spend on the internet, but only periods lasting longer than 10 minuted are recorded, so periods of time lasting shorting than 10 minutes are not recorded and the data is truncated

28
Q

Big data

A

Not well defined, but is used to describe data with characteristics that make it impossible to apply traditional methods for analysis

Typically, this means automatically collected data with characteristics that have to be inferred (deduced) from the data itself rather than known is advance from the design of an experiment

29
Q

Properties that can lead to data being classified as big data

A

Volume/Size

Velocity/Speed

Variety

Veracity/Reliability

30
Q

Volume/Size (as an property that can lead to data being classified as big data)

A

Big data include a very large number of individual cases, but each may include very many variables, a high proportion of which may be empty (or null) values - leading to sparse data

31
Q

Velocity/Speed (as an property that can lead to data being classified as big data)

A

The data to be analysed might be arriving in real time at a very fast rate

Eg from an array of sensors collecting data thousands of times a second

32
Q

Variety (as an property that can lead to data being classified as big data)

A

Big data is often composed of elements from many different sources which could have very different structures - or is largely unstructured

33
Q

Veracity/Reliability (as an property that can lead to data being classified as big data)

A

Given the volume, velocity and variety of data the reliability of individual data elements might be difficult to ascertain and could vary over time

34
Q

Sparse data

A

Data that include null (or empty) values

35
Q

What may happen when combining different data from anonymised sources

A

The individual cases may become unidentifiable

36
Q

Reproducibility

A

Refers to the idea that when the results of a statistical analysis are reported, sufficient information is provided so that an independent third party can repeat the analysis and arrive at the same results

37
Q

Replication

A

Refers to someone repeating an experiment (from scratch) and arriving at the same (or at least consistent) results

38
Q

When can replication be hard, expensive or impossible

A

If the study is big

If the study relies on data collected at great expense or over many years

If the study is of an unique occurrence

39
Q

Elements required for reproducibility

A

The original data

The computer code

Full documentation

The random seed to be set (where there is randomness in the statistical or machine learning techniques being used)

40
Q

Literal statistical programming

A

The program includes an explanation of the code in plain language

41
Q

Why is reproducibility valuable

A

Reproducibility is necessary for a complete technical work review to ensure the analysis has been correctly carried out and the conclusions are justified by the data and analysis

Reproducibility may be required by external regulators and auditors

Reproducible research is more easily extended to investigate the effect of changes in the analysis or to incorporate new data

It is often desirable to compare the results of an investigation with an similar one carried out in the past;
if an earlier investigation was reported reproducibly an analysis of the differences between between the two can be carried out with confidence

Reproducible research can lead to fewer errors that need correcting in the original work, and hence, greater efficiency

42
Q

What issues does reproducibility not address?

A

Reproducibility does not mean the analysis in correct

If activities involved in reproducibility is only carried out at the end of an analysis, this may be too late for the resulting challenges to be dealt with