Data analysis Flashcards

Question

Longitudinal data

Answer 1

Involves recording values of intervals over time Eg the amount spent by a particular member of a loyalty card scheme each week for a year

Answer 2

Occurs when the value of a particular variable is only partially known Eg when a subject in a survival study survives beyond the end of the study - only a lower bound for the survival period is known

Answer 3

Occurs when measurements on some data are not recorded and thus completely unknown Eg when collecting data on the periods of time users spend on the internet, but only periods lasting longer than 10 minuted are recorded, so periods of time lasting shorting than 10 minutes are not recorded and the data is truncated

Answer 4

Not well defined, but is used to describe data with characteristics that make it impossible to apply traditional methods for analysis Typically, this means automatically collected data with characteristics that have to be inferred (deduced) from the data itself rather than known is advance from the design of an experiment

Answer 5

Volume/Size Velocity/Speed Variety Veracity/Reliability

Answer 6

Big data include a very large number of individual cases, but each may include very many variables, a high proportion of which may be empty (or null) values - leading to sparse data

Answer 7

The data to be analysed might be arriving in real time at a very fast rate Eg from an array of sensors collecting data thousands of times a second

Answer 8

Big data is often composed of elements from many different sources which could have very different structures - or is largely unstructured

Answer 9

Given the volume, velocity and variety of data the reliability of individual data elements might be difficult to ascertain and could vary over time

Answer 10

Data that include null (or empty) values

Answer 11

The individual cases may become unidentifiable

Answer 12

Refers to the idea that when the results of a statistical analysis are reported, sufficient information is provided so that an independent third party can repeat the analysis and arrive at the same results

Answer 13

Refers to someone repeating an experiment (from scratch) and arriving at the same (or at least consistent) results

Answer 14

If the study is big If the study relies on data collected at great expense or over many years If the study is of an unique occurrence

Answer 15

The original data The computer code Full documentation The random seed to be set (where there is randomness in the statistical or machine learning techniques being used)

Answer 16

The program includes an explanation of the code in plain language

Answer 17

Reproducibility is necessary for a complete technical work review to ensure the analysis has been correctly carried out and the conclusions are justified by the data and analysis Reproducibility may be required by external regulators and auditors Reproducible research is more easily extended to investigate the effect of changes in the analysis or to incorporate new data It is often desirable to compare the results of an investigation with an similar one carried out in the past; if an earlier investigation was reported reproducibly an analysis of the differences between between the two can be carried out with confidence Reproducible research can lead to fewer errors that need correcting in the original work, and hence, greater efficiency

Answer 18

Reproducibility does not mean the analysis in correct If activities involved in reproducibility is only carried out at the end of an analysis, this may be too late for the resulting challenges to be dealt with

Data analysis Flashcards

(42 cards)