big DATA Flashcards

1
Q

wHAT IS BIG DATA

A

big data is a large or complex dataset that often needs terabytes or petabytes of storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 4 terms used to define characteristics of big data

A

Volume
velocity
variety
veracity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what are the r additional terms regarding data relevance

A

variability
value
visualisation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Volume

A

The computing capacity required to store and analyse data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Velocity

A

The speed at which data are created and analysed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Variety

A

The types of data sources available (text, images, social media, administrative)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Veracity

A

The accuracy and credibility of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Variability

A

The internal consistency of your data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Value

A

The costs required to undertake big data analysis should pay dividends for your organisation and their patients

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Visualisation

A

the use of novel techniques to communicate the patterns that would otherwise be lost in massive tables of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Where do big data come from

A

1) Electronic or health records
2) the internet (IoT-internet of things)
3) research or data repositories
4) social media

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is data linkage

A

it is the process of matching records from different sources based on key information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is deterministic data linkage

A

Exact matches based on personal information appearing in all of the datasets that are to be linked-N.B IT HAS TO BE EXACT MATCHES

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

probabilistic

A

statistical weights are used to calculate the probability that data from different sources refer to the same individual

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

NHI

A

it is basically a health number, and it is used to track your interactions with the health system

the purpose is basically so GPs, pharmacists, DHBs can be reimbursed for their data, services

Increasingly researchers are using encrypted versions of the NHI to investigate risk and protective factors associated with health outcomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is the IDI

A

it is a large research database containing microdata about people and their households

The deidentified data come from a range of government and non governemnt agencies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Benefits of IDI

A

De-identified, linkable data accessed in a data safe haven

The resource is only as good as the data it contains
-qualities about data quality
selection biases in data

Resident population definitions vary from study to study

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Some data the IDI has

A

housing data,
people and communities data
education and training data
income and work data

benefits and social services data

population data

health data

justice data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

privacy

A

refers to ability of a person to control the availability of information about themselves

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Security

A

refers to how the agency stores and controls access to data it holds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Confidentiality

A

refers to the protection of information from and about individuals and organisations and ensuring that the information is not made available or disclosed to unauthorised individuals and entities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are 3 key areas in which big data presents challenges

A

1) Data governance
2) Data generation
3) Data output

23
Q

Data governance

A

collection of practices and processes which help to ensure the formal management of data assets withn an organisation

Including storage, transferring, sharing privacy

24
Q

Data generation

A

Data quality is even more important when looking at large volumes of data
The belief that larger numbers result in a more accurate picture is not necessarily true

including capturing, curating, updating and accuracy

25
Q

Data output

A

Including analysis, querying large datasets, and generating meaningful and reliable outputs

26
Q

Possible implications of using big data(7)

A

Data in=data out
inadvertent discrimination of subpopulation
possible to conduct what if scenarios to determine impacts of policy

Control over “your data”
Need to re think privacy policies
Bias

27
Q

IMD employment

A

the extent of exclusion of the working age population from employment

. This is measured by the number of unemployment benefits being paid out to that neighborhood

28
Q

imd Income

A

The extent of income deprivation in a data zone by measuring state-funded financial assistance to those with insufficient income

Measured by the amount of state-funded financial support given to these people

29
Q

Crime IMD

A

Measures the risk of victimisation and property damage/loss

investigates the victim instead of the criminal because the effect on the victims more relevant to weelbeing. A shortcoming is that it does not measure the location of crime

30
Q

Housing IMD

A

The proportion of houses in a neighbourhood which is overcrowded and the extent of renting

31
Q

Education

A

Captures Youth disengagement, and the proportion of the working age population without a formal qualification.

Also investigates the proportion of working age individuals without any qualifications

32
Q

IMD access

A

measures the cost and inconvenience of travelling to access basic services

33
Q

Definition of deprivation

A

The state of observable and demonstrable disadvantages relative to the local community, society or nation to which an individual or group belongs

34
Q

How is neighborhood deprivation measured

A

It is measured by the census through a deficit model.

35
Q

Ecological fallacy

A

errors that arise from using information about groups to make assumptions about the individuals in said groups

36
Q

The integrated data infrastructure is a large research database containing microdata about people and households from different government agencies and other organisations across NZ. Briefly explain why the IDI has been referred to as a deficit data set

A

To apper in the IDI, you have to have interacted with these governments: such as the hospital or police which people may only interact with if there are problems with their lives

37
Q

You are part of a research team interested in finding out whether people who have a wearable activity tracker are less likely visit a doctor than those who do not. V3.com are able to provide these data to your research team, with an encrypted NHI, age, sex, ethnicity and usual residence infromation in health datasets. However data from the wearble activity trackers is only available with age, sex and usual residence information.

What method of data linkage discussed in class would be required to create your research data set

A

Probabalistic, as not all of the key personal information is available in the data to be linked, specifically ethnicity

38
Q

What are 2 reasons why you would recommend using IMD instead of NZDEP

A

With IMD, there are 28 indicators, and using this, we can drill down on the drivers of deprivation. Also the domains in IMD can be used collectively or separately whereas this cannot be done with NZDEP, provides more flexibility in use

39
Q

Advantages of IMD

A

Uses IDI, Which is more representative than the census

Explores drivers of deprivation, cosnsistes of specific indicators

Better small area information, as average IMD population size is 700

Forms specific solutions for small populations

40
Q

What are advantages to NZDEP

A

Weights domains,

widespread and well known to policy makers and analysts

41
Q

Disadvantages of IMD

A

Has not been used much

the quality of data of IDI is largely variable, and the disadvantages of the IDI

42
Q

purpose of the IMD

A

https://www.fmhs.auckland.ac.nz/assets/fmhs/soph/epi/hgd/docs/Final_Brief%20report%20on%20the%20New%20Zealand%20IMD.pdf

43
Q

Endemic disease

A

A disease that is constantly present in a given population

44
Q

Endemic disease outbreak

A

Fluctuations in endemic diseases are to be expected. An outbreak is when occurence exceeds expected levels

45
Q

Outbreak definition

A

The occurrence of cases of disease in a community or region where it would not normally be expected or at a much greater level than expected

Outbreak is often used for smaller, localised increase in disease occurrence, epidemics usually cover larger geographic areas

46
Q

Epidemic

A

quite similar to ourbreak, but its more like in a country rather than a localised event

Defined as the occurrence of disease at a level greater than that would normall be expected, baseline levels are important and there is a rapid spread to many people

47
Q

Definition of pandemic

A

An epidemic that has spread over several countries or continents, usually affecting a large number of people

48
Q

Basic reproduction number

A

The basic reproduction number of an infection is the number of cases one case generates, on average, over the course of its infectious period,
If R0 is less than 1, the infection will die out,
if the R0 is greater than 1, the infection can spread

49
Q

How do you work out basic reproductive rate

A

probability of infection being transmittedxthe rate of contacts in the host populationx duration of infectiousness

50
Q

limitations of basic reproductive rate calculation

A

assumes everyone is susceptible, when thats not true

51
Q

What is the SIR MODEL

A

The population is compartmentalised into different states

1) Susceptible
2) iNFECTED
3) rEMOVED/RECOVERED

Some of the more complex models factor in exposed/not exposed
vaccination
stochastic models,
multistate
multiagent based models
52
Q

Herd immunity

A

A form of indirect protection from infectious disease that occurs when a large percentage of a population has become immune to an infection, thereby providing a measure of proetction for individuals who are not immune,
referred to as indirect protection or a herd effect

53
Q

Strengths and weaknesses of BIG DATA in infectious diseases epidemiology

A

Strength;
opportunity for identifying associations, patterns and trends in data, hypothesis generating. When analysed can appropriately improve patient care, public health, reduce health care costs

Difficult to manage with traditional hardware and software, data quality and inconsistency issues

54
Q

What information is contained in big data

A

population, regional or local levels, or span different geographical areas
combinding data from multiple sources to explore population health outcomes