big DATA Flashcards
wHAT IS BIG DATA
big data is a large or complex dataset that often needs terabytes or petabytes of storage
What are the 4 terms used to define characteristics of big data
Volume
velocity
variety
veracity
what are the r additional terms regarding data relevance
variability
value
visualisation
Volume
The computing capacity required to store and analyse data
Velocity
The speed at which data are created and analysed
Variety
The types of data sources available (text, images, social media, administrative)
Veracity
The accuracy and credibility of data
Variability
The internal consistency of your data
Value
The costs required to undertake big data analysis should pay dividends for your organisation and their patients
Visualisation
the use of novel techniques to communicate the patterns that would otherwise be lost in massive tables of data
Where do big data come from
1) Electronic or health records
2) the internet (IoT-internet of things)
3) research or data repositories
4) social media
what is data linkage
it is the process of matching records from different sources based on key information
What is deterministic data linkage
Exact matches based on personal information appearing in all of the datasets that are to be linked-N.B IT HAS TO BE EXACT MATCHES
probabilistic
statistical weights are used to calculate the probability that data from different sources refer to the same individual
NHI
it is basically a health number, and it is used to track your interactions with the health system
the purpose is basically so GPs, pharmacists, DHBs can be reimbursed for their data, services
Increasingly researchers are using encrypted versions of the NHI to investigate risk and protective factors associated with health outcomes
what is the IDI
it is a large research database containing microdata about people and their households
The deidentified data come from a range of government and non governemnt agencies
Benefits of IDI
De-identified, linkable data accessed in a data safe haven
The resource is only as good as the data it contains
-qualities about data quality
selection biases in data
Resident population definitions vary from study to study
Some data the IDI has
housing data,
people and communities data
education and training data
income and work data
benefits and social services data
population data
health data
justice data
privacy
refers to ability of a person to control the availability of information about themselves
Security
refers to how the agency stores and controls access to data it holds
Confidentiality
refers to the protection of information from and about individuals and organisations and ensuring that the information is not made available or disclosed to unauthorised individuals and entities
What are 3 key areas in which big data presents challenges
1) Data governance
2) Data generation
3) Data output
Data governance
collection of practices and processes which help to ensure the formal management of data assets withn an organisation
Including storage, transferring, sharing privacy
Data generation
Data quality is even more important when looking at large volumes of data
The belief that larger numbers result in a more accurate picture is not necessarily true
including capturing, curating, updating and accuracy
Data output
Including analysis, querying large datasets, and generating meaningful and reliable outputs
Possible implications of using big data(7)
Data in=data out
inadvertent discrimination of subpopulation
possible to conduct what if scenarios to determine impacts of policy
Control over “your data”
Need to re think privacy policies
Bias
IMD employment
the extent of exclusion of the working age population from employment
. This is measured by the number of unemployment benefits being paid out to that neighborhood
imd Income
The extent of income deprivation in a data zone by measuring state-funded financial assistance to those with insufficient income
Measured by the amount of state-funded financial support given to these people
Crime IMD
Measures the risk of victimisation and property damage/loss
investigates the victim instead of the criminal because the effect on the victims more relevant to weelbeing. A shortcoming is that it does not measure the location of crime
Housing IMD
The proportion of houses in a neighbourhood which is overcrowded and the extent of renting
Education
Captures Youth disengagement, and the proportion of the working age population without a formal qualification.
Also investigates the proportion of working age individuals without any qualifications
IMD access
measures the cost and inconvenience of travelling to access basic services
Definition of deprivation
The state of observable and demonstrable disadvantages relative to the local community, society or nation to which an individual or group belongs
How is neighborhood deprivation measured
It is measured by the census through a deficit model.
Ecological fallacy
errors that arise from using information about groups to make assumptions about the individuals in said groups
The integrated data infrastructure is a large research database containing microdata about people and households from different government agencies and other organisations across NZ. Briefly explain why the IDI has been referred to as a deficit data set
To apper in the IDI, you have to have interacted with these governments: such as the hospital or police which people may only interact with if there are problems with their lives
You are part of a research team interested in finding out whether people who have a wearable activity tracker are less likely visit a doctor than those who do not. V3.com are able to provide these data to your research team, with an encrypted NHI, age, sex, ethnicity and usual residence infromation in health datasets. However data from the wearble activity trackers is only available with age, sex and usual residence information.
What method of data linkage discussed in class would be required to create your research data set
Probabalistic, as not all of the key personal information is available in the data to be linked, specifically ethnicity
What are 2 reasons why you would recommend using IMD instead of NZDEP
With IMD, there are 28 indicators, and using this, we can drill down on the drivers of deprivation. Also the domains in IMD can be used collectively or separately whereas this cannot be done with NZDEP, provides more flexibility in use
Advantages of IMD
Uses IDI, Which is more representative than the census
Explores drivers of deprivation, cosnsistes of specific indicators
Better small area information, as average IMD population size is 700
Forms specific solutions for small populations
What are advantages to NZDEP
Weights domains,
widespread and well known to policy makers and analysts
Disadvantages of IMD
Has not been used much
the quality of data of IDI is largely variable, and the disadvantages of the IDI
purpose of the IMD
https://www.fmhs.auckland.ac.nz/assets/fmhs/soph/epi/hgd/docs/Final_Brief%20report%20on%20the%20New%20Zealand%20IMD.pdf
Endemic disease
A disease that is constantly present in a given population
Endemic disease outbreak
Fluctuations in endemic diseases are to be expected. An outbreak is when occurence exceeds expected levels
Outbreak definition
The occurrence of cases of disease in a community or region where it would not normally be expected or at a much greater level than expected
Outbreak is often used for smaller, localised increase in disease occurrence, epidemics usually cover larger geographic areas
Epidemic
quite similar to ourbreak, but its more like in a country rather than a localised event
Defined as the occurrence of disease at a level greater than that would normall be expected, baseline levels are important and there is a rapid spread to many people
Definition of pandemic
An epidemic that has spread over several countries or continents, usually affecting a large number of people
Basic reproduction number
The basic reproduction number of an infection is the number of cases one case generates, on average, over the course of its infectious period,
If R0 is less than 1, the infection will die out,
if the R0 is greater than 1, the infection can spread
How do you work out basic reproductive rate
probability of infection being transmittedxthe rate of contacts in the host populationx duration of infectiousness
limitations of basic reproductive rate calculation
assumes everyone is susceptible, when thats not true
What is the SIR MODEL
The population is compartmentalised into different states
1) Susceptible
2) iNFECTED
3) rEMOVED/RECOVERED
Some of the more complex models factor in exposed/not exposed vaccination stochastic models, multistate multiagent based models
Herd immunity
A form of indirect protection from infectious disease that occurs when a large percentage of a population has become immune to an infection, thereby providing a measure of proetction for individuals who are not immune,
referred to as indirect protection or a herd effect
Strengths and weaknesses of BIG DATA in infectious diseases epidemiology
Strength;
opportunity for identifying associations, patterns and trends in data, hypothesis generating. When analysed can appropriately improve patient care, public health, reduce health care costs
Difficult to manage with traditional hardware and software, data quality and inconsistency issues
What information is contained in big data
population, regional or local levels, or span different geographical areas
combinding data from multiple sources to explore population health outcomes