Big data technologies in Health Study Design and Summarising Data Flashcards
What is the Big Data?
- Huge volume of data
- Billions of rows, millions of columns
- Complexity of data types and structures
- Relational, unstructured
- Speed of new data creation and growth
Big Data is data whose scale, distribution, diversity, and/or timeliness require the use of new technical
architectures and analytics to enable insights that unlock new sources of business value
3(4) V-s of Big Data?
3(4) V-s of Big Data:
* Volume - scale of data
* Velocity - analysis of streaming data - updates/changes on the data life cycle
* Variety - different forms of data e.g. posts, patient notes
* (Veracity) - uncertainty of the data
What are the challenges of research with health data?
How do we represent the medical knowledge in data, so that it is:
* Standardised
* Portable
* Computable
Text means nothing
* Not searchable
* Not interoperable
* Not computable
Computers need codes – i.e human input to define a concept more clearly at input.
What are the types of bias in big Health data?
- Collected data used for multiple purposes
- Patient information may not be complete, accurate, or current.
- Clinicians and insurers have to be aware of this
- Greater attention needs to be paid to the context in which data is recorded in the EHR system.
- Addressing information gaps in Randomised Control Trials
- Tracking provenance of data being produced
- Reimbursement bias
- Why record a Body Mass Index (BMI) in a thin person?
- Software bias
- System initiated – UK eHRs don’t allow negative values and <>
- Data errors
- 1% ‘resurrection’ rate in one UK longitudinal study
- Myocardial infarction in code ‘NOT’ in text….
- Different pick lists for terminologies and the use of non-standard representations e.g. BP!
What is the importance of reproducibility in health research?
Research community is struggling to ensure transparency and correctness of published research
* Reasons complex and interleaving (positive bias, intractable analysis, deluge of journals)
What are the concept of Learning Health Systems?
A ‘learning health system’ (LHS) continuously analyses data which is collected as part of routine care to monitor outcomes, identify improvements in care, and implement changes on the basis of evidence.
- Persistent issues with clinical research
- Hard to identify subjects
- Complex, costly CRFs with duplicate data entry
- Funding not cost-effective
- Integrated approach needed between clinical trials and observational studies
- Secondary problem: Diagnostic error
- 60% of litigation claims against GPs (UK/EU/US)
- Failure of Decision Support Systems for Diagnosis
- System increasingly data-driven!
- Fundamentally a cross-disciplinary challenge
Function of Learning health system?
Defining functions of a LHS are to:
1.routinely and securely aggregate data from disparate sources 2.convert the data to knowledge
3.disseminate that knowledge, in actionable forms, to everyone who can benefit from it.
What is Classification?
**Classification **– A systematic representation of terms and concepts and the relationship between them.
* The apple is the fruit of the APPLE TREE, which is part of the ROSE family.
Possible sources of bias?
Possible sources of bias
1. Health care system bias
a Reimbursement system, pay for performance (why record BMI of a thin person?)
b Role of clinician in the health care system; gatekeeping/non-gatekeeping
c Professional guidelines for recording (UK’s Quality Outcomes Framework)
d Ease of access by patients to their records
e Data sharing between health care providers
2. Practice workload
3. Variations between EHR system functionalities and lay-out
4. Coding systems and thesauruses
5. Knowledge and education regarding the use of electronic health record systems
6.Data extraction tools
7. Data processing – re-databasing
8. Research dataset preparation
9. Research methodologies
What is Non-Clementure?
Nomenclature (vocabulary) – An agreed system of assigned names.
What are the Anonymisation techniques for patients?
- Quantitative
- removing or aggregating variables
- reducing the precision or detailed textual meaning of a variable
- In relational data, where connections between variables in related datasets can disclose identities
- For geo-referenced data, where identifying spatial references also have a geographical value.
- Qualitative
- identifiers should not be crudely removed or aggregated, as this can distort the data or even make
them unusable - Pseudonyms, replacement terms or vaguer descriptors should be used.
- The objective: reasonable level of anonymisation whilst maintaining maximum content.
Obstacles in Big Data collection
- Restrictive policies on data access
- Lack of standard policy on patient data
- privacy/confidentiality
- No international standardisation on data collection routes
- Licenses for access to data can be expensive
Legistrlations passed??
Data Protection Act 1998
* Provisions for secure processing of identifiable data for medical research
* No definitions of “secure” and “medical research”
* Led to consent-or-anonymise approach
* According to Information Commisioner’s Office (ICO) anonymisation code
Health and Social Care Act 2002
Section 251 of the NHS Act of 2006
* provisions for allowing linkage of patient-identifiable data
* Applications made to Health Research Authority (HRA)
* NHS Information Centre for Health and Social Care (NHSIC)
* Application assistance
* Trusted third party for data linkage
Challenges of Research Data Management
Threats to reproducible science?