ERS43 Health Information In The Era Of Big Data Flashcards
Data, Information, Knowledge
Real world —(Collection Coding)—> Data —(Processing, Analysis, Interpretation, Presentation)—> Information —(Judgement, Conclusions)—> Knowledge —(Politics, Commitment)—> Decision and Action
Key issues with Data
- Validity (reflect reality?)
- Reliability
- Completeness
- Timeliness
- Analysis
- Confidentiality
- Information governance
Data linkage: Shared health records
Advantages
- Timely and accurate information for care
- Reduce duplication of tests / treatment
- Reduce medical errors
- Improve disease surveillance + monitoring of public health
- Gather comprehensive statistics for formulating public health policy
- Efficiency gains / Reduce cost from health expenditure
Routine data
- Continually collected, assembled, made available repeatedly (not one-off)
- Part of data collection system conducted at regular intervals
—> Track ***trends over time - Information coded according to
—> ***Well-defined protocols, standards to allow comparisons (e.g. with other countries / population / over time)
—> e.g. International Classification of Diseases standards
- Demography
- basic characteristics of population e.g. age, sex, geographical distribution
- ***Census / Population registers
—> conducted every 10 years
—> Gold standard (in terms of completeness)
—> Disadvantage: Self-reported, Under-reporting, Problems with small area estimates, Outdated, Expensive - Vital statistics
- systemically tabulated information e.g. birth, marriages, deaths
- **Birth records
- **Mortality: causes, distribution (by time, person, place)
—> most reliable health data (∵ death is unambiguous)
—> causes often inaccurate / incomplete (hard to determine exact cause)
—> insensitive measure of health —> non-fatal disease burden not reflected - Morbidity
- prevalence, incidence of diseases
- **Infectious disease notifications
—> notifiable diseases
—> generally adequate for monitoring trends but sometimes incomplete
- **Disease registers (e.g. HK cancer registry)
—> identify a specific group
—> may miss case due to no contact / non-identification
- ***Impairment, disability, handicap
—> functional status more relevant to patient (compared with disease status)
—> collected only from surveys - Health services data
- access and supply, utilisation, activity, costs of using health services
- e.g. diagnoses, interventions, procedures, outcomes
- relevant if condition result in health care use e.g. fracture
- data likely to be ***incomplete, poor quality
- record health service activity rather than outcomes / effectiveness e.g. disease burden - Health-related characteristics / risk factors e.g. deprivation, living conditions, employment, housing
- data from other agencies: social care, labour, housing etc.
- ***limited use: categories / definitions may be incompatible between different data sets
- incomplete data, poor quality
Specially collected data
- Collected for a particular purpose
—> Fulfil a specific ***time-limited study - Without intention of regular repetition / adherence to specific standards (outside of study needs)
- Information coded according to
—> ***Task at hand, may not conform to international standards —> difficult to compare with other data
—> e.g. Research, Commissioned studies
Disease registers
- Collect details of all diagnosed cases
- ***Reliable identification of cases: inclusion criteria with a defined population
- ***Continually updated (e.g. recovered, died, moved away)
- ***Expensive to maintain
- Require multiple data sources for case ascertainment + exclude duplication
- ***Useful for incidence rates, survival, remission, trends, making projections
- Linkage to other records e.g. health care events, co-morbidities, medication, spending, lifestyle
Health data in HK
Clinical Iceberg (most people not seek health service) —> does not capture whole population disease burden
E.g. HA Clinical Management System
- demographic data
- health service activity
- diagnosis, procedures codes: ICD standards (allow comparison)
- laboratory / pathology results
- radiology imaging + reports
- clinical notes: structured (coded) / semi-structured / unstructured (free text, require mining to extract information)
- medications record
Diagnostic coding for looking at diseases
- Standardised codes of diagnoses for ***accurate comparison (with other places etc.)
- Categorised for analysis
- e.g. Diseases, Disorders, Symptoms, Injuries, Procedures
Examples:
- ICD9, ICD10
- ICPC-2 (primary care), DSM (psychiatry), SNOMED CT (medical terms for symptoms)
- **Use:
1. Epidemiology - clinical burden of disease, risk factors
2. Financing, reimbursement
3. Health service planning / resource allocation
4. Evaluation of services
Limitations:
- Only for those with disease + Use services (not full picture of morbidity)
- Depends on accuracy / completeness of coding
- Differing coding practices in different places (change in coding for money)
- Expensive, time-consuming
- Changes in case definitions across time / place
- Historical comparisons - mapping to ICD from version 9 to 10
Surveys
- Previous surveys: Local / National
- readily available
- may be authoritative
- may not be generalisable to specific population of interest —> require “modelling” assumption
- ***variable quality: self-reported? representativeness?
- Thematic Household Surveys: chronic disease, insurance, service utilisation
- Behavioural Risk Factor Surveillance System (BRFSS) (by CHP): smoking, alcohol, diet etc. - Commissioned surveys
- **tailor-made, expensive esp. from scratch
- **more relevant —> ∵ collect specific information of interest
Qualitative data
Local description of environmental / social factors
- may give good understanding / stimulate further research
- ***difficult to assess scale of health impact of identified problems (∵ lack quantitative data)
Important to assess:
- **People’s perception of how health problems affect them
- could identify issues important to people
- qualitative data need careful handling e.g. context, unstable responses if question wording is inconsistent
- e.g. patient feedback
***Summary of Routine data Pros + Cons
Pros:
- ***Readily available, Lower cost (∵ already done)
- Useful for ***initial assessment (baseline of expected levels of health / disease)
- Identify important issues / hypotheses for further research
Cons:
- ***Not up-to-date, Less complete (except Census)
- ***Collected for different purpose so may not include specific variable of interests, report specific populations
- ***Not reliable e.g. subject to political influence / manipulation
- Data linkage may not be possible (∵ cannot access raw data)
- ***Individual level data inaccessible
Alternative to Routine data: Research studies
- Ecological studies
- Cross-sectional surveys
- Cohort studies
- Other commissioned studies
Application of data: Diabetes
Question: How many diabetics in HK?
Answers:
- Published studies by academia, government (CHP, household surveys), NGOs
- Lab results: HbA1c, OGTT, Fasting glucose etc.
- Diagnosis coding
- Diabetic medication prescriptions
- Self-reported diagnoses
- Attendances at diabetic-specific clinics
- Population denominator (time / person / place): Census, Population projections (determine population at risk to find out prevalence / incidence)
Limitations:
- Completeness (undiagnosed cases, private healthcare system e.g. GP)
- Data linkage (double counting of patients) between data sets
- Matching numerator (no. of cases) and denominator (population at risk) across different time, area, population
- Information governance
- patient confidentiality, data security, consent, ethical approval
Worldwide trends in diabetes
- Prevalence ↑
- ∵ population ageing, growth, chronic incurable disease - Disease burden ↑ (in terms of prevalence / number of people affected)
—> ↑ faster in low/ middle income countries (than in high income countries)
—> e.g. higher proportion of deaths - Most people with diabetes from low / middle income countries
- even though prevalence higher in high income countries - Incidence beginning to stabilise in high income countries e.g. HK
Diabetes in China, HK
China: Absolute number highest in world, among highest in prevalence
HK: ~ prevalence (~11%)
Big data
- Volume
- Variety (e.g. texts, images, numbers)
- Velocity (lots of data collected continuously)
- Veracity (uncertainty of data)
- Value (is data valuable?)
Use of big data
***Predictive analytics
Classical:
- quantitative risk prediction
- based on classical statistical learning
- from more structured data sources
Current:
- **Digitalisation of health-related records and data sharing
—> bigger / more variety of data sets
—> cover more people
- Availability of AI / deep learning to analyse **heterogeneous data sets
—> e.g. strengths of digital imaging over human interpretation
Goals:
- ***Generate new models, predictions
- ***Improve decision making
Decision making
During clinical decisions (esp. difficult decisions)
—> Clinicians should view output as only statistical prediction and maintain suspicion
—> Prediction may be wrong
Statistical performance of Risk prediction models (e.g. QRISK3) —> measured by:
1. **Discrimination (Do patients with outcome have higher risk prediction than those without?) (究竟higher risk同普通人有無分別)
2. **Calibration (Does risk prediction have the exact number of outcomes?) (個predicted risk準唔準)
—> Studies usually have better Discrimination than Calibration
Axes of Machine learning and Big data
Traditional clinical studies:
- analyse data from many patients using a statistical model
- low on machine learning spectrum
- Analyse data
Deep learning models:
- top of spectrum
- generative adversarial networks —> can ***generate new images from learning a large database of images
- Analysis data + ***Generate new data (i.e. Predictive models)
Trust “black box”?
1. Machine learning algorithms —> more complicated they are —> more powerful / results better —> but more opaque they are than classical statistical models (harder to give explanation how to arrive at output) —> less easy to interpret
- Developers reluctant to report algorithms ∵ proprietary
- ***Infeasible to interpret hidden features (∵ output depends on complex interactions with uninterpreted features in other layers)
Limitations:
1. ***Biases in training data
—> if have inherent biases (and reasoning not accounted)
—> AI will preserve the biases
—> generate data with biases
—> data inaccuracy, missingness, selective measurement even though more data is available
—> be careful of performance
- ***Privacy of health care data
- compare anonymised data with public information e.g. google image
—> can re-identify anonymised people