Week 3: Big Data Science Flashcards
What is data?
There is no such thing as pure, raw data. All data is cooked in some sense. Observation always relies on theory or interpretation of some sort.
We observe data that can represent facts/a phenomenon. Data are marks that are determined by the facts.
What is big data?
Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.”
Velocity, volume, variety, scope, resolution, indexicality.
Scope of big and small data
Small data: takes samples of data to make inferences about the whole population
Big data: N=all, collect and use all of the data pertaining to the phenomenon of interest. The scope is comprehensive.
Resolution of big and small data
Small data: low resolution, does not allow to make fine-grained distinctions between subcategories
Big data: high resolution and often very fine grained
Indexicality of small and big data
Small data: low to no indexicality: the data do not come with (much) metadata
Big data: highly indexical: the metadata specify the context of origin and use.
Google flu trends: example.A system to predict outbreaks of the seasonal flu. Using large-scale computational analyses of google searches.
A system to predict outbreaks of the seasonal flu. Using large-scale computational analyses of google searches.
Traditionally the flu is monitored through reports from a sample of doctors to a central health agency. This is sample-based and slow: it can take up to two weeks before an outbreak of the flu is detected.
Google flu trends took the data from the CDC on the spread of the seasonal flu from 2003-2008 and compared this with the 50 million most-used search terms those years to identify correlations.
volume, velocity, variety, scope, resolution and indexicality of flu trends
High volume: using data from the 50 million most common search terms.
High velocity: predictions could be updated continuously, based on never ending flow of research data
High variety: relied on unstructured search data
Comprehensive in scope: relied on billions of internet searches in the US
High resolution: outbreaks could potentially be predicted at a very fine-grained scale.
Highly indexical: the data were rich in meta-data: individual data points were time-stamped and could be traced back to specific IP-adress.
A new era of empiricism: flu trends + critical response
With big data, science can become “theory free”. The scientific knowledge of the future will be born from data. Unmediated by theory. Bit data is the empiricists dream come true. The problem of induction has become irrelevant in scientific practice.
Critical response: All observations is ‘theory-laden’. This also holds for datasets that are analysed by a computer algorithm to find patterns.
“observation is always selective. It needs a chosen object” - Popper
It is true that the epistemology of big data science doesn’t comply completely with purely hypothesis-driven approaches.
The end of subject matter specialists; flu trends + critical response
With big data, we no longer need domain-specific expertise. Most scientists in the future will be domain-general ‘data-scientists’
Critical response: Researchers with domain-specific knowledge are needed to acquire and select data sources, help identify appropriate computational methods, and interpret results based on their domain-specific expertise.
Correlation and prediction trump causation and explanation: flu trends + critical response
With big data, the task of finding causal explanations becomes obsolete. We only need to search for correlations to make accurate predictions.
Critical response: prediction may be all that is needed in commerce, but in science prediction isn’t the only epistemic aim. Scientists often seek causal explanations.
Even with massive amounts of data there’s still the risk of basing one’s prediction on spurious correlations.
Google flu trends outcome
Performed well its first year, but gave wild inaccurate predictions in later years. In 2013 it overestimated flu cases by about 50%.
The algorithm turned out to be part ‘flu-tracker’ and part “winter” tracker.
Absence of domain-specific theoretical knowledge: It had been assumed that searches for flu and flu-like symptoms are good indicators of influenza cases, but doctors know this is not the cases.
ChatGPT and trust in big data
Your phone’s text prediction engine with superpowers, trained on trained on vast amounts of data.
But: massive amounts of data about the
world ≠ understanding of the world.
This introduces risks if we start trusting AI chatbots as sources of information
Realistic promises of big data
Alphafold
Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments into the design of the deep learning algorithm”
Logical empiricism (big data) method vs scientific method
Logical empiricism:
Observation report –> hypothesis/conjecture –> prediction/explanation –> confirmation
Popper’s falsificationism:
Problem statement –> hypothesis/conjecture –> prediction/explanation –> falsification/refutation.
The hypothetic-deductive method
- Formulate hypothesis/conjecture
- Derive prediction/explanation
- Test