Week 3: Big Data Science Flashcards by Olivia Schougaard

What is data?

There is no such thing as pure, raw data. All data is cooked in some sense. Observation always relies on theory or interpretation of some sort.

We observe data that can represent facts/a phenomenon. Data are marks that are determined by the facts.

How well did you know this?

Not at all

Perfectly

What is big data?

Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.”

Velocity, volume, variety, scope, resolution, indexicality.

How well did you know this?

Not at all

Perfectly

Scope of big and small data

Small data: takes samples of data to make inferences about the whole population

Big data: N=all, collect and use all of the data pertaining to the phenomenon of interest. The scope is comprehensive.

How well did you know this?

Not at all

Perfectly

Resolution of big and small data

Small data: low resolution, does not allow to make fine-grained distinctions between subcategories

Big data: high resolution and often very fine grained

How well did you know this?

Not at all

Perfectly

Indexicality of small and big data

Small data: low to no indexicality: the data do not come with (much) metadata

Big data: highly indexical: the metadata specify the context of origin and use.

How well did you know this?

Not at all

Perfectly

Google flu trends: example.A system to predict outbreaks of the seasonal flu. Using large-scale computational analyses of google searches.

A system to predict outbreaks of the seasonal flu. Using large-scale computational analyses of google searches.

Traditionally the flu is monitored through reports from a sample of doctors to a central health agency. This is sample-based and slow: it can take up to two weeks before an outbreak of the flu is detected.

Google flu trends took the data from the CDC on the spread of the seasonal flu from 2003-2008 and compared this with the 50 million most-used search terms those years to identify correlations.

How well did you know this?

Not at all

Perfectly

volume, velocity, variety, scope, resolution and indexicality of flu trends

High volume: using data from the 50 million most common search terms.

High velocity: predictions could be updated continuously, based on never ending flow of research data

High variety: relied on unstructured search data

Comprehensive in scope: relied on billions of internet searches in the US

High resolution: outbreaks could potentially be predicted at a very fine-grained scale.

Highly indexical: the data were rich in meta-data: individual data points were time-stamped and could be traced back to specific IP-adress.

How well did you know this?

Not at all

Perfectly

A new era of empiricism: flu trends + critical response

With big data, science can become “theory free”. The scientific knowledge of the future will be born from data. Unmediated by theory. Bit data is the empiricists dream come true. The problem of induction has become irrelevant in scientific practice.

Critical response: All observations is ‘theory-laden’. This also holds for datasets that are analysed by a computer algorithm to find patterns.
“observation is always selective. It needs a chosen object” - Popper

It is true that the epistemology of big data science doesn’t comply completely with purely hypothesis-driven approaches.

How well did you know this?

Not at all

Perfectly

The end of subject matter specialists; flu trends + critical response

With big data, we no longer need domain-specific expertise. Most scientists in the future will be domain-general ‘data-scientists’

Critical response: Researchers with domain-specific knowledge are needed to acquire and select data sources, help identify appropriate computational methods, and interpret results based on their domain-specific expertise.

How well did you know this?

Not at all

Perfectly

Correlation and prediction trump causation and explanation: flu trends + critical response

With big data, the task of finding causal explanations becomes obsolete. We only need to search for correlations to make accurate predictions.

Critical response: prediction may be all that is needed in commerce, but in science prediction isn’t the only epistemic aim. Scientists often seek causal explanations.
Even with massive amounts of data there’s still the risk of basing one’s prediction on spurious correlations.

How well did you know this?

Not at all

Perfectly

Google flu trends outcome

Performed well its first year, but gave wild inaccurate predictions in later years. In 2013 it overestimated flu cases by about 50%.

The algorithm turned out to be part ‘flu-tracker’ and part “winter” tracker.

Absence of domain-specific theoretical knowledge: It had been assumed that searches for flu and flu-like symptoms are good indicators of influenza cases, but doctors know this is not the cases.

How well did you know this?

Not at all

Perfectly

ChatGPT and trust in big data

Your phone’s text prediction engine with superpowers, trained on trained on vast amounts of data.

But: massive amounts of data about the
world ≠ understanding of the world.

This introduces risks if we start trusting AI chatbots as sources of information

How well did you know this?

Not at all

Perfectly

Realistic promises of big data

Alphafold

Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments into the design of the deep learning algorithm”

How well did you know this?

Not at all

Perfectly

Logical empiricism (big data) method vs scientific method

Logical empiricism:
Observation report –> hypothesis/conjecture –> prediction/explanation –> confirmation

Popper’s falsificationism:
Problem statement –> hypothesis/conjecture –> prediction/explanation –> falsification/refutation.

How well did you know this?

Not at all

Perfectly

The hypothetic-deductive method

Formulate hypothesis/conjecture
Derive prediction/explanation
Test

How well did you know this?

Not at all

Perfectly

Genomics: hybrid

Study These Flashcards

In genomics, big data studies are a hybrid of data-driven and hypothesis-driven research (DDHD).

Example of genome-wide association studies (GWAS).
Step 1: generate millions of candidate hypotheses for which SNPs play a causal role in the disease.
Step 2: eliminate a large number of candidates through statistical analyses and using prior biological knowledge.
Step 3: adopting a ‘traditional’ molecular approach, try to identify the mechanisms the remaining SNPs play a role in.

Leonelli’s on the meaning of ‘data’

Study These Flashcards

“An object can be considered as a datum as long as (1) it is treated as potential evidence for one or more claims about the world, and (2) it is possible to circulate it among individuals/groups.” (Leonelli, 2016)

De-contextualization

Study These Flashcards

The decoupling of data from the specific features of the local context of their production, to make them ready for sharing in large cross- disciplinary databases.
* This typically involves applying bio-ontology terms.

Re-contextualization

Study These Flashcards

The adoption of data in new research contexts, where researchers have different goals and expertise.
* Requires having information about their provenance through metadata.

Summary

Study These Flashcards

The rhetoric of big data as introducing a complete new, theory-free way of doing science is overstated. In particular, it overlooks the pitfalls of spurious correlations.

That said, big data does show promise in some areas, where it introduces new ways of doing science. Big data science often involves massive computation efforts to inductively eliminative the least plausible hypotheses.
Big data don’t just ‘live in the cloud’; they need to get there and get out. Leonelli points us to the centrality of asking questions about how data travel from one research site to another.

Hvad forstår man overordnet ved Big Data? (Se Joeri’s prelecture)

Study These Flashcards

Big data omfatter meget store mængder data.
Men det ikke kun volumen som det omfatter. Det omfatter også hastigheden hvorved data er genereret, bearbejdet og udgivet. Dette sker konstant med nærmest uden pause.
Høj variation, er også en karakteristik af big data. Det menes med det at der ses stor forskel i struktur hvor det ikke forekommer sorteret i kategorier i forvejen.

Scope

Study These Flashcards

Small data: Takes samples of data to make inferences about the whole population

Big data: N=all, collect and use all of the data pertaining the phenomenon of the interest. The scope is comprehensive

Resolution (informational depth)

Study These Flashcards

Small data: low-resolution, does not allow to make fine-grained distinctions between subcategories.

Big data: High-resolution and often very fine-grained.

Indexicality (trait of referring to a particular time, place, or object)

Study These Flashcards

Small data: Low to no indexicality – the data does not come with (much) metadata
Big data: High indexicality – the metadata specify the context of origin and use

På hvilken måde kan “big data“ ifølge Chris Anderson ændre måden der bedrives videnskab på? Stemmer synspunktet præsenteret i Andersons artikel bedst overens med positivisternes eller Poppers syn på videnskab?

The scientific approach where hypothesizing, modeling, and testing is, in Andersons view, becoming obsolete when faced with the massive amounts of data. He says we can stop looking for models because Petabytes allow us to say that correlations are enough, and we can stop looking for models. With big data science for Andersons view you could see the work method like this: Observations -> finding patterns -> predicition -> confirmation Anderson has a very anti-Popperian view. Anderson and the positivists are more on the same page (only partily) because they have an inductive way of science. Whereas the Popperian way of science uses deduction. Induction step would be between the observations and finding patterns step. But with big data vi derive the method of the hypothesis/conjecture part which is different from both Popper and logical positivists.

Hvad forstår Leonelli ved begrebet ”data journey”, og hvilke udfordringer kan der være ved at få biologisk data til at ”rejse” mellem forskellige forskere?

Large databases that cater to different individuals/groups whose aims and methods vary, are circulated. This requires the data to be de-contextualized from the specific research context where it was collected and re-contextualized for use in new research contexts. It’s a term describing this process of which data travels from om group of scientist to another. Leonelli’s three stages of data travel Stage 1 - De-contextualization: - The decoupling of data from the specific features of the local context of their production to makes them ready for sharing in large cross-disciplinary databases. - This typically involves applying bio-ontology terms Stage 2 - Re-contextualization: - The adaptation of data in new research contexts, where researchers have different goals and expertise - Requires having information about their provenance through metadata Stage 3 – Re-use

6. Hvorfor mener Leonelli at data i store biologiske databaser er ”highly selected” (s. 7)? Hvilke faktorer påvirker udvælgelse og dele af bestemte typer data?

Leonelli says that Big Data made available through databases for future analysis, is determined by social, political, economic and technical factors. Data journeys depend on for example, data donation policies (privacy laws), the good will and resources of specific data producers and as well the ethos and visibility of the scientific traditions and environments in which the scientists work (fx private hired biologist may not be allowed to publish their work). Also the availability of well-curated databases, which then depends on the visibility and values placed upon them by the government of possible funders.

Week 3: Big Data Science Flashcards

(27 cards)