Week 3: Big Data Science Flashcards

1
Q

What is data?

A

There is no such thing as pure, raw data. All data is cooked in some sense. Observation always relies on theory or interpretation of some sort.

We observe data that can represent facts/a phenomenon. Data are marks that are determined by the facts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is big data?

A

Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.”

Velocity, volume, variety, scope, resolution, indexicality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Scope of big and small data

A

Small data: takes samples of data to make inferences about the whole population

Big data: N=all, collect and use all of the data pertaining to the phenomenon of interest. The scope is comprehensive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Resolution of big and small data

A

Small data: low resolution, does not allow to make fine-grained distinctions between subcategories

Big data: high resolution and often very fine grained

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Indexicality of small and big data

A

Small data: low to no indexicality: the data do not come with (much) metadata

Big data: highly indexical: the metadata specify the context of origin and use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Google flu trends: example.A system to predict outbreaks of the seasonal flu. Using large-scale computational analyses of google searches.

A

A system to predict outbreaks of the seasonal flu. Using large-scale computational analyses of google searches.

Traditionally the flu is monitored through reports from a sample of doctors to a central health agency. This is sample-based and slow: it can take up to two weeks before an outbreak of the flu is detected.

Google flu trends took the data from the CDC on the spread of the seasonal flu from 2003-2008 and compared this with the 50 million most-used search terms those years to identify correlations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

volume, velocity, variety, scope, resolution and indexicality of flu trends

A

High volume: using data from the 50 million most common search terms.

High velocity: predictions could be updated continuously, based on never ending flow of research data

High variety: relied on unstructured search data

Comprehensive in scope: relied on billions of internet searches in the US

High resolution: outbreaks could potentially be predicted at a very fine-grained scale.

Highly indexical: the data were rich in meta-data: individual data points were time-stamped and could be traced back to specific IP-adress.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

A new era of empiricism: flu trends + critical response

A

With big data, science can become “theory free”. The scientific knowledge of the future will be born from data. Unmediated by theory. Bit data is the empiricists dream come true. The problem of induction has become irrelevant in scientific practice.

Critical response: All observations is ‘theory-laden’. This also holds for datasets that are analysed by a computer algorithm to find patterns.
“observation is always selective. It needs a chosen object” - Popper

It is true that the epistemology of big data science doesn’t comply completely with purely hypothesis-driven approaches.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

The end of subject matter specialists; flu trends + critical response

A

With big data, we no longer need domain-specific expertise. Most scientists in the future will be domain-general ‘data-scientists’

Critical response: Researchers with domain-specific knowledge are needed to acquire and select data sources, help identify appropriate computational methods, and interpret results based on their domain-specific expertise.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Correlation and prediction trump causation and explanation: flu trends + critical response

A

With big data, the task of finding causal explanations becomes obsolete. We only need to search for correlations to make accurate predictions.

Critical response: prediction may be all that is needed in commerce, but in science prediction isn’t the only epistemic aim. Scientists often seek causal explanations.
Even with massive amounts of data there’s still the risk of basing one’s prediction on spurious correlations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Google flu trends outcome

A

Performed well its first year, but gave wild inaccurate predictions in later years. In 2013 it overestimated flu cases by about 50%.

The algorithm turned out to be part ‘flu-tracker’ and part “winter” tracker.

Absence of domain-specific theoretical knowledge: It had been assumed that searches for flu and flu-like symptoms are good indicators of influenza cases, but doctors know this is not the cases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

ChatGPT and trust in big data

A

Your phone’s text prediction engine with superpowers, trained on trained on vast amounts of data.

But: massive amounts of data about the
world ≠ understanding of the world.

This introduces risks if we start trusting AI chatbots as sources of information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Realistic promises of big data

A

Alphafold

Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments into the design of the deep learning algorithm”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Logical empiricism (big data) method vs scientific method

A

Logical empiricism:
Observation report –> hypothesis/conjecture –> prediction/explanation –> confirmation

Popper’s falsificationism:
Problem statement –> hypothesis/conjecture –> prediction/explanation –> falsification/refutation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The hypothetic-deductive method

A
  1. Formulate hypothesis/conjecture
  2. Derive prediction/explanation
  3. Test
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Genomics: hybrid

A

In genomics, big data studies are a hybrid of data-driven and hypothesis-driven research (DDHD).

Example of genome-wide association studies (GWAS).
Step 1: generate millions of candidate hypotheses for which SNPs play a causal role in the disease.
Step 2: eliminate a large number of candidates through statistical analyses and using prior biological knowledge.
Step 3: adopting a ‘traditional’ molecular approach, try to identify the mechanisms the remaining SNPs play a role in.

17
Q

Leonelli’s on the meaning of ‘data’

A

“An object can be considered as a datum as long as (1) it is treated as potential evidence for one or more claims about the world, and (2) it is possible to circulate it among individuals/groups.” (Leonelli, 2016)

18
Q

De-contextualization

A

The decoupling of data from the specific features of the local context of their production, to make them ready for sharing in large cross- disciplinary databases.
* This typically involves applying bio-ontology terms.

19
Q

Re-contextualization

A

The adoption of data in new research contexts, where researchers have different goals and expertise.
* Requires having information about their provenance through metadata.

20
Q

Summary

A

The rhetoric of big data as introducing a complete new, theory-free way of doing science is overstated. In particular, it overlooks the pitfalls of spurious correlations.

  • That said, big data does show promise in some areas, where it introduces new ways of doing science. Big data science often involves massive computation efforts to inductively eliminative the least plausible hypotheses.
  • Big data don’t just ‘live in the cloud’; they need to get there and get out. Leonelli points us to the centrality of asking questions about how data travel from one research site to another.
21
Q
  1. Hvad forstår man overordnet ved Big Data? (Se Joeri’s prelecture)
A

Big data omfatter meget store mængder data.
Men det ikke kun volumen som det omfatter. Det omfatter også hastigheden hvorved data er genereret, bearbejdet og udgivet. Dette sker konstant med nærmest uden pause.
Høj variation, er også en karakteristik af big data. Det menes med det at der ses stor forskel i struktur hvor det ikke forekommer sorteret i kategorier i forvejen.

22
Q

Scope

A

Small data: Takes samples of data to make inferences about the whole population

Big data: N=all, collect and use all of the data pertaining the phenomenon of the interest. The scope is comprehensive

23
Q

Resolution (informational depth)

A

Small data: low-resolution, does not allow to make fine-grained distinctions between subcategories.

Big data: High-resolution and often very fine-grained.

24
Q

Indexicality (trait of referring to a particular time, place, or object)

A

Small data: Low to no indexicality – the data does not come with (much) metadata
Big data: High indexicality – the metadata specify the context of origin and use

25
Q

På hvilken måde kan “big data“ ifølge Chris Anderson ændre måden der bedrives videnskab på? Stemmer synspunktet præsenteret i Andersons artikel bedst overens med positivisternes eller Poppers syn på videnskab?

A

The scientific approach where hypothesizing, modeling, and testing is, in Andersons view, becoming obsolete when faced with the massive amounts of data. He says we can stop looking for models
because Petabytes allow us to say that correlations are enough, and we can stop looking for models.

With big data science for Andersons view you could see the work method like this:
Observations -> finding patterns -> predicition -> confirmation

Anderson has a very anti-Popperian view. Anderson and the positivists are more on the same page (only partily) because they have an inductive way of science. Whereas the Popperian way of science uses deduction. Induction step would be between the observations and finding patterns step. But with big data vi derive the method of the hypothesis/conjecture part which is different from both Popper and logical positivists.

26
Q

Hvad forstår Leonelli ved begrebet ”data journey”, og hvilke udfordringer kan der være ved at få biologisk data til at ”rejse” mellem forskellige forskere?

A

Large databases that cater to different individuals/groups whose aims and methods vary, are circulated. This requires the data to be de-contextualized from the specific research context where it was collected and re-contextualized for use in new research contexts. It’s a term describing this process of which data travels from om group of scientist to another.

Leonelli’s three stages of data travel

Stage 1 - De-contextualization:
- The decoupling of data from the specific features of the local context of their production to makes them ready for sharing in large cross-disciplinary databases.
- This typically involves applying bio-ontology terms

Stage 2 - Re-contextualization:
- The adaptation of data in new research contexts, where researchers have different goals and expertise
- Requires having information about their provenance through metadata

Stage 3 – Re-use

27
Q
  1. Hvorfor mener Leonelli at data i store biologiske databaser er ”highly selected” (s. 7)? Hvilke faktorer påvirker udvælgelse og dele af bestemte typer data?
A

Leonelli says that Big Data made available through databases for future analysis, is determined by social, political, economic and technical factors. Data journeys depend on for example, data donation policies (privacy laws), the good will and resources of specific data producers and as well the ethos and visibility of the scientific traditions and environments in which the scientists work (fx private hired biologist may not be allowed to publish their work). Also the availability of well-curated databases, which then depends on the visibility and values placed upon them by the government of possible funders.