Week 5 - Open Data, Reproducibility and Replicability Flashcards

1
Q

What is verifiability

A

A statement is meaningful when it can be verified empirically

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the induction problem

A

To establish a law like “All swans are white” we must observe all swans

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is falsifiability

A

a statement is a valid theory if it makes predictions that can be tested, and can be falsified by a counterexample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a non-falsifiable theory

A

“One day there will be a human that can breathe underwater”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a Desideratum

A

in empirical work, we want to connect observations/measuerments with a falsifiable hypothesis or theory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Hypothesis testing

A

We need to establish a null hypothesis and an alternative hypothesis for falsifiability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is transparency

A

In the ideal world, a study is fully transparent, in terms of what hypothesis is being challanged, what methodology was used, and what results were obtained.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Open data

A

All data should be available in order for other researchers to evaluate the study, or reuse materials

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Open data is required or not required?

A

required in some academic journals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What could be the reasons why data is not shared?

A
  1. No time
  2. NO access
  3. Privacy
  4. Propriotery data (companies dont want to share their data)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

how to combat data not being shared (no open data)

A

enforcing open data as a journal, peer review practice
Open data is necessary but not sufficient to guarantee good research

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is replicaiblity

A

The ability of a researcher to duplicate the results of a prior study if the same procedures are followed but new data is collected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is reproducibility

A

The ability of a reasearher to duplicate the results of a prior study using the same maternals as were used by the original investigator.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are research artifacts

A

Any concrete object that was used in the execution of a study and that is needed to reproduce the study. Examples:
* Paper/report
* Dataset
* Model
* Software

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a taxonomy of best practices on paper/report

A

peer review and checklists

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the taxonomy of best practices of the dataset

A

Data annotation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the best taxonomy of best practices of Model and Software

A

ML best practices

18
Q

What are the problems we can encounter when doing ML paper

A
  • The data: the way we set up our data into a splits may impact performance
  • The network: many current models are DNN, meaning they consist of many layers with low interpretability
19
Q

How can we combat that. Give 2 methods

A
  • Reproducibility from within: things researchers can do to increase the quality of their research.
  • Reproducibility from outside: things reviewers should pay attention to
20
Q

What is generalization

A

your model’s ability to adapt properly to new, previously unseen data, drawn from the same distribution as the one used to create the model

21
Q

What is overfitting

A

The model performs very well on the training data but poorly on the validation data.

22
Q

what is underfitting

A

the model doesn’t perform well anywhere

23
Q

How to prevent loss hacking

A

Require authors to include loss statistics

24
Q

how to combat underspecification (not enough detail and not reproducible)

A

Provide the selected train,val,test set and ideally the code that we used to create the split. This makes sure that the report is reproducible

25
Q

How to combat label imballance (situation where the distribution of labels in the dataset is skewed)

A

Make sure that the relevant statistical properties of the intended splits are the same.

26
Q

What is cherry-picking

A

We could chose accidentally or not very favourable seed values

27
Q

how to combat cherry picking

A

Seed-averaging A simple solution is to average performance results over multiple runs with different seeds

28
Q

What could be the problem with classification metric like a contingency table

A

Table could be imballanced.

29
Q

What metrics are prefered when the data is unballanced in a contingency table?

A

Precision and Recall

30
Q

give accuracy function

A

Acc=correct predictions / # items

31
Q

Give Precision function

use ham and spam example

A

Prec = correctly marked as spam / marked as spam

32
Q

How can we mix precision and recall

A

We can define a weighted mix of precision and recall using Fb score

33
Q

What are Review Checklists

A

Review checklists in research methods are systematic tools used to evaluate the quality, rigor, and completeness of a research study. They ensure that all necessary aspects of the research process are addressed, including study design, data collection, analysis, and reporting. These checklists help maintain consistency and transparency, aiding in the replication and validation of research findings

34
Q

What are the two issues why we need review checklists

A

Central issue: in practice, reviewing takes palce on a pro bono basis, little time is available for reviewing
Reviewing vs reproduction: There is not always time to reproduce or replicate a given study

35
Q

What is the name of system where authors and reviewers are aware of best practices

A

Review checklists

36
Q

What are the 5 questions asked in the review checklist

A
  1. General content: contributions, intro, rq
  2. Scientific Artifacts: referenced? lincence?
  3. Computational Experiments: environment described, detailed results?
  4. Human participants: demographic, recruitment
  5. AI Assistants: use of ai
37
Q

What is Academic Sin

A

Use of plagiarism. Copying text or data from other researchers and pretending it is your own.

38
Q

What are code licenses

A

They might lead to lawsuits

39
Q

What is a research proposal

A
  1. Motivation: main idea
  2. Application: thesis, for funding
    There are different types of proposals: reprodude/replicate a study, propose new framework
40
Q

What is the structure of a research proposal. Give 6 points

A
  1. Background/context
  2. Research question
  3. Contributions
  4. Methodology
  5. Planning: timeline
  6. Resources
41
Q

How to set up a good proposal?

name 3 possible things

A
  1. Get creative
  2. Write idea first
  3. Be SMART
42
Q

Name the SMART attributes

A
  1. Specific
  2. Measurable
  3. Achievable
  4. Relevant
  5. Time-bound