Lesson 1 Flashcards
State three different widespread problems with data. For each problem, give a 1-sentence definition
faulty: when a data is elaborated or researched by humans, inevitable errors could be committed
incomplete: data could be incomplete (maybe lack of answers or participants)
censored: data could be censored and so impossible to reach
survey-based: based on survey
Briefly (≤ 3 sentences) explain the term replication crisis.
Replication crisis: many scientific results cannot be reproduced and this fact carry dubts about the affidability of the data used in those researches
This raises doubts about the validity and reliability of many published scientific studies
Briefly (≤ 2 sentences) state what is regulated in the Dickey Amendment
The Dickey-Wicker Amendment (1996) prohibits the use of federal funds for research that involves the dead by firearm
Fill in the blank boxes and the blank arrow descriptions denoted (1), (2),
. . . in the graph for the data value chain in the table below.
Impara lo schema
What does the term “GIGO” stand for in data science? Briefly (≤ 2 sentences) explain it
“GIGO” stands for “Garbage In, Garbage Out” in data science. It means that if low-quality or inaccurate data is used as input for a computer program or analysis, the output or results will also be of low quality and accuracy. In other words, the quality of the output depends on the quality of the input data.
For each of the following defintions of random variables, state whether it is a
correct definition of a random variable or not. If you answer \not”, briefly (≤ 2 sentences)
state which rule for random variables has been violated.
a Updown. This variable is −1, if the return of a specific index is < 0, and +1, if the
return of the same index is > 0.
b Continent. This variable is 1, if a certain place is located in Europe, 2 for North
America, 3 for South America, 4 for Asia, 5 for Australia and 6 for Antarctica.
c MaleSon. This variable is 1, if the eldest child of a person is male and 0 if the
eldest child of a person is female.
d FirmSize. This variable is “S” if a company has less than 100 employees and a
turnover of less than 10 million CHF. It is “M”, if only one of the above conditions
is true. It is “L”, if the company has 100 employees or more and a turnover of ≥ 10
million CHF.
a. The 0 is missing. Ubi ≠ omega
b. Africa is missing
c. If the person does not have children
d.not a real number is associated to S
Why data is useful?
New data set → research and business opportunities
Why data is problematic?
Recorded, processed, transferred and converted by humans
→ inevitable errors
widespread problems
replication crisis
what does it mean Data is (not) “given”?
consider collecting/deriving it yourself this:
Means we cannot change →
es. it means that in Y= a + bX
you can change a and b to improve data but you cannot change X and Y
explain briefly what does it mean “Responsible data use”
Check data: Compare to other sources, literature, check internal consistency
Reproducability: keep audit trail for any data usage
Share data (verification+insight)
Copyright and privacy
What is a data?
Data
= collection of realizations of random variables
= collection of
measurements
of a property
of an entity/individual
For each of the following numeric random variables, specify the statistical
type of variable. If you think that a variable could be of two different types, briefly
(1sentence for each case) state under which cirsumstances it would be of one type and under
which circumstances it would be of another type
- Left-Handedness
- Shoe size
- Firm size
- iPhone model
- Left-Handedness: indicator
- Shoe size: Categorical, ordered unclear relation
- Firm size: if you think about numbers of employees-> descrete, if you think about profit-> continuos ratio
Categorical, ordered but unclear relation - iPhone model: Categorical, unordered: because if we think about different iPhone 12 it has iPhone Pro, Pro max, plus or normal. but could be categorical ordered if we think about: iPhone 10, 11, 12, 13, 14…
The type of random variable is necessary for many tasks involving random variables.
Name three such tasks.
Design of data structure or database
Design optimal simulation scheme (e.g. Monte Carlo)
Data visualization
definition and interpretation of statistical model
what is a Random variable?
Function assigning a real number to every state of nature
Sample space Ω. Set of all possible (relevant) events.
States of nature. Each possible outcome = state of nature, (ωi).
Finite (i ∈ {1, 2, . . . , N}) or infinite (i ∈ N) size.
Partitions of Ω. Collection of all subsets P = {B1, . . . Bn}
what are the 2 “Pizza slicing rules”?
1) Don’t forget a part (or B1 ∪ B2 ∪ · · · ∪ Bn = Ω)
2) Don’t count a part twice