Coali Flashcards
What is Data Science?
Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured, and unstructured data.”
a data scientist specifically focus on econometrics rather than on predictive statistics (data analytics)
What a data scientist should know?
- Statistics / Mathematics
- Computer Science
- Field-related Knowledge
What is Data Analytics?
Data analysis is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making
What is the difference between data science and data analytics?
a data analyst makes sense out of existing data, whereas a data scientist creates new methods and tools to process data for use by analysts
What is the confirmation bias
Were you driven or inspired by the restaurant’s rating?
Is 4.3/5 a good rating? The answer is: it depends!
—You truly wanted to go to that specific Sushi restaurantàreaction: «4.3 is a
GREAT rating!»
— You already thought that the Sushi restaurant suggested by your friend was not good enough à reaction: «why settling down with 4.3 if I can get a 4.5?» or «Look at those 1-star reviews. I don’t like how many there are.»
What is a requisite in order to make good data-driven decision?
we need to set our goals in advance. Only in this way can numbers and statistics matter for decision- making!
What shall I do if I DO NOT HAVE ANY DATA AT ALL?
(What is the general Framework?)
- Setup a framework (e.g., a theory) before looking at the data.
- Develop a set of testable/ falsifiable hypotheses to test the theory
- Carefully operationalize the concepts/measures
- Collect the relevant data
- Estimate your models
- Make your decisions!
Why should we setup a framework before looking at the data?
Setting a decision rule before looking at data is a way to overcome biases.
How would you measure customer satisfaction?
We do not measure satisfaction, but we operationalize it according to our
own definition and own elements
- Social media sentiment? (NLP)
- Plain surveys?
- Retention rate / shop visits?
What is the added value my job should provide?
—Analyze data descriptively (explore/ get inspired)
—Build predictive algorithms (machine learning/ AI)
—Make** causal inference** over these results (use statistics)
It is the last point that allows you to translate insights from data to real decision outcomes (strategy)
What is a survey, what types of surveys could we have and what are their differences?
Running a survey is one of the most common way for collecting information directly from the subjects you are interested into.
-
Questionnaire
Standardized questions and answers
2.** Structured Interviews**
Standardized questions, free answers
-
Unstructured Interviews
Free questions and answers: exploratory
What are the advantages/disadvantages of the questionnaire?
**Standardized questions/answers **-> we can encode variables more easily and run statistical analyses
Tradeoff: requires a lot of thinking!
Same stimulus to the subjects (homogeneous answers) -> every respondent is replying to the same, standardized, instrument
Time and money effective for our purposes -> online tools help us in collecting many answers with almost zero cost
What are the type of questions and answers we can have in a questionnaire?
Two types of questions:
* Close-ended: questions for which the answer has standardized options, usually expressed in numerical or categorical forms (e.g. numbers, yes/no, scales)
* Open-ended: questions that allow the respondent to supply her own answer
Multiple types of answers:
* Numerical
* Narrative
* Categorical (i.e. multiple choice)
* Scale
What are the main KEY PRINCIPLES to construct a survey?
Be simple: Avoid using dialect, jargon, complex syntaxes, negations Tailor the content of questions on the population you are studying
Be short: Try to shorten as much as possible your questions. A long question can be preferred if it concerns some sensitive issues or for topics requiring extensive reflection.
Number of alternatives: Do not consider too many (or too detailed) alternatives for answering (ex. NOT: How old are you?
1) 18-23; 2) 24-27; 3) 28-30; 4) 31-34 etc…)
Do not take for granted: Do not give for granted some aspects or behaviours (ex, it is not given that a firm does R&D or that a consumer busy cookies)
Consider the «do not know» / «not appliable» answer: The unsure respondents should not (always) be forced to answer (ex. what are the risks in terms of data quality?)
Avoid tendentious questions: Do not «push» the respondent towards a right answer. The respondent must not perceive the existence of a “right or wrong” answer. You should formulate the question in order to make “acceptable” also the socially less desirable answers.
1) Principle
What is the relation between the question in the questionnaire and the attribute of the theory (i.e. how the question should address the attribute)?
Given an attribute, the QUESTION should be precisely linked to a question and related measurement. Think carefully about what you need to measure and craft the questions accordingly.
2 Principle of the questionnaire. Outcome of the questions should be evaluated against a…
Given an attribute and related measurement, the QUESTION should be Evaluated against a threshold. Remember to set a threshold on each empirical measurment before
looking at your data. The THRESHOLD depends on your prior beliefs and on the potential data collection biases (e.g. which sample do I have? How big? Is it representative of my target?)
Threshold: an example
Question (1-7 scale): «When I go back home from work, I cannot stop thinking about what I have done and what I have to do the next day». We obtain a 1-7 score for each respondent. Ultimately we get a distribution of scores for our sample (with its mean, median, SD etc…)
My threshold(s) could be:
To increase the likelihood that 𝑿𝒔 = 𝒚𝒆𝒔 I would expect a sample average greater than 3.5
OR
To increase the likelihood that 𝑿𝒔 = 𝒚𝒆𝒔 I would expect at least 30% of the sample indicating a value greater than 5
When and how threshold should be set for questions in the questionnarie?
What if the sample size of the experiment results too small?
- Set the threshold BEFORE looking at your data
- Set it according to your beliefs and expected sample collected
- Adapt the threshold if you realize your sample is too small/not representative (do that before looking at the outcome!)
How should be the sequence of the questions in the questionnaire (ex. hard, easy)
- Start with easy and comfortable questions -> facts rather than opinions
- Most «invasive» or «complicated» questions in the middle
- End with boring or «automatic» ones -> such as demographics
What is an attention check and how can be used?
- An attention check is nothing more than a «tricky» question that you put somewhere (either at beginning, or at the mid-point or at the end) of your questionnaire to ensure that people are paying the necessary attention
- When you analyze the data, you might want to be sure that excluding participants that failed the attention check is not biasing your sample:
— For instance, you might want to check whether people that failed the check systematically differ from people that passed the test on a number of demographic or important (according to your population framing and theory) traits
What are the Pro & Cons of Online Data Collection?
PROs
— Extremely cheap
— Many responses in relatively short timeframe
— Easy to collect data in multiple datapoints
— Can easily reach specific populations of individuals — Highly customizable and complex
CONS
— Selection is not always random (poor control on sampling procedure) — Cannot control who replies (level of compliance, attention etc…)