Coali Flashcards

1
Q

What is Data Science?

A

Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured, and unstructured data.”

a data scientist specifically focus on econometrics rather than on predictive statistics (data analytics)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What a data scientist should know?

A
  • Statistics / Mathematics
  • Computer Science
  • Field-related Knowledge
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Data Analytics?

A

Data analysis is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the difference between data science and data analytics?

A

a data analyst makes sense out of existing data, whereas a data scientist creates new methods and tools to process data for use by analysts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the confirmation bias

A

Were you driven or inspired by the restaurant’s rating?
Is 4.3/5 a good rating? The answer is: it depends!
—You truly wanted to go to that specific Sushi restaurantàreaction: «4.3 is a
GREAT rating!»
— You already thought that the Sushi restaurant suggested by your friend was not good enough à reaction: «why settling down with 4.3 if I can get a 4.5?» or «Look at those 1-star reviews. I don’t like how many there are.»

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a requisite in order to make good data-driven decision?

A

we need to set our goals in advance. Only in this way can numbers and statistics matter for decision- making!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What shall I do if I DO NOT HAVE ANY DATA AT ALL?

(What is the general Framework?)

A
  1. Setup a framework (e.g., a theory) before looking at the data.
  2. Develop a set of testable/ falsifiable hypotheses to test the theory
  3. Carefully operationalize the concepts/measures
  4. Collect the relevant data
  5. Estimate your models
  6. Make your decisions!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why should we setup a framework before looking at the data?

A

Setting a decision rule before looking at data is a way to overcome biases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How would you measure customer satisfaction?

A

We do not measure satisfaction, but we operationalize it according to our
own definition and own elements

  • Social media sentiment? (NLP)
  • Plain surveys?
  • Retention rate / shop visits?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the added value my job should provide?

A

—Analyze data descriptively (explore/ get inspired)
—Build predictive algorithms (machine learning/ AI)
—Make** causal inference** over these results (use statistics)
It is the last point that allows you to translate insights from data to real decision outcomes (strategy)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a survey, what types of surveys could we have and what are their differences?

A

Running a survey is one of the most common way for collecting information directly from the subjects you are interested into.

  1. Questionnaire
    Standardized questions and answers

2.** Structured Interviews**
Standardized questions, free answers

  1. Unstructured Interviews
    Free questions and answers: exploratory
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the advantages/disadvantages of the questionnaire?

A

**Standardized questions/answers **-> we can encode variables more easily and run statistical analyses
Tradeoff: requires a lot of thinking!

Same stimulus to the subjects (homogeneous answers) -> every respondent is replying to the same, standardized, instrument

Time and money effective for our purposes -> online tools help us in collecting many answers with almost zero cost

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the type of questions and answers we can have in a questionnaire?

A

Two types of questions:
* Close-ended: questions for which the answer has standardized options, usually expressed in numerical or categorical forms (e.g. numbers, yes/no, scales)
* Open-ended: questions that allow the respondent to supply her own answer

Multiple types of answers:
* Numerical
* Narrative
* Categorical (i.e. multiple choice)
* Scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the main KEY PRINCIPLES to construct a survey?

A

Be simple: Avoid using dialect, jargon, complex syntaxes, negations Tailor the content of questions on the population you are studying

Be short: Try to shorten as much as possible your questions. A long question can be preferred if it concerns some sensitive issues or for topics requiring extensive reflection.

Number of alternatives: Do not consider too many (or too detailed) alternatives for answering (ex. NOT: How old are you?
1) 18-23; 2) 24-27; 3) 28-30; 4) 31-34 etc…)

Do not take for granted: Do not give for granted some aspects or behaviours (ex, it is not given that a firm does R&D or that a consumer busy cookies)

Consider the «do not know» / «not appliable» answer: The unsure respondents should not (always) be forced to answer (ex. what are the risks in terms of data quality?)

Avoid tendentious questions: Do not «push» the respondent towards a right answer. The respondent must not perceive the existence of a “right or wrong” answer. You should formulate the question in order to make “acceptable” also the socially less desirable answers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

1) Principle

What is the relation between the question in the questionnaire and the attribute of the theory (i.e. how the question should address the attribute)?

A

Given an attribute, the QUESTION should be precisely linked to a question and related measurement. Think carefully about what you need to measure and craft the questions accordingly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

2 Principle of the questionnaire. Outcome of the questions should be evaluated against a…

A

Given an attribute and related measurement, the QUESTION should be Evaluated against a threshold. Remember to set a threshold on each empirical measurment before
looking at your data
. The THRESHOLD depends on your prior beliefs and on the potential data collection biases (e.g. which sample do I have? How big? Is it representative of my target?)

Threshold: an example
Question (1-7 scale): «When I go back home from work, I cannot stop thinking about what I have done and what I have to do the next day». We obtain a 1-7 score for each respondent. Ultimately we get a distribution of scores for our sample (with its mean, median, SD etc…)

My threshold(s) could be:
To increase the likelihood that 𝑿𝒔 = 𝒚𝒆𝒔 I would expect a sample average greater than 3.5

OR

To increase the likelihood that 𝑿𝒔 = 𝒚𝒆𝒔 I would expect at least 30% of the sample indicating a value greater than 5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

When and how threshold should be set for questions in the questionnarie?

What if the sample size of the experiment results too small?

A
  • Set the threshold BEFORE looking at your data
  • Set it according to your beliefs and expected sample collected
  • Adapt the threshold if you realize your sample is too small/not representative (do that before looking at the outcome!)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How should be the sequence of the questions in the questionnaire (ex. hard, easy)

A
  • Start with easy and comfortable questions -> facts rather than opinions
  • Most «invasive» or «complicated» questions in the middle
  • End with boring or «automatic» ones -> such as demographics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is an attention check and how can be used?

A
  1. An attention check is nothing more than a «tricky» question that you put somewhere (either at beginning, or at the mid-point or at the end) of your questionnaire to ensure that people are paying the necessary attention
  2. When you analyze the data, you might want to be sure that excluding participants that failed the attention check is not biasing your sample:
    — For instance, you might want to check whether people that failed the check systematically differ from people that passed the test on a number of demographic or important (according to your population framing and theory) traits
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are the Pro & Cons of Online Data Collection?

A

PROs
— Extremely cheap
— Many responses in relatively short timeframe
— Easy to collect data in multiple datapoints
— Can easily reach specific populations of individuals — Highly customizable and complex

CONS
— Selection is not always random (poor control on sampling procedure) — Cannot control who replies (level of compliance, attention etc…)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are Likert Scales and how they are evaluated?

A

It is a type of rating scale (i.e. it is a set of categories) designed to infer information about a qualitative or quantitative attribute.

it assumes that the distance between each answer option is the same.

In a battery of questions, it is typically the sum of the responses to each item

In the end, you want to create a unique scale that is:
* A sum of all the items, if recorded on the same scale (otherwise, standardize) -> the most common practice.
* The average of the items (a little bit more debated in stats, given the ordinal nature of Likert items) -> still, done in many applications

22
Q

What a well-designed Likert scale should exhibit?

A

A well-designed Likert item exhibit both:

  • Symmetry: means that the item contains equal numbers of positive and negative positions, symmetric with respect to the “neutral” value.
  • Balance: means that the distance between values is the same.
23
Q

What type of moments can we use to analyze linkert scales?

A
  • Median: this is considered by many people the «safest» one, since we are dealing with ordinal data.
  • Mean: despite the debate around it, if we assume the distance between responses can be operationalized as a constant one, we can use it. This also implies we can use a **t-test **to compare means between groups, for instance.
24
Q

How a t-test can be performed on a linkert scale? give an example

A

Certainly! Let’s consider a simple example involving a Likert scale and how we might analyze the responses using a t-test.

Example Scenario:
Imagine a company has conducted an employee satisfaction survey with one of the questions being:

“How satisfied are you with your current work-life balance?”
The responses are on a 5-point Likert scale:

Very Dissatisfied
Dissatisfied
Neutral
Satisfied
Very Satisfied

Now, let’s say the company wants to compare the satisfaction level between two departments: Department A and Department B.

Data Collection:
After conducting the survey, the company gathers the following data (hypothetical mean scores):

Department A: Mean satisfaction score = 3.8
Department B: Mean satisfaction score = 3.3
The company also calculates the standard deviations and the number of respondents for each department.

Application of the t-Test:
To determine if the difference in mean satisfaction scores between the two departments is statistically significant, the company can use an independent samples t-test.

Assumption:
The company assumes that the interval between the Likert scale points is equal, meaning they treat the data as interval data, not just ordinal. This allows them to calculate means and apply the t-test.

t-Test Calculation:
The t-test will compare the two means (3.8 for Department A and 3.3 for Department B) and take into account the standard deviations and sample sizes of each group to calculate the t-statistic.

25
Q

What is the Cronbach’s Alpha?

A

Cronbach’s Alpha is a measure of internal consistency, which is often used as an estimate of the reliability of a psychometric test for a sample of examinees. It assesses how closely related a set of items are as a group. Here’s a breakdown of the key points from the image:

Internal Consistency: It refers to the degree to which all items in a test measure the same construct or concept. A high internal consistency means that items that are intended to measure the same general construct yield similar scores.

Scale Reliability: Cronbach’s Alpha provides a measure of the reliability of a scale. Reliability in this context refers to the consistency of the measurement, or the degree to which the scale produces stable and consistent results.

One-Dimensional Concept: The assumption underlying Cronbach’s Alpha is that the items on the scale measure a one-dimensional construct. That is, all items are supposed to reflect only one attribute, quality, or construct. If a scale measures multiple dimensions, Cronbach’s Alpha may not be a suitable measure of reliability, and a factor analysis might be needed to understand the underlying structure.

Interpretation of the Value: The value of Cronbach’s Alpha ranges from 0 to 1. Higher values indicate greater internal consistency. Common benchmarks for interpreting the alpha value are:

0.6<α<0.6 Poor
0.6≤α<0.7 Questionable
0.7≤α<0.8 Acceptable
0.8≤α<0.9 Good
α≥0.9 Excellent

26
Q

What is a Semantic Differential Scale, what is its structure and how many alternatives should have?

A

A Semantic Differential Scale is a type of rating scale used to measure the connotative meaning of objects, events, and concepts. The connotative meaning refers to the subjective, emotional, and cultural associations that people might connect with a word or phrase, as opposed to its denotative meaning, which is its literal or dictionary definition.

Structure: It typically consists of a series of bipolar adjectives (e.g., happy-sad, efficient-inefficient), with respondents asked to rate the item being evaluated on a scale between two extremes.

Odd Number of Alternatives: It often uses an odd number of alternatives so that a neutral or middle point can be identified. This allows respondents who feel ambivalent or neutral about the subject to select a midpoint rather than being forced to lean towards a positive or negative response.

27
Q

What is the difference between Likert Scales and Semantic Differential and what should we choose?

A

Likert scales tend to provide a straightforward method for gauging the level of agreement with specific statements, leading to easily quantifiable data, while semantic differential scales are better suited for capturing the complexity and richness of attitudes and perceptions (with semantic differential, Responses are more subjective as they are based on the respondent’s personal associations with the adjectives).

The choice between the two scales should be based on what you want to measure and what aspects are most important to you. Likert scales might be preferred for direct questions about beliefs or behaviors, while semantic differential scales might be chosen to explore deeper, more nuanced attitudes or the emotional aspects of a concept.

28
Q

What are the types of interviews, what is their purpose?

A

Types of Interviews
Structured Interviews: These involve standardized questions with a logical sequence and formulation, allowing for limited variability in responses.

Semi-Structured Interviews: These are guided by predefined themes and topics, but the interviewer has the flexibility to adapt their strategy based on the interviewee’s real-time feedback.

Unstructured Interviews: These are more like free-flowing conversations with open-ended questions that may not be predetermined.

Deeper Understanding: Interviews are a qualitative research method used to gain a deeper understanding of a problem.

Refine Theory: They are mainly used to refine theories or identify new attributes relevant to the researcher’s area of study.

Crafting Hypotheses: Information gathered from interviews can help in crafting better hypotheses and formulating questions for subsequent quantitative surveys.

Testing Hypotheses: Structured interviews can also be used to test hypotheses.

Exploratory Insight: Exploratory interviews, in particular, are highlighted as an activity that can uncover overlooked insights.

29
Q

How should be the questions in an interview and how should be the interviewer?

A

Question Design: Questions should be centered on specific topics and phrased to be as neutral as possible to avoid leading questions.

Attitude of Interviewer: The interviewer’s attitude should be neutral, especially when asking for opinions, to encourage honest responses.

Factual Reference: Questions should refer to facts rather than abstract concepts.

30
Q

What are the steps involved in the “Sampling Design”?

A

Defining the Population: Before choosing a sampling technique, you must define the population under study. This includes identifying relevant traits and characteristics of individuals who will be included.

Sampling Strategy: After defining the population, the next step is to choose the most appropriate sampling strategy. There are two broad categories of sampling strategies:
* Probability Sampling: In this approach, every unit in the population has a non-zero probability of being selected in the sample, and you can determine this probability.
* Non-probability Sampling: Here, some units in the population may have a zero probability of selection, or their probability of selection cannot be determined.

The choice between these sampling strategies will influence how representative the sample is of the population and the kinds of conclusions you can draw from your research.

31
Q

What are the probability sampling techniques?

A

Random Sampling:
Each unit in the population has the same probability of being selected.
The selection process is a simple random draw.

Systematic Sampling:
You start with a list of all units in the population and select every nth unit to be included in the sample.
It’s important that there is no pattern in the list that could be related to the outcome under study, as this could bias the results.

Stratified Random Sampling:
The population is divided into subgroups, or “strata,” based on a known characteristic or trait. Simple random sampling is then conducted within each stratum.

Cluster Sampling:
The population is divided into clusters (e.g., classrooms in a school).
In one-stage cluster sampling, you randomly select entire clusters.
In two-stage cluster sampling, you select individual units from within the randomly chosen clusters.

32
Q

What are the Non-Probability Sampling techniques?

A

Quota Sampling:
The population is divided into subgroups based on known traits.
A predetermined quota of units is selected from each group to match the proportions in the population.
Selection within each group is not random.

Purposive Sampling:
The researcher uses their judgment to choose members of the population for the sample.
It requires good knowledge of the targeted population.
This method is typically used with small sample sizes.

Snowball Sampling:
The sampling process begins with a small group of known individuals who meet the study criteria.
These initial subjects recruit future subjects from among their acquaintances.
This continues until enough data has been collected.

Convenience Sampling:
The researcher selects units that are easiest to access.
It is not considered a robust method due to the high potential for bias.
These non-probability sampling methods are used when it is not feasible to select a sample that accurately represents the entire population, often due to constraints such as time, budget, or the nature of the research question.

33
Q

Which are the sampling errors?

A

Population Specific Error: Occurs when there is bad framing of the population. The population needs to be properly specified to avoid this error.

Sample Frame Error: Happens when the sample is badly specified. The intention might be to sample from population P, but instead, the sample comes from population Q.

Selection Error: This type of error arises when individuals choose themselves to participate in a study. For example, an online survey might only attract those who feel strongly about the topic or who have the time and interest to complete the survey. This can lead to a sample that is not representative of the entire population.

Non-Response Error: Relates to differences between those who respond to the survey and those who do not. It’s important to control for these differences to avoid bias in the results.

34
Q

What does the Frisch-Waugh-Lovell theorem say?

A

The FWL says that the following 3 estimators of β₁ are equivalent:

  • the OLS estimator obtained by regressing y on x₁ and x₂
  • the OLS estimator obtained by regressing y on x̃₁, where x̃₁ is the residual from the regression of x₁ on x₂
  • the OLS estimator obtained by regressing ỹ on x̃₁, where ỹ is the residual from the regression of y on x₂
35
Q

What are the 6 assumptions of the OLS?

A
  • Linearity: The relationship between the dependent variable y and the independent variables X is linear.
  • No Perfect Multicollinearity: None of the independent variables is a perfect linear function of any other variables, which ensures that the matrix X has full rank (the rank is equal to the number of predictors K, which is less than or equal to the number of observations N).
  • Random Sampling: The observations are assumed to be a random sample from the population, which supports the idea that the sample represents the population well.
  • Exogeneity of Errors: The conditional expected value of the error term ε, given the independent variables X, is zero. This implies that the error term is uncorrelated with the independent variables.
  • Homoskedasticity: The variance of the error term ε is constant across all levels of the independent variables (no heteroskedasticity).
  • Normality: The distribution of 𝜖 is normal
36
Q

What are the interpretation of Level model, Level-log model, Log-level model, and Log-log model?

A
  • Level model (level-level): interpretation of coefficient as marginal effect. If x rises by one unit, y changes by 𝛽 units.
  • Level-log model: interpretation of coefficient as semi-elasticity. If x rises by 100%, y changes by 𝛽 units.
  • Log-level model: interpretation of coefficient as semi elasticity. If x rises by one unit, y changes by 𝛽%.
  • Log-log model: interpretation of coefficient as elasticity. If x rises by 100%, y changes by 𝛽%.
37
Q

Why we can’t compute OLS with multicollinearity?

A

In principle, there is nothing wrong with including variables in your model that are correlated.

HOWEVER, if the correlation is too high, this may lead to estimation problems.
Technically, the matrix X’X that we compute for the OLS estimator is close to being not invertible, leading to unrealiable estimates with high standard errors and unexpected signs/magnitudes.

38
Q

Explain the 2 types of specification errors:

SUPERFLUOUS VARIABLES and OMITTED VARIABLES

A

SUPERFLUOUS VARIABLES: Your model includes some variables that are “not needed”.
This issue is not a huge problem, however if you include too many variables in your model, you are basically reducing the degrees of freedom and the accuracy of your estimates.
The risk is to over-fit or over-control your regression. The keyword here is to be parsimonious.

OMITTED VARIABLES
Your model does not include variables that should instead be included.
Is this a serious issue? Yes! If you omit variables that have a significant effect on your outcome variable, you are violating assumption 4 -> the error term cotanins a component that is correlated with other regressors.
Solution: very difficult to argue that you are controlling for all possible confounders in your model: this is why OLS results are hardly interpretable in a causal way.
To solve the endogeneity problems there are several ways: either to change the estimator or by carefully designing your empirical strategy.

39
Q

What if we have heteroschedasticity?

A

Assumption 5 about the conditional variance of the errors is not met. The OLS estimator is still consistent and correct if assumptions 1 to 4 are met, however we have problems in estimating the variance→estimates of the standard errors are biased→biased tests!

Solution: use robust estimators

40
Q

What are the adv/cons of Linear Probability Model (i.e. binary regression model and NOT logit/probit)?

A

Advantages
* Very simple to estimate and interpret (the coefficients are marginal changes in the probability of success)
* Inference is identical to the OLS case.

Disadvantages
* The errors 𝜖𝑖 are not normally distributed anymore. The errors 𝜖𝑖 are heteroskedastic by construction.
* The predicted probabilities can be outside the [0,1] interval. It is possible that the LPM routines gives values outside this range, since it is not bounded.
* Plus, we assume that the effect of each regressor is linear.

41
Q

What are the adv/cons of Logit and Probit with respect to the linear probability model?

A

Advantages
* We are sure that the predicted probabilities fall within the [0,1] range -> due to the fact that we model using a cumulative distribution function (F).
* We do not impose a linear structure on the marginal effects (more later)

Disadvantages
* We cannot directly interpret the coefficients we obtain as marginal effects (again, more later)
* We have to use Maximum Likelihood Estimators, requiring more computational effort ( short overview of ML in a while…)

42
Q

What is the difference between Multinomial Models and Ordinal Models?

A

Multinomial Models:
These models are used when there are multiple discrete outcomes with no ordinal properties, meaning the outcomes cannot be logically ordered. An example would be different modes of transportation, like bus, tram, bike, walking, or car.
The choices in the model should be exhaustive and mutually exclusive, ensuring that every possible outcome is covered and that they do not overlap.
Probabilities of all possible outcomes should sum to 1.
Factors affecting the probability of each outcome can be analyzed using multinomial probit or logit models.

Ordinal Models:
These models are appropriate when the dependent variable is ordinal, meaning the categories have a logical order (e.g., Likert scale ratings), but the distance between categories is not assumed to be equal.
Modeling is similar to binary cases but with multiple thresholds based on the number of categories.
A key technical assumption is proportional odds, where the relationship between all pairs of outcomes is consistent across the levels of the dependent variable: It means that the relationship between “likely” vs “unlikely” + “somehow likely” is the same as those that describe “unlikely” vs “likely” + “somehow likely”.

43
Q

What are the (best) methods we can use for CAUSAL INFERENCE and PREDICTION?

A

Causal Inference:
1) Randomized Control Trials (the best!)
2) **Observational studies with appropriate techniques
:
* «
Conditioned» regressions** (i.e., retreiving causal parameters by conditioning on observables)
* Instrumental variables
* Difference-in-Differences
* Regression Discontinuity

Prediction:
1) Supervised Machine Learning (S-ML)

44
Q

Derive the Potential Outcome Framework

A

Causal Effect Interest: The framework is designed to evaluate the causal effect of a treatment or intervention (denoted as
D) on an outcome variable Y.

Treatment Variable:
D is a dummy variable that indicates whether a unit has been treated (1) or not (0).

Potential Outcomes: Each unit has two potential outcomes Yi(0) and Yi(1).

Fundamental Problem: A key problem in causal inference is that we can only observe one of the two potential outcomes for a unit; the other outcome remains counterfactual. We never observe what would have happened to the treated unit had it not been treated and vice versa.

Causal Effect: The causal effect of the treatment for an individual unit is the difference between the two potential outcomes. However, since we can’t observe both outcomes for the same unit, we cannot directly calculate this for an individual.

Generalization to Multiple Units: When we have multiple units, we can consider the Average Treatment Effect (ATE), which is the expected difference in outcomes between the treated and control groups: ATE = E(Yi | Di = 1) - E(Yi | Di = 0).

Application of the ATE: This formula can be applied under certain conditions such as random assignment of treatment, which ensures that the treated and control groups are similar in all aspects except for the treatment.

45
Q

How randomization is used to address the selection problem?

A

Selection Problem: The issue arises because we observe outcomes only for treated units under treatment conditions and control units under control conditions. We cannot observe what would have happened to each unit had the opposite treatment been applied.

Randomization: Random assignment ensures that the treatment group and control group are statistically equivalent prior to the application of the treatment.

Expectation of Identical Outcomes: By randomly assigning units to treatment or control groups, the expected potential outcomes of both groups are identical to the average potential outcomes in the population.

Rationale: If the control group had been treated, they would, on average, show the same results as the currently treated group due to the random assignment.

In a Perfect Setting: With perfect randomization and full compliance (every participant adheres to their assigned group), estimating an unbiased and consistent Average Treatment Effect (ATE) can be straightforward.

Estimation Methods: In such ideal conditions, comparing means between the two groups (using t-tests, ANOVA, etc.) or employing a simple OLS regression can yield an accurate estimate of the ATE.

46
Q

What is the Conditional Independence Assumption?

A

CIA Defined: Under the CIA, treatment assignment is independent of the potential outcomes conditional on some covariates (X). This implies that once we control for X, the treatment is as good as randomly assigned even in a non-experimental setting.

CIA is crucial when:
* Randomization in an RCT fails, such as when there is self-selection or other threats to validity.
* Conducting observational studies where random assignment is not possible.

47
Q

Illustrate the Bias Variance Trade-off

A

Variance Decreases with Sample Size: As the sample size n increases, the variance of the parameter estimates decreases, improving the model’s estimation.

Effect of Number of Parameters: The variance of the model is also influenced by the number of parameters p included. Increasing p can cause the model to become more variable, potentially leading to overfitting.

Bias-Variance in Practice:
* In low-dimensional settings (with fewer parameters), aiming for unbiased parameter estimates can be advantageous for prediction.
* In high-dimensional settings (with many parameters relative to the sample size), good parameter estimation may not translate to good predictive performance due to the risk of overfitting.

48
Q

Discuss the dilemma of model selection in statistical analysis in the context of high-dimensional features

A

Model Selection Dilemma:
Including too many regressors can lead to overfitting, where the model captures the noise rather than the signal in the data.
Including too few regressors can result in omitted variable bias, where important variables are left out, leading to biased estimates.

High Dimensionality:
The problem of model selection becomes more pronounced in high-dimensional settings, where the number of potential regressors (p) is close to or larger than the number of observations (n).
* If p>n, the model cannot be identified because there are more parameters than data points.
* If p=n, the fit of the model will be perfect but meaningless, as it will simply memorize the data.
* If p<n but p is still large, the model is likely to overfit.

49
Q

Discuss Regularized Regressions and LASSO in particular

A

Goal of Regularized Regression: The objective is to find a model that balances the tradeoff between fitting the training data well (low bias) and maintaining good performance on new, unseen data (low variance). This is achieved by adding a penalty for complexity to the regression model.

Lasso Method: The Lasso (Least Absolute Shrinkage and Selection Operator) is a type of regularized regression that constrains the sum of the absolute values of the regression coefficients. This can result in some coefficients being exactly zero, which means Lasso performs feature selection by excluding some variables from the model.

Regularization Parameter:
* The higher the regularization parameter λ, the greater the penalty on the size of the coefficients.
* The parameter λ can be chosen using cross-validation, analytical solutions (in the context of certain assumptions such as heteroskedasticity), or information criteria (AIC or BIC).

50
Q

Discuss the Post Double Selection Lasso

A

It is the same idea of FWL but using LASSO.

This method combines machine learning (ML) with causal inference to identify the impact of a causal regressor on an outcome.
It involves using Lasso regression to select relevant control variables from a high-dimensional set.

Step-by-Step Approach:
* Usual Procedure: First, regress the outcome on controls using OLS to obtain residuals, then regress the causal variable on controls using OLS to obtain residuals, and finally regress the first set of residuals on the second set.
* Machine Learning Approach: Similar to the usual procedure, but using machine learning algorithms to estimate the residuals.

51
Q

What are the 2 assumptions behind Post Double Selection Lasso?

A
  1. The causal variable of interest is known, and there is no need for variable selection for this particular variable.
  2. **Lasso is our best choice if we believe in approximate sparsity **(Approximate sparsity means that we assume that, from the p regressors that we can include, only a few of them matter for the prediction).