General Knowledge Flashcards
Cross-Sectional, Panel Data and Time Series
Cross-sectional data is a sample on different entities, for example firms, households, companies, cities, states, and countries that are observed at a given point in time or in a given period.
Time-series is data for a single entity (firms, households, companies, cities, states, countries) collected at multiple time periods.
Panel data, also called longitudinal data, are data for multiple entities in which each entity is observed at two or more periods.
Explain Standard Deviation and Variation
Both Standard Deviation and Variation measures the “spread” of s probability distribution. The Variation is measured in squared unites, while standard deviation is the square root of this number.
Std is the square root of the variance. Variance is a measure for how far the actual observation is from the observed one, given in square unites. Hence, we use std since its more easy to interpret
What is the difference between Experimental data and observational data
Experimental data comes from an experiment that is designed to investigate the casual effect. Observational data is obtained by measuring actual behaviour outside of an experiment.
Sample Space and events
Sample space is the set of all possible outcomes. An event is what gives the outcomes. So one event might have a huge sample space; lots of things can happen from that event
Probability Distribution of a random variable
The probability distribution lists all possible values for the variable and the probability that each value will occur. These probabilities sum to 1.
Cumulative probability of a random variable
Cumulative probability refers to the likelihood that the value of a random variable is within a given range.
What is Joint probability and distribution
Joint probability is the probability of two events happening together (think venn-diagram). The joint distribution is the probability that X and Y take on certain values. Lets say that X is 1 when its raining and 0 when its not. Y is 1 when there is more than 10 degrees outside and 0 otherwise. The joint distribution of this is the probabilities of how these two scenarios happen, with 4 different outcomes. Each outcome has a probability and summed together they give a value of 1.
Marginal probability distribution
Just another name for its probability distribution. Term is used to distinguish the distribution of Y alone from the joint distribution.
Conditional Distribution
The distribution of a random variable Y conditional on another variable X taking on a specific value.
Conditional Expectation
Conditional expectation means that the value of a variable Y is dependent on the value of another variable X.
Law of iterated Expectations
The expected outcome of one event can be calculated by finding all the probability and expectation for all variables that affects it.
Intuitively: The mean height of adults is the weighted average of the mean heght for men and the mean height for women, weighted by the proportions of men and women. Mean of Y is the weighted average og the conditional expectation of Y given X.
What is the Standard Error in a regression?
- how accurately your sample data represents the whole population
- how accurate your regression fits real population
- mean distance between regression line and the actual observation
Intuitively: Think of a linear regression. For each point, The actual value will probabily not be similar to the expected value. The Standard Error is the mean distance between the observed value and the expected value (regression)
Formula is:
SSR/(n - 2)
n - 2 because it is correcting a bias with two regressor coefficients that were estimated (B0 and B1)
Note:
- SER measures the mean of the variation
- sd measures the amount of variation
Kurtosis
Kurtosis is how much mass the distribution has in its tails, and is therefore a measure of how much of the variance of Y that arises from extreme values. Extreme values are called outliers. The greater the kurtosis of a distribution is, the more likely it is to have outliers.
The kurtosis of a distribution is a measure of how much mass is in its tails and therefore is a measure of how much of the variance of Y arises from extreme values. BTW: An extreme value of Y is called an outlier.
Skewness
Skewness can be quantified as a representation of the extent to which a given distribution varies from a normal distribution. A normal distribution has a skew of zero, represented with equal weight on each tail.
If you are measuring height, you might get a mean of 172 with the tails being equally weighted.
If you are measuring income for people working 100%, few people will have an income under 300K. From 300K to 600K, there will probably be a steep increase. From 600K and to infinity, there will be fewer and fewer people, and the curve will be less and less steep. This means that we get the “long tail” on the right side. “long tail” on right side can be called a “positive skew”, so we can say that the distribution is positively skewed.
If we have an easy exam, and a lot of people get A’s or B’s, we will have a negative skew. The long tail will be on the left side, and slowly increase until it hits C or B. From there it will go steeply up. ‘
I.I.D
Independent and Identically distributed
Independent: The result from one event does not have any impact on the other event. So if you roll two dices, the result you got on the first dice does not affect the sum you will get on the second.
Identically: if you flip a coin (heads/tails) each throw gives you a 50/50 chance. The probability does not change over time.
Chi-Squared
DISTRIBUTION:
The distribution is asymmetrical, with a mean of zero and a standard deviation of one. It is positively skewed. It can be tested on categorical variables, which are variables that only falls into one category (male vs female etc.)
Chi-squared tests can be used when we:
1) Need to estimate how closely an observed distribution matches an expected one
2) need to estimate if two random variables are independent.
GOODNESS OF FIT:
When you have one independent variable, and you want to compare and observed frequency to a theoretical. For example, does age and car accidents have a relation?
H0: no relation between age and var accidents
HA: There is a relation between age and car accidents
Chi-Squared value that’s greater than our critical value implies that there is a relation between age and car accident, hence reject the hull hypothesis. It means that there most likely is a relation, but does not tell us how large that relation is.
Another example is if you flip a coin 100 times. You would expect it to get 50/50 head/tails. The further away from 50/50, the less goodness of fit.
Tests how well a sample of a data matches the known characteristics at the larger population that the sample is trying to represent. For example, the x^2 tells us how well the actual results from 100 coin flips compare to the theoretical model which assumes 50/50. The further away from 50/50, the less goodness of fit (and more likely to conclude that this is not a representative coin).
TEST FOR INDEPENDENCE:
Categorical data for two independent variables, and you want to see if there is an association between them.
Does gender have any significance on Driving test outcome? Is there a relation between student gender and course choice? Reasearcher collect data and compare the frequencies at which rate male and female students select among the different classes. The x^2 for independence tells us how likely it is that random chance can explain the observed difference.
P-value smaller than 0,05: Chi-square value bigger than critical: there is some relation in gender and driving test scores. Reject H0.
Normal Distribution
- bell shape
- both sides equally weighted
- mean of 0
- std of 1
- kurtosis of 3
- no skewness
- symmetrical
Student t
Similar to the normal distribution which has […..]. The difference is that this one got heavier tails, or in other words, a greater kurtosis. This leads to more variance of Y from outliers.
F Distribution + Statistics
F-distribution
- Divide one chi-squared variable by another = F-distribution
- Used specially in analysis of variance
- Function of ratio between two independent variables, each which has a chi-squared distribution
F-Statistics:
- Group A, B and C put on 10 mg, 5 mg and placebo.
- Mean Square Between (MSB) = Mean square between these groups
- Mean Square Error (MSE) = Mean Variance of all these groups added together
F-stat = MSB/MSE
- A large F-stat might indicate that the population means are not equal
The central limit theorem
- Each variable themselves can be random
- But as the sample increases, it will go towards a normal distribution
- The more we add, the closer we get to the real population distribution
- Higher N gives higher probability of having Normality and Consistency