Applied Quantitative Methods Flashcards
Variable
Usually denoted by capital letters such as X or Y, is a
characteristic or measurement that can be determined for each
member of the population.
Numerical variable
Take on numerical values.
Continuous variable
We measure it. Distance, height, GDP in kr., value of cars sold in kr.
Categorical variable
Known as qualitative data where the data is categorised (smoking vs non-smoking, vote (yes/no)) - the numbers in this type of data are purely for identification purposes (Ronaldo no.7, Christian Eriksen no. 10.)
Population
Collection of persons, things or objects under study.
Sampling
Select a subset (or portion) of the population, to gain information about population data.
Sample
Resulting data from sampling a population.
Statistic
Number that represents a property of the sample (e.g., sample mean, sample variance, etc.)
Parameter
Numerical characteristic of the whole population (e.g.
population mean, population variance, etc.)
Simple Random Sample
Chosen by a process that selects a sample of n objects from a population (N) in such a way that each member of the
population has the same probability of being selected.
Sampling Distributions
The population parameter (e.g., mean µ or variance ‡2), is a fixed (but
unknown) number.
But each sample from a population, has a different value of the mean and
variance. If you pick many samples and calculate the mean (and variance) of each sample, then the sample means (and variances) become a variable, which
can be treated as a random variable with a probability distribution.
Law of large numbers
States that given a random sample of size n from a population
N, the sample mean X¯ will approach the population mean µx as the sample size n
becomes large
Central Limit Theorem
States that the mean of a random sample, drawn from a population with any probability distribution, will be approximately: normally distributed given a large-enough sample size
Acceptance Interval
Is an interval where the sample mean has a high probability of occurring (given that we know the population mean and variance) If the sample mean falls within that specified interval, then we can accept
the conclusion that the random sample came from the population with the known mean and variance.
Distribution of sample proportion
Assume, we are dealing with a qualitative or categorical variable
For example, we investigate a characteristic (e.g. smoker/non-smoker) and note 1 if an individual has this characteristic and 0 otherwise. The (unknown) proportion of ones in the population is denoted P. We have a sample of 0 and 1 values.
Chi-Square Distribution
If we can assume that the underlying population distribution is
normal, then it can be shown that the sample variance and the
population variance are related through a probability distribution.
Student’s t Distribution
In this case, σ is replaced by the sample standard deviation (s):
t = X¯ − µ/ s/ √n
This random variable follows a member of a family of distributions called.
Sample Size for Population Proportion
Whatever the outcome, pˆ(1 − pˆ) cannot be bigger than 0.25 (i.e, when the
sample proportion is 0.5)
Thus, the largest possible value for the margin of error, ME, is given by
the following:
n =
0.25(zα/2)2/(ME)2
Null hypothesis and alternative hypothesis
We start with a hypothesis about the parameter - called the null hypothesis
- that hold unless there is strong evidence against this null hypothesis.
If we reject the null hypothesis, then the second hypothesis, named the
alternative hypothesis, will be accepted.
P-value
Getting p-value is the most popular procedure for considering the test of the null hypothesis in statistics
The p-value is the probability of obtaining a value of the test statistic as extreme
as or more extreme than the actual value obtained when the null hypothesis is true
p-value is the smallest significance level at which a null hypothesis can be rejected, given the observed sample statistic.
Significance level
In practice it can be necessary to decide that at what p-value we are going to
reject H0
The decision can be made if we have decided on a so-called α-level, known
as the significance level of the test
We reject H0, if p-value is less than or equal to α
We typically use 5% or 1% significance levels.
Tests of the difference between two population proportions
We consider the situation, where we have two qualitative samples and we
investigate whether a given property is present or not:
The proportion of population 1 has the property Px , which is estimated by pˆx
based on a sample of size nx
The proportion of population 2 has the property Py , which is estimated by pˆy
based on a sample of size ny
We are interested in the dierence py ≠ px , which is estimated by d = ˆpy ≠ pˆx.
Regression
Regressions are typically use to test whether two or more variables are
statistically related
In basic statistics, to explore the relationship between two variables.
Cross-sectional data
We can use numerical variables and also qualitative (or
categorical) variables in regression models
A regression model
Studies the relationship between two or more variables.
Bivariate model
Studies the relationship between only two variables, e.g., x and y.
Multiple regression model
Studies the relationship between more than two variables.
Quadratic functions
Are also used quite often in applied economics to capture
decreasing or increasing marginal effects.
Maximum Likelihood Estimation
The basic idea of Maximum Likelihood Estimation is:
The data we see comes from some model
We know the structure of the model - not the parameters
The ML principle:
▶ From all the possible values that the parameters can take, choose the
values that makes the observed data most likely (probable).
▶ These are the Maximum Likelihood Estimates (MLE) of the parameters.
Print in Python
Print() function is used to output information to the console or terminal.
Head in Python
The head() method is used to view the first few rows of a DataFrame or Series.
Describe in Python
Provides an overview of the dataset’s numeric and/or categorical features.
Helps detect outliers and understand the spread of the data.
Acts as a quick summary for exploratory data analysis (EDA).
Legend in Python
Is used to provide labels for the elements in a plot.
Show in Python
Ensures that the visualization appears as expected.
In some environments, such as scripts or command-line interfaces, plt.show() is necessary to display the plot.
Value_counts in Python
Counts the occurrences of unique values in a Series or DataFrame column. It is a powerful tool for exploratory data analysis (EDA) when working with categorical or numerical data.
Plt.tight_layout in Python
Automatically adjusts subplot parameters (e.g., spacing, padding).
Prevents overlapping of axes titles, labels, or legends.
Optimizes the use of available space in the figure.
Plt.figure in Python
Create a blank figure to hold subplots or plots.
Customize the size, resolution, and properties of the figure.
Allow for the creation of multiple figures in the same script.
Grid in Python
A grid refers to the lines that divide the plot into sections, helping to make data easier to interpret.