Nonparametric and semiparametric estimation Flashcards
Basic. What is a non parametric model?
In a non-parametric model, we assume as little as possible. E.g.,
$$
y =m(x)+\epsilon
$$
Where $m(.)$ is an unspecified function of $x$. This is thus a non-parametric regression model of the CDF.
We need a lot of data to use non-parametric estimation.
A histogram is actually a nonparametric estimator of the density of our variable of interest. If we like to get something smoother, we would use the kernel density estimator.
How dost it work when we estimate a non-parametric model?
When estimating a non-parametric regression model, a local weighted regression line at each point $x
$ is fitted using centered subsets that include the closest $h\times N$ observation. Where $h$ is the bandwidth and $N$ is the sample size. The weights decline as we move away from $x$.
What is a kernel?
The kernel estimate is a weighted average of observations within the bandwidth at the current point of evaluation. Data closest to the current point of evaluation are given more weight, as specified by a function called the kernel.
This is kind of a moving average of our data set (density)
How do we think about bandwidth in the context of kernels
Bandwidth = h
The bandwidth decides how much data we use in our moving average. Using more data will create a smoother estimate.
Choosing the smallest bandwidth leads to a jagged density estimate while the largest bandwidth overs smooths the data.
What is the point of estimating a density function? ANd what estimators can we use?
The point of these types of estimations is to estimate the density of $f(x_0)$ of $x$ evaluated at some point $x_0$. For this we can use:
- The histogram estimator (like a uniform kernel)
- The Kernel density estimator
Formulate the histogram estimator and describe its parts
See notion.
Where $h$ is the bandwidth and $N$ is the sample size. This estimator gives all observations in $x_0 \pm h$ equal weight. This leads to a density estimate that is a step function, even if the underlying density is continuous.
Formulate the general kernel estimator.
See notion.
What needs to be true for a kernel function?
The kernel function $K(·)$ must be continuous, symmetric
around zero, and integrate to unity.
What is the most important thing when it comes to non-parametric estimation. The kernel or the BW?
In practice the choice of kernel is not a huge deal, the choice of BW is more important.
What can be said about the mean obtained from the kernel density estimator?
The kernel density estimator is biased, with the bias term $ b(x_0)$ that depends on the bandwidth, the curvature of the true density, and the kernel used. The bias disappears asymptotically if $h\to0$ as $N \to \infin$.
What is the kernel density bias?
See notion.
What can be said about the mean obtained from the kernel density estimator?
The variance disappears if $Nh → ∞$, which requires that while $h → 0$, it must do so at a slower rate than $N → ∞$.
What is the bias-variance trade-off when using kernel density?
The choice of bandwidth $h$ is much more important than the choice of kernel function $K(·)$.
There is a tension between setting $h$ small to reduce bias and setting $h$ large to ensure smoothness. A natural metric to use is some form of the mean-squared error (MSE).
We would therefore find a way to optimally chose the bandwidth. This is done by minimizing some function of the integrated standard error (ISE), e.g., $E(ISE(h))$.
The optimal BW ($h$) goes to zero as $N \to \infin$.
What is non-parametric regression?
another interesting application of nonparametric methods is in the estimation of a regression function:
y_i =m(x_i)+\epsilon_i
Since the functional form of $m(x_i)$ is unspecified, we can not use OLS. Instead we use the sample average:
\hat m(x_0) = \sum_i w_{i0,h}y_i
where $w_{i0,h}$ are local weights.
The estimator is unbiased, but for consistency we need $N_0 → ∞$ as $N → ∞$ so that the variance goes to zero.
What can be said about the bias-variance trade of in nonlinear regression
Here we also have a bias-variance tradeoff. As $h$ becomes smaller $\hat m(x_0)$ becomes less biased, as only observations close to $x_0$ are being used, but more variable, as fewer observations are being used.