Lecture 4: Covariance Matrix Estimation Flashcards
Empirical Covariance matrix
- p x n data
- A = [a1..an]
- Each row represents a log-return time series
C = 1/n * Sum[(ai - a_hat)*(ai-a_hat)^T]
where a_hat = average of all a (log-return time series)
How do find an estimate of Σˆ of the true Σ based on the datapoints x1, x2..xn?
We should try to maximize the likelihood
L(Sigma) = Product [prob(Sigma, X)]
Changing variables (X = Sigma Inverse) and taking the log of the likelihood the problem can be written as
Max X (log det X) - Trace( CX)
What is wrong with the empirical estimate?
The approach fails when the covariance matrix is not positive. When P > n (when there are more assets than observations). It does not handle the missing data, it has high sensitivity to outliers. Hence we can come up with a better estimate.
How do we measure the estimation quality?
- Apply cross-validation principle.
- Remove 10% of the data
- Record new estimates
- Measure average “error” between estimates
How do we measure errors? Introduce a concept of distance between matrices, which is capture by Frobenius norm (Square root of sum of squares entries).
Sparse Graphical Models
If given prices of many assets - we like to draw a graph that describes links between the prices.
Conditional independence - The pair of random xi and xj are conditionally independent if for xK fixed (k is not equal to i,j) the density can be factored:
p(x) = pi(xi)*pj(xj)
the variables xi and xj are conditionally independent iff the i, j elements of the precision matrix is zero (Sigma^-1)ij = 0
Sparse Precision matrix estimation
In the maximum-likelihood estimation problem -
max X [log det X - Trace C_hat X - Lambda*Norm(X,1)]
The above provides an invertible results even if C_hat is not a positive definite, the problem is convex