next up previous


This LaTeX document is available as postscript or asAdobe PDF.

Restricted Maximum Likelihood

Restricted maximum likelihood (REML), was suggested by Thompson (1962), and was described formally by Patterson and Thompson (1971). The term 'restricted' is perhaps not the most correct to use to describe the method. The procedure requires that ${\bf y}$ have a multivariate normal distribution. The method is translation invariant. The maximum likelihood approach automatically keeps the estimator within the allowable parameter space(i.e. zero to plus infinity), and therefore, REML is a biased procedure.

The Likelihood Function

The likelihood function of ${\bf y}$ is assumed to have a multivariate normal distribution which is

\begin{displaymath}L({\bf y}) \mbox{ = } (2\pi)^{-.5N} \mid {\bf V} \mid^{-.5}
\exp(-.5({\bf y}-{\bf Xb})'{\bf V}^{-1}({\bf y}-{\bf Xb})). \end{displaymath}

In REML, however, the likelihood function of a set of error contrasts is used, denoted by \( {\bf K}'{\bf y} \), where \( {\bf K}'{\bf X} = 0 \), and ${\bf K}'$ has rank equal to $N-r({\bf X})$. That is,

\begin{displaymath}L({\bf K}'{\bf y}) = (2\pi)^{-.5(N-r({\bf X}))} \mid
{\bf K}'...
...5({\bf K}'{\bf y})'({\bf K}'
{\bf VK})^{-1}({\bf K}'{\bf y})). \end{displaymath}

The next step is to take the natural log of the likelihood function to give

\begin{displaymath}L_{1} = -.5(N-r({\bf X}))\ln(2\pi)-.5\ln \mid {\bf K}'{\bf VK}
\mid -.5{\bf y}'{\bf K}({\bf K}'{\bf VK})^{-1}{\bf K}'{\bf y}. \end{displaymath}

Notice that \(-.5(N-r({\bf X}))\ln(2\pi)\) is a constant that does not depend on the unknown variance components or factors in the model, and therefore, can be ignored. Searle (1979) showed that

\begin{displaymath}\ln \mid {\bf K}'{\bf VK} \mid \mbox{ = } \ln \mid {\bf V} \mid
+ \ln \mid {\bf X}'{\bf V}^{-1}{\bf X}
\mid \end{displaymath}

and that

\begin{displaymath}{\bf y}'{\bf K}({\bf K}'{\bf VK})^{-1}{\bf K}'{\bf y} =
{\bf...
... X}\hat{{\bf b}})'
{\bf V}^{-1}({\bf y}-{\bf X}\hat{{\bf b}}) \end{displaymath}

for any \({\bf K}'\) as long as ${\bf K}'{\bf X}={\bf0}$ and $r({\bf K}')=N-r({\bf X})$. Hence, L1 can be re-written as

\begin{displaymath}L_{2} = -.5\ln \mid {\bf V} \mid -.5\ln \mid {\bf X}'
{\bf V}...
...f X}\hat{{\bf b}})'{\bf V}^{-1}({\bf y}-{\bf X}
\hat{{\bf b}}).\end{displaymath}

Various alternative forms of L2 can be derived. Some basic results on determinants are necessary to follow the derivations.

First, if ${\bf A}$ and ${\bf D}$ are square matrices of the same order, then

\begin{displaymath}\mid {\bf AD} \mid \mbox{ = } \mid {\bf A} \mid \mbox{ } \mid
{\bf D} \mid .\end{displaymath}

This can be used to show that

\begin{eqnarray*}\mid {\bf V} \mid & = & \mid {\bf R}+{\bf ZGZ}' \mid \\
& = &...
...G}^{-1}+{\bf Z}'{\bf R}^{-1}
{\bf Z} \mid \ \mid {\bf G} \mid .
\end{eqnarray*}


Then it follows that

\begin{displaymath}\ln \mid {\bf V} \mid = \ln \mid {\bf R} \mid + \ln \mid
{\bf...
...\mid
+ \ln \mid {\bf G}^{-1}+{\bf Z}'{\bf R}^{-1}{\bf Z} \mid. \end{displaymath}

Secondly, for a general matrix below with \({\bf A}\) and \({\bf D}\) being square and non-singular, then

\begin{displaymath}\left\vert \begin{array}{rr}
{\bf A} & -{\bf B} \\ {\bf Q} & ...
...d {\bf D} \mid \mbox{ } \mid {\bf A}+{\bf BD}^{-1}{\bf Q}
\mid \end{displaymath}

which can be used to show that if \({\bf A}={\bf I}\)and \({\bf D}={\bf I}\), then

\begin{eqnarray*}\mid {\bf I}+{\bf QB} \mid & = & \mid {\bf I}+{\bf BQ} \mid \\ ...
...f B}'{\bf Q}' \mid \\
& = & \mid {\bf I}+{\bf Q}'{\bf B}' \mid
\end{eqnarray*}


Also, if \(\theta\) is a scalar constant, then

\begin{displaymath}\mid \theta {\bf A} \mid \mbox{ = } \theta^{m} \mid {\bf A} \mid \end{displaymath}

where m is the order of A. These relationships are used to show that if

\begin{displaymath}{\bf C} \mbox{ = } \left( \begin{array}{ll}
{\bf X}'{\bf R}^{...
...& {\bf Z}'{\bf R}^{-1}{\bf Z}+{\bf G}^{-1}
\end{array} \right) \end{displaymath}

then

\begin{eqnarray*}\mid {\bf C} \mid & = & \mid {\bf X}'{\bf R}^{-1}{\bf X} \mid \...
...-1}{\bf Z}
+{\bf G}^{-1})^{-1}{\bf Z}'{\bf R}^{-1}){\bf X} \mid
\end{eqnarray*}


and letting

\begin{displaymath}{\bf S} \mbox{ = } {\bf R}^{-1}-{\bf R}^{-1}
{\bf X}({\bf X}'{\bf R}^{-1}{\bf X})^{-}{\bf X}'{\bf R}^{-1}, \end{displaymath}

then

\begin{eqnarray*}\mid {\bf C} \mid & = & \mid {\bf X}'{\bf R}^{-1}{\bf X} \mid \...
...f Z}+{\bf G}^{-1} \mid \
\mid {\bf X}'{\bf V}^{-1}{\bf X} \mid .
\end{eqnarray*}


From the last equality, then

\begin{displaymath}\ln \mid {\bf X}'{\bf V}^{-1}{\bf X} \mid = \ln \mid {\bf C} \mid
- \ln \mid {\bf Z}'{\bf R}^{-1}{\bf Z}+{\bf G}^{-1} \mid. \end{displaymath}

Combining the above results gives

\begin{displaymath}L_{2} \mbox{ = } -.5\ln \mid {\bf R} \mid -.5\ln \mid {\bf G} \mid
-.5\ln \mid {\bf C} \mid -.5{\bf y}'{\bf Py} .\end{displaymath}

Now note that

\begin{eqnarray*}\ln \mid {\bf R} \mid & = & \ln \mid {\bf I} \sigma_{0}^{2} \mi...
...\sigma_{0}^{-2}({\bf Z}'{\bf MZ}+{\bf G}^{-1}
\sigma_{0}^{2}) .
\end{eqnarray*}


Then

\begin{displaymath}\ln \mid {\bf C} \mid = \ln \mid {\bf X}'{\bf X} \mid
-r({\bf...
... + \ln \mid {\bf Z}'{\bf MZ}+{\bf G}^{-1}
\sigma_{0}^{2} \mid ,\end{displaymath}

and finally, the log-likelihood function becomes

\begin{eqnarray*}L_{2} & = & -.5(N-r({\bf X})-q)\ln \sigma_{0}^{2} -.5 \sum_{i=1...
...} \\
& & -.5\ln \mid {\bf C}^{\star} \mid -.5 {\bf y}'{\bf Py},
\end{eqnarray*}


where

\begin{displaymath}{\bf C}^{\star} \mbox{ = } \left( \begin{array}{ll}
{\bf X}'{...
...\bf Z}'{\bf Z}+{\bf G}^{-1}\sigma_{0}^{2}
\end{array} \right) .\end{displaymath}

Derivatives of the Log Likelihood

Recall that the variance-covariance matrix of y is

\begin{displaymath}{\bf V} \mbox{ = } \sum_{i=1}^{s} {\bf Z}_{i}{\bf Z}'_{i}
\sigma_{i}^{2} + {\bf I}\sigma_{0}^{2}
\end{displaymath}

and a projection matrix is

\begin{displaymath}{\bf P} = {\bf V}^{-1} - {\bf V}^{-1}{\bf X} ({\bf X}'{\bf V}^{-1}
{\bf X})^{-} {\bf X}'{\bf V}^{-1} . \end{displaymath}

From Searle (1979), the following results about derivatives hold:
(1)

\begin{displaymath}\frac{\partial {\bf V}^{-1} }{\partial \sigma_{i}^{2} } \mbox...
...\frac{\partial {\bf V}}{\partial \sigma_{i}^{2} }
{\bf V}^{-1}
\end{displaymath}

(2)

\begin{displaymath}\frac{\partial \ln \mid {\bf V} \mid}{\partial \sigma_{i}^{2}...
...{-1} \frac{\partial {\bf V}}{\partial \sigma_{i}^{2} }
\right)
\end{displaymath}

(3)

\begin{displaymath}\frac{\partial {\bf P}}{\partial \sigma_{i}^{2} } \mbox{ = }
...
...bf P} \frac{\partial {\bf V}}{\partial \sigma_{i}^{2}} {\bf P} \end{displaymath}

(4)

\begin{displaymath}\frac{\partial {\bf V} }{\partial \sigma_{i}^{2} } \mbox{ = }
{\bf Z}_{i}{\bf Z}'_{i} \end{displaymath}

and
(5)

\begin{displaymath}\frac{\partial \ln \mid {\bf X}'{\bf V}^{-1}{\bf X} \mid }
{\...
...tial {\bf V} }{\partial
\sigma_{i}^{2} } {\bf V}^{-1}{\bf X}. \end{displaymath}

To derive formulas for estimating the variance components take the derivative of L2with respect to the unknown components.

\begin{eqnarray*}\frac{\partial L_{2}}{\partial \sigma_{i}^{2}} & = &
-.5 tr{\bf...
...rtial \sigma_{i}^{2}}
{\bf V}^{-1}({\bf y}-{\bf X}\hat{{\bf b}})
\end{eqnarray*}


Combine the two terms involving the traces and note that

\begin{displaymath}{\bf V}^{-1}({\bf y}-{\bf X}\hat{{\bf b}}) \mbox{ = } {\bf Py} ,\end{displaymath}

then

\begin{displaymath}\frac{\partial L_{2}}{\partial \sigma_{i}^{2}} =
-.5 tr({\bf...
...bf P}
\frac{\partial {\bf V}}{\partial \sigma_{i}^{2}}{\bf Py} \end{displaymath}


\begin{displaymath}= -.5 tr{\bf PZ}_{i}{\bf Z}'_{i} +.5{\bf y}'{\bf PZ}_{i}
{\bf Z}'_{i}{\bf Py} \end{displaymath}

for \(i = 1,\dots,s\) or

\begin{displaymath}= -.5 tr{\bf P} + .5{\bf y}'{\bf PPy} \end{displaymath}

for i=0 for the residual component. Using ${\bf P}$ and the fact that

\begin{displaymath}{\bf V}^{-1} = {\bf R}^{-1} - {\bf R}^{-1}{\bf Z}
({\bf Z}'{\bf R}^{-1}{\bf Z}+{\bf G}^{-1})^{-1}{\bf Z}'{\bf R}^{-1} \end{displaymath}

then

\begin{displaymath}tr{\bf PZ}_{i}{\bf Z}'_{i} = q_{i}/\sigma_{i}^{2} -
tr{\bf C}_{ii}\sigma_{0}^{2}/ \sigma_{i}^{4} \end{displaymath}

and

\begin{displaymath}tr{\bf P} = (N-r({\bf X})) -
\sum_{i=1}^{s}\hat{{\bf u}}'_{i}\hat{{\bf u}}_{i}/\sigma^{2}_{i}. \end{displaymath}

The other terms, ${\bf y}'{\bf PZ}_{i}{\bf Z}'_{i}{\bf Py}$ and ${\bf y}'{\bf PPy}$, were simplified by Henderson (1973) to show that they could be calculated from solutions to his Mixed Model Equations. To simplify these quadratic forms notice that

\begin{eqnarray*}{\bf Py} & = & ({\bf V}^{-1} - {\bf V}^{-1}{\bf X}({\bf X}'
{\b...
... ({\bf X}'{\bf V}^{-1}{\bf X})^{-}
{\bf X}'{\bf V}^{-1}{\bf y}.
\end{eqnarray*}


Therefore,

\begin{displaymath}{\bf y}'{\bf PZ}_{i}{\bf Z}'_{i}{\bf Py} = ({\bf y} -
{\bf X}...
...Z}_{i}{\bf Z}'_{i}
{\bf V}^{-1}({\bf y} - {\bf X}\hat{\bf b}).
\end{displaymath}

Henderson noticed that

\begin{displaymath}\hat{\bf u}_{i} = {\bf G}_{i}{\bf Z}'_{i}{\bf Py} \end{displaymath}

where ${\bf G}_{i}$ is the assumed variance covariance matrix of ${\bf u}_{i}$(in this case ${\bf G}_{i} = {\bf I} \ {\theta}_{i})$, and $\hat{\bf u}_{i}$ is BLUP of ${\bf u}_{i}$ and the solution of $\hat{\bf u}_{i}$ from his MME, then

\begin{eqnarray*}{\bf y}'{\bf PZ}_{i}{\bf Z}'_{i}{\bf Py}
& = & {\bf y}'{\bf PZ}...
... \ \
{\rm when} \ {\bf G}_{i} = {\bf I} \ {\sigma}_{i}^{2}. \\
\end{eqnarray*}


Similarly, Henderson showed that

\begin{displaymath}{\bf y}'{\bf PPy} = ({\bf y}'{\bf y} \ - \
\hat{\bf b}'{\bf...
...t{\bf u}_{i}^{'}\hat{\bf u}_{i}{\alpha}_{i})/
{\sigma}^{2}_{0}
\end{displaymath}

where ${\alpha}_{i} = {\sigma}^{2}_{0}/{\sigma}^{2}_{i}.$

Now equate the derivatives to zero to obtain

\begin{displaymath}q_{i}/\sigma_{i}^{2}- tr{\bf C}_{ii}\sigma_{0}^{2}/
\sigma_{i}^{4}
= \hat{{\bf u}}'_{i}\hat{{\bf u}}_{i}/\sigma_{i}^{4} \end{displaymath}

which leads to

\begin{displaymath}\hat{\sigma}_{i}^{2} = (\hat{{\bf u}}'_{i}\hat{{\bf u}}_{i}
+ tr{\bf C}_{ii}\sigma_{0}^{2})/q_{i} \end{displaymath}

and

\begin{displaymath}(N-r({\bf X}))\sigma_{0}^{2}-\sum_{i=1}^{s}\hat{{\bf u}}'_{i}...
...-
\sum_{i=1}^{s}\hat{{\bf u}}'_{i}\hat{{\bf u}}_{i}
\alpha_{i} \end{displaymath}

which leads to

\begin{displaymath}\hat{\sigma}_{0}^{2} = {\bf y}'{\bf Py}/(N-r({\bf X})). \end{displaymath}

Use of the above formulas is known as the EM algorithm, (Expectation Maximization). Also note that the EM algorithm is iterative. One must take the new estimates, reform the matrix ${\bf V}$, recalculate new solutions to Henderson's MME, and obtain new estimates of the variances. The process continues until the change in new versus old estimates is smaller than a pre-specified amount, like .000001%. The main problems with the EM algorithm are that convergence is very, very slow, if it occurs and may not converge, but instead diverge. The more components there are to estimate, the longer will be the number of iterates to reach convergence, depending upon the starting values in the iterative process.

Another major problem is the requirement for $tr{\bf C}_{ii}$ which is a submatrix of the inverse to the MME. In practice, the inverse of the MME is not explicitly derived and solutions to MME are obtained by Gauss-Seidel iteration. Thus, there have been many attempts to approximate the value of $tr{\bf C}_{ii}$, and most have not been suitable.

The above problems have led to attempts to discover other computational algorithms for REML. One of those has been Derivative Free REML which is described in the next section.

Derivative Free Method

Derivative Free REML was proposed by Smith and Graser(1986) and Graser, Smith and Tier(1987) and has been expanded upon by Meyer (1987,91) who has developed a set of programs for computing estimates of variance components for a whole range of univariate and multivariate models. The description given below is a very simplified version of the method for basic understanding of the technique.

In essence, one can imagine an s dimensional array containing the values of the likelihood function for every possible set of values of the ratios of the components to the residual variance. The technique is to search this array and find the set of ratios for which the likelihood function is maximized. There is more than one way to conduct this search. Care must be taken to find the 'global' maximum rather than one of possibly many 'local' maxima. At the same time the number of likelihood evaluations to be computed must also be minimized.

One begins by looking at alternative forms of the log likelihood function given by L2, that is,

\begin{eqnarray*}L_2 & = & -.5( (N-r({\bf X})-q)\ln \sigma_{0}^{2}
+ \sum_{i=1}^...
...2} \\
& & +\ln \mid {\bf C}^{\star} \mid + {\bf y}'{\bf Py}) .
\end{eqnarray*}


Note that

\begin{eqnarray*}q_{i} \ln \sigma_{i}^{2} & = & q_{i} \ln \sigma_{0}^{2}/
\alpha_{i} \\
& = & q_{i}( \ln \sigma_{0}^{2} -\ln \alpha_{i})
\end{eqnarray*}


so that

\begin{eqnarray*}L_{2} & = & -.5( (N-r({\bf X}))\ln \sigma_{0}^{2}
- \sum_{i=1}...
...
& & +\ln \mid {\bf C}^{\star}
\mid + {\bf y}'{\bf P}{\bf y} .
\end{eqnarray*}


The computations are achieved by constructing the following matrix,

\begin{displaymath}\left( \begin{array}{lll}
{\bf X}'{\bf X} & {\bf X}'{\bf Z} &...
...f y} \\ {\bf y}'{\bf W} & {\bf y}'{\bf y}
\end{array} \right), \end{displaymath}

then by Gaussian elimination of one row at a time, the sum of the log of the non-zero pivots (using the same ordering for each evaluation of the likelihood) gives

\begin{eqnarray*}\ln \mid {\bf C}^{\star} \mid & \ {\rm and} \ &
{\bf y}'{\bf Py}.
\end{eqnarray*}


Gaussian elimination, using sparse matrix techniques, requires less computing time than inverting the coefficient matrix of the mixed model equations. The ordering of factors within the equations could be critical to the computational process and some experimentation may be necessary to determine the best ordering. The likelihood function can be evaluated without the calculation of solutions to the mixed model equations, without inverting the coefficient matrix of the mixed model equations, and without computing any of the \(\sigma_{i}^{2}\). The formulations for more general models and multiple trait models are more complex, but follow the same ideas.

The general technique is best described with a small example. Below are ordinary least squares equations for a model with three factors, two of which (A and B) are random. The total sum of squares is ${\bf y}'{\bf y} = 356,000$.


\begin{displaymath}\left( \begin{array}{rrrrrrrrr}
50 & 0 & 5 & 15 & 30 & 5 & 10...
...860 \\ 3140 \\ 700 \\ 1320 \\ 2400 \\ 1160
\end{array} \right) \end{displaymath}

EM Algorithm

To demonstrate the EM algorithm, let \(\alpha_{A} = 10\)and \(\alpha_{B} = 5\) be the starting values of the ratios for factors A and B, respectively, which must be added to the diagonals of the above equations. The solution vector is

\begin{displaymath}\left( \begin{array}{r}
F_{1} \\ F_{2} \\ A_{1} \\ A_{2} \\ A...
...\ 5.1064 \\
2.6402 \\ -2.6433 \\ -5.1034 \end{array} \right) .\end{displaymath}

Then

\begin{displaymath}{\bf y}'({\bf X}\hat{{\bf b}}+{\bf Z}\hat{{\bf u}})
\mbox{ = } 347,871.2661 \end{displaymath}

and from the inverse of the coefficient matrix,

\begin{displaymath}tr{\bf C}_{AA} = .16493, \mbox{ and }
tr{\bf C}_{BB} = .3309886 \end{displaymath}

which give rise to the following estimates,

\begin{eqnarray*}\hat{\sigma}_{0}^{2} & = & (356,000 - 347,871.2661) / 88 \\
&...
... = & (66.0771576+.3309886(92.371976))/4 \\
& = & 24.16280774.
\end{eqnarray*}


New ratios are formed as

\begin{displaymath}\alpha_{A} = 92.371976/7.575855 = 12.192944, \end{displaymath}

and

\begin{displaymath}\alpha_{B} = 92.371976/24.16280774 = 3.822899 \end{displaymath}

and these are used to form the mixed model equations again, followed by new solutions and traces, and so on, until the estimated ratios and the prior values of the ratios are equal. The estimates appear to converge to

\begin{eqnarray*}\hat{\sigma}_{0}^{2} & = & 91.8639, \\
\hat{\sigma}_{A}^{2} & = & 2.5692, \\
\hat{\sigma}_{B}^{2} & = & 30.5190.
\end{eqnarray*}


DF REML

Begin by fixing the value of $\alpha_{B}$ at 10 and let the value of $\alpha_{A}$ take on values of $( 5, \ 10, \ 20, \ 30 )$. Using L2 to evaluate the likelihood, then the results were as follows:

$\alpha_{A}$ L2
5 -207.4442
10 -207.1504
20 -206.9822
30 -206.9274
To find the value of $\alpha_{A}$ that maximizes L2 for $\alpha_{B} = 10$, let

\begin{displaymath}{\bf Q} = \left( \begin{array}{rrr}
1 & 5 & 25 \\ 1 & 10 & 10...
...442 \\ -207.1504 \\ -206.9822 \\ -206.9274
\end{array} \right) \end{displaymath}

then

\begin{displaymath}\hat{\beta} = ({\bf Q}'{\bf Q})^{-1}{\bf Q}'{\bf Y} =
\left(...
...}{c}
-207.6941 \\ .0613844 \\ -.001205457
\end{array} \right) .\end{displaymath}

From this a prediction equation for L2 is

\begin{displaymath}L_{2} \mbox{ = } -207.6941 +.0613844 \alpha_{A} -.001205457
\alpha_{A}^{2} .\end{displaymath}

This equation can be differentiated with respect to $\alpha_{A}$ and then equated to zero to find the value of the ratio that maximizes the likelihood. This gives

\begin{displaymath}\alpha_{A} = .0613844/(2(.00125457)) = 25.4612. \end{displaymath}

The search for the global maximum continues now by keeping $\alpha_{A}$ at 25.4612, and looking at four new values of $\alpha_{B}$which give the following results.

$\alpha_{B}$ L2
2 -206.2722
3 -206.2055
4 -206.2473
5 -206.3426
Applying the quadratic regression to these points gives

\begin{displaymath}\alpha_{B} \mbox{ = } 3.1910. \end{displaymath}

The search continues by alternating between fixing either $\alpha_{A}$or $\alpha_{B}$ and narrowing the field of search. Keeping $\alpha_{B}$ at 3.19, now the possible values for $\alpha_{A}$might be 24, 25, 26, and 27.

To insure that a global maximum has been found, the entire process could be started with vastly different starting values for the ratios, such as \(\alpha_{B} = 50\) and let values for \(\alpha_{A}\) be 40, 50, 60, and 70.

The more components there are to estimate, the more evaluations of the likelihood that are going to be needed, and the more probable it is that convergence might be to a local maximum rather than to the global maximum. Another problem might be that one component might converge towards zero. There has been much work on searching algorithms for DF REML.

Please refer to the literature for specification of the log likelihood function for particular models and situations. Also, refer to work by Boldman and Van Vleck (1993) which found a simplification of Meyer's algorithms which reduced computational time by several orders of magnitude. Even so, DFREML has been applied to fairly small data sets and can take considerable time to find estimates for these. The available software may also not be able to handle particular models, and so the user should be aware of these possible problems.

Algorithms to Maximize a Nonlinear Function

Hofer (1998, Variance component estimation in animal breeding: a review, in the Journal of Animal Breeding and Genetics, 115:247-266) gives a general description of the algorithms used with REML. A gradient (change in some quantity as a result of a change in the parameters), called ${\bf d}$ determines the direction of the search in the next step. The gradient is modified by a matrix ${\bf M}$ and by a scalar, c. The new estimate is given by

\begin{displaymath}\sigma^{(r+1)} = \sigma^{(r)} + c^{(r)}{\bf M}^{(r)}{\bf d}^{(r)}, \end{displaymath}

where

\begin{displaymath}{\bf d}^{(r)} = \partial \ln L_{r}/ \partial \sigma \end{displaymath}

evaluated at $\sigma = \sigma^{(r)}$.

For the Newton-Raphson algorithm, evaluated at $\sigma = \sigma^{(r)}$,

\begin{eqnarray*}c^{(r)} & = & 1, \\
{\bf M}^{(r)} & = & \left( - (\partial^{2}...
...}}
{\bf P}\frac{\partial {\bf V}}{\partial \sigma_{j}}{\bf Py}
\end{eqnarray*}


For the Method of Scoring algorithm, evaluated at $\sigma = \sigma^{(r)}$,

\begin{eqnarray*}c^{(r)} & = & 1, \\
{\bf M}^{(r)} & = & \left( E[ -(\partial^{...
...\sigma_{i}}
{\bf P}\frac{\partial {\bf V}}{\partial \sigma_{j}}
\end{eqnarray*}


The traces in the Newton-Raphson algorithm and in the Method of Scoring algorithm both involve the inverse elements, in complex forms, of the mixed model equations, which makes them difficult to attack computationally. Johnson and Thompson (1995) suggested averaging the observed and expected information matrices and coined the term Average Information REML, or AI REML. They defined ${\bf M}^{(r)}$ as the inverse of the following matrix, which does not require elements of the inverse of the mixed model equations.

\begin{eqnarray*}{ {\rm AI}(\sigma_{i},\sigma_{j}) } & = &
\frac{1}{2} \left( - ...
...i}}
{\bf P}\frac{\partial {\bf V}}{\partial \sigma_{j}}{\bf Py}
\end{eqnarray*}


These last quantities are calculated by augmenting some previous equations as follows:

\begin{displaymath}\left( \begin{array}{lll}
{\bf C}^{\star} & {\bf W}'{\bf y} ...
...f}'_{i}{\bf y} & {\bf f}'_{i}{\bf f}_{i}
\end{array} \right), \end{displaymath}

where

\begin{displaymath}{\bf f}_{i} = \frac{\partial {\bf V}}{\partial \sigma_{i}}
{\bf Py}. \end{displaymath}

There would be one ${\bf f}_{i}$ vector for each variance component to be estimated. All of them may be added to the above matrix at one time, before Gaussian elimination is applied. The calculation of each ${\bf f}_{i}$ involves solutions to the mixed model equations.

Hofer (1998) compared the various algorithms for various models. The AI REML algorithm seemed to be advantageous over the other methods. DF REML was advantageous over EM REML when the number of parameters to be estimated was small, but EM REML became more efficient than DF REML when the number of parameters was large as in multiple trait models. Animal breeders, in general, need to be able to access and utilize REML programs written by experts in order to save time from writing their own programs, and to avoid the duplication of effort that would be needed to get new programs to work correctly and efficiently. However, for some models there may not be programs available to handle them, and then writing your own programs would be necessary.

Bayesian Estimation

These notes are based on Sorensen (1998) from a course on "Gibbs Sampling in Quantitative Genetics" given at AGBU, Armidale, NSW, Australia. The Gibbs sampling tool to derive estimates of variance components from the joint posterior distribution seems to be a very powerful tool to use with animal models. Computationally, one needs only a program that calculates solutions to Henderson's mixed model equations, very good random number generators, and computer time to process very large numbers of samples.

Begin with the simple single trait animal model. That is,

\begin{displaymath}{\bf y} = {\bf Xb}+{\bf Za} + {\bf e}. \end{displaymath}

The Bayesian approach is to derive the joint posterior distribution by application of Bayes theorem. If $\theta$ is a vector of random variables and ${\bf y}$ is the data vector, then

\begin{eqnarray*}p(\theta, {\bf y}) & = & p(\theta) \ p({\bf y} \mid \theta) \\
& = & p({\bf y}) \ p(\theta \mid {\bf y})
\end{eqnarray*}


Re-arranging gives

\begin{eqnarray*}p(\theta \mid {\bf y}) & = & \frac{p(\theta)p({\bf y} \mid \the...
...
& = & {\rm posterior \ probability \ function \ of } \ \theta
\end{eqnarray*}


In terms of the simple animal model, $\theta$ includes ${\bf b}$, ${\bf a}$, $\sigma^{2}_{a}$, and $\sigma^{2}_{e}$. The conditional distribution of ${\bf y}$ given $\theta$ is

\begin{eqnarray*}{\bf y} \mid {\bf b},{\bf a},\sigma^{2}_{a},\sigma^{2}_{e}
& ...
...{\bf Za})'
({\bf y}-{\bf Xb}-{\bf Za})/2\sigma^{2}_{e} \right]
\end{eqnarray*}


Prior distributions need to be assigned to the components in $\theta$, and these need to be multiplied together and times the conditional distribution of ${\bf y}$ given $\theta$. For the fixed effects vector, ${\bf b}$, there is little prior knowledge about the values that elements in that vector might have. This is represented by assuming

\begin{displaymath}p({\bf b}) \propto {\rm constant}. \end{displaymath}

For ${\bf a}$, the vector of additive genetic values, quantitative genetics theory suggests that they follow a normal distribution, i.e.

\begin{eqnarray*}{\bf a} \mid {\bf A}, \sigma^{2}_{a} & \sim & N({\bf0},
{\bf ...
...left[
- {\bf a}'{\bf A}^{-1}{\bf a} / 2 \sigma^{2}_{a} \right],
\end{eqnarray*}


where q is the length of ${\bf a}$. A natural estimator of $\sigma^{2}_{a}$ is ${\bf a}'{\bf A}^{-1}{\bf a}
/q$, call it S2a, which has

\begin{displaymath}S^{2}_{a} \sim \chi^{2}_{q} \sigma^{2}_{a} /q. \end{displaymath}

Multiplying both sides by q and dividing by $\chi^{2}_{q}$ gives

\begin{displaymath}\sigma^{2}_{a} \sim q S^{2}_{a} / \chi^{2}_{q} \end{displaymath}

which is a scaled, inverted Chi-square distribution, written as

\begin{displaymath}p(\sigma^{2}_{a} \mid v_{a},S^{2}_{a}) \propto
(\sigma^{2}_...
...t( -\frac{v_{a}}{2}
\frac{S^{2}_{a}}{\sigma^{2}_{a}} \right), \end{displaymath}

where va and S2a are hyperparameters with S2a being a prior guess about the value of $\sigma^{2}_{a}$ and va being the degrees of belief in that prior value. Usually q is much larger than va and therefore, the data provide nearly all of the information about $\sigma^{2}_{a}$. Similarly, for the residual variance,

\begin{eqnarray*}p(\sigma^{2}_{e} \mid v_{e},S^{2}_{e}) & \propto &
(\sigma^{2}...
... -\frac{v_{e}}{2}
\frac{S^{2}_{e}}{\sigma^{2}_{e}} \right). \\
\end{eqnarray*}


Now form the joint posterior distribution as

\begin{displaymath}p({\bf b},{\bf a},\sigma^{2}_{a},\sigma^{2}_{e} \mid {\bf y} ...
...({\bf y} \mid {\bf b},{\bf a},\sigma^{2}_{a},
\sigma^{2}_{e}) \end{displaymath}

which can be written as

\begin{displaymath}\propto
(\sigma^{2}_{e})^{-( \frac{N+v_{e}}{2}+1)}
\exp \l...
...bf Za})'({\bf y}-{\bf Xb}-{\bf Za}) + v_{e}S^{2}_{e})
\right] \end{displaymath}


\begin{displaymath}(\sigma^{2}_{a})^{- (\frac{q+v_{a}}{2}+1)} \exp \left[
- \fr...
...{a}} ({\bf a}'{\bf A}^{-1}{\bf a}
+ v_{a}S^{2}_{a}) \right]. \end{displaymath}

In order to implement Gibbs sampling, all of the fully conditional posterior distributions (one for each component of $\theta$ ) need to be derived from the above joint posterior distribution. Let

\begin{eqnarray*}{\bf W} & = & ( {\bf X} \ \ {\bf Z} ), \\
\beta' & = & ( {\bf ...
...W}'{\bf W} + \Sigma \\
{\bf C}\hat{\beta} & = & {\bf W}'{\bf y}
\end{eqnarray*}


Now we adopt a new notation, let

\begin{displaymath}\beta' = ( \beta_{i} \ \ \beta'_{-i} ), \end{displaymath}

where $\beta_{i}$ is a scalar representing just one element of the vector $\beta$, and $\beta_{-i}$ is a vector representing all of the other elements except $\beta_{i}$. Similarly, ${\bf C}$ and ${\bf W}$ can be partitioned in the same manner as

\begin{eqnarray*}{\bf W}' & = & ( {\bf W}_{i} \ \ {\bf W}_{-i} )' \\
{\bf C} & ...
...{i,-i} \\ {\bf C}_{-i,i} & {\bf C}_{-i,-i}
\end{array} \right).
\end{eqnarray*}


In general terms, the conditional posterior distribution of $\beta$ is

\begin{eqnarray*}\beta_{i} \mid \beta_{-i}, \sigma^{2}_{a}, \sigma^{2}_{e},{\bf y}
& \sim & N(\hat{\beta}_{i}, C^{-1}_{i,i} \sigma^{2}_{e})
\end{eqnarray*}


where

\begin{displaymath}C_{i,i}\hat{\beta}_{i} = ({\bf W}'_{i}{\bf y} - {\bf C}_{i,-i}
\beta_{-i}). \end{displaymath}

Then

\begin{displaymath}b_{i} \mid {\bf b}_{-i}, {\bf a}, \sigma^{2}_{a}, \sigma^{2}_...
...{\bf y} \ \sim \ N(\hat{b}_{i}, C^{-1}_{i,i} \sigma^{2}_{e} ), \end{displaymath}

for

\begin{displaymath}C_{i,i} = {\bf x}'_{i}{\bf x}_{i}. \end{displaymath}

Also,

\begin{displaymath}a_{i} \mid {\bf b},{\bf a}_{-i}, \sigma^{2}_{a}, \sigma^{2}_{...
... {\bf y} \ \sim \ N(\hat{a}_{i}, C^{-1}_{i,i} \sigma^{2}_{e}), \end{displaymath}

where

\begin{displaymath}C_{i,i} = ( {\bf z}'_{i}{\bf z}_{i} + A^{i,i}k), \end{displaymath}

for $k=\sigma^{2}_{e}/ \sigma^{2}_{a}$. The conditional posterior distributions for the variances are

\begin{displaymath}\sigma^{2}_{a} \mid {\bf b},{\bf a}, \sigma^{2}_{e},{\bf y} \sim
\tilde{v}_{a} \tilde{S}^{2}_{a} \chi^{-2}_{\tilde{v}_{a}} \end{displaymath}

for $\tilde{v}_{a} = q + v_{a}$, and $\tilde{S}^{2}_{a} = ({\bf a}'{\bf A}^{-1}{\bf a}
+ v_{a}S^{2}_{a} ) / \tilde{v}_{a},$and

\begin{displaymath}\sigma^{2}_{e} \mid {\bf b},{\bf a}, \sigma^{2}_{a},{\bf y} \sim
\tilde{v}_{e} \tilde{S}^{2}_{e} \chi^{-2}_{\tilde{v}_{e}} \end{displaymath}

for $\tilde{v}_{e} = N + v_{e}$, and $\tilde{S}^{2}_{e} = ({\bf e}'{\bf e}
+ v_{e}S^{2}_{e} ) / \tilde{v}_{e}$, and ${\bf e} = {\bf y}-{\bf Xb}-{\bf Za}$.

Computational Scheme

Gibbs sampling is much like Gauss-Seidel iteration, except that

1.
when a new solution is calculated a random amount is added to it based upon its conditional posterior distribution variance before proceeding to the next equation, and
2.
after all equations have been processed, new values of the variances are calculated and a new variance ratio is determined prior to beginning the next round.
Suppose we have the following MME for five animals:

\begin{displaymath}\left( \begin{array}{rrrrrr}
5 & 1 & 1 & 1 & 1 & 1 \\
1 & ...
...2 \\ 38.5 \\ 48.9 \\ 64.3 \\ 50.5 \\ 36.0 \end{array} \right), \end{displaymath}

where $k=\sigma^{2}_{e}/\sigma^{2}_{a} = 14$, and

\begin{displaymath}{\bf A}^{-1} = \frac{1}{14} \left( \begin{array}{rrrrr}
28 &...
...14 & 36 & -16 \\
0 & -16 & 0 & -16 & 32 \end{array} \right). \end{displaymath}

Let the starting values for $\beta = \left( \begin{array}{rrrrrr}
0 & 0 & 0 & 0 & 0 & 0 \end{array} \right)$, and let va=ve=10, and $S^{2}_{e}=93\frac{1}{3}$ and $S^{2}_{a}=6\frac{2}{3}$, so that k=14. Let RND represent a random normal deviate from a random normal deviate generator, and let CHI(idf) represent a random Chi-square variate from a random Chi-Square variate generator. To begin, let $\sigma^{2}_{e}=S^{2}_{e}$and $\sigma^{2}_{a}=S^{2}_{a}$. Below are descriptions of calculations in the first two rounds.

Round 1


\begin{eqnarray*}\hat{\mu} & = & (238.2 - a_{1}-a_{2}-a_{3}-a_{4}-a_{5})/5 \\
...
...{.5} \\
& = & -.9316 + (-.6472)(1.6817) \\
& = & -2.0200 \\
\end{eqnarray*}


Now calculate the residuals and their sum of squares in order to obtain a new residual variance.

\begin{eqnarray*}e_{1} & = & 38.5 - 42.41 -1.9067 = -5.8167 \\
e_{2} & = & 48.9...
... 42.41 + 2.0200 = -4.3900 \\
{\bf e}'{\bf e} & = & 705.1503 \\
\end{eqnarray*}


The new residual variance is

\begin{eqnarray*}\sigma^{2}_{e} & = & ({\bf e}'{\bf e} + v_{e}S^{2}_{e}) / CHI(1...
...\
& = & (705.1503 + (10)(93.3333))/17.1321 \\
& = & 95.6382.
\end{eqnarray*}


The additive genetic variance requires calculation of ${\bf a}'{\bf A}^{-1}{\bf a}$ using the a-values obtained above, which gives

\begin{displaymath}{\bf a}'{\bf A}^{-1}{\bf a} = 19.85586. \end{displaymath}

Then

\begin{eqnarray*}\sigma^{2}_{a} & = & ( {\bf a}'{\bf A}^{-1}{\bf a} + v_{a}
S^{...
...\
& = & ( 19.85586 + (10)(6.66667))/10.7341 \\
& = & 8.0605.
\end{eqnarray*}


The new variance ratio becomes

k = 95.6382 / 8.0605 = 11.8650.

Round 2

Round 2 begins by re-forming the MME using the new variance ratio. The equations have changed to

\begin{displaymath}\left( \begin{array}{rrrrrr}
5 & 1 & 1 & 1 & 1 & 1 \\
1 & ...
...2 \\ 38.5 \\ 48.9 \\ 64.3 \\ 50.5 \\ 36.0 \end{array} \right). \end{displaymath}

Now the process is repeated using the last values of $\mu$ and ${\bf a}$ and $\sigma^{2}_{e}$.

\begin{eqnarray*}\hat{\mu} & = & (238.2 - a_{1}-a_{2}-a_{3}-a_{4}-a_{5})/5 \\
...
...\\
a_{5} & = & -2.6253 + (.8184)(1.8442) \\
& = & -1.1160 \\
\end{eqnarray*}


The residuals and their sum of squares are

\begin{eqnarray*}e_{1} & = & 38.5 - 51.41 +1.3999 = -11.5101 \\
e_{2} & = & 48....
...51.41 + 1.1160 = -14.2940 \\
{\bf e}'{\bf e} & = & 627.1630 \\
\end{eqnarray*}


The new residual variance is

\begin{eqnarray*}\sigma^{2}_{e} & = & ({\bf e}'{\bf e} + v_{e}S^{2}_{e}) / CHI(1...
...\
& = & (627.1630 + (10)(93.3333))/20.4957 \\
& = & 76.1377.
\end{eqnarray*}


The additive genetic variance is

\begin{eqnarray*}\sigma^{2}_{a} & = & ( {\bf a}'{\bf A}^{-1}{\bf a} + v_{a}
S^{...
...\\
& = & ( 36.8306 + (10)(6.66667))/16.6012 \\
& = & 6.2343.
\end{eqnarray*}


The new variance ratio becomes

k = 76.1377 / 6.2343 = 12.2127.

Continue taking samples for thousands of rounds.

Burn In Period

Generally, the Gibbs chain of samples does not immediately converge to give samples from the joint posterior distribution. A burn in period is usually allowed during which the sampling process moves from the initial values of the parameters to those from the joint posterior distribution. The length of the burn-in period is usually judged by visually inspecting a plot of sample values across rounds. In the example figure below, the burn-in period should likely be 2000 rounds. This figure represents an animal model analysis of over 323,000 records on slightly over 1 million animals. The trait was one of the type traits in the Holstein breed. The samples generated during the burn-in period are typically discarded from further use.


\begin{figure*}
\centerline{
\psrotatefirst
\psfig{figure=figodd.ps,angle=270,height=4.0in,width=5.0in}
\psfull
}
\end{figure*}

Estimates

Each round of Gibbs sampling is dependent on the results of the previous round. Depending on the total number of observations and parameters, one round may be positively correlated with the next twenty to three hundred rounds. Even though samples are correlated, the user can determine the effective number of samples in the total by calculating lag correlations. Suppose a total of 12,000 samples (after removing the burn-in rounds) gave an effective number of samples equal to 500. This implies that every 24th sample should be uncorrelated.

An overall estimate of a parameter can be obtained by averaging all of the 12,000 samples (after the burn-in). However, to derive a confidence interval or to plot the distribution of the samples or to calculate the standard deviation of the estimate, the variance of the 500 independent samples should be used.

Influence of the Priors

In the small example, va=ve=10 whereas N was only 5. Thus, the prior values of the variances received more weight than information coming from the data. This is probably appropriate for this small example, but if N were 5,000,000, then the influence of the priors would be next to none. The amount of influence of the priors is not directly determined by the ratio of vi to N. In the small example, even though $v_{a}/(N+v_{a})=\frac{2}{3}$, the influence of S2a could be greater than $\frac{2}{3}$. Schenkel (1998) applied some methods recommended to him by Gianola in his thesis work.

Long Chain or Many Chains?

Early papers on MCMC (Monte Carlo Markov Chain) methods recommended running many chains of samples and then averaging the final values from each chain. This was to insure independence of the samples. That philosophy seems to have changed to a recommendation of one single long chain. For animal breeding applications this could mean 100,000 samples or more. If a month is needed to run 50,000 samples, then maybe three chains of 50,000 would be preferable. If only an hour is needed for 50,000 samples, then 1,000,000 samples would not be difficult to run. The main conclusion is that it is very difficult to know if or when a Gibbs chain has converged to the joint posterior distribution. This is currently an active area of statistical research.


next up previous

This LaTeX document is available as postscript or asAdobe PDF.

Larry Schaeffer
2000-03-17