This LaTeX document is available as postscript or asAdobe PDF.
Restricted maximum likelihood (REML), was suggested by Thompson (1962), and was described formally by Patterson and Thompson (1971). The term 'restricted' is perhaps not the most correct to use to describe the method. The procedure requires that have a multivariate normal distribution. The method is translation invariant. The maximum likelihood approach automatically keeps the estimator within the allowable parameter space(i.e. zero to plus infinity), and therefore, REML is a biased procedure.
The Likelihood Function
The likelihood function of
is assumed to have a multivariate normal
distribution which is
Various alternative forms of L_{2} can be derived. Some basic results on determinants are necessary to follow the derivations.
First,
if
and
are square matrices of the same order, then
Secondly,
for a general matrix
below with
and
being square and non-singular, then
Combining the above results gives
Derivatives of the Log Likelihood
Recall that the variance-covariance matrix of y is
To derive formulas for estimating the variance components take the
derivative of L_{2}with respect to the unknown components.
The other terms,
and
,
were simplified by Henderson (1973) to show
that they could be calculated from solutions to his Mixed Model
Equations.
To simplify these quadratic forms notice that
Now equate the derivatives to zero to obtain
Use of the above formulas is known as the EM algorithm, (Expectation Maximization). Also note that the EM algorithm is iterative. One must take the new estimates, reform the matrix , recalculate new solutions to Henderson's MME, and obtain new estimates of the variances. The process continues until the change in new versus old estimates is smaller than a pre-specified amount, like .000001%. The main problems with the EM algorithm are that convergence is very, very slow, if it occurs and may not converge, but instead diverge. The more components there are to estimate, the longer will be the number of iterates to reach convergence, depending upon the starting values in the iterative process.
Another major problem is the requirement for which is a submatrix of the inverse to the MME. In practice, the inverse of the MME is not explicitly derived and solutions to MME are obtained by Gauss-Seidel iteration. Thus, there have been many attempts to approximate the value of , and most have not been suitable.
The above problems have led to attempts to discover other computational algorithms for REML. One of those has been Derivative Free REML which is described in the next section.
Derivative Free Method
Derivative Free REML was proposed by Smith and Graser(1986) and Graser, Smith and Tier(1987) and has been expanded upon by Meyer (1987,91) who has developed a set of programs for computing estimates of variance components for a whole range of univariate and multivariate models. The description given below is a very simplified version of the method for basic understanding of the technique.
In essence, one can imagine an s dimensional array containing the values of the likelihood function for every possible set of values of the ratios of the components to the residual variance. The technique is to search this array and find the set of ratios for which the likelihood function is maximized. There is more than one way to conduct this search. Care must be taken to find the 'global' maximum rather than one of possibly many 'local' maxima. At the same time the number of likelihood evaluations to be computed must also be minimized.
One begins by looking at alternative forms of the log likelihood
function given by L_{2}, that is,
The general technique is best described with a small example. Below are ordinary least squares equations for a model with three factors, two of which (A and B) are random. The total sum of squares is .
EM Algorithm
To demonstrate the EM algorithm,
let
and
be the starting values of the ratios
for factors A and B, respectively, which must be added
to the diagonals of the above equations.
The solution vector is
DF REML
Begin by fixing the value of at 10 and let the value of take on values of . Using L_{2} to evaluate the likelihood, then the results were as follows:
L_{2} | |
5 | -207.4442 |
10 | -207.1504 |
20 | -206.9822 |
30 | -206.9274 |
The search for the global maximum continues now by keeping at 25.4612, and looking at four new values of which give the following results.
L_{2} | |
2 | -206.2722 |
3 | -206.2055 |
4 | -206.2473 |
5 | -206.3426 |
To insure that a global maximum has been found, the entire process could be started with vastly different starting values for the ratios, such as and let values for be 40, 50, 60, and 70.
The more components there are to estimate, the more evaluations of the likelihood that are going to be needed, and the more probable it is that convergence might be to a local maximum rather than to the global maximum. Another problem might be that one component might converge towards zero. There has been much work on searching algorithms for DF REML.
Please refer to the literature for specification of the log likelihood function for particular models and situations. Also, refer to work by Boldman and Van Vleck (1993) which found a simplification of Meyer's algorithms which reduced computational time by several orders of magnitude. Even so, DFREML has been applied to fairly small data sets and can take considerable time to find estimates for these. The available software may also not be able to handle particular models, and so the user should be aware of these possible problems.
Algorithms to Maximize a Nonlinear Function
Hofer (1998, Variance component estimation in animal breeding: a review,
in the Journal of Animal Breeding and Genetics, 115:247-266) gives a
general description of the algorithms used with REML. A gradient
(change in some quantity as a result of a change in the parameters),
called
determines the direction of the search in the next
step. The gradient is modified by a matrix
and by a
scalar, c. The new estimate is given by
For the Newton-Raphson algorithm, evaluated at
,
For the Method of Scoring algorithm, evaluated at
,
The traces in the Newton-Raphson algorithm and in the Method of
Scoring algorithm both involve the inverse elements, in complex
forms, of the mixed model equations, which makes them difficult
to attack computationally. Johnson and Thompson (1995) suggested
averaging the observed and expected information matrices and
coined the term Average Information REML, or AI REML. They defined
as the inverse of the following matrix, which does
not require elements of the inverse of the mixed model equations.
Hofer (1998) compared the various algorithms for various models. The AI REML algorithm seemed to be advantageous over the other methods. DF REML was advantageous over EM REML when the number of parameters to be estimated was small, but EM REML became more efficient than DF REML when the number of parameters was large as in multiple trait models. Animal breeders, in general, need to be able to access and utilize REML programs written by experts in order to save time from writing their own programs, and to avoid the duplication of effort that would be needed to get new programs to work correctly and efficiently. However, for some models there may not be programs available to handle them, and then writing your own programs would be necessary.
These notes are based on Sorensen (1998) from a course on "Gibbs Sampling in Quantitative Genetics" given at AGBU, Armidale, NSW, Australia. The Gibbs sampling tool to derive estimates of variance components from the joint posterior distribution seems to be a very powerful tool to use with animal models. Computationally, one needs only a program that calculates solutions to Henderson's mixed model equations, very good random number generators, and computer time to process very large numbers of samples.
Begin with the simple single trait animal model. That is,
For ,
the vector of additive genetic values, quantitative
genetics theory suggests that they follow a normal distribution, i.e.
Now form the joint posterior distribution as
In order to implement Gibbs sampling, all of the fully conditional
posterior distributions (one for each component of
) need
to be derived from the above joint posterior distribution.
Let
In general terms, the conditional posterior distribution of
is
Computational Scheme
Gibbs sampling is much like Gauss-Seidel iteration, except that
Round 1
Round 2 begins by re-forming the MME using the new variance ratio.
The equations have changed to
Burn In Period
Generally, the Gibbs chain of samples does not immediately converge to give samples from the joint posterior distribution. A burn in period is usually allowed during which the sampling process moves from the initial values of the parameters to those from the joint posterior distribution. The length of the burn-in period is usually judged by visually inspecting a plot of sample values across rounds. In the example figure below, the burn-in period should likely be 2000 rounds. This figure represents an animal model analysis of over 323,000 records on slightly over 1 million animals. The trait was one of the type traits in the Holstein breed. The samples generated during the burn-in period are typically discarded from further use.
Estimates
Each round of Gibbs sampling is dependent on the results of the previous round. Depending on the total number of observations and parameters, one round may be positively correlated with the next twenty to three hundred rounds. Even though samples are correlated, the user can determine the effective number of samples in the total by calculating lag correlations. Suppose a total of 12,000 samples (after removing the burn-in rounds) gave an effective number of samples equal to 500. This implies that every 24^{th} sample should be uncorrelated.
An overall estimate of a parameter can be obtained by averaging all of the 12,000 samples (after the burn-in). However, to derive a confidence interval or to plot the distribution of the samples or to calculate the standard deviation of the estimate, the variance of the 500 independent samples should be used.
Influence of the Priors
In the small example, v_{a}=v_{e}=10 whereas N was only 5. Thus, the prior values of the variances received more weight than information coming from the data. This is probably appropriate for this small example, but if N were 5,000,000, then the influence of the priors would be next to none. The amount of influence of the priors is not directly determined by the ratio of v_{i} to N. In the small example, even though , the influence of S^{2}_{a} could be greater than . Schenkel (1998) applied some methods recommended to him by Gianola in his thesis work.
Long Chain or Many Chains?
Early papers on MCMC (Monte Carlo Markov Chain) methods recommended running many chains of samples and then averaging the final values from each chain. This was to insure independence of the samples. That philosophy seems to have changed to a recommendation of one single long chain. For animal breeding applications this could mean 100,000 samples or more. If a month is needed to run 50,000 samples, then maybe three chains of 50,000 would be preferable. If only an hour is needed for 50,000 samples, then 1,000,000 samples would not be difficult to run. The main conclusion is that it is very difficult to know if or when a Gibbs chain has converged to the joint posterior distribution. This is currently an active area of statistical research.
This LaTeX document is available as postscript or asAdobe PDF.
Larry Schaeffer