This LaTeX document is available as postscript or asAdobe PDF.

Random Regression Models
L. R. Schaeffer, March 1999
Updated March 2000

All biological creatures grow and perform over their lifetime. Traits that are measured at various times during that life are known as longitudinal data. Examples are body weights, body lengths, milk production, feed intake, fat deposition, and egg production. On a biological basis there could be different genes that turn on or turn off as an animal ages causing changes in physiology and performance. Also, an animal's age can be recorded in years, months, weeks, days, hours, minutes, or seconds, so that, in effect, there could be a continuum or continuous range of points in time when an animal could be observed for a trait. These traits have also been called infinitely dimensional traits.

Take body weight as an example, as given in the table below on gilts.

 Animal Days on Test 10 20 30 40 50 100 1 42 53 60 72 83 140 2 30 50 58 68 76 122 3 38 44 51 60 70 106 SD 1.6 3.7 3.9 5.0 5.3 13.9

The differences among the three animals increase with days on test as the gilts become heavier. As the mean weight increases, so also the standard deviation of weights increases. The weights over time could be modeled as a mean plus covariates of days on test and days on test squared. Depending on the species and trait, perhaps a cubic or spline function would fit the data better. The point is that the means can be fit by a linear model with a certain number of parameters.

Covariance Functions are a way of modeling the variances and covariances of the weights over time. This model could be different from the model for the actual weights. A random regression model essentially combines both functions into one model. Covariance functions can be written as random regressions where the covariates are standardized expressions of time that go from -1 to +1.

Multiple Trait Approach

The data presented in the previous table have typically been analyzed such that the weights at each day on test are different traits. If t is the day on test, i.e. 10, 20, 30, 40, 50, or 100, then a model for any one of them could be

which is just a simple, single record, animal model. Analyses are usually done so that the genetic and residual variances and covariances are estimated among the six weights. Suppose that an estimate of the genetic variances and covariances was

Suppose the covariance between weights on day 20 with day 70 was needed. The above matrix does not offer any value, but one might be able to get an estimate by extrapolation. A solution offered by Kirkpatrick et al.(1991) is to use covariance functions. A covariance function (CF) is a way of modelling variances and covariances of a longitudinal trait. The elements of are written as a function of time,

where k is the dimension of , and qm are standardized time numbers between -1 to +1. Let tmin=10 and tmax=100 for this example, then for each calculate as

which gives

 ti 10 20 30 40 50 100 qi -1 -0.778 -0.556 -0.333 -0.111 +1

The quantity hij is an element of a matrix , and is an element of another matrix, . The function, , is known as a Legendre polynomial. To calculate the Legendre polynomials, define

then, in general,

These quantities are "normalized" to give

This gives the following series,

and so on. The first six can be put into a matrix, , as

Now define another matrix, , as a matrix containing the polynomials of the six standardized time values. That is,

This gives

which can be used to specify the elements of as

Note that , , and are matrices defined by the Legendre polynomial functions and by the standardized time values and do not depend on the data or values in the matrix . Therefore it is possible to estimate either or ,

and

By using and there is no need for the use of Legendre polynomials, but rather just functions of the standardized time values. Why Legendre polynomials? Because they are defined over the range of -1 to +1, and they are orthogonal. However, there are other kinds of orthogonal polynomials defined over the same range. The Legendre polynomials are probably the easiest to calculate.

Either or can be used to calculate the covariance between any two days on test between 10 and 100 days. Suppose we want the covariance between days 20 and 70. Using we need a row of for day 70, and the one for day 20. These are

Then the variances and covariance are

Note that the variance for day 70 seems artificially very high. This is likely because there were not any estimates of variances and covariances closely around day 70. Also, it is unlikely that the covariance of weight traits would be negative. This situation shows the need to have time periods evenly represented in throughout the range that may be important.

Using we need rows of , which can be obtained by

then the variances and covariances are

Reduced Orders of Fit

Although the order of in the previous example was six and polynomials of standardized ages to the fifth power were used to derive covariance functions, it could be that only squared or cubed ages are needed to adequately describe the elements of . Thus, we want to perform a reduced order of fit. That is, in

we seek an that is rectangular, and a that has smaller order. For with k rows and m<k columns, then

and pre- and post- multiply by the inverse of to determine ,

To illustrate, let m=3, then

and

Also,

The matrix is then

What order of reduced fit is sufficient to explain the variances and covariances in ? Kirkpatrick et al.(1990) suggest looking at the eigenvalues of the matrix from a full rank fit. Below are the values. The sum of all the eigenvalues was 3,865.6681, and also shown is the percentage of that total. Also included are the eigenvalues of and their percentages.

 Eigenvalue Percentage Eigenvalue Percentage 3826.5161 .9899 87538.3830 .9859 22.6809 .0059 1147.1364 .0129 11.2688 .0029 56.6293 .0006 3.8368 .0010 26.9076 .0003 1.3291 .0003 18.1450 .0002 .0364 .0000 .1774 .0000

Both matrices indicate that the majority of change in elements in is explained by a constant, and perhaps a little by a linear increment. Both suggest that a quadratic function of the polynomials is probably sufficient. Is there a way to statistically test the reduced orders of fit to determine which is sufficient? A goodness of fit statistic is where

and is a vector of the half-stored elements of the matrix , i.e.,

A half-stored matrix of order k has k(k+1)/2 elements. For k=6 there are 21 values. Likewise, is a vector of half stored elements of the matrix . Although this matrix also has 21 values, because has only m<k columns, the number of independent values is m(m+1)/2. For m=3 this number is 6.

The test statistic, , has a Chi-square distribution with k(k+1)/2 - m(m+1)/2 degrees of freedom. In the example with m=3,

and the residuals (differences from the original ) are

so that the goodness of fit statistic is

with 21-6=15 degrees of freedom.

Is a fit of order 3 poorer than a fit of order 5? An F-statistic is possible by taking the difference in the goodness of fit statistics, divided by an estimate of the residual variance. The residual variance is estimated from a fit of order k-1 or in this case of order 5. The goodness of fit statistic for order 5 was 7.8490 with 21-15=6 degrees of freedom. Hence the residual variance is

The F-statistic to test if a fit of order 3 is different from a fit of order 5 is

with (9,6) degrees of freedom. The table F-value at the (P=.05)level is 4.10. Thus, the difference is significant, and a fit of order 5 is better than a fit of order 3.

CF to RRM

How are covariance functions and a random regression model related to each other? If we know the lower and upper age range for a trait, tmin and tmax, then we know that the genetic variance for any particular age, i, within the age range is given by the covariance function, and implies that there is a different true genetic value of an animal for every age within the age range. For the ith age of animal j, the true genetic value is

where is a vector of regression coefficients for the jth animal, and is a vector of polynomials of standardized age i. The variance of is assumed to be

then

Now notice that is the same for animal j regardless of the age at which the observation is taken. Using this definition we can write the model equation for an observation taken at age ion animal j as

This is the basic starting point for a random regression model. The variance-covariance matrix of true genetic values follows the correct variances and covariances through the assumed time range, provided that the order of fit for the covariance function is sufficient to define this covariance structure adequately.

Notice that the residual effects may also have a different variance for each age. It is logical to assume that ejican be split into parts, i.e., eji & = & pji + te,

where pji also follows a covariance function (different from that of aji) assumed to relate environmental effects between observations on the same animal, and where te is a temporary environmental effect which is independent of animal and age and other te effects within and between animals. The pji factor is the same as a permanent environmental effect (repeated records on the same animal) for animal j, but which varies with the age i of the animal.

To write an overall model for several observations per animal at different ages, then

where
must contain a different mean for each discrete age within the pre-defined age range,
are regression coefficients (random) for each animal representing the additive genetic contribution to the trait,
are regression coefficients (random) for each animal representing the permanent environmental effects,
is a vector of temporary environmental effects peculiar to each observation independent of age and animal,
is the design matrix relating observations to the respective age means,
is a matrix of standardized age covariates associated with each record and linked to the appropriate animals, and
is a vector of longitudinal data on animals.
and
where the dimensions of and are equal to the order of fit of the covariance functions. Usually the order of fit is the same for both effects, but theoretically, the genetic and permanent environmental effects might require different orders of fit.
which allows for the possibility that the temporary residual variance could change throughout the age range of the data. This would also pick up any 'leftover' effects from the random regressions if the orders of fit were not sufficient to define the correct covariance structures over the age range.

Depending on the trait, there could be other fixed and random factors in the model. For example, contemporary group effects would be defined as those animals that were housed in the same location and measured on the same day by the same person. There could also be other effects like diet effects that could influence a trait on a given day which might differ from animal to animal and from one measurement to the next for the same animal.

The fixed age means in the model do not assume any particular shape of the trait as it changes with age. However, the age range may be large and thus, there would be a large number of discrete age group effects in the model, and possibly there may not be enough data to estimate the means accurately. Therefore, ages may need to be grouped to give fewer age groups overall. Alternatively, one might assume that a mathematical function of age fits the age means, and therefore only a few parameters need to be estimated. If possible, it is likely better to not assume any mathematical function and to have as many age groups as practically possible.

Simulation of Data

Below are the data structure and pedigrees of four dairy cows. Given is the age at which they were observed for stature during four visits to this one herd.

 Age at Classification Cow Sire Dam Visit 1 Visit 2 Visit 3 Visit 4 1 7 5 22 34 47 2 7 6 30 42 55 66 3 8 5 28 40 4 8 1 20 33 44

The model equation will be

where
CGj is a random contemporary group effect which is assumed to follow a normal distribution with mean 0 and variance,
b0, b1, and b2 are fixed regression coefficients on (A)= age and age squared which describes the general relationship between age and stature. The assumed true values for these parameters will be b0=.5, b1=1.2, and b2=-.01 and in this example, the ages are not standardized between -1 to +1.
ai0, ai1, and ai2 are random regression coefficients for animal i additive genetic effects, assumed to follow a multivariate normal distribution with mean vector null and variance-covariance matrix, , equal to

pi0, pi1, and pi2 are random regression coefficients for animal i permanent environmental effects, assumed to follow a multivariate normal distribution with mean vector null and variance-covariance matrix, , equal to

and ejik is a temporary residual error term assumed to follow a normal distribution with mean 0 and variance,

The simulation process begins by

Step 1.
Generate four contemporary group (visit) effects using a random normal deviate generator and ,

Step 2.
Generate the fixed effects, , where

Step 3.
Compute the Cholesky decomposition of as

For each animal, three values will be generated which define the random regression coefficents for the genetic effects. For animals 5, 6, 7, and 8, we assume they are base population animals and unrelated to each other. Generate 3 random normal deviates into a vector, , for each animal and premultiply by . The values obtained were

 Animal a0 a1 a2 5 -6.08 .2225 -.002131 6 4.54 -.0822 .000439 7 19.35 -.5466 .005114 8 -4.99 .2624 -.002695

For animal 1 who has parents 7 and 5, first average the random regression coefficients of the parents,

Add to this where is another vector of new random normal deviates. Animals 2, 3, and 4 are done in a similar manner ( we do not need to worry about inbreeding here). The other true genetic values are

 Animal a0 a1 a2 1 -5.15 .3249 -.003547 2 13.47 -.1286 .000704 3 -6.25 .2839 -.002687 4 -6.84 .2415 -.002291

Step 4.
Compute the Cholesky decomposition of as

For animals with observations, i.e. animals 1, 2, 3, and 4, generate random regressions for permanent environmental effects by generating a 3 by 1 vector of random normal deviates for each animal and premultiplying by . The results are

 Animal p0 p1 p2 1 3.67 -.1480 .001328 2 7.59 -.2352 .002204 3 -2.97 -.0034 .000386 4 3.26 -.0935 .000878

Step 5.
Generate individual observations. Just follow the model equation,

where has already been created, , , and have been created, and

The matrix is of order 12 by 12.

Next create 12 random temporary residual effects by generating 12 random normal deviates and multiplying each by . The results, rounded to the nearest whole number, are given in the following table.

 24 22.06 2.5233 .2811 1.0567 -2.2206 44 27.50 2.5233 10.2456 2.5176 1.1294 24 26.26 2.5233 -.4074 -2.7626 -1.3047 36 29.74 2.1462 1.7963 .1732 1.7420 47 33.26 2.1462 9.3107 1.5995 1.0237 42 32.50 2.1462 .8068 -2.4884 8.6908 20 20.50 2.1462 -2.9264 1.7412 -1.1633 39 34.81 -2.0483 2.2850 -.3524 4.2048 41 36.25 -2.0483 8.5266 1.3211 -3.0580 34 29.21 -2.0483 -1.3654 1.1306 6.8494 44 36.14 -2.4207 8.0490 1.6674 .9080 28 33.94 -2.4207 -.6494 .8458 -3.3007

MME

The mixed model equations that need to be constructed to provide estimated breeding values are as follows;

The entire MME can not be presented, but parts of the MME are given below.

and are composed of the following four blocks of order 3, for the four animals with records;

The right hand sides of the MME are

and

The solutions to MME are

Let the solutions for the animal additive genetic random regression coefficients be presented in tabular form as follows.

 Animal a0 a1 a2 1 -1.720812 .039069 -.000334 2 9.908001 -.286581 .002556 3 -8.723928 .245155 -.002166 4 -2.972874 .108809 -.001022 5 -5.215457 .139228 -.001218 6 5.243107 -.150764 .001344 7 4.086682 -.120872 .001080 8 -4.114334 .132408 -.001206

Similarly, the solutions for the animal permanent environmental random regression coefficients can be given in tabular form.

 Animal p0 p1 p2 1 -.761270 .012384 -.000093 2 3.536084 -.101685 .000906 3 -2.737505 .073743 -.000643 4 -.037309 .015559 -.000170

From the table of additive genetic solutions, it becomes a problem to decide how to rank the animals. If animals are ranked on the basis of a0, then animal 2 would be the highest (if that was desirable). If ranked on the basis of a1, then animal 3 would be the highest, and if ranked on the basis of a2, then animal 2 would be the highest. To properly rank the animals we need to calculate an EBV for stature at different ages, and then combine these with the appropriate economic weights. Suppose we calculate EBVs for 24, 36, and 48 mo of age. Suppose also that the economic weights were 2, 1, and .5, respectively, for the three EBVs, so that a Total Economic Value can be calculated as

The results are shown in the following table.

 Animal EBV(24) EBV(36) EBV(48) TEV 1 -.98 -.75 -.61 -3.00 2 4.50 2.90 2.04 12.93 3 -4.09 -2.71 -1.95 -11.85 4 -.95 -.38 -.10 -2.33 5 -2.58 -1.78 -1.34 -7.60 6 2.40 1.56 1.10 6.91 7 1.81 1.13 .77 5.14 8 -1.63 -.91 -.54 -4.44

The animal with the highest TEV was animal 2. It is interesting to compare animals 1 and 8. Animal 8 was lower than animal 1 for EBV(24) and EBV(36), but not for EBV(48), so that it is evident that growth patterns in these two animals are genetically different. This is a strength of a random regression model analysis, in that different patterns of growth at different ages can be spotted readily. Rankings of animals change with age. Thus, it is possible to change the pattern of growth to one that is desirable.

REML

Estimation of variances and covariances by EM REML is a little more complicated than usual in this model. The residual variance is still estimated as

where

Let the additive genetic random regression coefficients in a previous table be called the matrix of order 8 by 3, and let the table of permanent environmental random regression coefficients be called of order 4 by 3. The inverse of the MME can be represented as

The matrix is of order 24 with 3 rows and columns per animal. For EM REML we need the inverse elements of the MME, for the additive genetic effects, and for the permanent environmental effects. To see this better, partition into the elements corresponding to each animal as

where each submatrix is of order 3 by 3. Now we need to treat each submatrix of as a single entity (like a scalar) and calculate the trace of , which in this case would result in a matrix of order 3 by 3. That is,

Also needed is

Then an estimate of is

For the permanent environmental variance-covariance matrix, is of order 12 with 3 rows and columns for each of the four animals with records, then

and therefore,

As usual, the calculations are iterative and must be repeated until convergence is reached. All results are about twice as large as the original values, possibly due to the particular sample that was generated.

Bayesian Approach - Gibbs Sampling

The conditional distributions from the joint posterior distribution for the solutions to the MME are all normal distributions, and each is sampled separately. Nothing is different from the previous models in this respect. The conditional distributions for

is a scaled inverted Chi-square distribution with hyperparameters ve and S2e,

for the animal additive genetic values has an inverted Wishart distribution,
• Form where S2a is a matrix of order 3 by 3,
• Invert ,
• Decompose into where is a lower triangular matrix, and use as input to
• Generate a new with a Wishart variate generator,
        CALL WISHRT(L,TI,ndf)

where ndf is va+q,
and
for the animal permanent environmental effects also has an inverted Wishart distribution,
• Form where S2p is a matrix of order 3 by 3,
• Invert ,
• Decompose into where is another lower triangular matrix,
• Generate a new with a Wishart variate generator using as input to the routine.
Continue sampling in this manner.

This LaTeX document is available as postscript or asAdobe PDF.

Larry Schaeffer
2000-03-21