next up previous


This LaTeX document is available as postscript or asAdobe PDF.

COMPUTING IDEAS
B. W. Kennedy, 1993

Increasing use is being made of animal models in animal breeding. For example, animal models are being applied routinely for genetic evaluation of pigs in Canada and beef cattle in the United States. Animal models for joint bull and cow evaluation of dairy cattle in both the U.S. and Canada have been developed and implemented. Also, animal models are now being used for estimation of genetic parameters.

One of the impediments to application of animal models has been their computational difficulty. They can be very demanding. However, development of efficient computing strategies and increased computing power at reasonable costs are making the application of animal models to livestock improvement increasingly more practical. In this section we examine briefly some computing strategies for applications of animal models and examine some actual applications to genetic improvement in field populations.

Prediction of Breeding Values

In prediction of breeding values from data from field populations, the number of equations is usually too large to permit an explicit solution and solutions are obtained by iterative procedures. Gauss-Seidel or a modification to it is frequently used. The usual method is to form the equations for the coefficient matrix on an external storage device, such as disk or tape, and read the coefficient matrix for each cycle of iteration. Usually only non-zero elements of the coefficient matrix are stored. Frequently the order of the equations has been reduced through absorption of some effects or through use of a reduced animal model. These procedures, although efficient in terms of use of computer memory, are time consuming because they must access external storage to read the coefficient matrix with each cycle of iteration.

An alternative computing strategy was described by Schaeffer and Kennedy (1986, 3rd World Cong. Genet. Appl. Livest. Prod. XII:382) which requires reading the data and a pedigree file with each cycle of iteration. In most applications, this requires fewer read operations per cycle of iteration and as a result less computer time for solving the equations. The animal model coefficient matrix equations are not set up explicitly, but the structure of the equations is capitalized upon to adjust observations and breeding value estimates as the data and pedigree files are read. Rather than describe the method in general terms, the method is illustrated here for a specific model and application to pig breeding.

Consider the model

\begin{displaymath}y_{ijk} = \mu + h_{i} + l_{ij} + a_{ijk} + e_{ijk} \end{displaymath}

where yijk is an observation, $\mu$ is the population mean, hiis the effect to the ith herd-year-season, lij is a random effect of the ijth litter $\sim({\bf O},{\bf I}\sigma_{l}^{2})$, aijk is the breeding value of the ijkth pig $\sim({\bf O},{\bf A}\sigma_{a}^{2})$ and eijk is the environmental variance $\sim({\bf O},{\bf I}\sigma_{e}^{2})$. The litter term is an environmental covariance among littermates or c2. Let h2 = .2 and c2 = .2 or $\lambda = \sigma_{e}^{2}/
\sigma_{a}^{2} = 3$ and $\gamma = \sigma_{e}^{2}/\sigma_{l}^{2} = 3$.

We have the following data

Herd-Year-Season Litter Animal Sire Dam
1 1 6 1 2
1 1 7 1 2
1 1 8 1 2
2 2 9 1 3
2 2 10 1 3
2 3 11 4 5
2 3 12 4 5
The data are sorted by animal within litter within herd-year-season. Assume animals 1, 2, 3, 4 and 5 are unrelated. Within litter, animals are full-sibs. Litters 1 and 2 are paternal half-sibs. To account for ${\bf A}^{-1}$ in the evaluation process, we create a pedigree file as follows:
        Sire or Dam or  
Type H-Y-S Litter Animal Progeny Mate Record
2 0 0 1 0 0 0
3 0 0 1 6 2 0
3 0 0 1 7 2 0
3 0 0 1 8 2 0
3 0 0 1 9 3 0
3 0 0 1 10 3 0
2 0 0 2 0 0 0
3 0 0 2 6 1 0
3 0 0 2 7 1 0
3 0 0 2 8 1 0
2 0 0 3 0 0 0
3 0 0 3 9 1 0
3 0 0 3 10 1 0
2 0 0 4 0 0 0
3 0 0 4 11 5 0
3 0 0 4 12 5 0
2 0 0 5 0 0 0
3 0 0 5 11 4 0
3 0 0 5 12 4 0
1 1 1 6 1 2 150
1 1 1 7 1 2 144
1 1 1 8 1 2 156

continued ...

        Sire or Dam or  
Type H-Y-S Litter Animal Progeny Mate Record
1 2 2 9 1 3 148
1 2 2 10 1 3 145
1 2 3 11 4 5 140
1 2 3 12 4 5 151
The pedigree file is sorted by type within animal. The first column indicates the type of record. Type 1 is for animals with records. In this case columns 5 and 6 identify the sire and dam of the animal. There is one type 1 record for each animal with a phenotypic record. Type 2 is for parents of animals with records who do not have a record themselves (ancestors). In this case columns 5 and 6 are blank. There is one type 2 record for each ancestor. Type 3 records are for offspring of parents. In this case column 4 identifies the parent, column 5 identifies the offspring and column 6 identifies the mate. There are two type 3 records for each offspring, one for its sire and one for its dam. In some applications, type 3 records are not required as this information can be obtained from the type 1 records.

Let $\hat{\bf h}$, $\hat{\bf l}$ and $\hat{\bf a}$ be vectors of solutions for herd-year-seasons, litters and animals, respectively. Initially,

\begin{displaymath}\hat{\bf h} = \hat{\bf l} = \hat{\bf a} = {\bf0}. \end{displaymath}

For each cycle of iteration, herd solutions are computed as the data are read, one herd at a time, as

\begin{displaymath}\hat{h}_{i} = (y_{i..} - \sum_{j=1}^{q_{i}} n_{ij} \hat{l}_{ij} -
\sum_{j}^{q_{i}} \sum_{k}^{n_{ij}} \hat{a}_{ijk})/n_{i} \end{displaymath}

where qi is the number of litters in the ith herd, niis the number of pigs in the ith herd and nij is the number of pigs in the ijth litter. Previous litter and animal solutions are used. For this example, the first cycle solutions for h1 and h2 are

\begin{eqnarray*}\hat{h}_{1} & = & (450 - 0 - 0)/3 = 150 \\
\hat{h}_{2} & = & (584 - 0 - 0)/4 = 146.
\end{eqnarray*}


Litter solutions are computed as

\begin{displaymath}\hat{l}_{ij} = (y_{ij.} - n_{ij}\hat{h}_{i} - \sum_{k=1}^{n_{ij}}
\hat{a}_{ijk})/(n_{ij} + \gamma). \end{displaymath}

Previous herd and animal solutions are used. For this example, the first cycle solutions for l1, l2 and l3 are

\begin{eqnarray*}\hat{l}_{1} & = & [450 - (3)(150) - 0]/(3 + 3) = 0 \\
\hat{l}_...
...3) = .2 \\
\hat{l}_{3} & = & [291 - (2)(146) - 0]/(2 + 3) = -.2
\end{eqnarray*}


Lastly animal solutions are computed from the pedigree file one animal at a time. A diagonal element (dt) and an adjusted right-hand side (yt*) are accumulated with each pedigree file record for the tth animal. First yt* = 0. Then for a type 1 record

\begin{displaymath}y_{t}^{*} = y_{ijk} - \hat{h}_{i} - \hat{l}_{ij} + (\hat{a}_{s} +
\hat{a}_{d})\lambda \end{displaymath}

and

\begin{displaymath}d_{t} = 1 + 2\lambda \end{displaymath}

where $\hat{h}_{i}$ and $\hat{l}_{ij}$ are current herd-year-season and litter solutions and $\hat{a}_{s}$ and $\hat{a}_{d}$ are current solutions for the sire and dam of the animal. For a type 2 record,

yt* = 0

and

\begin{displaymath}d_{t} = \lambda. \end{displaymath}

And for each type 3 record

\begin{displaymath}y_{t}^{*} = y_{t}^{*} + (\hat{a}_{o} - .5 \hat{a}_{m})\lambda \end{displaymath}

and

\begin{displaymath}d_{t} = d_{t} + .5\lambda \end{displaymath}

are accumulated where $\hat{a}_{o}$ and $\hat{a}_{m}$ are the current solutions for the offspring and the mate producing the offspring. When all pedigree records of an animal have been processed

\begin{displaymath}\hat{a}_{t} = y_{t}^{*}/d_{t}. \end{displaymath}

For this example, there is one type 2 record and five type 3 records for a1 in the pedigree file. For a1

y1* = 0

and

\begin{displaymath}d_{1} = \lambda + 5(.5\lambda) = 3.5\lambda = 10.5 \end{displaymath}

for the first cycle of iteration and $\hat{a}_{1} = 0$. For a12, there is one type 1 record and

\begin{eqnarray*}y_{12}^{*} & = & [151 - 146 - (-.2) + (0 + 0)3] = 5.2 \\
d_{12} & = & 1 + 2(3) = 7 \\
\hat{a}_{12} & = & 5.2/7 = .743.
\end{eqnarray*}


After the first cycle, a second cycle of iteration is performed and so on until convergence is attained. Solutions after cycle 1 and at convergence agree as follows:
  Cycle 1 Convergence
$\hat{h}_{1}$ 150 149.96
$\hat{h}_{2}$ 146 146.00
$\hat{l}_{1}$ 0 0
$\hat{l}_{2}$ .2 .154
$\hat{l}_{3}$ -.2 -.154
$\hat{a}_{1}$ 0 .077
$\hat{a}_{2}$ 0 0
$\hat{a}_{3}$ 0 .077
$\hat{a}_{4}$ 0 -.077
$\hat{a}_{5}$ 0 -.077
$\hat{a}_{6}$ 0 .039
$\hat{a}_{7}$ -.857 -.819
$\hat{a}_{8}$ .857 .896
$\hat{a}_{9}$ .257 .330
$\hat{a}_{10}$ -.171 -.099
$\hat{a}_{11}$ -.829 -.901
$\hat{a}_{12}$ .743 .670
Rate of convergence can be increased by forcing ${\bf l'A}^{-1}\hat{\bf a} = 0$at the end of each cycle of iteration.

Schaeffer and Kennedy (1986) compared this procedure of iterating on the data with genetic evaluation through the usual process of storing the equations for the coefficient matrix on disk. A reduced animal model was used for the latter. The data were 86,385 growth and backfat records on pigs. Iterating on the data took only 43% as much computing time as the usual method for 120 cycles of iteration. Further iterating on the data took fewer cycles of iteration to reach given convergence criteria. The procedure can also be applied to reduced animal models, multiple trait evaluation and variance component estimation. Additional details are in Schaeffer and Kennedy (1986) and Schaeffer and Wilton (1987, Mimeo, Univ. Guelph).

Estimation of Variance Components

Variance component estimation under an animal model can be very demanding computationally. For example, obtaining REML estimates of $\sigma_{a}^{2}$ and $\sigma_{e}^{2}$ requires iterating on

\begin{displaymath}\hat{\sigma}_{e}^{2} = ({\bf y'y} - \tilde{\bf b}{\bf 'X'y} -
\hat{\bf a}{\bf Z'y})/[m-r({\bf X})] \end{displaymath}


\begin{displaymath}\hat{\sigma}_{a}^{2} = [\hat{\bf a}{\bf 'A}^{-1}\hat{\bf a} +
\hat{\sigma}_{e}^{2}{\rm tr}({\bf A}^{-1}{\bf C})]/n \end{displaymath}

where m is the number of records and n is the number of animals. Solution requires obtaining

\begin{displaymath}{\bf C} = ({\bf Z'MZ} + {\bf A}^{-1}\lambda)^{-1} \end{displaymath}

where

\begin{displaymath}{\bf M} = [{\bf I} - {\bf X}({\bf X'X})^{-}{\bf X'}]. \end{displaymath}

${\bf C}$ is the order of n and is difficult to compute if n is large, say greater than 500 animals. Similarly, ${\bf C}$ is required to compute MIVQUE estimates of $\sigma_{a}^{2}$ and $\sigma_{e}^{2}$. Therefore use of animal models for variance component estimation has generally been limited to small problems. Also, most conventional variance component estimation programs assume a defined breeding structure (eg. a sire model) and are not readily suited for application to animal models. There are, nonetheless, some computing alternatives to make application of animal models more practical for large data sets.

Use of An Equivalent Model

Most variance component estimation programs assume that the random effects are independent of each other, eg. ${\rm Var}({\bf a}) = {\bf I}
\sigma_{a}^{2}$. Of course with the animal model, ${\rm Var}({\bf a}) =
{\bf A}\sigma_{a}^{2}$. Meyer (1987, J. Anim. Breed. Genet. 104:163) suggested use of a transformation of ${\bf a}$, ${\bf a}^{*}$, such that ${\rm Var}({\bf a}^{*}) = {\bf I}\sigma_{a}^{2}$.

Recall that ${\bf A} = {\bf TWT'}$ where ${\bf W}$ is a diagonal matrix. This can also be expressed as ${\bf A} = {\bf TQQT'}$ where if qi is the ith diagonal element of ${\bf Q}$ and wi is the ith diagonal element of ${\bf W}$, $q_{i} = \sqrt{w_{i}}$ for all i. Let

\begin{displaymath}{\bf L} = {\bf TQ}, \; {\rm i.e.} \; {\bf A} = {\bf LL'}. \end{displaymath}

Now

\begin{eqnarray*}{\rm Var}({\bf y}) & = & {\bf ZAZ'}\sigma_{a}^{2} + {\bf I}\sig...
...2} \\
& = & {\bf ZLL'Z'}\sigma_{a}^{2} + {\bf I}\sigma_{e}^{2}.
\end{eqnarray*}


Define ${\bf Z}^{*} = {\bf ZL}$ and ${\bf a}^{*} = {\bf L}^{-1}{\bf a}$, then

\begin{displaymath}{\bf y} = {\bf Xb} + {\bf Z}^{*}{\bf a}^{*} + {\bf e} \end{displaymath}

is an equivalent model to

\begin{displaymath}{\bf y} = {\bf Xb} + {\bf Za} + {\bf e} \end{displaymath}

with

\begin{eqnarray*}{\rm Var}({\bf a}^{*}) & = & {\rm Var}({\bf L}^{-1}{\bf a}) = {...
...}{\bf LL'}({\bf L}^{-1})'\sigma_{a}^{2} = {\bf I}\sigma_{a}^{2}.
\end{eqnarray*}


We can solve for

\begin{eqnarray*}\hat{\bf a}^{*} & = & ({\bf L'Z'MZL} + {\bf I}\lambda)^{-1}{\bf L'Z'M'y} \\
& = & {\bf C}^{*}{\bf L'Z'My}
\end{eqnarray*}


for

\begin{displaymath}{\bf M} = {\bf I} - {\bf X}({\bf X'X})^{-}{\bf X'}. \end{displaymath}

REML solutions for $\sigma_{e}^{2}$ and $\sigma_{a}^{2}$ are obtained from

\begin{eqnarray*}\hat{\sigma}_{e}^{2} & = & ({\bf y'My} - \hat{\bf a}^{*}{\bf 'L...
...\hat{\bf a}^{*} +
\hat{\sigma}_{e}^{2} {\rm tr}({\bf C}^{*})]/n.
\end{eqnarray*}


Although the order of ${\bf C}^{*}$ is still n, the number of animals, use of this equivalent model may allow one to use a conventional variance component estimation program designed for random effects that are independently and identically distributed.

Reduced Animal Model

Just as use of a gametic or reduced animal model can reduce the order of the coefficient matrix for prediction of breeding values, it can also do so for estimation of variance components. With a reduced animal model, if fixed effects are absorbed, the order of equations to be inverted is reduced to the number of parents.

Illustrations of variance component estimation under reduced animal models by MIVQUE and REML are in Sorensen and Kennedy (1986, J. Anim. Sci. 68:245) and Henderson (1986, J. Dairy Sci. 69:1394).

Avoiding the Inverse of the Coefficient Matrix

The limiting factor to applications of animal models for variance component estimation has been obtaining the inverse of the coefficient matrix of the order of the number of animals or parents. However, recently Grasser et al. (1987, J. Anim. Sci. 64:1362) have presented a derivative-free REML algorithm for animal models that does not require direct inversion. Their procedure is applicable to relatively large data sets.

The model is

\begin{displaymath}{\bf y} = {\bf Xb} + {\bf Za} + {\bf e} \end{displaymath}

with

\begin{displaymath}{\rm Var}({\bf y}) = {\bf ZAZ'}\sigma_{a}^{2} + {\bf I}\sigma_{e}^{2}. \end{displaymath}

The animal model equations are

\begin{displaymath}\left[ \begin{array}{ll}
{\bf X'X} & {\bf X'Z} \\ {\bf Z'X} &...
...t[ \begin{array}{c}
{\bf X'y} \\ {\bf Z'y} \end{array} \right] \end{displaymath}

or

\begin{displaymath}{\bf Cf} = {\bf r}. \end{displaymath}

If we assume normality, the part of the likelihood independent of fixed effects can be written as

\begin{eqnarray*}L & = & - \frac{1}{2}[{\rm constant} \; + \; {\rm log} \vert {\...
...gma_{a}^{2} + \; {\rm log} \; \vert {\bf C} \vert + {\bf y'Py} ]
\end{eqnarray*}


where

\begin{displaymath}{\bf P} = {\bf V}^{-1} - {\bf V}^{-1}{\bf X}({\bf X'V}^{-1}{\bf X})^{-}
{\bf X'V}^{-1} \end{displaymath}

and m is the number of observations and n is the number of animals. Calculation of ${\bf y'Py}$ and log $\vert{\bf C}\vert$ can be accomplished by augmenting the coefficient matrix as

\begin{displaymath}{\bf M} = \left[ \begin{array}{lll}
{\bf X'X} & {\bf X'Z} & {...
...
{\bf C} & {\bf r} \\ {\bf r'} & {\bf y'y} \end{array} \right] \end{displaymath}

and then absorbing ${\bf C}$ into ${\bf y'y}$ by Gaussion elimination to give ${\bf y'Py}$. This does not require inversion of ${\bf C}$ and requires about $\frac{1}{3}$ of the computer time as would inversion of ${\bf C}$for large data sets. Also, log $\vert{\bf C}\vert$ can be obtained simultaneously.

Solution of $\sigma_{e}^{2}$ is

\begin{displaymath}\hat{\sigma}_{e}^{2} = {\bf y'Py}/[m-r({\bf X})] \end{displaymath}

and

\begin{displaymath}\hat{\sigma}_{a}^{2} = \hat{\sigma}_{e}^{2}/\hat{\lambda}. \end{displaymath}

The process is repeated for a range of $\hat{\lambda}$'s and the log likelihood (L) is evaluated for each. Grasser et al. fitted a quadratic function in $\lambda$ to the log likelihood to predict the maximum of log L as a maximum of this function. They found this procedure gave rapid convergence.

The procedures can be extended to models with more than one random effect, and a general purpose program DFREML for a variety of animal models has been written by K. Meyer. Southwood et al. (1989, J. Dairy Sci. 72:3006) have applied this procedure to simulated data under cytoplasmic ( $\sigma_{c}^{2}$) and additive ( $\sigma_{a}^{2}$) inheritance for records on 5000 cows. Results were

  True Estimated
  Parameter Parameter
$\sigma_{a}^{2}$ .30 .37 $\pm$ .05
$\sigma_{c}^{2}$ .31 .34 $\pm$ .05
$\sigma_{e}^{2}$ .71 .69 $\pm$ .03
h2 .23 .26 $\pm$ .02
c2 .23 .24 $\pm$ .03
The procedure seems to work well with reasonable sampling errors.

Application to Joint Bull and Cow Evaluation

Animal models lend themselves to joint bull and cow evaluation, but this can be difficult computationally. If we assume that repeated records on a cow are the same trait genetically, the model is

\begin{displaymath}{\bf y} = {\bf Xb} + {\bf Za} + {\bf Zp} + {\bf e} \end{displaymath}

with

\begin{displaymath}E({\bf y}) = {\bf Xb} \end{displaymath}

and

\begin{displaymath}{\rm Var}({\bf y}) = {\bf ZAZ'}\sigma_{a}^{2} + {\bf ZZ'}\sigma_{p}^{2}
+ {\bf I}\sigma_{e}^{2} \end{displaymath}

where ${\bf a}$ is the additive genetic effect (breeding value) and ${\bf p}$is the permanent environmental effect. The animal model equations are

\begin{displaymath}\left[ \begin{array}{lll}
{\bf X'X} & {\bf X'Z} & {\bf X'Z} \...
...y}{c} {\bf X'y} \\ {\bf Z'y} \\
{\bf Z'y} \end{array} \right]
\end{displaymath} (1)

for

\begin{displaymath}\lambda = \sigma_{e}^{2}/\sigma_{a}^{2} = (1-r)/h^{2} \end{displaymath}

and

\begin{displaymath}\gamma = \sigma_{e}^{2}/\sigma_{p}^{2} = (1-r)/(r-h^{2}) \end{displaymath}

where r = repeatability.

Cow Evaluation by Intra-Herd Animal Model

Henderson (1975, J. Dairy Sci. 58:1910) proposed an intra-herd animal model for cow evaluation. Essentially the model assumes that

\begin{displaymath}{\bf A} = \left[ \begin{array}{llll}
{\bf A}_{1} & {\bf0} & \...
...\vdots \\
{\bf0} & {\bf0} & & {\bf A}_{t} \end{array} \right] \end{displaymath}

for herds 1, 2, ..., t, i.e. relationships across herds are ignored. However, Henderson proposed incorporating genetic evaluations of sires across herds, computed separately, into the cow evaluations to tie evaluations across herds and allow across herd selection. The method has computational advantages in that herds can be processed independently, one herd at a time.

The animal model equations are set up for a single herd as in (6.1), but additions are made to diagonal elements of the sire equations and the corresponding right hand sides. For each sire, add n(1-r)/(4-h2)to the diagonal of ${\bf A}^{-1}\lambda$ and $(1-r)[4(n-1)h^{2}]/
[h^{2}(4-h^{2})]\hat{a}_{s}$ to the right hand side, where $\hat{a}_{s}$ is the estimated breeding value of the sire from the across herd genetic evaluation system. The value n is the ``effective number of daughters'' for the sire from the across herd system and is computed from the accuracy of sire evaluation as

\begin{displaymath}n = t \; R/(1-R) \end{displaymath}

where R = repeatability (squared correlation between true and estimated breeding value) and

\begin{displaymath}t = (\sigma_{e}^{2} + \sigma_{p}^{2} + .75\sigma_{a}^{2})/.25
\sigma^{2}a. \end{displaymath}

The intra-herd animal model equations are solved in the usual manner.

One cautionary note, the intra-herd system assumes that base population sires and cows were sampled from a common population and that $\hat{a}_{s}$ is expressed relative to the mean of that population. This might not be the case. An expedient solution would be to adjust the estimated breeding value of the sire to the genetic base of the herd, if possible. If one then wanted to choose among cows in different herds, the intra-herd evaluations would have to be adjusted again to a common base. This might be difficult operationally.

The intra-herd model also assumes that unknown ancestors of new cows entering the herd are also from the same base population as the herd. If this is not the case, there is need for some form of grouping to accommodate this. Grouping strategies are considered in the next section. This is difficult for an intra-herd animal model because group size is small.

Inter-Herd Animal Model

Inter-herd animal models for joint bull and cow evaluation nationally have been developed and implemented (Wiggans et al., 1988, J. Dairy Sci. 71, Suppl. 2:54; Robinson and Chesnais, 1988, J. Dairy Sci. 71 Suppl. 2:70). The systems make use of all known relationships among animals and use the computing strategy of Schaeffer and Kennedy (1986).

Previously, joint bull and cow evaluation using an animal model has been done by Westell and Van Vleck (1987, J. Dairy Sci. 70:1006) on data from New York State. They pointed out that proper grouping to account for the fact that unidentified ancestors were not sampled from the same population was important. They used Thompson's (1979, Biometrics 35:339) model as a grouping strategy. Let

\begin{displaymath}y_{ij} = \mu + h_{i} + a_{j} + \sum_{k=1}^{n} q_{jk}g_{k} + p_{j}
+ e_{ij} \end{displaymath}

where gk is the group and qjk is the fractional contribution of the group to the observations. In matrix notation the model is

\begin{displaymath}{\bf y} = {\bf Xb} + {\bf ZQg} + {\bf Za} + {\bf Zp} + {\bf e} \end{displaymath}

where

\begin{displaymath}{\bf Q} = {\bf TQ}^{*}. \end{displaymath}

${\bf T}$ is obtained from ${\bf A} = {\bf TWT'}$ and ${\bf Q}^{*}$assigns unidentified ancestors to groups. The breeding value of an individual is

\begin{displaymath}\hat{\bf a}^{*} = {\bf Q}\hat{\bf g} + \hat{\bf a}. \end{displaymath}

The approach of Westell and Van Vleck (1987) was to define groups as expected year of birth of unidentified ancestors. Then ``phantom parents'' were assigned to animals with one or both parents unknown. The ``phantom parents'' were assumed to have had only one progeny each. Allowance was also made for different selection intensities for sires of bulls, sires of cows, dams of bulls and dams of cows. Within each of the four pathways, the phantom parents were grouped by year of birth of their progeny. In this way each animal may be traced to an appropriate base population.

Use of phantom groups is illustrated for a simpler situation. Consider the following pedigree

2       1       3
  $\searrow$   $\swarrow$   $\searrow$   $\swarrow$  
    4       5    
      $\searrow$   $\swarrow$      
        6        
Parents of 1, 2 and 3 are not identified. The pedigree can be re-written to include phantom parents according to the following list.
Animal Sire Dam
1 P1 P2
2 P3 P4
3 P5 P6
4 1 2
5 1 3
6 5 4
If unidentified males are assigned to one group and unidentified females to a second group the pedigree list is
Animal Sire Dam
1 G1 G2
2 G1 G2
3 G1 G2
4 1 2
5 1 3
6 5 4
For this pedigree

\begin{eqnarray*}{\bf Q} & = & {\bf TQ}^{*} \\
& = & \;\;\;\;\;\; P_{1} \;\;\;\...
... & .5 \\ .5 & .5 \\ .5 & .5 \\
.5 & .5 \end{array} \right] \: .
\end{eqnarray*}


The usual animal model equations are

\begin{displaymath}\left[ \begin{array}{lll}
{\bf X'X} & {\bf X'Z} & {\bf X'ZQ} ...
...{c} {\bf X'y} \\ {\bf Z'y} \\
{\bf Q'Z'y} \end{array} \right] \end{displaymath}

but we want $\hat{\bf a}^{*} = \hat{\bf a} + {\bf Q}\hat{\bf g}$ and Quaas and Pollak (1981, J. Dairy Sci. 64:1868) showed that $\hat{\bf a}^{*}$ can be obtained directly from

\begin{displaymath}\left[ \begin{array}{lll}
{\bf X'X} & {\bf X'Z} & {\bf0} \\
...
...{array}{c} {\bf X'y} \\ {\bf Z'y} \\ {\bf0} \end{array}\right] \end{displaymath}

which is easier to compute. Note that ${\bf a}$ contains the phantom parents ( ${\bf a}_{0}$) as well as the vector of real animals ( ${\bf a}_{1}$). Because ${\bf Z}_{0}={\bf O}$, absorbing the phantom parent equations gives

\begin{displaymath}\left[ \begin{array}{lll}
{\bf X'X} & {\bf X'Z}_{1} & {\bf0} ...
...{\bf X'y} \\ {\bf Z'}_{1}{\bf y} \\ {\bf0} \end{array} \right] \end{displaymath}

The elements of ${\bf P}^{11}$, ${\bf P}^{10}$, ${\bf P}^{01}$ and ${\bf P}^{00}$ can be obtained by rules similar to those of Henderson (1976) for non-inbred populations. The rules for construction of ${\bf P}^{11}$ are the same as those that would be obtained from construction of ${\bf A}^{-1}$ ignoring phantom parents. For ${\bf P}^{10}$ and ${\bf P}^{01}$ add -.5 to the animal by sire phantom group element and -.5 to the animal by dam phantom group element. For ${\bf P}^{00}$ add .25 to the sire and dam phantom group diagonals and .25 to the sire by dam phantom group off diagonals. Further details are in Westell et al. (1988, J. Dairy Sci. 71:1310).

For the example pedigree

\begin{displaymath}\left[ \begin{array}{cc} {\bf P}^{11} & {\bf P}^{10} \\
{\bf...
...5 & -.5 & -.5 & 0 & 0 & 0 & .75 & .75
\end{array} \right] \: . \end{displaymath}

Major Genes
B. W. Kennedy, 1993

Considerable attention has been paid to methods for detecting genes with major effect, but little work has been done on genetic evaluation for traits influenced by major genes. A number of single genes, inherited in a simple Mendelian manner, that can be screened for and identified in the animal have been shown to be associated with quantitative traits of economic importance. However, most of these quantitative traits are likely influenced also by many other genes that have smaller individual effects but large aggregate effects on the trait. Under certain population and breeding structures, these polygenic effects can be confounded with effects of the single gene under investigation, even in the absence of linkage. In the analysis of data to examine effects of single genes on quantitative traits, it is important to disentangle this confounding to obtain unbiased estimates of single gene effects and valid significant levels of tests of hypotheses about these single gene effects. Most studies to date have used least squares analyses to measure single gene effects and have ignored some or all of the effects of the background polygenes.

In this Chapter we consider two situations, where observations on animals can be classified as to genotype without error and where there is error or incertainty in the classification.

Observations Classified as to Genotype Without Error

If animals upon which observations are available can be readily classified as to whether or not they possess the major gene, BLUE of the effect of genotype at the major gene locus and BLUP of additive genetic merit $(\hat{\bf a})$ of animals for the polygenes influencing the trait can be obtained by treating the major gene or transgene as a fixed effect assuming the gene is expressed equally in all animals. If the effect of the major gene on the trait of concern is additive or incompletely dominant, then heterozygotes need to be distinguished from homozygotes.

In this section we consider both randomly mated and selected populations and show where use of an individual animal model provides unbiased estimates and exact tests of hypotheses of single gene effects and where use of ordinary least squares (OLS) analysis, the usual method of analysis, does not.

Methodology

Consider the simple genetic model

yij = gi + aij + eij (2)

where yij is an observation on a quantitative trait on the ijth animal, gi is the effect of the ith observable genotype at a single locus, aij is the additive polygenic effect of the ijth animal $\sim N({\bf0},{\bf A}\sigma_{a}^{2})$, where ${\bf A}$ is a matrix of additive relationships among animals, and eij is a random environmental effect $\sim N({\bf0},{\bf I}\sigma_{e}^{2})$. This is the classic mixed model both in a genetic sense of involving a mixture of polygenic and single locus effects and in a statistical sense of involving both fixed (single locus genotype) and random (polygenic) effects. We are dealing with a special case of the genetic mixed model, however, in that we assume we can identify all individuals with observations on the quantitative trait as to genotype at the single locus of interest. For this model we assume also, for simplicity, no other fixed effects, polygenic effects are additive, and there are no interactions between single gene and polygenic effects, but expansion of the model to accommodate more complicated models is possible.

In matrix notation, the model is

\begin{displaymath}{\bf y} = {\bf Xg} + {\bf Za} + {\bf e}
\end{displaymath} (3)

with (assuming no selection)

\begin{displaymath}E({\bf y}) = {\bf Xg} \end{displaymath}

and

\begin{displaymath}{\rm Var}({\bf y}) = {\bf V} = {\bf ZAZ'}\sigma_{a}^{2} +
{\bf I}\sigma_{e}^{2}.
\end{displaymath} (4)

The matrix ${\bf X}$ assigns observations to single locus genotypes $({\bf g})$ and the matrix ${\bf Z}$ assigns observations to animals. If all animals in ${\bf a}$ have a single record, then ${\bf Z} = {\bf I}$, the identity matrix.

Mixed Model Method

Estimation of effects of single locus genotype under (7.2) and (7.3) are obtained as follows:

$\displaystyle \left[ \begin{array}{c} \hat{\bf g} \\  \hat{\bf a} \end{array} \right]$ = $\displaystyle \left[ \begin{array}{ll} {\bf X'X} & {\bf X'Z} \\  {\bf Z'X} &
{\...
...ight] ^{-1} \left[
\begin{array}{c} {\bf X'y} \\  {\bf Z'y} \end{array} \right]$  
  = $\displaystyle \left[ \begin{array}{ll} {\bf C}_{11} & {\bf C}_{12} \\  {\bf C}_...
...ay} \right] \left[ \begin{array}{c} {\bf X'y} \\
{\bf Z'y} \end{array} \right]$ (5)

where

\begin{displaymath}\lambda = \sigma_{e}^{2}/\sigma_{a}^{2} \end{displaymath}

or as

\begin{displaymath}\hat{\bf g} = ({\bf X'V}^{-1}{\bf X})^{-1} {\bf X'V}^{-1}{\bf y}. \end{displaymath}

Var $(\hat{\bf g})$ is computed as

\begin{displaymath}{\rm Var}(\hat{\bf g}) = {\bf C}_{11}\sigma_{e}^{2}
\end{displaymath} (6)

and an estimate is provided by substituting

\begin{eqnarray*}\hat{\sigma}_{e}^{2} & = & ({\bf y'y} - \hat{\bf g}{\bf 'X'y} -...
...}{\bf y} - \hat{\bf g}{\bf 'X'V}^{-1}{\bf y})/
[N - r({\bf X})],
\end{eqnarray*}


where N = the number of observations and $r({\bf X})$ = the number of genotypes.

The null hypothesis $H_{o}: \; {\bf K'g} = {\bf0}$ can be tested from $F = Q/f \hat{\sigma}_{e}^{2}$ where $f = r({\bf K})$ and

\begin{displaymath}Q = ({\bf K'}\hat{\bf g})'({\bf K'C}_{11}{\bf K})^{-1}({\bf K'}
\hat{\bf g}). \end{displaymath}

The F-test has f numerator and $N - r({\bf X})$ denominator degrees of freedom.

Ordinary Least Squares

Frequently, in analysis of data for single gene effects an operational model of

\begin{displaymath}y_{ij} = g_{i} + \varepsilon_{ij}, \end{displaymath}

where $\varepsilon_{ij} = a_{ij} + e_{ij}$ from (7.1) is assumed, and analysis is by ordinary least squares. In matrix notation, the model is

\begin{displaymath}{\bf y} = {\bf Xg} + {\bf\mbox{\boldmath$\varepsilon$ }} \end{displaymath}

with $E({\bf y})$ and Var$({\bf y})$ as in (7.2) and (7.3).

However, with ordinary least squares, Var$({\bf y})$ is (incorrectly) assumed for computation purposes to be ${\bf I}\sigma_{\varepsilon}^{2}$where $\sigma_{\varepsilon}^{2} = \sigma_{a}^{2} + \sigma_{e}^{2}$ and the correlated error structure from ${\bf ZAZ'}\sigma_{a}^{2}$ is ignored. Under ordinary least squares, estimation of genotype effects is from

\begin{displaymath}\hat{\bf g} = ({\bf X'X})^{-1}{\bf X'y}.
\end{displaymath} (7)

Var $(\hat{\bf g})$ is computed as

\begin{displaymath}{\rm Var}(\hat{\bf g}) = ({\bf X'X})^{-1} \hat{\sigma}_{\varepsilon}^{2}, \end{displaymath}

where

\begin{displaymath}\hat{\sigma}_{\varepsilon}^{2} = ({\bf y'y} - \hat{\bf g}{\bf 'X'y})/
[N - r({\bf X})]. \end{displaymath}

The hypothesis $H_{o}: \; {\bf K'g} = {\bf0}$ is tested by $F = Q/f
\hat{\sigma}_{\varepsilon}^{2}$, where

\begin{displaymath}Q = ({\bf K'}\hat{\bf g})'[{\bf K'}({\bf X'X})^{-1}{\bf K}]^{-1}
({\bf K'}\hat{\bf g}), \end{displaymath}

with f numerator and $N - r({\bf X})$ denominator degrees of freedom.

Random Mating

We now consider properties of the mixed model and ordinary least squares estimates under random mating.

Mixed Model

Under random mating, $E({\bf y}) = {\bf Xg}$ and it is well known that (given $\lambda$),

\begin{displaymath}E(\hat{\bf g}) = {\bf g} \end{displaymath}

and Var $(\hat{\bf g}) = {\bf C}_{11}\sigma_{e}^{2}$ is minimum given $E(\hat{\bf g}) = {\bf g}$, i.e. $\hat{\bf g}$ is the best linear unbiased estimator (BLUE) of ${\bf g}$. If inbreeding occurs, use of ${\bf A}$in the mixed model equations accounts for the decline in additive genetic variance associated with inbreeding.

Also, for

\begin{displaymath}Q = \hat{\bf g}'{\bf C}_{11}^{-1}\hat{\bf g}, \end{displaymath}


\begin{displaymath}Q/f\hat{\sigma}_{e}^{2} \sim F[f,N - r({\bf X})], \end{displaymath}

where f = the number of genotypes, gives an exact and best, in the sense of most powerful, test of the null hypothesis

\begin{displaymath}H_{o}: \; {\bf g} = {\bf0}. \end{displaymath}

Linear functions of ${\bf g}$ can be tested using

\begin{displaymath}Q = ({\bf K'}\hat{\bf g})'({\bf K'C}_{11}{\bf K})^{-1}({\bf K'}
\hat{\bf g}) \end{displaymath}

where

\begin{displaymath}f = r({\bf K}). \end{displaymath}

Use of ${\bf A}$ in the estimation procedure accounts for the correlated structure among observations $({\bf y})$.

Ordinary Least Squares

With analysis by OLS,

\begin{displaymath}E(\hat{\bf g}) = E({\bf g}), \end{displaymath}

that is estimates of single gene effects are unbiased, but Var $(\hat{\bf g}) \neq ({\bf X'X})^{-1}\sigma_{\varepsilon}^{2}$, as would be computed by most OLS statistical software. Rather,

\begin{eqnarray*}{\rm Var}(\hat{\bf g}) & = & {\rm Var}[({\bf X'X})^{-1}{\bf X'y...
... + ({\bf X'X})^{-1}{\bf X'ZAZ'X}
({\bf X'X})^{-1}\sigma_{a}^{2}.
\end{eqnarray*}


Only if ${\bf Z} = {\bf I}$ and ${\bf A} = {\bf I}$, which is the case only if animals are unrelated and there is a single observation per animal, or if $\sigma_{a}^{2} = 0$, that is, there is no genetic variance due to polygenes, does Var $(\hat{\bf g}) = ({\bf X'X})^{-1}\sigma_{\varepsilon}^{2}$. If these conditions are not true $({\bf X'X})^{-1}\sigma_{\varepsilon}^{2}$will not represent true Var $(\hat{\bf g})$.

This also has ramifications for hypothesis tests involving ${\bf g}$. For

\begin{displaymath}Q = ({\bf K'}\hat{\bf g})'[{\bf K'}({\bf X'X})^{-1}{\bf K}]^{-1}
({\bf K'}\hat{\bf g}) \end{displaymath}

and

\begin{displaymath}\hat{\sigma}_{\varepsilon}^{2} = ({\bf y'y} - \hat{\bf g}{\bf 'X'y})/
[N - r({\bf X})], \end{displaymath}


\begin{displaymath}Q/f \hat{\sigma}_{\varepsilon}^{2} \; {\rm is \; not \; distributed \; as \;}
F[f,N - r({\bf X})] \end{displaymath}

under the null hypothesis $H_{o}: \; {\bf K'g} = {\bf0}$, because Q and $f\hat{\sigma}_{\varepsilon}^{2}$ are not independent and do not have the same expectation. Indeed, if the true model is (7.2) and (7.3) it is likely that

\begin{displaymath}E(Q/f)>E(\hat{\sigma}_{\varepsilon}^{2}). \end{displaymath}

It can be shown that

\begin{eqnarray*}E({\bf y'y} - \hat{\bf g}{\bf 'X'y}) & = & E[{\bf y'}({\bf I} -...
...{\bf ZAZ'} -
{\bf X}({\bf X'X})^{-1}{\bf X'ZAZ'}]\sigma_{a}^{2}.
\end{eqnarray*}


Now

\begin{displaymath}{\rm tr}({\bf ZAZ'}) = {\rm tr}({\bf AZ'Z}) = \sum_{k=1}^{n} n_{k}a_{k} \end{displaymath}

where ak is the kth diagonal element of ${\bf A}$ and nk is the number of records on individual k. In the absence of inbreeding $\Sigma n_{k}a_{k} = N$, and with inbreeding $\Sigma n_{k}a_{k} = N +
\Sigma n_{k}F_{k}$, which reduces to $N + \Sigma F_{k}$ when each individual has a single record. The second term involving $\sigma_{a}^{2}$ is

\begin{displaymath}{\rm tr}[{\bf X}({\bf X'X})^{-1}{\bf X'ZAZ'}] = {\rm tr}[{\bf AZ'X}
({\bf X'X})^{-1}{\bf X'Z}]. \end{displaymath}

If ${\bf X}$ and ${\bf Z}$ are ordered by genotype, and when each individual has a single record,

\begin{displaymath}{\bf Z'X}({\bf X'X})^{-1}{\bf X'Z} = \bigoplus_{i=1}^{t} \;
{\bf J}_{i}/n_{i.}, \end{displaymath}

where t is the number of genotypes, $\oplus$ represents a direct sum, ${\bf J}_{i}$ is a matrix with each element equal to one and ni. is the number of individuals with genotype i. Therefore

\begin{eqnarray*}{\rm tr} \left( {\bf A} \; \bigoplus_{i=1}^{t} \; {\bf J}_{i}/n...
...}^{2}\bar{a}_{ii} \\
& = & \sum_{i=1}^{t} \; n_{i.}\bar{a}_{ii}
\end{eqnarray*}


where $\bar{a}_{ii}$ is the average relationship among individuals of genotype i. (Results are similar if we relax the assumption that each individual has a single record, except that individuals contribute to $\bar{a}_{ii}$ in proportion to their number of records, that is $\bar{a}_{ii}$ is an average relationship weighted by number of records contributed by each animal.)

Therefore, the expectation of $\sigma_{\varepsilon}^{2}$, again assuming for simplicity that each individual has a single record, is

\begin{eqnarray*}& & ([N - r({\bf X})]\sigma_{e}^{2} + [N + \sum_{k=1}^{n} F_{k}...
...k} - \Sigma n_{i.}\bar{a}_{ii})
\sigma_{a}^{2}/[N - r({\bf X})].
\end{eqnarray*}


Note that if all individuals are unrelated and there is no inbreeding

\begin{displaymath}(N + \Sigma F_{k} - \Sigma n_{i.}\bar{a}_{ii}) = N - t = N - r({\bf X}) \end{displaymath}

and

\begin{displaymath}E(\hat{\sigma}_{\varepsilon}^{2}) = \sigma_{e}^{2} + \sigma_{a}^{2}, \end{displaymath}

i.e., $\hat{\sigma}_{\varepsilon}^{2}$ is an unbiased estimator of $\sigma_{\varepsilon}^{2}$. However, if animals are related within genotype, $\Sigma n_{i.}\bar{a}_{ii}>t$ and there is a downward bias in $\hat{\sigma}_{\varepsilon}^{2}$, assuming no inbreeding. With inbreeding the direction of the bias in $\hat{\sigma}_{\varepsilon}^{2}$ depends upon whether $\Sigma F_{i} - \Sigma n_{i.}\bar{a}_{ii}$ is greater than or less than -t. A downward bias in $\hat{\sigma}_{\varepsilon}^{2}$, unless accompanied by an equal proportional downward bias in the contribution of $\sigma_{\varepsilon}^{2}$ to Q, will result in an inflated F-test.

Now we examine Q, the numerator of the F-test. Under $H_{o}: \; {\bf K'g} = {\bf0}$,

\begin{displaymath}E(Q) = {\rm tr}([{\bf K'}({\bf X'X})^{-1}{\bf K}]^{-1}{\bf K'...
...^{-1}{\bf K})\sigma_{a}^{2} +
{\rm tr}({\bf I})\sigma_{e}^{2}. \end{displaymath}

Again, if individuals are ordered by genotype and each individual has a single observation,

\begin{displaymath}({\bf X'X})^{-1}{\bf X'ZAZ'X}({\bf X'X})^{-1} = \bar{\bf A}
= \{\bar{a}_{ij}\} \end{displaymath}

for i,j = 1, 2, ...$\; t$. Consider now some hypotheses associated with estimates of genotype effects of interest. We might have two alleles and three genotypes and be interested, for example, in the estimate of the difference between the two homozygous genotypes and the test of the hypothesis that the genotypes are equal. In this instance

\begin{displaymath}{\bf K'} = (1 \;\;\; 0 \;\; -1) \end{displaymath}

and

\begin{displaymath}E(Q) = \sigma_{e}^{2} + (1/n_{1.} + 1/n_{3.})^{-1}(\bar{a}_{11}
+ \bar{a}_{33} - 2\bar{a}_{13}) \sigma_{a}^{2}. \end{displaymath}

If individuals are unrelated

\begin{displaymath}\bar{a}_{11} = 1/n_{1.}, \end{displaymath}


\begin{displaymath}\bar{a}_{33} = 1/n_{3.} \end{displaymath}

and

\begin{displaymath}\bar{a}_{13} = 0, \end{displaymath}

and therefore

\begin{displaymath}E(Q) = \sigma_{e}^{2} + \sigma_{a}^{2} = \sigma_{\varepsilon}^{2} \end{displaymath}

and the F-test is unbiased. However, if individuals are related there is a bias. If individuals within a genotype are more highly related than individuals of different genotypes, as is likely the case, then the bias is upwards, that is

\begin{displaymath}(\bar{a}_{11} + \bar{a}_{33} - 2\bar{a}_{13}) > (1/n_{1.} + 1/n_{3.}). \end{displaymath}

Recall that the denominator of the F-test, $\hat{\sigma}_{\varepsilon}^{2}$, is biased downwards in the presence of relationships within genotype (ignoring inbreeding), and therefore it is likely the F-test will be inflated and give too many ``significant'' results. The estimate of the difference between genotypes, in the absence of selection, is nonetheless unbiased.

Similarly, the test that the heterozygous genotype is equal to the average of the homozygous genotypes (i.e. dominance = 0) can be constructed through ${\bf K'} = (-1 \;\; 2 \; -1)$. With this hypothesis, it can be shown under the null hypothesis that

\begin{displaymath}E(Q) = \sigma_{e}^{2} + (1/n_{1.} + 4/n_{2.} + 1/n_{3.})^{-1}...
... 4\bar{a}_{12} +
2\bar{a}_{13} - 4\bar{a}_{23})\sigma_{a}^{2}. \end{displaymath}

Again, if individuals are unrelated,

\begin{displaymath}E(Q) = \sigma_{e}^{2} + \sigma_{a}^{2} = \sigma_{\varepsilon}^{2}, \end{displaymath}

but if they are related and relationships are greater within genotype than across genotype, then there is likely to be an upward bias in Q and again a likelihood of too many significant results. Similarly, a test of the hypothesis g1 = g2 = g3, which can be constructed from the two previous tests, will likely be biased upwards.

Selection

Again for simplicity assume that the single locus has two allelic forms and three possible genotypes. Assume also that at generation 0 the population is in Hardy-Weinberg and linkage equilibrium and let genotypes g1, g2 and g3 have values of $\alpha$, $\delta$ and $-\alpha$ and frequencies of p02, 2p0q0 and q02, respectively. If ${\bf y}_{0}$ represents phenotypic observations then, before selection,

\begin{displaymath}E({\bf y}_{0}) = {\bf Xg}. \end{displaymath}

We now impose truncation selection on ${\bf y}_{0}$ such that individuals with the highest phenotypic values are selected to produce progeny in generation 1 and those with the lowest phenotypic values do not contribute progeny.

For the jth selected individual of genotype i

E(yijs) = gi + vi + wi,

where

\begin{displaymath}v_{i} = h^{2}E(\bar{y}_{i.s} - \bar{y}_{i.0}) \end{displaymath}

and

\begin{displaymath}w_{i} = (1 - h^{2})E(\bar{y}_{i.s} - \bar{y}_{i.0}) \end{displaymath}

are the expected additive genetic and environmental values of selected individuals of the ith genotype. Because new environmental deviations are generated each generation, wi does not concern us for generation 1, but vi does. For $\alpha \geq \delta \geq -\alpha$, then $v_{1} \leq v_{2} \leq v_{3}$. Intuitively this seems reasonable because on average, if an individual of the poorest single genotype is selected, it is likely to be better than average for additive genetic value of polygenes and vice-versa. In matrix notation, the expected value of the vector of observations above the truncation point is

\begin{displaymath}E({\bf y}_{0_{s}}) = {\bf X}_{s}{\bf g} + {\bf v} + {\bf w} \end{displaymath}

for

\begin{displaymath}{\bf v'} = (v_{1} \;\; v_{2} \;\; v_{3}) \end{displaymath}

and

\begin{displaymath}{\bf w'} = (w_{1} \;\; w_{2} \;\; w_{3}). \end{displaymath}

The values of ${\bf X}_{s}$, ${\bf v}$ and ${\bf w}$ will depend upon initial gene frequencies, the proportion of individuals selected and values of ${\bf g}$, $\sigma_{a}$ and $\sigma_{e}$. If the proportional contributions of genotypes g1, g2 and g3 from generation 0 to generation 1 are p02r, 2p0q0s and q02t, respectively, where r, s and t are the relative fitnesses of the respective genotypes scaled to keep population size constant, then in generation 1 the expected value of an individual of the ith genotype, if matings among selected parents is at random, is

\begin{displaymath}E(y_{11}) = \alpha + p_{0}(p_{0}rv_{1} + q_{0}sv_{2})/p_{1}
\end{displaymath} (8)


\begin{displaymath}E(y_{21}) = \delta + (p_{0}^{2}q_{1}rv_{1} + p_{0}q_{0}sv_{2} +
q_{0}^{2}p_{1}tv_{3})/2p_{1}q_{1}
\end{displaymath} (9)


\begin{displaymath}E(y_{31}) = -\alpha + q_{0}(p_{0}sv_{2} + q_{0}tv_{3})/q_{1}
\end{displaymath} (10)

with

p1 = p02r + p0q0s

and

q1 = p0q0s + q02t.

We now consider the effects of this selection on estimates of ${\bf g}$ from mixed model methods and OLS.

Mixed Model Methods

Properties of mixed model methods when selection has occurred are not fully understood. Use of ${\bf A}$ in (7.4) accommodates the change in variance due to gametic disequilibrium. Henderson (1975) has shown that use of the usual mixed model equations (7.4), ignoring selection, yield BLUE of fixed effects and BLUP of random effects if selection is on a linear function of the data, which he designated as ${\bf L'y}$, ${\bf a}$ and ${\bf e}$ are multivariate normal, which implies an infinitesimal model for the polygenes, ${\bf A}$ and $\lambda$ are known and ${\bf L'X} = {\bf0}$. It is this last restriction that is most troublesome. Henderson has interpreted this to imply that selection must be within levels of fixed effects.

If we assume the effect of the single locus genotype to be fixed, the selection process envisioned here clearly is not within genotype at the single genetic locus and seemingly does not fit Henderson's requirement that ${\bf L'X} = {\bf0}$. With repeated sampling, however, ${\bf L}$is not fixed and will vary from sample to sample. In our sampling scheme, so will ${\bf X}$.

It seems intuitively reasonable that, if the data used for analyses contain all the observations used in the selection decision process, use of (7.4) will provide unbiased estimates of ${\bf g}$, over repeated sampling, even though selection is directly on the observations $({\bf y})$which include the contributions of the effects of genotype at the single genetic locus. However, if data upon which selection was practiced are not included, some bias will result.

Ordinary Least Squares

We now consider properties of OLS estimates of ${\bf g}$ under selection. First consider the situation where only data from the progeny of selected parents are used (generation 1). From (7.7) and (7.9) it is easy to show that the estimate of $\alpha$ as obtained from

\begin{displaymath}\hat{\alpha} = 1/2(\bar{y}_{11} - \bar{y}_{31}), \end{displaymath}

which is the estimate that would be obtained by OLS, is biased by

1/2[p0(p0rv1 + q0sv2)/p1 - q0(p0sv2 + q0tv3)/q1].

Because with directional selection, p1 > p0and q1 < q0 and v1 < v2 < v3, this bias is negative, that is the effect of the single gene is underestimated.

The bias is directly proportional to heritability of the polygenes. Under an additive model the bias depends mostly upon selection intensity, and increases with increasing intensity. With complete dominance, the bias is greatest for small initial frequencies of the favorable gene and tends to zero at high gene frequencies for all intensities of selection. With a recessive model the opposite tends to occur.

If $\delta$ is estimated as

\begin{displaymath}\hat{\delta} = \bar{y}_{21} - 1/2(\bar{y}_{11} + \bar{y}_{31}), \end{displaymath}

it can be easily shown from (7.7), (7.8) and (7.9) that $E(\hat{\delta})
= \delta$ and there is no bias.

If data are pooled over generations 0 and 1, and all data of generation 0 are included, the situation is more complicated. The OLS estimate of g1is

\begin{displaymath}\hat{g}_{1} = (p_{0}^{2}\bar{y}_{10} + p_{1}^{2}\bar{y}_{11})/
(p_{0}^{2} + p_{1}^{2}) \end{displaymath}

and

\begin{eqnarray*}E(\hat{g}_{1}) & = & (p_{0}^{2}\alpha + p_{1}^{2}[\alpha + p_{0...
...+ p_{0}p_{1}(p_{0}rv_{1} + q_{0}sv_{2})/(p_{0}^{2} +
p_{1}^{2}).
\end{eqnarray*}


The latter term represents the bias in $\hat{g}_{1}$. This bias is in the direction of selection, that is, if positive selection is practiced, v1 and v2 will be positive.

Similarly, the expected value of the OLS estimate of g3 is

\begin{displaymath}E(\hat{g}_{3}) = - \alpha + q_{0}q_{1}(p_{0}sv_{2} + q_{0}tv_{3})/
(q_{0}^{2} + q_{1}^{2}). \end{displaymath}

Therefore, if we estimate $\alpha$ from OLS estimates of 1/2(g1 - g3), that is ${\bf K'g}$ for ${\bf K}' = (\frac{1}{2} \;\; 0 \;\; -\frac{1}{2})$, then

\begin{displaymath}E(\hat{\alpha}) = \alpha + p_{0}p_{1}(p_{0}rv_{1} + q_{0}sv_{...
..._{0}q_{1}(p_{0}sv_{2} + q_{0}tv_{3})/2(q_{0}^{2} + q_{1}^{2}).
\end{displaymath} (11)

As selection progresses, p will increase relative to q and will tend to cause a positive bias in $\hat{\alpha}$. The numerator terms in the bias can be expressed as

\begin{displaymath}p_{1}(z_{1} + \frac{1}{2} z_{2}) \end{displaymath}

and

\begin{displaymath}q_{1}(\frac{1}{2} z_{2} + z_{3}) \end{displaymath}

where

z1 = p02rv1,


z2 = 2p0q0sv2

and

z3 = q02tv3.

These terms (zi) are h2 times the genotypic frequency times the probability density of each genotype at its selection truncation point (divided by the proportion of all individuals in generation 0 selected to be parents). The bias in $\hat{\alpha}$ is then

\begin{displaymath}p_{1}(z_{1} + \frac{1}{2} z_{2})/2(p_{0}^{2} + p_{1}^{2})
- q_{1}(\frac{1}{2} z_{2} + z_{3})/2(q_{0}^{2} + q_{1}^{2}). \end{displaymath}

The estimate of $\delta$ requires an estimate of g2. The OLS estimate of g2 is

\begin{displaymath}\hat{g}_{2} = (p_{0}q_{0}\bar{y}_{20} + p_{1}q_{1}\bar{y}_{21})/
(p_{0}q_{0} + p_{1}q_{1}) \end{displaymath}

and

\begin{displaymath}E(\hat{g}_{2}) = \delta + (p_{0}^{2}q_{1}rv_{1} + p_{0}q_{0}sv_{2} +
q_{0}^{2}p_{1}tv_{3})/2(p_{0}q_{0} + p_{1}q_{1}). \end{displaymath}

If we estimate $\delta$ from OLS estimates of $g_{2} - \frac{1}{2}(g_{1}
+ g_{3})$, that is ${\bf K'g}$ for ${\bf K'} = (-\frac{1}{2} \;\; 1 \;\;
-\frac{1}{2})$, then
$\displaystyle E(\hat{\delta})$ = $\displaystyle \delta + (p_{0}^{2}q_{1}rv_{1} + p_{0}q_{0}sv_{2} +
q_{0}^{2}p_{1}tv_{3})/2(p_{0}q_{0} + p_{1}q_{1})$  
    - p0p1(p0rv1 + q0sv2)/2(p02 + p12)  
    - q0q1(p0sv2 + q0tv3)/2(q02 + q12)  
  = $\displaystyle (q_{1}z_{1} + \frac{1}{2} z_{2} + p_{1}z_{3})/2(p_{0}q_{0} + p_{1}q_{1})$  
    $\displaystyle - p_{1}(z_{1} + \frac{1}{2} z_{2})/2(p_{0}^{2} + p_{1}^{2})$  
    $\displaystyle - q_{1}(\frac{1}{2} z_{2} + z_{3})/2(q_{0}^{2} + q_{1}^{2}).$ (12)

Bias in $\hat{\alpha}$ is influenced greatly by selection and is generally positive with intense selection. At low selection intensities bias in $\hat{\alpha}$ is small. Under an additive model at very high selection intensity, bias can be greater than the size of the true effect but decreases proportionally as the magnitude of the true effect increases. Bias is slightly higher at high gene frequency than at low. With dominance and high intensities of selection, bias is more dependent on gene frequency and peaks at frequencies of the desirable gene of between .1 and .2, and then drops off quickly as gene frequency increases. With recessive gene action, bias tends to increase with gene frequency.

As with $\hat{\alpha}$, bias in $\hat{\delta}$ is small with weak selection. With intense selection, direction of the bias depends very much on mode of gene action and gene frequency. Under an additive model, large positive biases in $\hat{\delta}$ occur at low gene freqency and large negative biases occur at high gene frequency if selection is very intense. Otherwise, bias in $\hat{\delta}$ is relatively small. With complete dominance, bias in $\hat{\delta}$ is largely positive, can be high at low gene frequency if selection intensity is high, but tends to zero at intermediate and high gene frequencies. However, with a recessive model, large negative biases in $\hat{\delta}$ can occur at high gene frequency and intense selection, with small positive biases at low gene frequencies. In contrast to the situation in estimation of $\alpha$, bias in $\hat{\delta}$ tends to be proportionally less for genes of moderate effect than genes of large effect.

The preceding is based on the assumption that there is an effect of the single gene (i.e. $\alpha \neq 0$). If $\alpha = \delta = 0$, then v1 = v2 = v3, r = s = t, p1 = p0 and q1 = q0. If we let v* = rv1 = sv2 = tv3 then it is simple to show that (7.7), (7.8) and (7.9), respectively, reduce to E(y11) = E(y21) = E(y31) = v*, that is the expected value of an observation on each genotype at generation 1 is simply the response to selection for the polygenes. As a result, $E(\hat{\alpha})$ of (7.10) reduces to $\alpha$ and $E(\hat{\delta})$ of (7.11) reduces to $\delta$, that is there is no bias in the least squares estimates of $\alpha$ and $\delta$ whether data from both generations are used or data only from generation 1 are used. Although $\hat{\alpha}$ and $\hat{\delta}$ are unbiased when true $\alpha = \delta = 0$, the likelihood of finding spurious ``significant'' effects of $\alpha$ and $\delta$ increases with selection.

The following simulation results illustrate this. Records of animals were simulated over five generations for a population size of 40 (20 males and 20 females). The trait was controlled by a single locus with two alleles (p=q=0.5 in the base population) as well as by polygenes. For the polygenes, $\sigma_{a}^{2}$ was 1 as was the environmental variance ( $\sigma_{e}^{2}$) (i.e. h2=05). Data were analyzed by an animal model according to (7.2) and by least squares (7.6) ignoring ${\bf Za}$ in (7.2).

In the first simulation, five of 20 males were randomly selected for breeding each generation and there was no effect of the single gene (i.e. a=0 and d=0). The results are summarized in the following table.

Estimates of the additive effect of a single gene (a), average F-ratios to test the null hypothesis Ho:a=0 and frequency of rejection of the null hypothesis using an animal model (AM) and ordinary least squares (OLS) when selection is at random and there is no effect of the single gene (a=0).

  $\hat{a}$ F-Ratio Freq. Rejection
Gen. AM OLS AM OLS AM OLS
0 -.01 -.01 1.19 1.19 .08 .08
3 -.01 -.01 1.08 $\:$1.57* .06 $\:$.14*
5 -.00 -.01 1.09 $\:$1.70* .05 $\:$.18*
* P<.05

When selection is at random, both the animal model and least squares gave unbiased estimates of a at all generations. However, as time progressed, least squares produced inflated F-ratios and the null hypothesis Ho:a=0(which was true) was rejected more frequently than the stated type I error rate (P=.05). The animal model rejected the null hypothesis at the specified error rate. In other words, use of least squares which ignores the background polygenes will likely find ``significant'' single gene effects which are not real. Use of an animal model will not.

With selection of the best five of 20 males on phenotype the situation is much worse as illustrated in the following table.

Estimates of the additive effect of a single gene (a), average F-ratios to test the null hypothesis Ho:a=0 and frequency of rejection of the null hypothesis using an animal model (AM) and ordinary least squares (OLS) under selection when there is no effect of the single gene (a=0).

  $\hat{a}$ F-Ratio Freq. Rejection
Gen. AM OLS AM OLS AM OLS
0 -.01 -.01 1.19 1.19 .08 .08
3 -.01 -.04 $\;$.96 $\:$1.60* .05 $\:$.14*
5 -.00 -.01 $\;$.99 $\:$2.58* .05 $\:$.31*
* P<.05

Although least squares is still unbiased, the probability of finding significant results when there was no effect of the single gene was very high and was 0.31 at generation 5. Again the animal model worked well.

Also with selection, if there is a real effect of the single gene its effect will be overestimated by least squares but an animal model will estimate it unbiasedly. In the following simulation, there was an effect of the single gene $(a=1, \; d=1)$ and selection was the best five of 20 males on phenotype.

Estimates of the additive (a) and dominance (d) effects of a single locus using an animal model (AM) and least squares (LS) under selection with complete dominance and a true effect of the single gene of a=1 and d=1.

  $\hat{a}$ $\hat{d}$
Gen. AM OLS AM OLS
0 $\;$.99 .99 1.03 1.03
3 1.00 1.09* 1.02 1.04
5 1.00 1.20* 1.04 1.04
* P<.05

The animal model gave unbiased estimates of both a and d. Although least squares gave unbiased estimates of d, by generation 5 estimates of a were overestimated by 20%.

Observations Classified as to Genotype With Error

If it is not practical or feasible to classify the individual as to major genotype then one could resort to the usual genetic evaluation system whereby the major gene is simply considered as one of the many genes influencing the trait, and its effect is included as part of ${\bf a}$. This, however, is not too satisfactory if the gene has a large effect. Multivariate normality could no longer be assumed legitimately, even prior to selection. The greater the effect of the major gene, the greater will be the departures from normality.

An alternative worth investigating is to consider the observations as arising from a mixture of two or more normally distributed populations. Mixture models, which are distinct from mixed models, were first considered by Pearson (1894) who used the method of moments to estimate parameters of a mixture of two normal densities; the means and variances of each population and the proportion of each population in the mixture. Maximum likelihood procedures have been developed for mixture models. Their application is demanding computationally, perhaps too demanding for application to large data sets but computational requirements are less if a common variance can be assumed for the populations. Applications have been made in the field of human genetics to distinguish between the effects of major genes and polygenes and these have been extended to large animal populations.

In the case of a major gene with two alleles, a mixture of two (complete dominance) or three (incomplete dominance) populations, each normally distributed, could be hypothesized. Evaluation could be according to the model

\begin{displaymath}{\bf y } ={\bf Pg} + {\bf Xb} + {\bf Za} + {\bf e}, \end{displaymath}

where ${\bf g}$ represents the unknown effect of major genotype (g1 = homozygous for the dominant allele (AA), g2 = heterozygous (Aa) and g3 = homozygous for the recessive allele (aa)). If an additive model is assumed, $\hat{\bf g}$ should be restricted such that $\hat{g}_{2} = 0.5(\hat{g}_{1} + \hat{g}_{3}$). What distinguishes this model is that elements of ${\bf g}$ cannot be assigned to ${\bf y}$ with certainty by use of an incidence matrix. Rather $\hat{\bf g}$ and an additional parameter $\hat{\bf p}$, the proportion of the mixture arising from each population, would be estimated jointly with $\hat{\bf a}$, $\hat{\bf b}$ and $\hat{\bf e}$. Using normal distribution theory, one could then assign a vector of probabilities $(\hat{{\bf\mbox{\boldmath$\pi$ }}})$ to each observation for each of the three populations on the basis of the solutions to the equations. The matrix ${\bf P}$ is not a fixed incidence matrix of the usual form but is the collection of $\hat{{\bf\mbox{\boldmath$\pi$ }}}$ for all observations. The value of ${\bf P}$ would be adjusted each iterate. The estimated genetic value for the transgenotype of the animal that made the observation would then be $\hat{{\bf\mbox{\boldmath$\pi$ }}}'\hat{\bf t}$. The solution for $\hat{\bf a}$ could be added to estimate genetic merit over all loci. Development of equations for solution of the parameters is not an easy exercise. One avenue is to take an iterative approach and to estimate ${\bf b}$, ${\bf a}$ and ${\bf e}$ and then ${\bf t}$, ${\bf p}$and ${\bf P}$ in a two-stage procedure per iterate. Also, one would want to make use of elements of the relationship matrix in computing ${\bf P}$if possible. Also, allowance must be made that ${\bf P}$ is not fixed, but is a probability matrix with error.

It is possible to estimate the effects of genotype through an appropriate regression of phenotype on genotype probabilities. This regression is complicated by the fact that genotype status is calculated as a set of probabilities, and is therefore measured with error. In the absence of error, each probability is a certainty with a value of either 0 or 1 and we have the usual fixed incidence matrix. With probabilities of less than one we must account for the error.

The problem is most simply illustrated for just two genotypic classes as in a haploid organism. Simple linear regression of phenotype on probability of belonging to better genotype class is overestimated. This section from Kinghorn et al. (1993) derives an appropriate correction, developed from the ideas of Cochran (1968).

Consider the regression of phenotype (y) on probability of belonging to the better of the two genotype classes (X). Then

\begin{displaymath}y_{k} = \alpha^{*} + \beta^{*}X_{k} + \epsilon_{k}^{*} \end{displaymath}

where $\alpha^{*}$ and $\beta^{*}$ are the errored regression intercept and coefficient and $\epsilon_{k}^{*}$ is the residual for individual k under this model. Xk = xk + ek is an errored measurement of xk, which is the incidence of $\beta$, the true difference between genotype class effects, for individual k. x can only take values of 0 or 1 as genotype class membership is discrete, and is related to y as follows

\begin{displaymath}y_{k} = \alpha + \beta x_{k} + \epsilon_{k} \end{displaymath}

where $\alpha$ is the unbiased regression intercept and $\epsilon_{k}$ is the residual for individual k under this true model.

The objective is to estimate $\beta$ given y and X. This can be done by first evaluating $\beta^{*}$

\begin{displaymath}\beta^{*} = \frac{{\rm Cov}(y,X)}{\sigma_{X}^{2}} \: , \end{displaymath}

where

\begin{eqnarray*}{\rm Cov}(y,X) & = & {\rm Cov}(y,x+e) \\
& = & {\rm Cov}(y,x) ...
...+ {\rm Cov}(y,e) \\
& = & \beta \sigma_{x}^{2} + {\rm Cov}(y,e)
\end{eqnarray*}


It can be shown that

\begin{displaymath}{\rm Cov}(y,e) = E(ye) = 0, \end{displaymath}

such that

\begin{displaymath}{\rm Cov}(y,X) = \beta \sigma_{x}^{2}, \end{displaymath}

giving

\begin{displaymath}\beta^{*} = \frac{{\rm Cov}(y,X)}{\sigma_{X}^{2}} =
\frac{\beta \sigma_{x}^{2}}{\sigma_{X}^{2}} \: , \end{displaymath}

and

\begin{displaymath}\hat{\beta} = \hat{\beta}^{*} \: \frac{\sigma_{X}^{2}}{\sigma_{x}^{2}} \: . \end{displaymath}

With imprecise genotype allocation, $\sigma_{X}^{2}$ is less than $\sigma_{x}^{2}$such that $\beta$ is overestimated by $\hat{\beta}^{*}$.

For diploid organisms with two sites per locus and two alleles segregating (one of these being a putative ``major gene'') there are two degrees of freedom for estimation of genotypic effects. These are taken as the effect of carrying one copy $(\beta_{1})$ and the effect of carrying two copies $(\beta_{2})$ of the major gene. The effect of carrying no copies is thus part of the regression intercept.

The ``errored'' model is now

\begin{displaymath}y_{k} = \alpha^{*} + \sum_{i=1}^{2} \: \beta_{i}^{*}X_{ik} +
\epsilon_{k}^{*}
\end{displaymath} (13)

such that the covariance between phenotype and probability of carrying j major genes (Xjk for individual k) is, following Cochran (1968),

\begin{displaymath}{\rm Cov}(y,X_{j}) = {\rm Cov}(\sum_{i} \beta_{i}^{*}X_{i} + \epsilon^{*},
X_{j}) = \sum_{i} \beta_{i}^{*}\sigma_{ij}^{*}
\end{displaymath} (14)

where

\begin{displaymath}\sigma_{ij}^{*} = {\rm Cov}(X_{i},X_{j}). \end{displaymath}

The ``true'' model is now

\begin{displaymath}y_{k} = \alpha + \sum_{i} \beta_{i}x_{ik} + \epsilon_{k} \end{displaymath}

and this lead to an alternative form for Cov(Y,Xj), first noting that E(yej) = 0, where Xj = xj + ej following the haploid case,
$\displaystyle {\rm Cov}(y,X_{j})$ = $\displaystyle {\rm Cov}(y,x_{j} + e_{j})$  
  = $\displaystyle {\rm Cov}(y,x_{j}) + {\rm Cov}(y,e_{j})$  
  = $\displaystyle {\rm Cov}(\sum_{i} \: \beta_{i}x_{i} + \epsilon, x_{j}) + 0$  
  = $\displaystyle \sum_{i} \: \beta_{i}\sigma_{ij}$ (15)

where $\sigma_{ij} = {\rm Cov}(x_{i},x_{j})$. In matrix notation we have

\begin{displaymath}{\rm Cov}(y,X) = {\bf V_{X}}\hat{\bf\mbox{\boldmath$\beta$ }}^{*} =
{\bf V_{x}}\hat{\bf\mbox{\boldmath$\beta$ }} \end{displaymath}

where ${\bf\mbox{\boldmath$\beta$ }}$ is a column vector containing $\beta_{1}$ and $\beta_{2}$, ${\bf\mbox{\boldmath$\beta$ }}^{*}$ is a column vector containing $\beta_{1}^{*}$ and $\beta_{2}^{*}$ and ${\bf V_{X}}$and ${\bf V_{x}}$ are the variance-covariance matrices of genotype probabilities and true genotypes respectively.

Thus

$\displaystyle \hat{\bf\mbox{\boldmath$\beta$ }}$ = $\displaystyle {\bf V}_{x}^{-1}{\bf V}_{X}
\hat{\bf\mbox{\boldmath$\beta$ }}^{*}$  
  = $\displaystyle {\bf W}^{-1}{\bf\mbox{\boldmath$\beta$ }}^{*}.$ (16)

This is analogous to the univariate case, where

\begin{displaymath}\hat{\beta} = \hat{\beta}^{*} \: \frac{\sigma_{X}^{2}}{\sigma_{x}^{2}} \: . \end{displaymath}

Fitting Polygenic Effects

Consider the model

\begin{displaymath}y = {\bf P}{\bf\mbox{\boldmath$\beta$ }}^{*} + {\bf Xb}^{*} + {\bf Za}^{*}
+ \varepsilon^{*}
\end{displaymath} (17)

where y is a vector of observations, ${\bf\mbox{\boldmath$\beta$ }}^{*}$is a vector of major genotype effects, ${\bf b}^{*}$ is a vector of fixed effects other than major genotype effects, and ${\bf a}^{*}$ is a vector of animals' breeding values as affected by all other loci. Asterisk superscripts denote an effect of measurement error e contained in probability estimates X. ${\bf P}$ is a design matrix containing major genotype probabilities, ${\bf X}$ and ${\bf Z}$ are design matrices relating observations to fixed and random effects.

Substituting ${\bf\mbox{\boldmath$\beta$ }}^{*} = {\bf W}{\bf\mbox{\boldmath
$\beta$ }}$ leads to the mixed model equation

\begin{displaymath}\left[ \begin{array}{lll} {\bf W'P'PW} & {\bf W'P'X} & {\bf W...
...}{c} {\bf W'P'y} \\ {\bf X'y}
\\ {\bf Z'y} \end{array} \right]
\end{displaymath} (18)

where ${\bf A}$ is the numerator relationship matrix and $\lambda =
\sigma_{\varepsilon^{*}}^{2}/\sigma_{a^{*}}^{2}$. Thus the corrections for measurement error can be incorporated within a simultaneous analysis of polygenic effects, to give estimates of genotype effects, given knowledge of ${\bf W}$.

The procedure requires data on the trait of interest plus identity of sire and dam. Data involving missing pedigree information and missing recordings are permitted for genotype probability estimation, and can be used if a sufficiently flexible mixed model method is adopted for fitting polygenic and fixed effects. A prior value for the frequency of the putative major gene is required and starting values for the effects of one and two copies of the major gene ( $\hat{\beta}_{1}$ and $\hat{\beta}_{2}$) are required. These are changed by the analysis and converge to the true values under ideal conditions. A prior estimate of heritability is required for the calculation of $\lambda$.

The first action is to calculate genotype probabilities given starting values of $\hat{\beta}_{1}$ and $\hat{\beta}_{2}$. The heights of the normal distributions at the ith individual's phenotype corrected for estimated breeding value are, in arbitrary units,

\begin{displaymath}h_{ij} = e^{-\frac{1}{2}(y_{i} - X_{i}\hat{b}^{*} - Z_{i}\hat{a}^{*}
- \hat{\beta}_{j})^{2}/\sigma_{w}^{2}} \end{displaymath}

given j = 0, 1 or 2 major genes present in individual i. Phenotype is yi, Xi and Zi are the ith rows of ${\bf X}$ and ${\bf Z}$, respectively, and $\hat{b}^{*}$ and $\hat{a}^{*}$ are the estimated fixed effects and breeding values independent of major locus effects, initially set to zero then recalculated at each passage of this step. The value of $\beta_{0}$ is zero by definition. The variance within each major locus genotype class of phenotype corrected for estimated breeding value is assumed equal between classes for simplicity, and is estimated as

\begin{displaymath}\sigma_{w}^{2} = {\rm Var}(y - \hat{a}^{*}) - \sigma_{b}^{2}, \end{displaymath}

where $\sigma_{b}^{2}$ is variance between genotype classes, estimated assuming Hardy-Weinberg equilibrium as

\begin{displaymath}\sigma_{b}^{2} - p^{2}\hat{\beta}_{2}^{2} + 2pq\hat{\beta}_{1}^{2}
- (p^{2}\hat{\beta}_{2} + 2pq\hat{\beta}_{1})^{2} \end{displaymath}

where p is the prior value allocated to frequency of the major gene and q = 1 - p.

Without using information from relatives, the probabilities for individual i of carrying j copies of the major gene are thus

For j = 2: Prob 2 = p2hi2/(p2hi2 + 2pqhi1 + q2hi0).

For j = 1: Prob 1 = 2pqhi1/(p2hi2 + 2pqhi1 + q2hi0).

For j = 0: Prob 0 = q2hi0/(p2hi2 + 2pqhi1 + q2hi0).

The next step is to fit (7.16) and (7.17) with $\lambda =
\sigma_{\varepsilon^{*}}^{2}/\sigma_{a^{*}}^{2}$. The residual variance is

\begin{displaymath}\sigma_{\varepsilon^{*}}^{2} = \sigma_{y}^{2} - \sigma_{a^{*}}^{2} -
\sigma_{Pw \beta}^{2} \end{displaymath}

where $\sigma_{y}^{2}$ is observed phenotypic variance, $\sigma_{a^{*}}^{2}$is described below, and $Pw \beta$ is predicted major locus merit deregressed for error. The variance of $Pw \beta$ is here taken as equal to $\sigma_{b}^{2}$, described above, which is true when genotype effects have been estimated at their true values. The variance of estimated genetic merits at the major locus, $\sigma_{P \beta}^{2}$, underestimates $\sigma_{Pw \beta}^{2}$ at true genotype effect values. The additive genetic variance is

\begin{displaymath}\sigma_{a^{*}}^{2} = h^{2}(\sigma_{y-Pw \beta}^{2}) = h^{2} (...
...igma_{Pw \beta}^{2}) = h^{2}(\sigma_{y}^{2} - \sigma_{b}^{2}). \end{displaymath}

The elements of matrix ${\bf W}$ are calculated from genotype probabilities as follows

\begin{displaymath}\sigma_{ii} = \overline{{\rm Prob}_{i}}.(1 - \overline{{\rm Prob}_{i}}) \end{displaymath}

and

\begin{displaymath}\sigma_{ij} = -1. \overline{{\rm Prob}_{i}}. \overline{{\rm Prob}_{j}} \end{displaymath}


\begin{displaymath}\sigma_{ii}^{*} = {\rm Var}({\rm Prob}_{i}) \end{displaymath}

and

\begin{displaymath}\sigma_{ij}^{*} = {\rm Cov}({\rm Prob}_{i},{\rm Prob}_{j}) \end{displaymath}

where $\overline{{\rm Prob}_{i}}$ is the mean probability of carrying imajor genes over all animals with a record, and ${\rm Var}({\rm Prob}_{i})$is the variance of the probability of carrying i major genes over all animals with a record. Following this regression, genotype probability calculations are again carried out, and the full cycle repeated sufficient times to give convergence in $\hat{\beta}_{1}$ and $\hat{\beta}_{2}$.

Hoeschele (1988) has given ideas similar to those presented to provide joint estimates of major gene effects, gene frequencies and predictors of additive effects of polygenes. As before the model is

\begin{displaymath}{\bf y} = {\bf Pg} + {\bf Xb} + {\bf Za} + {\bf e} \end{displaymath}

with elements as previously defined. Hoeschele (1988) used a Bayesian approach to derive the following iterative system of nonlinear equations (for cycle of iteration n+1)

\begin{displaymath}\left[ \begin{array}{lll} {\bf D}^{n} & {\bf Q'}^{n}{\bf X} &...
...f Q'}^{n}{\bf y} \\ {\bf X'y} \\ {\bf Z'y} \end{array} \right] \end{displaymath}

where

\begin{displaymath}{\bf D}^{n} = {\rm Diag}\{ \sum_{i=1}^{N} \: \pi_{ij}^{n} \vert {\bf y}\} \end{displaymath}


\begin{displaymath}{\bf Q}^{n} = {\bf P}^{n} \vert {\bf y} \end{displaymath}

for i = 1, 2, ..., N = the number of animals and j = 1, 2, ..., m = the number of genotypes.

The elements of ${\bf Q}^{n}$ are the posterior probabilities conditional on the data that the ith individual has major genotype j assuming that ${\bf t}$, ${\bf b}$ and ${\bf a}$ are known. Approximations are presented to estimate ${\bf P}$.

The procedure is iterative starting with initial values of ${\bf P}^{o}$, ${\bf t}^{o}$, ${\bf b}^{o}$ and ${\bf a}^{o}$ and terminating when ${\bf P}^{n}$, ${\bf t}^{n}$, ${\bf b}^{n}$ and ${\bf a}^{n}$ equal ${\bf P}^{(n+1)}$, ${\bf t}^{(n+1)}$, ${\bf b}^{(n+1)}$ and ${\bf a}^{(n+1)}$ within some specified level of convergence. If $\lambda$ is not known, Hoeschele (1988) presented computing algorithms for estimating $\sigma_{a}^{2}$ and $\sigma_{e}^{2}$.

The methods of Hoeschele (1988) and Kinghorn et al. (1993) use a single estimate of polygenic breeding value for each animal irrespective of its genotype, and Hofer and Kennedy (1993) extended this to use three values for each animal depending on its genotype but independent of the genotypes of all other animals. Hofer and Kennedy (1993) compared their method to those of Hoeschele (1988) and Kinghorn et al. (1993) by simulation.

Phenotypic observations were generated by using the following mixed model

yijk = hysi + gj + aijk + eijk,

where hysi is the fixed effect of herd-year-sex i, gj is the fixed effect of majorlocus genotype j, aijk is the polygenic breeding value and eijk is the random residual effect. The effects in the model were sampled as follows

\begin{displaymath}\{hys_{i}\} \sim N(0,{\bf I}\sigma_{h}^{2}), \end{displaymath}


\begin{displaymath}\{a_{ijk}\} \sim N(0,{\bf A}\sigma_{a}^{2}) \end{displaymath}

and

\begin{displaymath}\{e_{ijk}\} \sim N(0,{\bf I}\sigma_{e}^{2}. \end{displaymath}

Major locus genotypes were simulated with two segregating alleles.

Three different sets of parameters were used. Only additive effects of the major locus were considered although all of the methods compared allow for dominance. In the first set of parameters 50% of the phenotypic variance (variance due to major locus + polygenic variance + residual variance) was due to genetic effects, 75% of the genetic variance was due to the major locus and 25% was due to the polygenes. The frequency of allele A with major effect was 25% in the base population, which resulted in an allele substitution effect $\alpha$ of 1.0, i.e. genotype effects of 2.0 (AA), 1.0 (Aa) and 0 (aa). In parameter set 2 the allele frequency p was .5, but the genotype effects as well as all other parameters were the same as in set 1. Thus the variance due to the major locus was increased from .375 to .5, and the phenotypic variance changed from 1.0 to 1.125. In parameter set 3 the allele frequency p was .25 and 50% of the phenotypic variance was due to genetic effects, as in parameter set 1, but the proportion of genetic variance due to the polygenes was increased from 25% to 40% which resulted in an allele substitution effect $\alpha$ of $\sqrt{.8}$.

In each of 10 herds, 20 base dams each had a record in year 1. A group of 20 base sires each with their own record in a common herd-year (e.g. test station) was mated to these base dams. Each sire was randomly mated to one dam in each herd. Each mating produced 5 progeny in year 2.

Table 7.1 shows the simulation results for the three parameter sets using all three procedures when major locus genotypes were unknown. For parameter sets 1 and 2 estimates of major locus effects ${\bf g}$ were close to the true values or slightly underestimated with approximated maximum likelihood (AML) of Hofer and Kennedy (1993), underestimated by about 20% with the method of Hoeschele (1988) and overestimated by 25 to 30% with the method of Kinghorn et al. (1993). For parameter set 3 estimates of major locus effects ${\bf g}$ were zero for 2 replicates using AML and for 21 replicates using the method of Hoeschele (1988). Non-zero estimates of ${\bf g}$ were biased upwards with AML by 14% and with the method of Kinghorn et al. (1993) by 47%. Both AML and the method of Hoeschele (1988) showed a large variability of the non-zero estimates of major locus effects for parameter set 3. When the true allele frequency was 0.25 the allele frequency p was substantially underestimated with AML, but estimated quite well with the two other methods. Correlations between true and predicted breeding values were similar for AML and the method of Hoeschele (1988), but zero for the method of Kinghorn et al. (1993). For parameter sets 1 and 2 the correlations between true (${\bf Pg}$) and estimated ( $\hat{\bf P}\hat{\bf g}$) major locus effects were similar for all three methods. When major locus effects were smaller (parameter set 3) these correlations were largest with the method of Kinghorn et al. (1993). Predicted breeding values were positively correlated to estimated major locus effects $\hat{\bf P}\hat{\bf g}$with AML and to a larger extent with the method of Hoeschele (1988). Using the method of Kinghorn et al. (1993) these correlations were strongly negative.

Table 7.1
Estimates of major locus effects ${\bf g}$ and allele frequency p, and accuracies of predicted polygenic breeding values and estimated major locus effects with unknown major locus genotypes using maximum likelihood with approximations (AML) and the methods of Hoeschele (1988) and Kinghorn et al. (1993) (mean and standard deviation over 25 replicates, starting values = true values)

  AML Hoeschele Kinghorn
  Mean SD Mean SD Mean SD
Parameter set 1            
$\hat{g}_{1}$ 2.058 .237 1.659 .189 2.607 .132
$\hat{g}_{2}$ 1.067 .095 .744 .172 1.302 .072
$\hat{p}$ .115 .033 .235 .044 .246 .033
$r_{a,\hat{a}}$ .394 .062 .405 .078 .023 .125
$r_{Pg,\hat{P}\hat{g}}$ .684 .061 .720 .058 .701 .059
$r_{\hat{a},\hat{P}\hat{g}}$ .385 .128 .632 .101 -.627 .028
Parameter set 2            
$\hat{g}_{1}$ 1.836 .085 1.628 .089 2.495 .074
$\hat{g}_{2}$ .894 .086 .779 .101 1.227 .068
$\hat{p}$ .497 .103 .498 .048 .496 .043
$r_{a,\hat{a}}$ .377 .076 .375 .068 -.060 .144
$r_{Pg,\hat{P}\hat{g}}$ .752 .035 .752 .035 .729 .040
$r_{\hat{a},\hat{P}\hat{g}}$ .614 .085 .711 .066 -.647 .031
Parameter set 31            
$\hat{g}_{1}$ 2.042 .686 1.779 .914 2.664 .114
$\hat{g}_{2}$ 1.019 .264 .272 .257 1.300 .074
$\hat{p}$ .041 .024 .181 .101 .253 .027
$r_{a,\hat{a}}$ .468 .060 .457 .084 .001 .126
$r_{Pg,\hat{P}\hat{g}}$ .455 .147 .486 .134 .609 .078
$r_{\hat{a},\hat{P}\hat{g}}$ .236 .130 .420 .178 -.649 .029
1
Included are only 23 (AML) and 4 (Hoeschele) replicates with non-zero estimates for ${\bf g}$

In summary, AML generally slightly understimates major locus effects ${\bf g}$ and seriously underestimates allele frequency p when the true frequency is 0.25. The underestimation of p leads to increased estimates of ${\bf g}$.

The method of Hoeschele (1988) consistently underestimates major locus effects ${\bf g}$ which is in agreement with her simulation results. For smaller allele effects (parameter set 3), although still quite large, most of the estimates of ${\bf g}$ were zero, indicating that the genotype effects have to be large in order to be recognized.

With the method of Kinghorn et al. (1993) estimates of the allele frequency p were generally closer to the true values than with the two other procedures. However, major locus effects were overestimated and the correlations between true and predicted breeding values were close to zero.

Clearly, none of the methods is very satisfactory for a separate genetic evaluation for the major locus and the polygenes. In this study only large effects were considered. AML and especially the method of Hoeschele (1988) were unable to detect smaller effects than used with parameter set 3.

More work is required in this area.


next up previous

This LaTeX document is available as postscript or asAdobe PDF.

Larry Schaeffer
1999-10-27