next up previous


This LaTeX document is available as postscript or asAdobe PDF.

Categorical Data Analyses
L. R. Schaeffer, April 2000

INTRODUCTION

A categorical trait is one which is expressed phenotypically as a member of one of m categories. When m=2, then the trait is an "all-or-none" trait. Most disease traits are binomial in nature, where the animal is either diseased or not diseased. Traits with more than one category are calving ease or conformation traits. The categories are arranged in a delineated sequence from one extreme expression to the opposite extreme expression. Calving ease, for example, the categories are (1) unassisted calving, (2) slight assistance, (3) hard pull, or (4) caesarian section required. Rump angle, for example, has nine categories where the pins range from too low to too high with the desirable category being somewhere in the middle.

Categorical traits may be inherited in a polygenic manner with a very large number of genes involved. The underlying susceptibility to a disease trait, calving ease, or type trait may actually be continuous and may follow a normal distribution. The underlying continuous scale is known as the liability scale. On this liability scale is a threshold point (or points) where above this threshold the animal expresses the disease phenotype, and below the threshold point the animal does not express the disease. The liability scale is only conceptual and cannot be observed.

Two papers were published in 1983 by Gianola and Foulley, and 1984 by Harville and Mee both describing the same procedure for the analysis of categorical data. The reader is referred to these papers for the theoretical details. The derivation is Bayesian in nature. The model employed is known as a "threshold" model. The details of this procedure are described in these notes.

THRESHOLD MODEL COMPUTATIONS

The general linear model is

\begin{displaymath}{\bf y} = {\bf Xb} + {\bf Zu} + {\bf e}, \end{displaymath}

where ${\bf y}$ is the observation vector, but in the case of categorical data represents a variable on the underlying liability scale. The underlying liability scale is unobservable and therefore, ${\bf y}$ is unknown. However, ${\bf y}$ is known to be affected by fixed effects, ${\bf b}$, and random effects, ${\bf u}$, such as animal additive genetic effects, and random environmental (or residual) effects, ${\bf e}$. The assumed covariance structure is

\begin{displaymath}Var \left( \begin{array}{c} {\bf u} \\ {\bf e} \end{array}\ri...
...cc} {\bf G} & {\bf0} \\
{\bf0} & {\bf R} \end{array} \right). \end{displaymath}

The analysis requires that the threshold points be estimated from the categorical data simultaneously with the estimation of ${\bf b}$ and ${\bf u}$. This makes for a set of non-linear estimation equations. Initial values of the thresholds are used to estimate ${\bf y}$, which are then used to estimate new threshold points and ${\bf b}$ and ${\bf u}$. This must be repeated until the threshold points do not change.

There are various quantities which need to be computed repeatedly in the analysis, and these are based on normal distribution functions.

1.
$\Phi(x)$ is known as the cumulative distribution function of the normal distribution. This function gives the area under the normal curve up to the value of x, for x going from minus infinity to plus infinity (the range for the normal distribution). For example, if x=.4568, then $\Phi(x)=.6761$, or if x=-.4568, then $\Phi(x)=.3239.$ The short program given at the end of these notes computes $\Phi(x)$ for a value of x. Let $\Phi_{k}$ represent the value up to and including category k. Note that if there are m categories and k=m, then $\Phi_{k}=1$.

2.
$\phi(x)$ is a function that gives the height of the normal curve at the value x, for a normal distribution with mean zero and variance 1. That is,

\begin{displaymath}\phi(x) = (2\pi)^{-.5} \exp{-.5 x^{2}}. \end{displaymath}

For example, if x=1.0929, then $\phi(x)=.21955.$

3.
P(k) is the probability that x from a N(0,1)distribution is between two threshold points, or is in category k. That is,

\begin{displaymath}P(k) = \Phi_{k} - \Phi_{k-1}. \end{displaymath}

If k=1, then $\Phi_{k-1}=0.$

DATA

Data should be arranged by smallest subclasses. Consider the calving ease data from the paper by Gianola and Foulley (1983).

Calving Ease Data
Herd- Age of Sex of Sire of Category Total
Year Dam(yr) Calf Calf 1 2 3  
1 2 M 1 1 0 0 1
1 2 F 1 1 0 0 1
1 3 M 1 1 0 0 1
1 2 F 2 0 1 0 1
1 3 M 2 1 0 1 2
1 3 F 2 3 0 0 3
1 2 M 3 1 1 0 2
1 3 F 3 0 1 0 1
1 3 M 3 1 0 0 1
               
2 2 F 1 2 0 0 2
2 2 M 1 1 0 0 1
2 3 M 1 0 0 1 1
2 2 F 2 1 0 1 2
2 3 M 2 1 0 0 1
2 2 F 3 0 1 0 1
2 3 M 3 0 0 1 1
2 2 M 4 0 1 0 1
2 2 F 4 1 0 0 1
2 3 F 4 2 0 0 2
2 3 M 4 2 0 0 2

Let njk represent the number of observations in the kthcategory for the jth subclass, and nj. represent the marginal total for the jth subclass. There are s=20 subclasses in the above example data, and m=3 categories for the trait.

EQUATIONS TO SOLVE

The system of equations to be solved can be written as follows:

\begin{displaymath}\left( \begin{array}{ccc} {\bf Q} & {\bf L}'{\bf X} &
{\bf L...
...} \\ {\bf Z}'{\bf v} -{\bf G}^{-1}{\bf u}
\end{array} \right). \end{displaymath}

The equations must be solved iteratively. Note that $\Delta {\bf b}$, for example, is the change in solutions for ${\bf b}$ between iterations. The calculations for ${\bf Q}$, ${\bf L}$, ${\bf W}$, ${\bf p}$, and ${\bf v}$ need to be described. The values of these matrices and vectors change with each iteration of the non-linear system. Using the example data on calving ease, then ${\bf X}$ contains columns referring to an overall mean, herd-year effects, age of dam effects, and sex of calf effects, and ${\bf Z}$ contains columns referring to sires, as follows:

\begin{displaymath}{\bf X} = \left( \begin{array}{rrrrrrr}
1 & 1 & 0 & 1 & 0 & 1...
...& 0 & 1 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1
\end{array} \right). \end{displaymath}

The vector ${\bf t}$ will contain the threshold points for a N(0,1) distribution at the end of the iterations.

The process is begun by choosing starting values for ${\bf b}$, ${\bf u}$, and ${\bf t}$. Let ${\bf b}={\bf0}$ and ${\bf u}={\bf0}$, then starting values for ${\bf t}$ can be obtained from the data by knowing the fraction of animals in the first category (i.e. 0.6786), and in the first two categories (i.e. 0.8572). The threshold values that give those percentages are t1 = 0.3904, and t2 = 0.9563.

Suppose that several iterations have been performed and the latest solutions are

\begin{displaymath}{\bf t} = \left( \begin{array}{r} .37529 \\ 1.01135 \end{arra...
...} -.08154 \\
.06550 \\ .12280 \\ -.10676 \end{array} \right). \end{displaymath}

The following calculations are performed for each subclass from j=1 to s, those for j=1 are shown below:

1.
$f_{jk} = t_{k} - {\bf x}'_{j}{\bf b} - {\bf z}'_{j}{\bf u}$for k=1 to (m-1).

\begin{eqnarray*}f_{11} & = & t_{1} - \mu - H_{1} - A_{1} - S_{1} - u_{1} \\
&...
... t_{2} - \mu - H_{1} - A_{1} - S_{1} - u_{1} \\
& = & 1.09289.
\end{eqnarray*}


2.
For k=1 to m, $\Phi_{jk} = \Phi(f_{jk})$.

\begin{eqnarray*}\Phi_{11} & = & \Phi(.45683) \\
& = & .6761, \\
\Phi_{12} &...
...28, \\
\Phi_{13} & = & 1, \ \mbox{and} \\
\Phi_{10} & = & 0.
\end{eqnarray*}


3.
For k=1 to m, $P_{jk} = \Phi_{jk} - \Phi_{j(k-1)}$.

\begin{eqnarray*}P_{11} & = & .6761 \ - \ 0 \ = \ .6761, \\
P_{12} & = & .8628...
...6761 \ = \ .1867, \\
P_{13} & = & 1 \ - \ .8628 \ = \ .1372.
\end{eqnarray*}


4.
For k=0 to m, $\phi_{jk} = \phi(f_{jk})$.

\begin{eqnarray*}\phi_{j0} & = & 0.0 \\
\phi_{j1} & = & \phi(.45683) \ = \ .35...
...{j2} & = & \phi(1.09289) \ = \ .2196, \\
\phi_{j3} & = & 0.0.
\end{eqnarray*}


5.
The matrix ${\bf W}$ is a diagonal matrix of order equal to the number of smallest subclasses, s. The elements of ${\bf W}$ provide a weighting factor for the jth subclass that depends on the number of observations in that subclass and on the values of $\phi_{jk}$ and Pjk. The jth diagonal is given by

\begin{displaymath}w_{jj} = n_{j.} \sum_{k=1}^{m}( \phi_{j(k-1)} - \phi_{jk})^{2}/
P_{jk}. \end{displaymath}

For the first subclass,

\begin{eqnarray*}w_{11} & = & n_{1.} \left( \frac{(\phi_{10}-\phi_{11})^{2}}{P_{...
...}}{.1867}
+ \frac{(.2196)^{2}}{.1372} \right) \\
& = & .6471.
\end{eqnarray*}


The values for all subclasses were

$( \ .6471$ .5167 .6082 .5692 1.3058 1.5721 1.4002
.5447 .6687 1.2380 .7196 .6923 1.3247 .7234
.6776 .7334 .7146 .6110 1.1373 1.3723 )

6.
The vector ${\bf v}$ is created and used in place of ${\bf y}$, which is unknown. For the jth subclass,

\begin{displaymath}v_{j} = \sum_{k=1}^{m} n_{jk} ( \phi_{j(k-1)} - \phi_{jk})/P_{jk}. \end{displaymath}

For j=1, then

\begin{eqnarray*}v_{j} & = & [ 1 (\phi_{10}-\phi_{11})/P_{11} + 0
(\phi_{11}-\...
...\phi_{13})/P_{13} ], \\
& = & -.3594/.6761, \\
& = & -.5316.
\end{eqnarray*}


The complete vector ${\bf v}$ is

$( \ -.5316$ -.3475 -.4671 .9848 1.0414 -1.0678 -.0928
1.0521 -.5731 -.9676 -.6993 1.4632 .9961 -.7115
.6416 1.3042 .4859 -.4713 -.8220 -1.2216 )

7.
The matrix ${\bf L}$ is of order $s \times (m-1)$ and the jkth element is

\begin{displaymath}\ell_{jk} = -n_{j.} \phi_{jk}[(\phi_{jk}-\phi_{j(k-1)})/P_{jk}
- (\phi_{j(k+1)}-\phi_{jk})/P_{j(k+1)}]. \end{displaymath}

For the example data,

\begin{displaymath}{\bf L} = \left( \begin{array}{rr}
-.46033 & -.18681 \\ -.41...
...
-.87141 & -.26588 \\ -.92684 & -.44551 \end{array} \right). \end{displaymath}

8.
The matrix ${\bf Q}$ is a tri-diagonal matrix of order $(m-1) \times (m-1)$, however, this is only noticeable when m is greater than 3. The diagonals of ${\bf Q}$ are given by

\begin{displaymath}q_{kk} = \sum_{j=1}^{s} n_{j.} \phi_{jk}^{2}
(P_{jk}+P_{j(k+1)})/P_{jk}P_{j(k+1)}, \end{displaymath}

and the off-diagonals are

\begin{displaymath}q_{k(k+1)} = - \sum_{j=1}^{s} n_{j.} \phi_{jk}\phi_{j(k+1)}/P_{j(k+1)}. \end{displaymath}

For the example data,

\begin{displaymath}{\bf Q} = \left( \begin{array}{rr} 24.12523 & -11.55769 \\
-11.55769 & 16.76720 \end{array} \right). \end{displaymath}

9.
The elements of vector ${\bf p}$ are given by

\begin{displaymath}p_{k} = \sum_{j=1}^{s} \phi_{jk} [ (n_{jk}/P_{jk})
- (n_{j(k+1)}/P_{j(k+1)}) ], \end{displaymath}

for k=1 to (m-1). Hence,

\begin{displaymath}{\bf p}' = \left( \begin{array}{rr} .003711 & .000063
\end{array} \right). \end{displaymath}

Assume that ${\bf G} = {\bf I}(\frac{1}{19})$ for the example data. There are dependencies in ${\bf X}$ and with the threshold points, so that 4 restrictions on the solutions are needed. Let the solutions for $\mu$, H1, A1, and S1 be restricted to zero by removing those columns from ${\bf X}$. The remaining calculations give the following submatrices of the mixed model equations.

\begin{eqnarray*}{\bf L}'{\bf X} & = & \left( \begin{array}{rrr}
-6.8311 & -6.6...
...rray}{rrrr}
23.4219 & 24.4953 & 23.0246 & 22.8352 \end{array} ).
\end{eqnarray*}


The solutions (change in estimates) to these equations are added to the estimates used to start the iteration to give the new estimates, as shown in the table below.

  $\Delta \beta$ $+ \beta_{i-1}$ $= \beta_{i}$
t1 .000229 .37529 .375519
t2 .000158 1.01135 1.011508
H1 0.0 0.0 0.0
H2 -.000046 .29752 .297473
A1 0.0 0.0 0.0
A2 -.000013 -.12687 -.126883
S1 0.0 0.0 0.0
S2 .000064 -.39066 -.390596
u1 .000011 -.08154 -.081529
u2 -.000013 .06550 .065487
u3 -.000014 .12280 .122786
u4 .000017 -.10676 -.106743

As further iterations are completed, then the right hand sides to the equations approach a null vector, so that the solutions, $\Delta \beta$, also approach a null vector.

There could be problems in obtaining convergence, or a valid set of estimates. In theory the threshold points should be strictly increasing. That is, tm should be greater than t(m-1) should be greater than t(m-2) and so on. The solutions may not come out in this order. This could happen when two thresholds are close to each other due to the definition of categories. An option would be to combine two or more of the categories involved in the problem, and to re-analyze with fewer categories.

Another problem could occur when all observations fall into either extreme category. This is known as an "extreme category" problem. Suggestions have been to delete subclasses with this problem, but throwing away data is not a good idea. Other types of analyses may be needed.

Estimation of Variance Components

Harville and Mee (1984) recommend a REML-like procedure for estimating the variance ratio of residual to random effects variances. One of the assumptions in a threshold model is that the residual variance is fixed to a value of 1. Hence only that variances of the random effects need to be estimated. Let a generalized inverse of the coefficient matrix in the equations be represented as

\begin{displaymath}{\bf C} = \left( \begin{array}{lll}
{\bf C}_{tt} & {\bf C}_{t...
...\bf C}_{zt} & {\bf C}_{zx} & {\bf C}_{zz} \end{array} \right). \end{displaymath}

Then the REML EM estimator of variance of the random effects is

\begin{displaymath}\sigma^{2}_{u} = ( {\bf u}'{\bf G}^{-1}{\bf u} +
tr({\bf G}^{-1}{\bf C}_{zz})) / d, \end{displaymath}

where d is the number of levels in ${\bf u}$.

Using the example data, and the recent estimates (as final values),

\begin{eqnarray*}{\bf u}'{\bf G}^{-1}{\bf u} & = & .0374061, \\
tr {\bf C}_{zz}...
...{2}_{u} & = & (.0374061 \ + \ .18460 ) / 4, \\
& = & .0555015.
\end{eqnarray*}


The new variance ratio is (.0555015)-1 = 18.0175. The REML EM-like algorithm would need to be repeated until the variance estimate is converged. Hence there are two levels of iteration going on at the same time in a very non-linear system of equations.

The final estimate of the variances are in terms of the liability or underlying normal distribution scale. If the number of categories was only two, then heritability can be converted to the "observed" scale using the formula

h2obs = h2lia z2/ (p(1-p)),

where p is the proportion of observations in one category, and z is the height of the normal curve at the threshold point (on the observed scale).

Expression of Genetic Values

Using the estimates from the non-linear system of equations, the probability of a sire's offspring to fall into each category can be calculated. Take sire 3 as an example. First the threshold point for male offspring of 2-yr-old dams in herd-year 1 needs to be determined as

\begin{eqnarray*}x & = & t_{1} + H_{1} + A_{1} + S_{1} + u_{3}, \\
& = & .375519 + 0 + 0 + 0 + .122786, \\
& = & .498305.
\end{eqnarray*}


Using the short program below, then

\begin{displaymath}\Phi(x) = \Phi(.498305) = .69087. \end{displaymath}

Similarly, the probability of the sire's offspring to be in categories 1 and 2 would be based on the second threshold,

\begin{displaymath}x = 1.011508 \ + \ .122786 = \ 1.134294, \end{displaymath}

or $\Phi(x) = .87166.$ Thus, the proportion of offspring in category 2 would be $.87166 \ - \ .69087 \ = \ .18079,$and the proportion in category 3 would be $1.0 \ - \ .87166 \ = \ .12834.$In the raw data about .68 of the offspring were in category 1, and therefore sire 3 has only .01 greater in probability for category 1, and for categories 1 and 2 the raw percentage was .86. With the low heritability assumed, sires would not be expected to have greatly different probabilities from the raw averages.

Notice that the probabilities are defined in terms of one of the smallest fixed effects subclasses. If 3-yr-old dams and female calves were considered, then the new threshold values would be lower, thereby giving fewer offspring in category 1 and maybe more in categories 2 or 3, because the solutions for 3-yr-old dams is negative and for female calves is negative. If an average offspring is envisioned, then solutions for ages of dam can be averaged and solutions for sex of calf can be averaged. A practical approach would be to choose the subclass where there seems to be the most economic losses and rank sires on that basis.

Program to compute $\Phi(x)$

      Real*8 Height,Q
      Real*4 x,p
      print *,'Enter a number greater than 99 to quit'
    5 print *,'Enter an x-axis value'
      read(5,*)x
        if(x.gt.99.0)go to 99
       p = pnor(x)
       Q = -0.5*x*x
       Height = dexp(Q)/2.506628
      print 104,x,p,Height
  104 format('  x = ',f15.5/
             '  p = ',f15.5/
         '  Height = ',f10.5)
      go to 5
   99 print 105
  105 format('  End of Program')
      stop
      end


next up previous

This LaTeX document is available as postscript or asAdobe PDF.

Larry Schaeffer
2000-04-03