This LaTeX document is available as postscript or asAdobe PDF.

Linear Models
L. R. Schaeffer, March 1999

Introduction

The quality of any statistical analysis is best judged by the model that describes the data. A model represents the sampling nature of the data and reflects the biology of the problem. There are three conceptual levels of models.

• A true model describes the data perfectly, leaving no unexplained variation. The true model is never known exactly and may not necessarily be a linear model.
• An ideal model is as close to the true model as possible. The ideal model should be used for the analysis, but often can not be utilized for various reasons.
• An operational model is a simplified version of the ideal model, and is used in the analysis.

A good operational model is derived from the ideal model. Given the data and resource limitations under which the researcher must function, the operational model is a simplification of the ideal model. If too many simplifications have to be made, then perhaps an analysis may not be worthwhile. The assumptions to go from the ideal model to the operational model should be known, and thus the quality of the operational model can be judged.

Observations

The observation vector contains elements resulting from measurements, either subjective or objective, on the experimental units (usually animals) under study. The elements in the observation vector are random variables that have a multivariate distribution, and if the form of the distribution is known, then advantage should be taken of that knowledge.

Factors

Factors refer to variables, either discrete or continuous, which may influence or are related to the elements in the observation vector. For example, the milk yield of a dairy cow is known to be influenced by her age at calving, the season in which she calved, her genetic potential, and the number of days nonpregnant, to list a few. Model building requires that all useful factors be identified.

Discrete factors usually have classes or levels such as age at calving could have four levels (e.g. 20 to 24 months, 25 to 28 months, 29 to 32 months, and 33 months or greater). Hence an analysis of data would provide estimates of differences in milk yields of cows in the various age levels. Alternatively, the effect of age might be considered as a covariable with a linear and quadratic effect upon milk yield, and the regression coefficients relating age and age squared to milk yields would be estimated in the analysis.

Some factors may have a special interest to the researcher while other factors need to be included in the model in order to reduce the residual variation. For example, a researcher could be interested in the effects of various levels of application of growth hormones to beef cattle, but the model must also include the effects of age of the animal, the sex, the location, diet, and breed. The latter group of effects are often refered to as nuisance factors. Nuisance factors cannot be ignored or omitted from the model because this could drastically alter the interpretation of results for the factors of main interest.

Fixed and Random Factors

In the traditional "frequentist" approach, fixed and random factors need to be distinguished. In a Bayesian approach there is no such distinction between factors. Both approaches will be used in this course.

Fixed factors are factors in which the classes comprise all of the possible classes of interest that could be observed. For example, the sex of an animal is either male, female, sterilized male, or sterilized female. If the number of classes in a factor is small and confined to this number even if conceptual resampling were performed an infinite number of times, then the factor is likely fixed. Other examples are age classes, lactation number, management system, cage number, and breed class. Usually if the sampling were to be repeated a second time, those factors which maintain the same classes between the two samplings would be fixed factors. For example, a growth trial on pigs using two diets would probably need to use the same housing facilities, the same age groups of pigs, and the same diets, but the individual pigs would necessarily have to be new animals because an animal could not go through the same growth phase a second time in its life. Pig effects would be considered a random factor while the other effects would be fixed.

Random factors are factors whose levels are considered to be drawn randomly from an infinitely large population of levels. As in the previous pig experiment, pigs were considered random because the pig population of the world is large enough to be considered infinitely large, and the group that were involved in that experiment were a random sample from that population. In actual fact, however, the pigs on that experiment were likely sampled from those relatively few pigs that were available at the time the trial started, but still they are considered to be a random factor because if the experiment were to be repeated again, there would likely be a completely different group of pigs involved.

Another way to determine if a factor is fixed or random is to know how the results will be used. In a nutrition trial the results infer something about the diets in the trial. The diets are specific and no inferences should be made about other diets not tested in the experiment. Hence diet effects would be a fixed factor. In contrast, if animal effects were in the model, inferences about how any animal might respond to a specific diet may need to be made. There should not be anything peculiar about the animal on the trial that would nullify that inference. Animal effects would be a random factor.

In general, a few questions need to be answered to make the correct choice of fixed or random factor designation. Some of the questions are

1.
How many levels of the factor are in the model? If small, then perhaps this is a fixed factor. If large, then perhaps this is a random factor.
2.
Is the number of levels in the population large enough to be considered infinite? If yes, then perhaps this factor is random.
3.
Would the same levels be used again if the experiment were to be repeated a second time? If yes, then perhaps this factor is fixed.
4.
Are inferences to be made about levels not included in the experiment? If yes, then perhaps this factor should be random.
5.
Were the levels of a factor determined in a nonrandom manner? If yes, then perhaps this factor should be treated as fixed.
By studying the scientific literature, a researcher should be able to get some help in this decision process. If in doubt, then the assistance of an experienced statistician should be sought.

In a Bayesian context, a prior distribution needs to be assumed about each of the factors. For random factors, typically these might be assumed to have a Normal distribution with a particular mean and variance. For fixed factors, an uniform distribution may be assumed or a prior distribution in which the factors are proportional to a constant. In a Bayesian context, even the variances need to have an assumed prior distribution. The prior distributions are combined to derive the distribution of the observations, and then are used with the distribution of the data to arrive at a posterior distribution from which inferences may be made.

Models

Only linear models are discussed. A linear model contains a set of factors which additively affect the observations, but a variable within a factor may represent, for example, a squared term. Linear models are adequate in most biological circumstances. This does not imply that nonlinear models are unimportant. Nonlinear relationships may often be approximated by a linear model. However, if a nonlinear model gives a better ideal model than a linear model, then the nonlinear model should be utilized. Texts that deal with nonlinear model methods should be consulted.

A linear model, in the traditional sense, is composed of three parts:

1.
The equation.
2.
Expectations and Variance-Covariance matrices of random variables.
3.
Assumptions, restrictions, and limitations.

The Equation

The equation of the model defines the factors that may have an effect on the observed trait. A matrix formulation of a general model equation is as follows:

where
is the vector of observed values of the trait,
is a vector of factors, collectively known as fixed effects,
is a vector of factors known as random effects,
is a vector of residual terms, also random,
are known matrices, commonly known as design matrices, that describe the precise relationship between the elements of and with those of .
This equation is a mixed model that contains both fixed and random factors. Because the vector is a random factor and because most models include an effect due to the overall mean, then technically speaking, all linear models are mixed models. However, a fixed effects model is one in which does not appear, and a random effects model is one in which , i.e. no fixed factors other than the overall mean effect.

Expectations and VCV Matrices

In general terms, the expectations are

and the variance-covariance matrices are

where and are general square matrices assumed to be nonsingular and positive definite, and the elements of which are assumed known. Also,

Assumptions and Limitations

The third part of a model includes items that are not apparent in parts 1 and 2; for example, information about the data or the manner in which the data were collected. Were the animals randomly selected or did they have to meet some minimum standards? Did the data arise from many environments, at random, or were the environments specially chosen?

In this part of the model the differences between the operational model and the ideal model should be listed, and the possible effects of those differences on the analysis should be explained. Such a comparison is frequently overlooked or ignored, but part 3 of the model contains the most important information for assessing the quality of the analysis.

A linear model is not complete unless all three parts of the model are present. Statistical procedures and strategies for data analysis are determined only after a complete model is in place.

Examples of Models

Beef Calf Weights

Suppose we have weights on beef calves taken at 200 days of age as shown in the table below.

 Males Females 198 187 211 194 220 202 185
The equation of the model might be

yij = si + cj + eij,

where yij is one of the 200 day weights, si is an effect due to the sex of the calf (fixed factor), cj is an effect of the calf (random factor), and eij is a residual effect or unexplained variation (random factor).

The expectations and variance-covariance matrices of the random factors are

Additionally, Cov(cj,cj') = 0, which says that all of the calves are independent of each other, i.e. unrelated. Note that implies that the residual variance is different for each sex of calf, because of the subscript i. Also, Cov(eij,eij')=0 and Cov(eij,ei'j')=0 says that all residual effects are independent of each other within and between sexes.

The assumptions and limitations of the model could be listed as follows:

1.
All calves are assumed to be of the same breed.
2.
All calves were reared in the same environment and time period.
3.
All calves were from dams of the same age (e.g. 3 yr olds).
4.
Maternal effects do not influence 200 day weights.
5.
Calf effects contain all genetic effects.
6.
All weights were accurately recorded (i.e. not guessed).
The assumption about maternal effects may not be true, but without pedigree information it is not possible to include maternal effects in the model. Without breed information, then calves have to be assumed to be of the same breed otherwise the estimated sex differences could include differences between breeds. Without any knowledge of the dams or their ages, then we have to assume that the dams were all of the same age. Age of dam effects are known to exist and should not be ignored. If one or more of the assumptions are known or suspected to be violated, then the model should be re-formulated and further information should be obtained to make the model better, so that the assumption does not need to be made. Matrix Formulation

Ordering the observations by males, then females, the matrix representation of the model would be

and of order 7. Also,

Sire Model

Suppose we have progeny data of three sires on temperament scores (on a scale of 1 to 40) taken at milking time as shown below:

 CG Age Sire Score 1 1 1 17 1 2 2 29 1 1 2 34 1 2 3 16 2 2 3 20 2 1 3 24 2 2 1 13 2 1 1 18 2 2 2 25 2 1 2 31

The equation of the model might be

yijkl = ci + aj + sk + eijkl,

where yijkl is a temperament score, ci is a contemporary group effect (CG) which identifies animals that are typically reared and treated alike together; aj is an age group effect, in this case just two age groups; sk is a sire effect; and eijkl is a residual effect. Contemporary groups and age groups are often taken to be fixed factors, and sires are generally random factors. Age group 1 was for daughters between 18 and 24 mo of age, and age group 2 was for daughters between 25 and 32 mo of age.

The expectations and variance-covariance matrices are

Thus, the residual variance differs between contemporary groups. The sire variance represents one quarter of the additive genetic variance because all progeny are assumed to be half-sibs (i.e. from different dams). The sires are assumed to be unrelated.

The assumptions and limitations of the model could be

1.
Daughters were approximately in the same stage of lactation when temperament scores were taken.
2.
The same person assigned temperament scores for all daughters.
3.
The age groupings were appropriate.
4.
Sires were unrelated to each other.
5.
Sires were mated randomly to dams (with respect to milking temperament or any correlated traits).
6.
Only one offspring per dam.
7.
Only one score per daughter.
8.
No preferential treatment towards particular daughters.
Many assumptions are actually made when a model is described, but often these assumptions are just implied and therefore not clearly stated. It is a good idea to try to think of and to state in words as many of the assumptions as possible prior to the analysis.

The reader is asked to set out the matrix formulation of the design matrices and and .

Feed Intake

Feed intake can be measured on individual pigs which might be modeled as

yijkmn = (HYM)i + Sj + Lk + akm + pkm + eijkmn,

where yijkmn is a feed intake measurement at a specified moment in time, n, on the mth pig from litter k, whose sow was in age group j, within the ith herd-year-month of birth subclass; HYM is a herd-year-month of birth or contemporary group effect; Sj is an age of sow effect identified by parity number of the sow; Lk is a litter effect which identifies a group of pigs with the same genetic and environmental background; akm is an additive genetic animal effect; pkm is an animal permanent environmental effect common to all measurements on an animal; and eijkmn is a residual effect specific to each measurement. Then

All pigs were purebred Landrace. Two males and two females were taken randomly from each litter for feed intake measurements. The model assumes that there are no sex differences in feed intake nor that maternal effects have any influence. All measurements were taken at approximately the same age of the pigs within a controlled environment at one location. Feed and handling of pigs was therefore, uniform for all pigs within a herd-year-month subclass. Litters were related through the use of boars from artificial insemination. Feed intake was the average of 3 daily intakes during the week, and weekly averages were available for 5 consecutive weeks. Growth was assumed to be linear during the test period.

The reader should try to identify weaknesses in the above model and recommend changes or further assumptions that are missing. Note that all three parts of the model have been given without labelling each part.

This LaTeX document is available as postscript or asAdobe PDF.

Larry Schaeffer
1999-02-26