This LaTeX document is available as postscript or asAdobe PDF.
The quality of any statistical analysis is best judged by the model that describes the data. A model represents the sampling nature of the data and reflects the biology of the problem. There are three conceptual levels of models.
A good operational model is derived from the ideal model. Given the data and resource limitations under which the researcher must function, the operational model is a simplification of the ideal model. If too many simplifications have to be made, then perhaps an analysis may not be worthwhile. The assumptions to go from the ideal model to the operational model should be known, and thus the quality of the operational model can be judged.
The observation vector contains elements resulting from measurements, either subjective or objective, on the experimental units (usually animals) under study. The elements in the observation vector are random variables that have a multivariate distribution, and if the form of the distribution is known, then advantage should be taken of that knowledge.
Factors refer to variables, either discrete or continuous, which may influence or are related to the elements in the observation vector. For example, the milk yield of a dairy cow is known to be influenced by her age at calving, the season in which she calved, her genetic potential, and the number of days nonpregnant, to list a few. Model building requires that all useful factors be identified.
Discrete factors usually have classes or levels such as age at calving could have four levels (e.g. 20 to 24 months, 25 to 28 months, 29 to 32 months, and 33 months or greater). Hence an analysis of data would provide estimates of differences in milk yields of cows in the various age levels. Alternatively, the effect of age might be considered as a covariable with a linear and quadratic effect upon milk yield, and the regression coefficients relating age and age squared to milk yields would be estimated in the analysis.
Some factors may have a special interest to the researcher while other factors need to be included in the model in order to reduce the residual variation. For example, a researcher could be interested in the effects of various levels of application of growth hormones to beef cattle, but the model must also include the effects of age of the animal, the sex, the location, diet, and breed. The latter group of effects are often refered to as nuisance factors. Nuisance factors cannot be ignored or omitted from the model because this could drastically alter the interpretation of results for the factors of main interest.
Fixed and Random Factors
In the traditional "frequentist" approach, fixed and random factors need to be distinguished. In a Bayesian approach there is no such distinction between factors. Both approaches will be used in this course.
Fixed factors are factors in which the classes comprise all of the possible classes of interest that could be observed. For example, the sex of an animal is either male, female, sterilized male, or sterilized female. If the number of classes in a factor is small and confined to this number even if conceptual resampling were performed an infinite number of times, then the factor is likely fixed. Other examples are age classes, lactation number, management system, cage number, and breed class. Usually if the sampling were to be repeated a second time, those factors which maintain the same classes between the two samplings would be fixed factors. For example, a growth trial on pigs using two diets would probably need to use the same housing facilities, the same age groups of pigs, and the same diets, but the individual pigs would necessarily have to be new animals because an animal could not go through the same growth phase a second time in its life. Pig effects would be considered a random factor while the other effects would be fixed.
Random factors are factors whose levels are considered to be drawn randomly from an infinitely large population of levels. As in the previous pig experiment, pigs were considered random because the pig population of the world is large enough to be considered infinitely large, and the group that were involved in that experiment were a random sample from that population. In actual fact, however, the pigs on that experiment were likely sampled from those relatively few pigs that were available at the time the trial started, but still they are considered to be a random factor because if the experiment were to be repeated again, there would likely be a completely different group of pigs involved.
Another way to determine if a factor is fixed or random is to know how the results will be used. In a nutrition trial the results infer something about the diets in the trial. The diets are specific and no inferences should be made about other diets not tested in the experiment. Hence diet effects would be a fixed factor. In contrast, if animal effects were in the model, inferences about how any animal might respond to a specific diet may need to be made. There should not be anything peculiar about the animal on the trial that would nullify that inference. Animal effects would be a random factor.
In general, a few questions need to be answered to make the correct choice of fixed or random factor designation. Some of the questions are
In a Bayesian context, a prior distribution needs to be assumed about each of the factors. For random factors, typically these might be assumed to have a Normal distribution with a particular mean and variance. For fixed factors, an uniform distribution may be assumed or a prior distribution in which the factors are proportional to a constant. In a Bayesian context, even the variances need to have an assumed prior distribution. The prior distributions are combined to derive the distribution of the observations, and then are used with the distribution of the data to arrive at a posterior distribution from which inferences may be made.
Only linear models are discussed. A linear model contains a set of factors which additively affect the observations, but a variable within a factor may represent, for example, a squared term. Linear models are adequate in most biological circumstances. This does not imply that nonlinear models are unimportant. Nonlinear relationships may often be approximated by a linear model. However, if a nonlinear model gives a better ideal model than a linear model, then the nonlinear model should be utilized. Texts that deal with nonlinear model methods should be consulted.
A linear model, in the traditional sense, is composed of three parts:
The equation of the model defines the factors that may have an effect on the
observed trait. A matrix formulation of a general model equation is
Expectations and VCV Matrices
In general terms, the expectations are
Assumptions and Limitations
The third part of a model includes items that are not apparent in parts 1 and 2; for example, information about the data or the manner in which the data were collected. Were the animals randomly selected or did they have to meet some minimum standards? Did the data arise from many environments, at random, or were the environments specially chosen?
In this part of the model the differences between the operational model and the ideal model should be listed, and the possible effects of those differences on the analysis should be explained. Such a comparison is frequently overlooked or ignored, but part 3 of the model contains the most important information for assessing the quality of the analysis.
A linear model is not complete unless all three parts of the model are present. Statistical procedures and strategies for data analysis are determined only after a complete model is in place.
Examples of Models
Beef Calf Weights
Suppose we have weights on beef calves taken at 200 days of age as shown in the table below.
The expectations and variance-covariance matrices of the
random factors are
The assumptions and limitations of the model could be listed as follows:
Ordering the observations by males, then females, the matrix
representation of the model would be
Suppose we have progeny data of three sires on temperament scores (on a scale of 1 to 40) taken at milking time as shown below:
The equation of the model might be
The expectations and variance-covariance matrices are
The assumptions and limitations of the model could be
The reader is asked to set out the matrix formulation of the design matrices and and .
Feed intake can be measured on individual pigs which might be
The reader should try to identify weaknesses in the above model and recommend changes or further assumptions that are missing. Note that all three parts of the model have been given without labelling each part.
This LaTeX document is available as postscript or asAdobe PDF.Larry Schaeffer