If you’re going to be poking around data looking for patterns and anomalies, you should be aware of the fundamental requirements you need to fulfill, or at least assume you fulfill. Consider this. All models make assumptions, an evil necessity for simplifying complex analyses. If your model deals in probabilities, like statistical models do, you’ll be making at least five assumptions:
- Representativeness – The samples or measurements used to develop the model are representative of the population of possible samples or measurements.
- Linearity – The model can be expressed in an intrinsically linear, additive form.
- Independence – Errors in the model are not correlated.
- Normality – Errors in the model are normally distributed.
- Homogeneity of Variances – Errors in the model have equal variances for all values of the dependent variables.
The most important assumption common to all statistical models is that the samples used to develop the model are representative of a population of possible samples that are being investigated. Some statistics books don’t discuss this as a basic assumption because it is viewed as more of a requirement than an assumption. But, obtaining representative samples of populations can be a challenge. Unlike the other assumptions, failure to obtain a representative sample from a population under study would necessarily be a fatal flaw for any statistical analysis. You might not know it, though, because there’s no good way to determine if a sample is representative of the underlying population. To do that, you would need to know the characteristics of the population. But if you knew about the population, you wouldn’t need to bother with a sample. So, representativeness has to be addressed indirectly by building randomization and variance control into the sampling program before it is undertaken. If randomization cannot be incorporated into the sampling procedure in some way, the only alternative is to try to evaluate how the sample might not be representative. This is seldom a satisfying exercise. Making statements like “the results are conservative because only the worst cases were sampled” are usually conjectural, qualitative, and unconvincing to anyone who understands statistics.
The linearity assumption requires that the statistical model of the dependent variable being analyzed can be expressed by a linear mathematical equations consisting of sums of arithmetic coefficients times the independent variables. The effects of nonlinear relationships are usually substantial. Applying a linear model to a nonlinear pattern of data will result in misleading statistics and a poor fit of the model to the data. Evaluating the linearity assumption is usually straightforward. Start by plotting the dependent variable versus the independent variables, calculate correlations, and go from there.
This assumption is seldom a problem for three reasons. First, in practice, most models of dependent variables can be expressed as linear mathematical equations consisting of arithmetic sums of coefficients times the independent variables. Second, the assumption will still be met when one or more of the independent variables have a nonlinear relationship with the dependent variable if a mathematical transformation can be found to make the relationship linear. The only catch is that the coefficients (termed the parameters of the model) must still be linear. These models are termed intrinsically linear. In contrast, intrinsically nonlinear models have coefficients that are nonlinear. Third, if a transformation cannot be found to correct a nonlinear relationship, you can still resort to using statistical methods for intrinsically nonlinear models. Nonlinear modeling uses different terminology and optimization processes than linear regression and usually requires specialized software.
The third assumption common to statistical models is that the errors in the model are independent of each other. Some introductory statistics textbooks describe this assumption in term of the measurements on the dependent variable. There are two reasons for this. First, it’s a lot easier for beginning students to understand, especially if they aren’t familiar with the mathematical form of statistical models and the concept of model errors. Second, and more importantly, the two approaches to describing the independence assumption are equivalent. This is because a data value can be expressed as the sum of an inherent “true” value and some random error. If you have controlled all those sources of extraneous variation, the data and the model errors should be identically distributed.
Say you were conducting a study that involved measuring the temperature of human subjects. Without your knowledge, a well-meaning assistant provides beverages in the waiting room – piping hot coffee and iced tea. When you plot a histogram of the temperature data, you might see three peaks (called modes), one centered at 98.6°F, another at a degree or so higher and a third a degree or so lower. Your data have violated the independence assumption. The subjects who drank the coffee all had their temperatures linked to the higher temperature of the coffee. The subjects who drank the iced tea all had their temperatures linked to the lower temperature of the iced tea. What are the chances you might notice this dependency? If you had a dozen or so subjects, the chances wouldn’t be good. With 100 subjects, you might notice something. With 1,000 subjects, you would almost certainly notice the effect, though if you’re providing beverages to 1,000 subjects, you might consider getting out of research and opening a coffee shop.
Assessing independence involves looking for serial correlations, autocorrelations, and spatial correlations. A serial correlation is the correlation between data points with the previously listed data points. For example, making measurements with an instrument that is drifting out of calibration will introduce a serial correlation. Spatial or temporal dependence are often present in environmental data. For example, two soil samples located very close together are more likely to have similar attributes than two samples located very far apart. Likewise, two well water samples collected a day apart are more likely to have similar attributes than two samples collected two years apart.
Most statistical software will allow you to conduct the Durban-Watson test for serial correlation as part of a regression analysis. For temporally related data, correlograms are used to assess autocorrelations and partial autocorrelations. Spatial independence can be evaluated using variograms, plots of the spatial variance versus the distances between samples. Correlograms and variograms require specialized software to produce and some experience to interpret.
When the independence assumption is violated, the calculated probability that a population and a fixed value (or two populations) are different will be underestimated if the correlation is negative, or overestimated if the correlation is positive. The magnitude of the effect is related to the degree of the correlation.
Some people confuse the independence assumption, which refers to model errors or measurements of the dependent variable, with the assumption that the independent variables (AKA, predictor variables) are not correlated. Correlations between predictor variables, termed multicollinearity, are also problematical for many types of statistical models because statistics associated with such models can be misleading.
The Normality assumption requires that model errors (or the dependent variable) mimic the form of a Normal distribution. This assumption is important because the Normal model is used as the basis for calculating probabilities related to the statistical model. If the model errors don’t at least approximate a Normal distribution, the calculated probabilities will be misleading. It would be like trying to put a square peg into a round hole.
There are many methods for evaluating the Normality of a distribution, which fall into one of three categories:
- Descriptive Statistics – Including the coefficient of variation (the standard deviation divided by the mean), the skewness (a measure of distribution symmetry), and the kurtosis (a measure of relative frequencies in the center versus the tails of the distribution). If the coefficient of variation is less than about one, and the skewness and the kurtosis are close to zero, it’s reasonable to assume the errors approximate a Normal distribution
- Statistical Graphics – Statistical graphics are more revealing than descriptive statistics because they indicate visually what data deviate from the Normal model. Interpreting these graphics can be somewhat subjective, however. The most commonly used statistical graphics are histograms,box plots,and probability plots. Other statistical graphics sometimes used to evaluate Normality include stem-and-leaf diagrams, dot plots, and Q-Q plots.
- Statistical Tests – Statistical tests are more rigorous than either descriptive statistics or statistical graphics. Commonly used tests of normality include the Shapiro-Wilk test, the Chi-squared test, and the Kolmogorov-Smirnov test. One of the problems with statistical tests of Normality is that they become more and more sensitive as the sample size gets large. So, a statistical test may indicate a significant departure from Normality that is so minor it is unimportant. Thus, tests of normality may be definitive but irrelevant.
So how should you evaluate Normality? Focus on one method or decide on the basis of a preponderance of the evidence? First, you have to understand that statistical tests, statistical graphics, and descriptive statistics are like advisors. They all have an opinion, none is always correct, and they sometimes provide conflicting advice.
One approach to evaluating Normality is to first look at a histogram to get a general impression of whether the data distribution is even close to a Normal distribution. If it is, look at a test of Normality, preferably a Shapiro-Wilk test. This test assumes Normality so if there’s no significant difference, you can conclude that the data came from a Normally distributed population. If there is a significant difference, then your decision becomes problematical. Look at a probability plot to determine where the departures from Normality are. You might have a problem if the deviations are in the tails because that’s where the test probabilities are calculated. If there is an appreciable deviation from Normality in the tails of the distribution of errors, consider transforming the dependent variable or using a nonparametric procedure.
The last assumption, termed homoscedasticity, means that the errors in a statistical model have the same variance for all values of the dependent variable. For models involving grouping variables, the assumption means that all groups have about the same variance. For models involving continuous-scale variables, homoscedasticity means that the variances of the errors don’t change across the entire scale of measurement.
For example, in the case of a measurement instrument, homoscedasticity requires that the error variance be about the same for measurements at the low, middle, and high portions of the instrument’s range, which can be a difficult requirement to meet. Another example would be measurements made over many years. Improvements in measurement technologies could cause more recent measurements to be less variable (i.e., more precise) than historical measurements.
Assessing homoscedasticity is more straightforward for discrete-scale variables than for continuous-scale variables because there are usually more than a few data points at each scale level. A simple qualitative approach is to calculate the variances for each group and look at the ratios of the sample sizes and the variances. There are also more sophisticated ways to evaluate homoscedasticity, such as Levene’s test.
Violations of the homoscedasticity assumption tend to affect statistical models more than do violations of the Normality assumption. Generally, the effects of violating the homogeneity-of-variances assumption will be small if the largest ratio of variances is near one and the sample sizes are about the same for all values of the independent variables. As differences in both the variances and the numbers of samples become large, the effects can be great. Violations of homoscedasticity can often be corrected using transformations. In fact, transformations that correct deviations from Normality will often also correct heteroscedasticity. Non-parametric statistics also have been used to address violations of this assumption.
So, you don’t have to assume the worst might happen in violating a statistical assumption. The effects may be minor or there may be an alternative approach you can use. You just have to know what to look for.