The Seeds of a Model

Always start with good seeds.

Perhaps the most complicated and time-consuming aspect of model building is selecting the components of your model—the variables, the samples, and the data (https://statswithcats.wordpress.com/2010/12/04/many-paths-lead-to-models/). Here are a few tips for collecting the seeds of your model.

Models Revisited

Here’s a quick review of the components of a statistical model. The key variable that characterizes the phenomenon to be modeled is called the criterion variable, or more commonly, the dependent variable. Variables (usually, but not necessarily, more than one) that will be used to test, predict, or explain the dependent variable in the model are called grouping variables, predictor variables, explanatory variables, or most commonly, independent variables. A prototype model is represented as:

Dependent variable that
characterizes the phenomenon

Independent variable(s) that test, predict, or explain the dependent variable

By convention, the criterion or dependent variable is always placed to the left of the equals sign, and the independent variables are placed to the right. This representation says that the information in the dependent variable can be obtained from the information in the independent variable(s). Usually, though, the independent variables in a model won’t all be equally important for describing a dependent variable. Each independent variable has to be weighted by multiplying it by an adjustment factor to account for the differences. The adjustment factors also correct for the independent variables being measured in different units, or even scales of measurement. So a more detailed representation of a model would be:

Dependent variable

Variable 1 Adjustment Factor * Independent variable 1 +
Variable 2 Adjustment Factor * Independent variable 2 +
… and so on … +
Model Adjustment Factor

This says that the information in your dependent variable can be expressed as the sum of your independent variables, which have been adjusted to account for their scales of measurement and for their contributions to the model, plus an adjustment factor for the entire model not related to a specific independent variable. If all of the adjustment factors are constants in a given model, which they usually are, you have a linear model. The values for the adjustment factors are determined by the technique you’re using to calibrate the model. If the value of a dependent variable is always equal to the sum of the adjustment factors times the values of the independent variables, plus the model constant, the model is called exact or deterministic (https://statswithcats.wordpress.com/2010/08/08/the-zen-of-modeling/).

Even with all those adjustment factors, though, sometimes the independent variables can’t quite reproduce the values of the dependent variable, so there are errors. Add an error term to the model and you have a statistical model:

Dependent variable

Variable 1 Adjustment Factor * Independent variable 1 +
Variable 2 Adjustment Factor * Independent variable 2 +
… and so on … +
Model Adjustment Factor +
Error

To be more concise, the terms in the model can be represented by letters and rewritten as:

y = a₀ + a₁x₁ + a₂x₂ + … a_nx_n + e

where:

y is the dependent variable that characterizes the phenomenon.

x₁through x_n are the independent variables that test, predict, or explain the dependent variable.

a₀ is the Model Adjustment Factor.

a₁through a_n are the Variable Adjustment Factors. a₁ through a_n are constants called coefficients or parameters of the model. If a₁through a_n
aren’t constants, you have a nonlinear model.

e is the Error term, which allows you to characterize the uncertainty in the model.

The y and the xs are the variables you create and measure on your samples. The as and the e are the constants the statistical procedure estimates. That’s a statistical model. To add a little more perspective, if you have only one dependent variable, only one independent variable, and no error, the model reduces to:

y = a + bx

Now that’s a different way to look at things.

which you may remember from high school algebra is the equation of a straight line where a is the y-intercept and b is the slope of the line. So mathematical models really aren’t so mysterious and shouldn’t induce the terror of, say, getting sucked down the toilet in the restroom of a Boeing 747 and falling 35,000 feet into a fetid swamp full of vampire bats, ticks, leeches, and IRS agents, then having to give an hour-long presentation on your experience au naturel at the next Christian Nudist Convocation. Try both, you’ll see.

Dependent Variables

To build your model, select as many dependent variables as you feel you’ll need to characterize the phenomenon. Usually, statistical models have only one dependent variable. These are called univariate statistical models. If you think you’ll need only one dependent variable, that’s great. It will make for a fairly straightforward analysis.

If more than one dependent variable is needed to describe a phenomenon, the model is called a multivariate statistical model. (Some statistical textbooks, particularly in the social sciences, refer to statistical procedures that analyze more than one kind of variable, either dependent or independent, as multivariate. But the complexity of the analysis is far greater if there are multiple dependent variables then if there are multiple independent variables.)

If you need more than one dependent variable, try to limit the number. If you have more than a few dependent variables, here are a few things you can do to reduce the number of candidate dependent variables.

Focus on Aspects of the Phenomenon—Some phenomena are very complex or at least multifaceted. You may be able to reduce the number of dependent variables you are considering by focusing on just one aspect of the phenomenon.
Narrow the Objective—If you are trying to do too much in one study, you might try to reduce your aims, or break up the project into parts and conduct the subprojects sequentially.
Focus on Hard Information—Hard information involves measurements of tangible, observable demonstrations as opposed to measurements of intangible beliefs or opinions. Focus on dependent variables that involve hard information.
Focus on Direct Information—Direct information involves measurements specifically of the phenomenon being investigated, as opposed to measurements of factors associated with the phenomenon. Focus on dependent variables that directly measure the phenomenon.
Eliminate Correlated Variables—If several candidate dependent variables are highly intercorrelated, pick the best and eliminate the rest.
Create Multiple Models—If you have to have more than one dependent variable, create a different model for each one. This is like subdividing the objectives—not optimal but sometimes a necessary evil.
Conduct a Factor Analysis—You might be able to reduce the number of dependent variables using factor analysis to combine the multiple variables into one.

If you can’t do any of these things, you’re probably headed for a multivariate analysis. Consider looking for help.

Independent Variables

Your selection of independent variables will hinge on what you plan to use the model for. Here are a few tips for identifying candidate measures and scales:

Variables for Characterizing, Classifying, Identifying, and Explaining—Select enough variables to address all the theoretical aspects of the phenomenon, even to the point of having some redundancy. Sometimes two differently measured or differently scaled variables that address the same theoretical concept will make dissimilar contributions to the model. When you calibrate the model, the extra variables will drop out.
Variables for Comparing—Test what you want to know, not everything under the sun. Keep the number of variables to an absolute minimum or your analysis will become intractable. Try to use conventionally recognized variables and scales rather than creating new ones if you can. This will facilitate replication studies.
Variables for Predicting—Be sure that the variables and scales you select are relatively inexpensive and easy to create or obtain. A prediction model won’t be very useful if the prediction variables cost more to generate than the prediction is worth. For example, if you plan to use the model repeatedly, say to make monthly forecasts, you’ll want the model inputs to be simple enough that you could generate all the data you would need in a couple of weeks at most. If the inputs were so complex that they take months to generate, you wouldn’t be able to use the model as you wanted. Stress precision in selecting variables. Accuracy tends to come easy while precision is elusive. Prediction models usually keep only the variables that work best in making a prediction, so the number of variables you select initially isn’t that important. Recognize, though, that the more variables you have in your conceptual model, the more work it will be to winnow out the ones you don’t need.

Some of the variables may have several possible scales (https://statswithcats.wordpress.com/2010/09/12/the-measure-of-a-measure/). If these extra scales are related to each other by a linear algebraic relationship, keep only one. This is because the variables will be perfectly correlated, and thus, will add no new information to the model. For example, if you measure temperature in degrees Fahrenheit, you don’t need to also include temperature in degrees Celsius because °C = 5/9(°F − 32). Pick the scale that will give you the best resolution. In the example of temperature, Fahrenheit-scaled thermometers can be read with greater precision than Celsius-scaled thermometers because they have smaller divisions. Better yet, get a digital thermometer that displays several decimal places.

If two measures have unrelated scales or can be measured differently, keep them all at this point. You will sort out the best measures when you calibrate the model. For example, you could measure pH using pH paper, a field meter, or a lab titration. If a concept that you want to evaluate with your model were a person’s size, you could use a height scale and a weight scale. However, you wouldn’t need to include weight in both pounds and kilograms because the two scales are linearly related (1 kg = 2.2 lbs). You could include weight measured by a balance beam, a strain gage, a spring scale, or even a circus weight-guesser because they use different techniques to measure weight (although they would probably be highly correlated).

Samples and Data

The samples you select must represent the population you want to analyze. A lot of thought must go into defining the population and finding samples that will fairly represent that population. Then all those mental maneuvers go into fitting considerations, like the sample hierarchy, resolution and the number of samples (https://statswithcats.wordpress.com/2010/07/17/purrfect-resolution/), and the sampling scheme, into a comprehensive sampling plan. So the last thing you want to have happen is to have the sampler, the person who will generate the data, stray from the carefully thought-out plan. You don’t want field technicians moving sampling locations so that they don’t have to walk so far from their truck. You don’t want doctors reassigning their friends to experimental groups that will get preferential treatment. You don’t want your survey takers concentrating on attractive members of the opposite sex. You get the idea.

I think I’ll take a sample here

When it comes to samples, samplers should have little or no discretion to stray from the plan. Follow the map that will find the population you’re looking for. Then there’s the process of generating the data. As much as you plan to minimize variance with reference, replication, and randomization (https://statswithcats.wordpress.com/2010/09/05/the-heart-and-soul-of-variance-control/), there will always be opportunities at the point of data collection to improve the process. A dropped meter may require recalibration that’s not called for in the sampling plan. A survey taker might ask a clarifying question, check spelling, or point out a math error before a respondent forever disappears. A surveyor can correct a map with an incorrectly located sampling point. An accountant can adjust misclassified debits in financial records. As the data analyst, you are mostly powerless to make such corrections and clarifications until it’s too late, and you have to puzzle over the cause of an outlier. You need to rely on the knowledge and experience of the people collecting the data. So when it comes to data, samplers should have considerable discretion to use their initiative to ensure the quality of the data, minimize variance, and achieve the intent, if not the letter, of the sampling plan.

Once you know what you want your model to do and you know what you need to measure, you can consider the statistical techniques you might use (https://statswithcats.wordpress.com/2010/08/27/the-right-tool-for-the-job/).

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.com, barnesandnoble.com, or other online booksellers.