Stats With Cats Blog

The Right Tool for the Job

Posted on August 27, 2010 by statswithcats

Statistics are like power tools. If you know how to use them, they are incredibly valuable and fun to use. They help you do your job better, more thoroughly, and more quickly. But if you are careless, they can cause great damage.

You work. I’ll supervise.

Think of an expert carpenter like Norm Abram on This Old House.Norm has a different tool for every possible job he might need to do in his workshop. Statistical methods are like that. There are many different types of statistical analysis. Some perform a single function, and some perform many. In the same way that there are several different types of saws, there are different statistical methods for doing exactly the same thing. And just as Norm knows when to use his table saw and when to use his band saw, a statistician knows when to use different types of statistical analysis.

If you haven’t been trained in statistics, selecting which technique to use may seem bewildering. You can usually get in the right ballpark, though, if you understand your variables and your objectives. Consider the hierarchy for selecting a statistical analysis method summarized in this flowchart. The flowchart has five major decision points:

How many variables do you have?
What is your statistical objective?
What scales are the variables measured with?
Is there a distinction between dependent and independent variables?
Are the samples autocorrelated?

By the time you get to this point in planning your statistical analysis, you should already have determined the answers to the first four questions.

The first decision is a no-brainer. How many variables do you have—one or more than one? It doesn’t get any simpler than that. But say you have many variables, more than you can easily remember. Then it might be advantageous to use cluster analysis to select representative variables or use a data reduction technique to create new, more efficient variables.

The second decision is “what is your statistical objective?” There are five choices—description. Identification or classification, comparison or testing, prediction, or explanation. There’s more information on these objectives in The Five Pursuits You Meet in Statistics. Once again, this should be a fairly easy decision to make.

The third decision is “what scales are the variables measured with?” This decision is a bit tougher because you have to know something about measurement scales. You might be able to get away with distinguishing only between just a few scales, like nominal (i.e., groups or categories), ordinal (i.e., a sequence of integers), and continuous scales. The more you know about the quirks of the scales the better able you will be to avoid problems. The quirks of time scales, for example, are formidable. Read Time Is On My Side and you’ll see what I mean.

The fourth decision is “is there a distinction between dependent and independent variables?” Once again, this is a decision that is a bit more sophisticated because you have to know something about statistical modeling. In particular, you have to understand why one variable might be the focus of your analysis efforts while the others would be used for support. If your objective involves prediction, you have to have a separate dependent and independent variables.

The fifth decision is “are the samples autocorrelated?” There are three ways observations or samples can be autocorrelated—by time, by location, and by sequence. If it’s important that the dependent variable is measured at a particular location or time, your data are probably autocorrelated. The autocorrelation may not be large but it will be present. There are sophisticated ways of detecting spatial and temporal autocorrelation, but this rule-of-thumb will work most of the time. Measurements can also be autocorrelated by sequence, that is, the order they were taken. Say a measurement device is drifting slowly out of calibration. Each subsequent measurement would have an increasing bias independent of the time or location of the measurement. Sequential autocorrelation isn’t necessarily harder to detect, you just have to know to look for it.

Who needs tools when you have these?

As with most generalized flowcharts for decision-making, there are exceptions. Variables based on cyclic scales, like orientations and months of the year, are an example. There are two options for treating these types of scales. You can either transform the variable into a non-repeating linear scale or use specialized techniques. The first option is usually easier but the second option usually provides better results. Also, if you have more than one dependent variable and you want to analyze all the dependent variables simultaneously, you have to use multivariate statistics. Multivariate statistics are a quantum leap more complex than univariate (i.e., one dependent variable) statistics, and are probably best left to experienced statisticians.

So if you have some notion of what statistical techniques you might apply, read more about it on the Internet and go from there. Just remember, describing all the statistical techniques you might use in an analysis would be like trying to describe all the tools used in carpentry. There are some very common tools, such as saws and hammers, as well as very specialized tools, the ones that aren’t likely to be on the shelves at your local Home Depot. Don’t worry about the very specialized tools. You can accomplish quite a lot with these off-the-shelf statistical techniques. The other thing to bear in mind is that method selection guides such as those presented here can help you decide what you could use but not what you should use. You can use a sledge hammer to drive a nail, but you’d probably be better off using a smaller hammer. That’s a matter of experience, or at least, trial and error. Good luck!

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.com, barnesandnoble.com, or other online booksellers.

Posted in Uncategorized | Tagged ANOVA, autocorrelated, canonical analysis, cats, cluster analysis, correlation coefficient, dependent variable, discriminant analysis, factor analysis, jargon, measurement scales, regression, statistical analysis, statistical tests, statistics, stats with cats | 9 Comments

The Five Pursuits You Meet in Statistics

Posted on August 22, 2010 by statswithcats

When people think about statistical analyses, they often think only of mind-numbing number crunching that creates yet more numbers. But that’s like touring a cabinetmaker’s shop and seeing only the sawdust. A talented cabinetmaker can create beauty and function in his products. In the same way, a creative statistician can create enlightenment and utility if he or she has vision and purpose.

Statistical analyses usually aim at achieving one of five objectives:

Describe — characterizing populations and samples using descriptive statistics, statistical intervals, correlation coefficients, graphics, and maps.
Identify or Classify — classifying and identifying a known or hypothesized entity or group of entities using descriptive statistics; statistical intervals and tests, graphics, and multivariate techniques such as cluster analysis.
Compare or Test — detecting differences between statistical populations or reference values using simple hypothesis tests, and analysis of variance and covariance.
Predict — predicting measurements using regression and neural networks, forecasting using time-series modeling techniques, and interpolating spatial data.
Explain — Explaining latent aspects of phenomena using regression, cluster analysis, discriminant analysis, factor analysis, and other data mining techniques.

The following table provides some examples of data analysis tools that can be used for addressing the objectives.

Examples of Tools and Uses of Statistical Objectives

Objectives	Commonly Used Tools	Examples of Applications
Describe	Text and Images Graphs Descriptive statistics	Opinion surveys Demographic surveys
Compare or Test	Text and Images Graphs Descriptive statistics Statistical tests	Pharmaceutical effectiveness Educational methods
Identify or Classify	Visual scans Filters, queries & sorts Graphs Discriminant analysis Association rules Classification trees Data Mining	Biological species Tax return audits Possible criminals or terrorists
Predict	Graphs Regression Neural Networks Data Mining	Credit worthiness Student success in College
Explain	Regression Analysis of Variance (ANOVA) Other multivariate statistics	Academic research

There are other classification schemes that describe other statistical pursuits, so don’t feel constrained by these five categories. But this classification of statistical aims is a reasonable place to start. It has three features. First, it’s easy to figure out so non-statisticians can decide in which category their project fits. Second, the major statistical techniques tend to be used primarily in just one of the classifications. And third, the scheme can be thought of as an index of the professional peril a statistician could face in doing the analysis. Here’s why.

Description is relatively straightforward. You can do the calculations on spreadsheet software. All you have to be aware of are measurement scales, distributions, sampling schemes, measures of central tendency and dispersion, and methods for dealing with outliers and missing data.

Identification and classification range from simple visual recognition to the exploration of arcane mathematical dimensions where only bold number crunchers venture. It’s like finding Waldo. At a convention of funeral directors, one look would be all you needed. If he were making American flags in a candy cane forest, you might need some non-visual clues. You can determine a person’s sex by looking at him or her but not from a table of eye and hair color. On the other hand, you couldn’t tell who the best players were on a sports team from their pictures, but you could from their performance statistics. However you do it, identification is the gateway to classification. If you can do one, you can probably do the other.

Comparison is tougher even though there is ample software available for most analyses. You need to know what test to run or ANOVA design to use as well as understand probability, effect size, and violations of assumptions. There’s a much greater chance of something going wrong.

Prediction is next. In addition to all the description and comparison techniques, you’ll need to know how to use a variety of model building and assessments methods and understand the morass of prediction error. It’s easy to make a prediction. It’s hard to make an accurate prediction. It’s damn near impossible to make an accurate prediction that is also precise. Even if you did nothing wrong statistically, it’s easy to produce a poor prediction, and a poor prediction will eventually be noticed. One really good prediction and a psychic is famous; one really bad prediction and a statistician is relegated to selling insurance.

Well designed and well crafted. It’s purrfect.

Finally, explanation is the toughest of all objectives. Not only do you need to understand some of the more esoteric statistical methods, like factor analysis and canonical correlation, but you also have to understand the conceptual framework of the systems the data come from. Then, you have to have the talent to apply the knowledge creatively. You can’t explain your statistical model of stream contamination without knowing something about stream hydraulics, hydrogeology, meteorology, and environmental geochemistry. You can’t explain customer satisfaction without knowing something about demographics, marketing, business, and psychology. You’ll also probably have to integrate the information and think of it in ways that have never been thought of before. Explanation can create fundamental wisdom, although most of the time, your results will be humdrum. If you do come up with something truly consequential, though, some people will believe your results are erroneous, coincidental, or faked. Some people will claim that your finding is old news, having discovered it themselves years before, but then post it on Reddit for the karma. Most people, though, will just ignore you.

Creating a finished statistical analysis from raw data requires knowledge, experience, and often a bit of artistry. So when you conduct or review a statistical analysis, don’t let all the numbers obscure the craftsmanship and functionality of the products. And accordingly, don’t neglect to appreciate the talent and the artistry of the numbermaker.

Posted in Uncategorized | Tagged cats, jargon, mean, objectives, population, statistical analysis, statistics, stats with cats | 13 Comments

Time Is On My Side

Posted on August 15, 2010 by statswithcats

If you do much data analysis it won’t be long before you work with data measured over a range of times. When you do see time-series data, you’ll find that time scales and time units have some very quirky properties.

Time after Time

You might think that time is measured on a ratio scale given its ever finer divisions (i.e., hours, minutes, seconds). Yet it doesn’t make sense to refer to a ratio of two times any more than the ratio of two location coordinates. The starting point is also arbitrary. So time clearly isn’t measured on a ratio scale but it can be measured on interval or ordinal scales. Time units are also used for durations; however durations can be measured on a ratio scale. Durations can be used in ratios and they have a starting point of zero.

Time measurements can be linear or cyclic. Year is linear, and can be measured on either an interval scale or an ordinal scale. For example, the year 1953 can be expressed as an integer (ordinal scale) or a decimal (interval scale). Furthermore, all values of linear time are unique. The year 1953 happened once and will never recur. Linear time is like a river. You start at some point and go with the flow. You can’t get back to your starting point, but it still exists somewhere in time.

Some time scales repeat. If day one is a Monday, then so is day eight. Likewise, month one is the same as month thirteen. So time can also be treated as being measured on a repeating ordinal scale. Durations don’t repeat; one day isn’t the same as eight days.

Does Anybody Really Know What Time It Is?

Most measurement scales are based on factors of ten. With time, though, there are 60 seconds per minute, 60 minutes per hour, and 24 hours per day. Blame the Babylonians for starting this craziness and every civilization for the next 4,000 years for being content with the status quo. In contrast, calendars have evolved from the Hellenic calendar (~850 BC), the Roman calendar (~750 BC), the Julian calendar (46 BC), to the Gregorian calendar (1582).

Everybody knows about seconds, minutes, hours, days, months, years, and even decades, centuries, and millennia, but there are many other units used for time. A jiffy is either one tick of a computer’s system clock (about 0.01 second) or the time required for light to travel one centimeter (about 33.3564 picoseconds). A New York second is the time between when a traffic signal turns from red to green and when the driver behind you honks his horn, about a second and a half. An inna minute is the time between when you ask a teenager to do something and the time he or she complies, usually about ten to thirty minutes. A warhol is being famous for fifteen minutes; a kilowarhol is being famous for approximately ten days. A moment is a medieval unit of time equal to about a minute and a half. A fortnight is two weeks. A platonic year is an astronomical unit measuring the time required for planets to align (about 26,000 calendar years).

There have been several systems in which time units were based on factors of ten, most notably by the Chinese (before the 17^th century) and in France (during the 18^th century). Decimal time divided a day (i.e., one rotation of the earth) into 10 metric hours, each hour into 100 metric minutes, and each minute into 100 metric seconds, sometimes termed a blink. A blink is 0.864 standard second, which is about twice the time it takes for you to blink your eye (from www.neatorama.com/2009/01/30/fun-and-unusual-units-of-measurements/)

Then there’s geologic time, which is subdivided into eon, eras, periods, epochs, and ages. The divisions are based on the rocks that were formed at the time and the fossils that occur within them. Consequently, the divisions aren’t all the same lengths and there aren’t the same number subdivisions in each division. For example, the Paleozoic era is twice as long as the Mesozoic era, and four times longer than the Cenozoic era (which admittedly is still in progress). Likewise, some periods are four times longer than others. Moreover, the lengths of the divisions can change as more is learned about the history of the Earth. The units of the scale are also different in different parts of the world. Geologic time is an ordinal scale devised because measurements of the interval scale on which it is based (i.e., years) lacks accuracy and precision.

Astronomical time is confusing, relatively, and it’s different if you’re on board the Enterprise or the Galactica. So the point is this—measuring time is complicated, not to mention time-consuming. But there’s even more to it than that.

Time Of The Season

Selecting an appropriate time scale is especially important because the scale can dictate the resolution and types of analyses that can be done. Resolution is an important matter. Select an interval that is too small and your database may become unmanageably large. Select an interval that is too large and you may not have enough resolution to investigate the time unit you are interested in. A good rule-of-thumb is to select an interval that is at least one time unit smaller than your unit of interest. For example, if you are interested in yearly trends, collect measurements every month. If you only collect measurements yearly, you won’t be able to assess the variability that occurs within a year. If you collect measurements more often than daily, you may have to rollup the data to make it manageable.

Take Your Time

Time formats can be difficult to deal with. Most data analysis software offer a dozen or more different formats for what you see. Behind the spreadsheet format, though, the database has a number, which is the distance the time is from an arbitrary starting point, in an arbitrary unit of time, almost always days. Convert a date-time format to a number format, and you’ll see what I mean. The software formatting allows you to recognize values as times while the numbers allow the software to calculate statistics. This quirk of time formatting also presents a potential for disaster if you use more than one piece of software, which use different starting points or time units. Always check that the formatted dates are the same between applications.

Time Will Tell

Time-series data are probably the most difficult type of data to analyze. Measurements involving time are usually autocorrelated, so using conventional statistical procedures can produce biased results. Besides their scale of measurement, there are several other aspects of temporal variables that add to the confusion.

Ch-Ch-Ch-Ch-Changes—Time-series data can exhibit a variety of patterns, including step changes, linear and nonlinear trends, and cyclic fluctuations. The effects may be superimposed on each other within a given time period or spread over many different time periods. For example, a change in the discharge of a river may be attributable to abrupt and ephemeral causes such as failure of a dam or a sudden downpour (shocks), abrupt and long-term causes such as natural changes in a drainage way or a man-made diversion (step changes), long-term causes such as drought or changes in water consumption (trends), repetitive changes such as seasonal cycles related to rainfall or irrigation (cyclic fluctuations) as well as random variations. Confounded effects are often impossible to separate, especially if the data record is short or the sampled intervals are irregular or too large.
One Day at a Time—Time-series measurements may not all be collected at a single instant in time. Some measurements are composites over time. For example, a flow measurement (e.g., stream, air) may be an instantaneous discharge or a total discharge over a selected time period. A sample may be collected at one time or be a composite of several samples collected at discrete time intervals and combined into a single sample container. The period over which each measurement is averaged is called the support. Obviously, you can’t evaluate a given time interval if your support is the same or larger than the interval.
For the Times They Are a Changing—There is a dilemma involving time-series that are measured over many years. It goes like this. As knowledge and technology improve, the greater the chance that there will be improvements in sampling and analysis procedures that will reduce the overall variability of more recent measurements. That leads to violations of one of the fundamental assumption of parametric statistical procedures, equality of variances (also called homoscedasticity). Sometimes, you just can’t win.
In the Year 2525 … —With most types of analysis, both statistical and deterministic, data analysts collect data over the entire range of the area of interest. If you want to analyze a chemical reaction at 100 degrees, you might analyze the reaction at temperatures between 80 degrees and 120 degrees. You wouldn’t, however, test the reaction at 40 to 80 degrees and extrapolate to what might happen at 100 degrees. In fact, scientists are taught never to extrapolate outside the range of their data. With time-series data, though, you have to extrapolate because you almost always want to know what will happen in the future. If you wait to see what actually happens, then it’s no longer interesting because it’s the past. And in the ultimate of ironies, you often can extrapolate time-series data because they are … autocorrelated. So the same property that makes time-series data difficult to analyze is what allows them to be extrapolated to future times, a process called forecasting. Mother Nature has a wicked sense of humor.
Time Keeps on Slipping into the Future—With other types of data, even autocorrelated spatial data, you can verify predictions whenever the need arises. With predictions for a time-series, forecasts, you have to wait until the time in question arrives. Then you have just one chance. You can’t go back if something goes wrong and you miss collecting the verification data. Hence, you can’t control verification.

So those are a few points about how time is measured and analyzed. There’s much more to it than that, but I’ll save those thoughts for another time.

Posted in Uncategorized | Tagged cats, jargon, math, resolution, software, standardization, statistical analysis, statistics, stats with cats, time | 13 Comments

The Zen of Modeling

Posted on August 8, 2010 by statswithcats

What’s the first thing you think of when you hear the word model? The plastic model airplanes you used to build? A fashion model? The model of the car you drive? The person who is your role model? But what do any of those things have to do with data analysis? Read on; you’re about to find that statistical analyses begin and end with models.

By Any Other Name

What do a Ford Focus, a plastic airplane, and Tyra Banks have in common? They are all called models. They are all representations of something, usually an ideal or a standard.

This is supposed to be a model of ME!

Models can be true representations, approximate (or at least as good as practicable), or simplified, even cartoonish compared to what they represent. They can be about the same size, bigger, or most typically, smaller, whatever makes them easiest to handle. They usually represent physical objects but can also represent a variety of phenomena, including conditions such as weather patterns, behaviors such as customer satisfaction, and processes such as widget manufacturing. The models themselves do not have to be physical objects either. They can be written, drawn, or consist of mathematical equations or computer programming. In fact, using equations and computer code can be much more flexible and less expensive than building a physical model.

Customarily, models are used either to:

Display what they represent (e.g., model airplanes) or are associated with (e.g., fashions)
Substitute for incomplete real world data, such as using the Normal distribution as a surrogate for a sample distribution.
Manipulate their components to learn more about the things they represent (e.g., scientific models for planetary motion).

Whether you know it or not, you deal with models every day. Your weather forecast comes from a meteorological model, maybe several. Mannequins are used to display how fashions may look on you. Blueprints are drawn models of objects or structures to be built. Examples are plentiful.

Examples of Physical Models

Humans, in particular, are modeled all the time because of our complexity. Children play with dolls as models of playmates. Mannequins are simplified models of fashion models, who in turn, are models of people who might wear a fashion designer’s wares. Posing models provide reference points for artists. Crash test dummies reveal how the human body might react in an automobile accident. Medical researchers use laboratory animals in place of humans for basic research. Medical schools use donated cadavers as models, very good ones as it turns out, of the human anatomy. So, there should be nothing unfamiliar or intimidating about models.

Whether it is a physical scale-model of a hydroelectric dam or a mathematical model of weather patterns, a model is nothing more than a tool used to stimulate the imagination by simulating an object or phenomenon. The model airplane takes its young pilot looping through the blue skies of a summer day. Globes teach geography and orreries teach planetary motion. The mannequin shows the bride-to-be how beautiful she’ll look in the gown at her wedding. The concept car unveiled today gives consumers an idea of what they may be driving in a few years. The National Hurricane Center uses over a dozen mathematical models to forecast the intensities and paths of tropical storms and to help understand the complex dynamics of hurricanes.

It should come as no surprise, then, that scientists, engineers, and mathematicians use models, especially virtual models, all the time. It may be surprising, though, that virtual models are also used extensively in business, economics, politics, and many other fields. Nevertheless, there is a mystique associated with modeling, especially the mathematical variety. Some believe that models are infallible and unchanging. Some believe that models are impossibly complex and necessarily unfathomable. Some believe that models are sophisticated delusions for obfuscating real data. In reality, none of these opinions is correct, at least entirely.

A Medley of Numbers

Mathematical models can be either theoretical (i.e., derived mathematically from scientific principles) or empirical
(i.e., based on experimental observations). For example, celestial movements and radioactive decay are phenomena that can be evaluated using theory-based models. To calibrate a theoretical model, the form of the model (i.e., the equation) is fixed and the inputs are adjusted so that the calculated results adequately represent actual observations.

Empirical models differ from theoretical models in that the model is not necessarily fixed for all instances of its use. Rather, empirical models are developed for specific situations from measured data. Model formulation and calibration are simultaneous. However, the selection of the form of the equation and the inputs used in an empirical model are usually based on related theories. Models developed using statistical techniques are examples of empirical models.

Empirical models can also be deterministic, stochastic, or sometimes a hybrid of the two. Deterministic empirical models presume that a specific mathematical relationship exists between two or more measurable phenomena (as do theoretical models) that will allow the phenomena to be modeled without uncertainty under a given set of conditions (i.e., the model’s inputs and assumptions). Biological growth models are examples of deterministic empirical models.

Both theoretical models and deterministic empirical models provide solutions that presume that there is no uncertainty. These solutions are termed “exact” (which does not necessarily imply “correct”). Conversely, stochastic empirical models presume that changes in a phenomenon have a random component. The random component allows stochastic empirical models to provide solutions that incorporate uncertainty into the analysis.

Statistical models are examples of stochastic empirical models in which the model equation is generated by quantifying and minimizing errors (i.e., uncertainty). Statistical models place great emphasis on examining and quantifying uncertainty, whereas theoretical models generally do not.

OK, that’s way more than you need to know. Let me simplify. Mathematical models are based on theories or observations or both. They can produce a single (exact) answer for a set of inputs by assuming there is no variability or a range of (inexact) answers by incorporating the variability into the model.

For example, distribution models are equations that produce exact solutions for the equation curve. The model describes what your data frequency would look like if your sampling were a perfect representation of the population. So if your data follow a particular distribution model, you can use the model instead of your data to estimate the probability of a data value occurring. This is the basis of parametric statistics; you evaluate your data as if they came from a population described by the model. (In contrast, nonparametric statistics use your data instead of an exact model to estimate the probability of a data value occurring.) It’s like building a sand castle. A distribution model is like a bucket you can fill with sand (data) to create the castle (the result) with great efficiency. Without the model serving as a substitute, it takes more effort (data) to completely shape the castle.

Statistical analyses involving descriptive statistics and testing rely on exact mathematical models like the Normal distribution to represent data frequencies and error rates. Just as importantly, though, statistical techniques are used to build models from data. Such statistical models include an error term to incorporate the effects of variation, and thus, are inexact because they produce a solution that is a range of possible values. Statistical analyses involving detecting differences, prediction, or exploration involve using statistics to estimate the mathematical coefficients, the parameters, of a model.

So, models and statistics are closely intertwined. Statistical analyses begin and end with models. Models serve as both inputs and outputs of statistical analyses. You can’t do without them, so you might as well understand what they are.

Posted in Uncategorized | Tagged cats, drivers, jargon, math, model, modeling, Normal distribution, population, probability, statistical analysis, statistics, stats with cats, t distribution, uncertainty, workload | 16 Comments

There’s Something About Variance

Posted on August 1, 2010 by statswithcats

Imagine practicing hitting a target using darts, bow and arrow, pistol, cannon, missile launcher, or whatever. You aim for the center of the target. If your shots land where you aimed, you are considered to be accurate. If all your shots land near each other, you are considered to be precise. The two properties are not linked. You can be accurate but not precise, precise but not accurate, neither accurate nor precise, or both accurate and precise.

Accuracy and precision also apply to statistics calculated from data. If you’re trying to determine some characteristic of a population (i.e., a population parameter), you want your statistical estimates of the characteristic to be both accurate and precise.

The same also applies to the data themselves. When you start measuring data for an analysis, you’ll notice that even under similar conditions, you can get dissimilar results. That lack of precision
is called variability. Variability is everywhere; it’s a normal part of life. In fact, it is the spice in the soup. Without variability, all wines would taste the same. Every race would end in a tie. Even statistics might lose its charm. Your doctor wouldn’t tell you that you have about a year to live, he’d say don’t make any plans for January 11 after 6:13 pm EST. So a bit of variability isn’t such a bad thing. The important question, though, is what kind of variability?

The Inevitability of Variability

Before going further, let me clarify something. Statisticians discuss variability using a variety of terms, including errors, uncertainty, deviations, distortions, residuals, noise, inexactness, dispersion, scatter, spread, perturbations, fuzziness, and differences. To nonprofessionals, many of these terms hold pejorative connotations. But variability isn’t bad … it’s just misunderstood.

Suppose you’re sitting in your living room one cold winter night contemplating the high cost of heating oil. The thermostat reads 68 degreesF, but you’re still shivering. Maybe the thermostat is broken. Maybe the heater is malfunctioning or you need more insulation. You need a warmer place to sit while you read An Inconvenient Truth, so you grab a thermometer from the medicine cabinet and start measuring temperatures around the room. It’s 115 degrees at the radiator, 68 degrees at your chair, 59 degrees at the window, and 69 degrees at the stairs. You keep measuring. It’s 73 degrees at the fish tank, 67 degrees at the couch and bookcase, 82 degrees at the TV, and 60 degrees at the door. That’s a lot of variation!

Think of those temperature readings as the summation of five components:

Characteristic of Population—the portion of a data value that is the same between a sample and the population. This part of a data value forms the patterns in the population that you want to uncover. If you think of the living room space as the population you’re measuring, the characteristic temperature would be the 68 degrees at your chair where you want to read.
Natural Variability—the inherent differences between a sample and the population. This part of a data value is the uncertainty or variability in population patterns. In a completely deterministic world, there would be no natural variability. You would read the same value at every point where you took a measurement. But in the real world, if you made the same measurement again and again, you probably would get different values. If all other types of variation were controlled, these differences would be the natural or inherent variability.
Sampling Variability—differences between a sample and the population attributable to how uncharacteristic (non-representative) the sample is of the population. Minimizing sampling error requires that you understand the population you are trying to evaluate. The sampling variability in the living room would be attributable to where you took the temperature readings. For example, the radiator and TV are heat sources. The door and window are heat sinks. Furthermore, if all the readings were taken at eye level, the areas near the ceiling and floor would not have been adequately represented. The floor may be a few degrees cooler because the more dense cold air sinks displacing the warmer air upward, which is why the air at the ceiling is warmer.
Measurement Variability—differences between a sample and the population attributable to how data were measured or otherwise generated. Minimizing measurement error requires that you understand measurement scales and the actual process and instrument you use to generate data. Using an oral thermometer for the living room measurements may have been expedient but not entirely appropriate. The temperatures you wanted to measure are at the low end of the thermometer’s range and may be less accurate than around 98 degrees. Also, the thermometer is slow to reach equilibrium and can’t be read with more than one decimal place of precision. Use a digital infrared laser-point thermometer next time. More accurate. More precise. More fun.
Environmental Variability—differences between a sample and the population attributable to extraneous factors. Minimizing environmental variance is difficult because there are so many causes and because the causes are often impossible to anticipate or control. For example, the heating system may go on and off unexpectedly. Your own body heat adds to the room temperature and walking around the living room taking measurements mixes the ceiling and floor air which adds variability to the temperatures.

When you analyze data, you usually want to evaluate characteristics of some population and the natural variability associated with the population. Ideally, you don’t want to be mislead by any extraneous variability that might be introduced by the way you select your samples (or patients, items, or other entities), measure (generate or collect) the data, or experience uncontrolled transient events or conditions. That’s why it’s so important to understand the ways of variability.

Variability versus Bias

Remember target practice? If there is little variation in your aim, the deviations from the center of the target would be random in distance and direction. Your aim would be accurate and precise. But what if the sight on your weapon were misaligned? Your shots would not be centered on the center of the target. Instead there would be a systematic deviation caused by the misaligned sight. Your shots would all be inaccurate, by roughly the same distance and direction from the center. That systematic deviation is called bias. You may not even have known there was a problem with the sight before shooting, although you would probably suspect something after all the misses.

Bias usually carries the connotation of being a bad thing. It usually is. It may be why 19th Century British Prime Minister Benjamin Disraeli mistakenly associated statistics with lies and damn lies. But if the systematic deviation is a good thing because it fixes another bias, it’s called a correction. For example, you could add a correction, an intentional bias in the direction opposite the bias introduced by the weapon sight, to compensate for the inaccuracy. So bias can be good (in a way) or bad, intentional or not, but it’s always systematic. On the other hand, a bias applied to only selected data is a form of exploitation, and is nearly always intentional and a very bad thing.

So the relationships to remember are:

Variance ↔ Imprecision

Bias ↔ Inaccuracy

Most statistical techniques are unbiased themselves, as long as you meet their assumptions. If something goes wrong, you can’t blame the statistics. You may have to look in the mirror, though. During the course of any statistical analysis, there are many decisions that have to be made, primarily involving data. Whatever the decisions are, such as deleting or keeping an outlier, there will be some impact on precision and perhaps even accuracy. In an ideal world, the sum of the decisions wouldn’t add appreciably to the variability. Often, though, data analysts want to be conservative, so they make decisions they believe are counter to their expectations. But when they don’t get the results they expected, they go back and try to tweak the analysis. At that point they have lost all chance of doing an objective analysis and are little better than analysts with vested interests who apply their biases from the start. Avoiding such analysis bias requires no more than to make decisions based solely on statistical principles. This sounds simple but it isn’t always so.

Sometimes bias isn’t the fault of the data analyst, as in the case of reporting bias. In professional circles the most common form of reporting bias is probably not reporting non-significant results. Some investigators will repeat a study again and again, continually fine-tuning the study design until they reach their nirvana of statistical significance. Seriously, is there any real difference between probabilities of significance of 0.051 versus 0.049? But you can’t fault the investigators alone. Some professional journals won’t publish negative results, and professionals who don’t publish perish. Can you imagine the pressure on an investigator looking for a significant result for some new business venture, like a pharmaceutical? He might take subtle actions to help his cause then not report everything he did. That’s a form of reporting bias.

Perhaps the most common form of reporting bias in nonprofessional circles is cherry picking, the practice of reporting just those findings that are favorable to the reporter’s position. Cherry picking is very common in studies of controversial topics such as climate change, marijuana, and alternative medicine. Virtually all political discussions use information that was cherry picked.

Given that someone else’s reporting bias is after-the-analysis, why is it important to your analysis? The answer is that it’s how you can be misled in planning your statistical study. Never trust a secondary source if you can avoid it. Never trust a source of statistics or a statistical analysis that doesn’t report variance and sample size along with the results. And always remember: statistics don’t lie; people do.

Posted in Uncategorized | Tagged accuracy, bias, cats, environmental, jargon, measurement, politics, population, sampling, statistical analysis, statistics, stats with cats, uncertainty, variability, variance | 19 Comments

Samples and Potato Chips

Posted on July 26, 2010 by statswithcats

Samples are like potato chips. You’re never satisfied with just one. Every one you take makes you want more. And you’re never sure you’ve had enough until you’ve had way too many.

Betcha Can’t Take Just One

One observation. One test sample. One subject. One measurement. One of anything isn’t that satisfying. You’ll always want more to replicate the experience, to find out if there is consistency. Maybe you take just a few. If you sense a pattern, you can build your observations into an anecdote, a story. Many statistical analyses, in fact, grow out of anecdotal evidence. You just can’t stop at the story telling stage. Statistics are antidotes to those anecdotes.

Chips all gone. Want more!

Politicians, preachers, and parents can get away with telling tales to illustrate points they want to make. Their followers trust them and want to believe them whether they are telling the truth or not. Other professionals, though, can’t rely on their audience having such unquestioning faith. Scientists rely on hard data to test their hypotheses. Educators need test scores so they can grade on a curve. Businessmen want to see the numbers before they spend their money (your money, not so much). So you can pretty much expect that once you start collecting data, you’ll going to want more.

Want More

You know you want more data, so first you estimate how many more samples you’ll need to get that Purrfect Resolution. Say you’ve estimated that you need 1,000 samples to do a statistical analysis. You package your sampling and analysis plan into a proposal and give it to your client. One thing you can bet on is that your client won’t want to spend the money to collect that many samples. So what can you do? Here are a few suggestions:

Change the Study—Lower your confidence (1 minus the false positive error rate you’ll allow) and power (1 minus the false negative error rate you’ll allow). If you do this, look out for those misleading test results. You can look for bigger effects (e.g., differences between means, size of targets, and so on). You won’t get the resolution you wanted but it could be a good start. Also consider limiting the study area, level of detail, or analysis scope. Sometimes you can trade other project costs, like meetings and deliverables, for a few more samples.

Take Smaller Bites—Take as many samples as you can and use the information to decide what to do next. This is sometimes the aim of a pilot study. You can use the samples collected during a pilot study to estimate more precisely how many more samples you’ll need to get the statistical resolution you want. You might also be able to collect samples in phases or change the implementation schedule to accommodate your client’s budget cycle.

Use Supporting Data—There may be historical data available that you can use to reassess the number of samples you’ll need and even augment the samples you plan to collect (i.e., provided the quality of the historical data is appropriate). You can also consider surrogate sampling, in which you correlate the results of many inexpensive observations or measurements to the few expensive samples your client can afford.

Control Variance—If you think about it, the reason you need more samples in the first place is because you need to improve precision (not accuracy). So think harder about how you can reduce any extraneous variability in the data generation process. Standardized procedures and training of the data collectors might mitigate the need for quite a few samples.

Too Many

Can you eat too many potato chips? Of course you can. It’s happened to many of us. Likewise, you can have too many samples, which presents its own set of challenges. Here are five:

Information Overload — Statistical software tends to be very efficient, but when you have tens of thousands of samples, you start to see performance slow a bit. What’s more important, though, is the inefficiency you run into when you scrub your dataset, especially if you use a lot of spreadsheet array formulas. Be patient. You can use the waiting time to read a good book.

Chasing Tails — In any data set, you may have 5% influential observations not to mention the outliers and errors that you’ll have to check to determine if they should be corrected, removed from the dataset, or left alone. This is a very time consuming process. With a small dataset, you may have to investigate just a few samples. With a 1,000-record dataset, you may have to investigate 50 samples. This is part of why data scrubbing can represent most of the work in a data analysis project.

Data Intimacy — When you’re working with only a few dozen samples, you get to know each data point. You can look at plots and tables and see how individual details fit into a bigger picture. You can’t do that with a thousand data points. Sometimes you can get around this problem by dividing the data into groups and working with the groups, or analyzing a higher level of hierarchical data.

Graphic Mud — It’s tough to see patterns with only a few samples but plotting thousands of samples can be just as perplexing. You won’t be able to use any small plots like matrix plots. Even with full-scale plots, it will be difficult to see subtle differences in data point markers, like size, shape, and even color. Points will overwrite each other so you won’t be able to tell it there is one point at a graph location or a hundred points stacked on top of each other. And even the best statistical software will choke when trying to print graphs with thousands of data points. Solving this problem usually involves plotting group means or only randomly selected records from the data matrix.

Meaningless Differences — Sometimes you can have too much resolution in a statistical test. If the test can detect a difference smaller than would be of interest in the real world, it’s probably because you used too many samples. Conduct a power analysis after statistical testing to determine what your effect size and error rates were for the test.

And that’s why it’s important to have about the right number of samples; enough to at least make progress towards your goal but not so many that the progress doesn’t justify the effort.

Posted in Uncategorized | Tagged cats, information, number of samples, precision, resolution, rule of thumb, sample size, samples, standardization, statistical analysis, statistical tests, statistics, stats with cats, uncertainty | 9 Comments

Purrfect Resolution

Posted on July 17, 2010 by statswithcats

No matter what their area of expertise, statisticians are asked certain questions with such predictability that it borders on the deterministic. No question is asked more often than:

How many samples do I need?

Most statisticians wish they could answer the sample size question definitively instead of mumbling about effect sizes and whatnot. It’s just not that simple.

One way to look at how many individual samples (i.e., observations, cases, records, subjects, survey respondents, organisms, or any other object or entity on which you collect information) you need for an analysis is in terms of how much resolution you want. Think of the resolving power of a telescope or a microscope, or the number of pixels in a computer image. The greater the resolution, the more detail you’ll see.

Consider this picture. You couldn’t make out the image with a resolution of 9 pixels per inch and maybe not even with a resolution of 18 pixels per inch. At 36 pixels per inch, you can tell it’s an image of a kitten, even if it’s a bit fuzzy. At 72 pixels per inch, the image is sharp and you can tell that the kitten is Kerpow. Doubling the resolution again adds little to your perception of the image; it’s a waste of the additional information. Likewise with statistics, the greater the number of samples, the more precise your results will be. But beyond a certain point, adding samples adds little to your understanding. In fact too many samples can have negative consequences. So, the trick is to collect the fewest samples that will achieve your objective.

Deciding how many samples you’ll need starts with deciding how certain your answer needs to be given your objective. Now here’s the bad news. There’s no way to know exactly how many samples you’ll need before you conduct your study. There are, however, formulas for estimating what an appropriate number of samples might be. In the situations in which the formulas don’t apply, there are rules-of-thumb or other ways to come up with a number. Unfortunately, it seems that no matter how many samples you estimate you’ll need, the number is always a lot more than your client wants to collect. After all, they are the ones who have to pay for collecting and analyzing the samples. So, any estimate of the number of samples that you tell your client ends up being a subject of negotiations. But here are a few places to start.

How Many Samples for Describing Data?

Say all you want to do is to collect enough samples to calculate some descriptive statistics. Maybe you want to characterize some condition, like the average weight of a litter of kittens or the average age of your favorite professional sports team. How many samples do you need? Well if your population is small enough, like five kittens or 25 baseball players, you simply use all the members of the population, a census.

But what if you want to calculate descriptive statistics to characterize a large population? The number of samples you’ll need to describe it will depend on the precision you want, not the accuracy. The greater the number of samples, the more precise your estimate will be. More specifically, the precision will be proportional to the variance divided by the square root of the number of samples. So maximize the number of samples if you can (by a lot, remember, precision is proportional to the square root of the number of samples), but if you can’t, try to control the variance.

How Many Samples for Detecting Differences?

Often, the point of calculating statistics is to make an inference from a sample to a population. You can estimate how many samples you might need to conduct a statistical test of one or more populations by rearranging the equation for the test you plan to use and solve for the number of samples. To take this approach, there are usually two other things you need to know—the difference you want to detect and the population variance.

You should have some idea of the size of the difference you want to detect, called a meaningful difference. Say you want to compare how long it takes you to commute to work via two different routes. Differences of a few seconds probably aren’t meaningful but differences of a few minutes probably are. If you work as a NASCAR driver, go with seconds. The smaller the difference you want to detect the more samples you’ll need.

Knowing the population variance is the Catch-22 of statistics. You can’t calculate the number of samples you’ll need without knowing the population variance and you can’t estimate the population variance without already having samples from the population. Now, there are maybe a half dozen ways to try to get around this problem but they all require you to know or guess at some aspect of the population. The approach is often used after a preliminary study (called a pilot study) is done in part to estimate the population variance.

How Many Samples for Opinion Surveys

If you’re going to survey a small population, like your colleagues at work, send surveys to everybody and hope you get a representative sample from the people who do respond. If the size of your population is large compared to the number of surveys you might take, a quick way to estimate the sample size is:

sample size = 1 / (approximate percent error you want)²

So if you want a ±5% error with 95% confidence, you would need about 400 surveys to be completed (i.e., 1/0.05²). With 1,000 surveys, the error drops to about 3%, but to get to 2% error, you would have to collect 2,500 surveys. It’s more complicated than this of course. If your sample will be a sizable proportion of your population or if the opinions aren’t evenly divided, the short-cut formula will overestimate how many surveys you might need. If you plan to subdivide the population to look at demographics, you’ll need more surveys to get to your desired error rates.

How Many Samples for Evaluating Trends?

Say you plan to do a regression analysis to evaluate the relationship between two sets of measurements. How many samples do you need? There are two ways to answer this question, a difficult way and an easy way. The difficult way is to calculate it the same way as you would if you were looking to detect differences. This approach requires a sophisticated understanding of statistical tests and the populations being tested. It is most often used in experimental situations.

The simpler approach is to base the number of samples on a rule-of-thumb based on the number of independent variables. The more independent variables (i.e., predictor variables) there are, the more samples are needed to define their relationship to a dependent variable. The guidelines are not hard and fast but they boil down to these:

10 samples per predictor variable—the bias may be large but there are often enough samples to estimate simple linear relationships with adequate precision.
50 samples per predictor variable—the bias is relatively small, linear relationships can be estimated with good precision, and there are usually enough samples to determine the form of more complex relationships.
100 samples per predictor variable—the bias is insignificant, linear relationships are estimated precisely, and complex nonlinear relationships can be estimated adequately.
250+ samples per predictor variable—the bias is insignificant and most complex relationships can be estimated precisely.

How Many Samples for Forecasting Time Series?

Deciding how many samples to use for analyzing a time series can be a challenge. Here are two popular rules-of-thumb:

Collect samples at regular intervals from at least three or four consecutive cycles or units of any pattern in which you might be interested. For example, if you are interested in seasonal patterns (i.e., a pattern lasting a year) collect data for at least three or four years.
Collect samples at time units smaller than the duration of the pattern in which you might be interested. For example, if you are interested in seasonal patterns, collect data weekly, biweekly, or at least, monthly.

How Many Samples for Identifying Targets?

Sometimes the goal of sampling is to find one or more targets. For example, in World War II, destroyer captains needed to know how many depth charges to drop to be reasonably certain of destroying an enemy submarine. Likewise, adventurers looking for sunken ships, like the Monitor and the Titanic, use statistical sampling to find their targets. In the environmental field, sampling is often done to look for “hot spots” of contamination in soil. There are two ways this type of problem is typically handled—judgment sampling and search sampling.

The strategy behind judgment sampling is that an expert picks locations for sampling that he or she believes are most likely to reveal a target. With this approach, it is assumed that the expert has some preternatural ability to find the targets. Judgment sampling (a.k.a. judgmental sampling, biased sampling, haphazard sampling, directed sampling, professional judgment) has the advantage of involving far fewer samples than search sampling. The disadvantage is that there is no way to quantify the uncertainty of the result.

Search sampling involves sampling on a regular grid so that it is possible to estimate the probability of finding randomly located targets. In essence, the probability of finding a target depends on the size and shape of the target and the size and shape of the cells of the sampling grid. The downside of this sampling approach is that it usually involves many more samples than the judgment sampling approach and the results do not always sound very reassuring. For example, you would need over 10,000 samples taken on a 100-foot square grid in a 1,000,000 square-foot search area to have an 80% probability of finding a circular hotspot 100 feet in diameter. In search sampling, a large number of samples is the price you pay for being able to quantify uncertainty. But if you understand the uncertainty, you are one giant step closer to controlling adverse risks. That’s the resolving power of statistics.

Posted in Uncategorized | Tagged cats, number of samples, polls, resolution, rule of thumb, sample size, samples, statistical tests, stats with cats, surveys, trend, uncertainty | 14 Comments

30 Samples. Standard, Suggestion, or Superstition?

Posted on July 11, 2010 by statswithcats

If you’ve ever taken any applied statistics courses in college, you may have been exposed to the mystique of 30 samples. Too many times I’ve heard statistician do-it-yourselfers tell me that “you need 30 samples for statistical significance.” Maybe that’s what they were taught; maybe that’s how they remember what they were taught. In either case, the statement merits more than a little clarification, starting with the 30-samples part. Suffice it to say that if there were any way to answer the how-many-samples-do-I-need question that simply, you would find it in every textbook on statistics, not to mention TV quiz shows and fortune cookies. Still, if you do an Internet search for “30 samples” you’ll get millions of hits.

Like many legends, there is some truth behind the myth. The 30-sample rule-of-thumb may have originated with William Gosset, a statistician and Head Brewer for Guinness. In a 1908 article published under the pseudonym Student (Student. 1908. Probable error of a correlation coefficient. Biometrika 6, 2-3, 302–310.), he compared the variation associated with 750 correlation coefficients calculated from sets of 4 and 8 data pairs, and 100 correlation coefficients calculated from sets of 30 data pairs, all drawn from a dataset of 3,000 data pairs. Why did he pick 30 samples? He never said but he concluded, “with samples of 30 … the mean value [of the correlation coefficient] approaches the real value [of the population] comparatively rapidly,” (page 309). That seems to have been enough to get the notion brewing.

Since then, there have been two primary arguments put forward to support the belief that you need 30 samples for a statistical analysis. The first argument is that the t-distribution becomes a close fit for the Normal distribution when the number of samples reaches 30. (The t-distribution, sometimes referred to as Student’s distribution, is also attributable to W. S. Gosset. The t-distribution is used to represent a normally distributed population when there are only a limited number of samples from the population.) That’s a matter of perspective.

This figure shows the difference between the Normal distribution and the t-distribution for 10 to 200 samples. The differences between the distributions are quite large for 10 samples but decrease rapidly as the number of samples increases. The rate of the decrease, however, also diminishes as the number of samples increases. At 30 samples, the difference between the Normal distribution and the t-distribution (at 95% of the upper tail) is about 3½%. At 60 samples, the difference is about 1½%. At 120 samples, the difference is less than 1%. So from this perspective, using 30 samples is better than 20 samples but not as good as 40 samples. Clearly, there is no one magic number of samples that you should use based on this argument.

The second argument is based on the Law of Large Numbers, which in essence says that the more samples you use the closer your estimates will be to the true population values. This sounds a bit like what Gosset said in 1908, and in fact, the Law of Large numbers was 200 years old by that time.

This figure shows how differences between means estimated from different numbers of samples compare to the population mean. (These data were generated by creating a normally distributed population of 10,000 values, then drawing at random 100 sets of values for each number of samples from 2 to 100 (i.e., 100 datasets containing 2 samples, 100 datasets containing 3 samples, and so on up to 100 datasets containing 100 samples. Then, the mean of the datasets was calculated for each number of samples. Unlike Gosset, I got to use a computer and some expensive statistical software.) The small inset graph shows the largest and smallest means calculated for datasets of each sample size. The large graph shows the difference between the largest mean and the smallest mean calculated for each sample size. These graphs show that estimates of the mean from a sampled population will become more precise as the sample size increases (i.e., the Law of Large Numbers). The important thing to note is that the precision of the estimated means increases very rapidly up to about ten samples then continues to increase, albeit at a decreasing rate. Even with more than 70 or 80 samples, the spread of the estimates continues to decrease. So again, there’s nothing extraordinary about using 30 samples.

So while Gosset may have inadvertently started the 30-samples tale, you have to give him a lot of credit for doing all those calculations with pencil and paper. To William Gosset, I raise a pint of Guinness.

Now we still have to deal with that how-many-samples-do-I-need question. As it turns out, the number of samples you’ll need for a statistical analysis really all comes down to resolution. Needless to say, that’s a very unsatisfying answer compared to … 30 samples.

Posted in Uncategorized | Tagged cats, correlation coefficient, Gosset, jargon, law of large numbers, Normal distribution, number of samples, population, rule of thumb, sample size, samples, statistical analysis, statistics, statistics 101, stats with cats, t distribution | 13 Comments

It’s All Greek

Posted on July 3, 2010 by statswithcats

When Humpty Dumpty uses a word, it means just what he chooses it to mean, neither more nor less. To people not conversant in a technical specialty, it seems that all the experts are Humpty Dumptys. Statistics is no exception.

It doesn’t look like a mouse to me.

If you’re a beginner at data analysis, it will seem like there is a superabundance of esoteric statistical slang. You’ll hear it even from friendly statisticians. It gets worse when you start reading websites, books, and worst of all, journal articles. If you want to see what I mean, read some of the article titles in the Journal of the American Statistical Association (at http://pubs.amstat.org/loi/jasa). The statisticians who write those obfuscatory tracts believe they are writing to people who know as much as they do. This seems odd given that those authors are supposed to be the experts in what they are writing about. Even other statisticians can’t decipher some of those articles without spending time with the reference books. So don’t feel like you’re alone in a foreign country. We stand befuddled together.

To simplify statistical jargon, think of three distinctions—statistical concepts named after someone, special words created to convey a special meaning, and common words and phrases with alternative meanings. We’ll leave the acronyms out of it for now.

Named Things

Statistical procedures, especially statistical tests, are often modified to accommodate some special circumstance or to have some desirable property. When this occurs, the new procedure is commonly named after the originators. Thus, there are statistical tests named after Dixon, Tukey, Wilcoxon, Scheffe, Kolmogorov, Fisher, Levene, Hotelling, Dunnett, and Bonferroni. And those are just some well-known ones. Dig into the literature, and you’ll find scores more.

It’s not just tests that get named. Bayesian statistics is a branch of statistics based on Bayes Theorem formulated in the 1700s by Reverend Thomas Bayes. Kriging, the interpolation algorithm of geostatistics was named after Daniel Krige, a South African mining engineer, who pioneered the field in the 1950s. The Normal distribution is also called the Gaussian distribution after Carl Friedrich Gauss, who introduced it in 1809, and the Laplacian distribution after Pierre-Simon Laplace who showed that the distribution was the basis for the central limit theorem in 1810. There are also theoretical frequency distributions named after Benford, Weibull, Rayleigh, Cauchy, Poisson, and Bernoulli.

If someone mentions a named distribution , test, or other statistical procedure, don’t panic. Nobody knows everything. Just ask what the distribution or procedure is supposed to do. If you took an introductory course in statistics and know about probability, the Normal distribution, and hypothesis testing, you’re in great shape for understanding most of the named stat terms you might run into. This type of statistical jargon could be much worse. When biologists name something after someone, they do it in Latin.

Created Words

Some statistical jargon might just as well be a foreign language because the words have no common meaning in the English language outside of statistics (or math). Examples of such words include: kurtosis, leptokurtic, platykurtic, skewness, covariance, autoregressive, variogram, logit, probit, eigenvalue, median, outlier, stationarity, winsorizing, communality, multicollinearity, and my personal favorite, homoscedasticity. If you’re at a bar and you hear any of these words being bandied around, slip quietly out the door and run for your life. Any statistician who uses these words with innocent civilians without explanation either doesn’t understand his or her audience or is a sadist. Dealing with created statistical terms is straightforward; just ask the statistician using them what they mean. Preferably ask in a foreign language just to prove the point.

Alternative Meanings

The most confusing statistical jargon just might be words in most people’s everyday vocabulary that have a very different statistical meaning. For example, when you hear the word mean, your mind has to sort out the word’s connotation. It can signify to intend, as in say what you mean. It can be used to associate, as in spring means flowers. It can refer to resources or methods, as in by any means. It can indicate character, as in she has a mean streak. It can imply exceptional skill, as in he has a mean fastball. And of course, in statistics, mean means average.” If you don’t realize that some words in English have different meanings in statistics, you can get confused very quickly. I’ve had well-meaning report editors change median to medium and nonsignificant to insignificant.

Here are a few more examples:

Word	Meaning to a Statistician	Meaning to a Nonstatistician
bagging	A method for combining predictions from many data mining models	What the cashier does with your groceries when you’re done paying
blocking	A technique for controlling variation in ANOVA	What the offensive line does during football season
brushing	Interactively selecting data points on an on-screen graph to access other information associated with the point	What you do with your toothpaste and toothbrush
breakdown	Splitting data into groups to calculate descriptive statistics and correlations	What happens to your car when you’re in a hurry to get somewhere
censoring	Data with a real but undetermined value, usually less than or greater than all other values in a dataset.	Restricting free speech; removing material considered to be offensive from books or other media
confidence	Absence of type I errors	Ego stability
discriminate	Classify observations by a statistical model; a good thing.	To make distinctions based on race, creed, ethnicity, age or other category without regard to individual merit; a bad thing
errors	Differences between observed values and values predicted from a statistical model; residuals	Mistakes
mode	The most frequently appearing number in a set of numbers	A manner of acting, such as being in “relaxation mode.”
Monte Carlo	A simulation procedure for evaluating the properties or performance of a statistic	The quarter of Monaco known for its resorts and casinos; a hotel in Las Vegas
Normal	Follows a Gaussian (bell-shaped) distribution	Typical, routine, sane
residuals	Differences between observed values and values predicted from a statistical model; errors	Money made by musicians and actors when their works are replayed.
sample	An individual observation or multiple observations that are part of a population	A piece, a bit, a taste.

Don’t feel that you’re alone in the quagmire of statistical jargon. Like dialects of the English language, different statistical specialties have their own jargon and ways of expressing ideas. Data mining, time-series forecasting, quality control, nonlinear modeling, biometrics, econometrics, and geostatistics are all examples of statistical specialties that use terms not used in the other specialties. Imagine a Louisiana Cajun talking to a Pennsylvania Dutch. They both speak dialects of English, but it might as well be Greek.

Posted in Uncategorized | Tagged cats, information, jargon, journal, math, mean, Normal distribution, probability, standardization, statistical analysis, statistical tests, statistics, stats with cats, uncertainty | 10 Comments

Weapons of Math Production

Posted on June 27, 2010 by statswithcats

In theory, if you have the free time, you can calculate any statistic you might need using nothing more than a pencil and paper. After all, it’s just matrix mathematics. With a lot of data or a complicated procedure, though, you might need a lot of free time. A generation ago, that’s how most statistics were calculated. Most people didn’t have computers, or calculators for that matter. Slide rules … maybe. Now, there is an abundance of hardware and software to ease the tedium. Having a statistician’s version of Norm Abram’s workshop to use actually makes analyzing data a lot of fun.

Whether you’re planning a career in statistics or just looking to analyze your current dataset, you’re going to need software to do the calculations. Yes, there are some people who still calculate descriptive statistics manually, but this practice is so prone to errors that it’s only applied to very small datasets. And yes, there are some people who develop their own statistical routines, usually with R, a programming language for statistics available for free under a General Public License, or matrix manipulation software like matlab, maple and mathematica. Unless you’re a mathematical statistician developing a new statistical technique, though, you won’t need to take this approach if you don’t want to. There’s plenty of software available. All you need to know is the kind of statistical analyses you’re likely to use and your price range.

Software for General Statistics

With a few exceptions, almost all of the statistical software you’ll find is geared to the most common types of statistical analysis, including descriptive statistics, hypothesis testing, correlation and regression, and analysis of variance. Software used for statistical analysis can be grouped into five categories:

Web-based Calculators—Web sites that perform simple statistical calculations can be found at statpages.org/. This is the low end of cost, but also usability. You usually have to enter your data and edit it manually, so it’s not really suitable for production work.
Spreadsheets—You probably already have a copy of Microsoft Excel or some other spreadsheet software on your computer. If you are a beginner at data analysis, you’ll find that you can accomplish most of what you want to do using spreadsheet software. Advanced data analysis may be more of an issue, though. Some statisticians advise against using spreadsheet software, particularly Excel, citing three reasons. First, Excel doesn’t do some calculations and graphs that statistical packages do. Well, of course it doesn’t. It’s a spreadsheet program that sells for less than $200 (by itself, not part of Office) compared to statistical packages that cost ten times as much. Big deal. Second, Excel’s calculated probabilities are incorrect, reportedly in the third decimal place. OK, but if you would base a decision solely on whether a probability is 0.051 instead of 0.049, you really don’t understand the nature of statistical testing (more on this in another blog). And third, Excel’s random number generators are not of research quality. Yup, so if you’re planning to do Monte Carlo simulations with Excel … well, don’t (not necessarily because your answer will be wrong as much as because some people will think it is wrong).
Basic Statistical Software—This category includes software that is used mainly for less sophisticated types of statistical analysis. Most can be purchased for less than about $500. Key examples include StatsDirect, In Stat, Analyze It, and Assistat.
Intermediate Statistical Software—This category includes software that can be used for many types of statistical analysis except some of the more sophisticated techniques like multivariate analysis. Most but not all are a single module and cost less than about $1,000. Examples include NCSS, Statistix, Costat, Origin, Prostat, Soritec, MVSP, and Simstat.
Major Statistical Packages—This category includes software that can be used for a variety of purposes. Most have a base module and a variety of optional add-on modules. They are usually purchased through annual licenses specifying a number of users, and cost more than about $1,000 (in some cases, way over). Some of the major packages like SAS and SPSS have been around since the mainframe days of the 1960s. Others like Statistica are products of the 1980s development of personal computers. Other examples include S-Plus, Stata, Systat, Minitab, and Statgraphics.

Data analysis programs typically have spreadsheet screens for data because statistical calculations use matrices, and after all, a spreadsheet is really just a matrix. They also have utilities for both data management and graphing, which are essential for any type of data analysis. Most all statistical software has graphical user interfaces (GUIs) and many also allow you to write your own code for specialized applications. Almost all have downloadable demos, usually fully functional (at least for basic statistics) for 30 days.

To conduct an analysis with statistical software, you enter or upload your data, scrub it (a whole other discussion), then pick from the program’s menus the graphing or analysis procedure you want to run. Submenus will pop up with all the specifications and options for the procedure. So, it’s quite easy to do a lot of statistical analyses with just a few mouse clicks but you really have to understand what all those specifications and options are about.

All of the software packages have their fans, especially the major packages. SPSS was created in the 1960s by graduates of Stanford who continued development at the University of Chicago. It used to be called Statistical Package for the Social Sciences, which is why it’s still very popular in the social sciences. SPSS was bought by IBM in 2009. SAS, formerly called the Statistical Analysis System, was developed in the early 1970s by professors at North Carolina State University. S-Plus started out as a programming language developed by Bell Laboratories in the 1980s. Minitab was created by professors at the Pennsylvania University in the 1970s from statistical spreadsheet software developed at the National Institute of Standards and Technology (NIST). It’s now focusing on Six Sigma statistics procedures for managing quality.

There is no real best statistical software. They’re all pretty good, dollar-for-dollar. A lot of what determines a user’s preference is what software is (was) available at their college or the place they work. For example, if you go (went) to Penn State, you probably think Minitab is the best. If you work at a pharmaceutical company, you probably use SAS because that’s what the entire pharmaceutical industry uses. Social scientists like to use SPSS. If you like programming your own procedures you’re probably a proponent of the R programming language for statistics.

Assuming you don’t have access to software through your school or work, you can evaluate your software needs by answering three questions:

How sophisticated are the statistical techniques you need to use?
How often would you likely need to use the software?
How much do you have to spend for the software?

If you are planning on doing only one analysis, see if you can use what you have. You may be able to do all your calculations in a spreadsheet program or use free software or web-based software. If you are going to do full-time statistical consulting and you can’t afford a license for a major package, bite the bullet and learn R. Another option would be to buy a basic or an intermediate package and move up as you can afford to. If you’re only going to be an occasional user, any of the statistical packages will be better than using a spreadsheet (except perhaps for dataset scrubbing), so purchase whatever you can afford.

If you aren’t acquainted with statistical software, conduct a web search or start at en.wikipedia.org/wiki/List_of_statistical_packages. Explore the web sites you find to be sure that the software has the statistical procedures you think you will be using. Almost all of the sites have free downloads, such as brochures, white papers and demonstration software. Don’t download the demo software until you’re ready to make a decision. Most demos are good for only 30 days after which the software won’t work even if you download a new copy.

Software for Specialized Applications

There are a few kinds of analysis you might run into that will require specialized software. For example, have you ever seen an icon plot using sparklines or Chernoff faces? How about a ternary diagram or a piper plot? Some day you may have to produce one of these specialized graphics. Software you could look into would include: Sigmaplot, Origin, AquaChem, GraphPad, EasyPlot, Delta Graph, and Grapher.

If you ever have to do time-series analysis, you could start with some of the high-end statistical packages. Or, you could look into specialized software including Autobox, Eviews, ForecastX, and RATS. If you have to produce maps, find a GIS expert to help you. If you’re committed to doing it yourself, try Surfer. If you’re not into meteorology or geology, you probably don’t run into orientation data very often, but if you ever do, get Oriana. For critical-path scheduling, try Microsoft Project or P5, an update to Primavera Project Planner, now a product of Oracle. There’s also software for resampling statistics, control charts, ANOVA, neural networks, nonparametric statistics, power analysis, Bayesian statistics, data mining and many other specialties.

The software market changes rapidly. The big packages keep getting bigger, spawning optional modules from procedures that used to be part of the basic package. At the same time, new statistical software appears, usually for specialized application. Spreadsheet software is also becoming more sophisticated. Introductory statistics classes are now taught with spreadsheet software; even calculators are a thing of the past. So do some research and get the software that’s best for your situation.

Posted in Uncategorized | Tagged cats, hardware, math, PCs, R language, SAS, software, SPSS, statistical analysis, statistics, stats with cats | 17 Comments