Samples are like potato chips. You’re never satisfied with just one. Every one you take makes you want more. And you’re never sure you’ve had enough until you’ve had way too many.
Betcha Can’t Take Just One
One observation. One test sample. One subject. One measurement. One of anything isn’t that satisfying. You’ll always want more to replicate the experience, to find out if there is consistency. Maybe you take just a few. If you sense a pattern, you can build your observations into an anecdote, a story. Many statistical analyses, in fact, grow out of anecdotal evidence. You just can’t stop at the story telling stage. Statistics are antidotes to those anecdotes.
Politicians, preachers, and parents can get away with telling tales to illustrate points they want to make. Their followers trust them and want to believe them whether they are telling the truth or not. Other professionals, though, can’t rely on their audience having such unquestioning faith. Scientists rely on hard data to test their hypotheses. Educators need test scores so they can grade on a curve. Businessmen want to see the numbers before they spend their money (your money, not so much). So you can pretty much expect that once you start collecting data, you’ll going to want more.
You know you want more data, so first you estimate how many more samples you’ll need to get that Purrfect Resolution. Say you’ve estimated that you need 1,000 samples to do a statistical analysis. You package your sampling and analysis plan into a proposal and give it to your client. One thing you can bet on is that your client won’t want to spend the money to collect that many samples. So what can you do? Here are a few suggestions:
Change the Study—Lower your confidence (1 minus the false positive error rate you’ll allow) and power (1 minus the false negative error rate you’ll allow). If you do this, look out for those misleading test results. You can look for bigger effects (e.g., differences between means, size of targets, and so on). You won’t get the resolution you wanted but it could be a good start. Also consider limiting the study area, level of detail, or analysis scope. Sometimes you can trade other project costs, like meetings and deliverables, for a few more samples.
Take Smaller Bites—Take as many samples as you can and use the information to decide what to do next. This is sometimes the aim of a pilot study. You can use the samples collected during a pilot study to estimate more precisely how many more samples you’ll need to get the statistical resolution you want. You might also be able to collect samples in phases or change the implementation schedule to accommodate your client’s budget cycle.
Use Supporting Data—There may be historical data available that you can use to reassess the number of samples you’ll need and even augment the samples you plan to collect (i.e., provided the quality of the historical data is appropriate). You can also consider surrogate sampling, in which you correlate the results of many inexpensive observations or measurements to the few expensive samples your client can afford.
Control Variance—If you think about it, the reason you need more samples in the first place is because you need to improve precision (not accuracy). So think harder about how you can reduce any extraneous variability in the data generation process. Standardized procedures and training of the data collectors might mitigate the need for quite a few samples.
Can you eat too many potato chips? Of course you can. It’s happened to many of us. Likewise, you can have too many samples, which presents its own set of challenges. Here are five:
Information Overload — Statistical software tends to be very efficient, but when you have tens of thousands of samples, you start to see performance slow a bit. What’s more important, though, is the inefficiency you run into when you scrub your dataset, especially if you use a lot of spreadsheet array formulas. Be patient. You can use the waiting time to read a good book.
Chasing Tails — In any data set, you may have 5% influential observations not to mention the outliers and errors that you’ll have to check to determine if they should be corrected, removed from the dataset, or left alone. This is a very time consuming process. With a small dataset, you may have to investigate just a few samples. With a 1,000-record dataset, you may have to investigate 50 samples. This is part of why data scrubbing can represent most of the work in a data analysis project.
Data Intimacy — When you’re working with only a few dozen samples, you get to know each data point. You can look at plots and tables and see how individual details fit into a bigger picture. You can’t do that with a thousand data points. Sometimes you can get around this problem by dividing the data into groups and working with the groups, or analyzing a higher level of hierarchical data.
Graphic Mud — It’s tough to see patterns with only a few samples but plotting thousands of samples can be just as perplexing. You won’t be able to use any small plots like matrix plots. Even with full-scale plots, it will be difficult to see subtle differences in data point markers, like size, shape, and even color. Points will overwrite each other so you won’t be able to tell it there is one point at a graph location or a hundred points stacked on top of each other. And even the best statistical software will choke when trying to print graphs with thousands of data points. Solving this problem usually involves plotting group means or only randomly selected records from the data matrix.
Meaningless Differences — Sometimes you can have too much resolution in a statistical test. If the test can detect a difference smaller than would be of interest in the real world, it’s probably because you used too many samples. Conduct a power analysis after statistical testing to determine what your effect size and error rates were for the test.
And that’s why it’s important to have about the right number of samples; enough to at least make progress towards your goal but not so many that the progress doesn’t justify the effort.
Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.com, barnesandnoble.com, or other online booksellers.
Pingback: Ten Fatal Flaws in Data Analysis | Stats With Cats Blog
Pingback: Limits of Confusion | Stats With Cats Blog
Pingback: Aphorisms for Data Analysts | Stats With Cats Blog
hi there. I have a sampling question for you on sampling…I work for a tree nursery. We handle young seedlings that look identical, but are actually different varieties. For quality assurance purposes we would like to test our stocks. There are some DNA tests that we can perform to confirm the identity of the plants. My question is can I pick a contamination ratio first and then use it to determine the number of samples I need to collect? For example: say I want to make sure I can detect a 5% contamination rate (25 plants in the 500 are something other than variety A). I DON’T need to detect all 25 contaminants at this point…I just need to make sure I collect enough random samples that I am assured I hit ONE of the contaminants…thus tipping me off to the problem in this lot. I could then conduct a more thorough sampling if needed or just dump the whole lot. I would appreciate your thoughts on this… Thanks.
Pingback: HOW TO WRITE DATA ANALYSIS REPORTS. LESSON 1—KNOW YOUR CONTENT. | Stats With Cats Blog
Pingback: How to Write Data Analysis Reports in Six Easy Lessons | Stats With Cats Blog
Pingback: Searching for Answers | Stats With Cats Blog
Pingback: Dare to Compare – Part 1 | Stats With Cats Blog