The Kibble Buffet Experiment

Cat IntroMy three indoor cats are all seniors now, so I’m even more concerned that they have a healthy diet. I feed them both wet and dry food. They share most of two cans of wet food per day and have dry food, kibble, available all the time. Kibble is like snack time. Fortunately, they are more disciplined than I am; they are all of normal weight despite having access to all that food.

With their advanced age, I was concerned that the kibble be easy on their digestive systems. I had been feeding them Iams for the past six years since I rescued them and they seemed content with it. Nevertheless, I went to Chewy to see if they had any alternatives for senior cats with sensitive stomachs. To say the least, I was quite surprised that Chewy had so many brands catering to what I thought was a limited feline demographic. So much for an exhaustive experiment I envisioned, this was going to be a pilot test limited to just a few brands.

If this were a critical scientific experiment that would face peer review and replication, there would be a lot to consider in planning. But this is just a small, personal experiment between just me and my cats. We’ll all be satisfied with the results whatever they may be. So, with that said, here’s the outline of the experiment.


The population for the experiment is small; it’s just my three indoor cats – Critter. Poofy, and Magic. Critter (AKA Ritter Critter) was rescued by my daughter from outside the Ritter building at Temple University when she was a kitten in 2006. Poofy (AKA Poofygraynod) was rescued from my back yard in 2013 when (the vet thought) she was about 6. Magic (AKA Black Magic) was also rescued (with Poofy) from my back yard in 2013 when (the vet thought) she was about 2. Critter and Magic were born in the wild; Poofy might have been a stray. At the time of the experiment, Critter and Poofy were about 14 and Magic was about 10. Poofy weighs about 13 pounds, Critter weighs about 11 pounds, and Magic weighs about 8 pounds, Poofy eats mostly kibble but also some wet food. Critter eats mostly wet food (ocean whitefish is her favorite) but also some kibble. Magic eats both kibble and wet food equally, in good feral fashion.

Cat population

Because the population consists of only three individuals, and it’s their composite response that is of interest, the experiment actually involves measuring the cats’ preferences repeatedly over the course of the experiment. This is a census (rather than a survey) of the population. The sampling design is systematic, one set of measurements of kibble eaten by brand for the duration of the experiment. This is called a repeated measures design.


The phenomenon being evaluated is preference for selected brands of kibble. Each cat may have a different preference, even changing day-by-day, but only the composite preference is important because I purchase the kibble in the aggregate. Preference is measured by the amount of each brand of kibble consumed in a 24-hour period.

Research Questions

I had four research questions I wanted to answer.

  1. What kinds of kibble do my indoor cats like to eat?
  • I hypothesized that they might prefer Iams since that is the brand they had been eating for the prior seven years.
  • I hypothesized that they might like seafood best because this preference is often depicted in cat-related cliche.
  • Protein, Fat, Fiber. I hypothesized that they might prefer the highest protein content.
  • I hypothesized that they might like kibble that was smaller and more rounded so that it was easier to swallow.
  1. How much do they eat in a day? I hypothesized that they would eat less that three cups of food per day based on prior feeding patterns. I provided a total of about twice that amount during the experiment, one cup of kibble for each of the six brands each day.
  2. Do they prefer variety or will they eat the same kibble consistently? I hypothesized that they would eat a variety of the brands because I like variety in MY diet. My vet disagreed. He thought they would eat what they were most familiar with.
  3. Will a different kibble reduce their barfing? I hypothesized that there would be no difference because they were already eating Iams kibble for cats with sensitive stomachs.

Kibble Brands

I was surprised by how many different brands there are of dry food for senior cats with sensitive stomachs available from just one vendor (Chewy). I decided to test just six because of cost and test logistics. They were:

Iams – because that’s the brand I had been feeding the cats for the last seven years; it was my “control group” brand. It is the least expensive of the brands and consists primarily of chicken and turkey, corn, rice, and oats.

Ingredients Iams

Purina – because I wanted a well-known national brand available in supermarkets. I selected two Purina products, Focus and True Nature, to see if there was a difference between kibble formulations from the same company. They have the largest kibble, high protein, high calories, and tend to be more expensive. Focus is chicken and turkey flavored and contains rice, oats, and barley. True Nature is salmon and chicken flavored, and is grain free.

Ingredients Focus

Ingredients True Nature

Hills Science Diet – because I wanted a well-known, highly-rated, vet-recommended brand sold mostly in pet stores. It is high density and high calorie but lower protein than the other brands. It is chicken flavored and contains corn, soy, and oats.

Ingredients Hills

Halo – Perhaps the first “holistic” cat food, having been introduced in the 1980s, it purports to be ultra-digestible because of its use of fresh meats, vitamins, probiotics, and other healthy ingredients. It is seafood flavored and contains oats, soy, and barley.

Ingredients Halo

I and Love and You – A newer formulation of holistic ingredient, it is grain-free, includes prebiotics and probiotics for healthy digestion, and has the longest ingredient list by far. It contains seafood and chicken/turkey.

Ingredients I Love You

Experimental Procedure

The experimental set up consisted of six paper bowls, one for each brand of kibble. The positions of the bowls were randomized so that the cats wouldn’t associate a certain kibble brand with a position. Every 24 hours, at 8 PM, the bowls were filled with one cup of kibble and weighed. (They still got their can of wet food at 5 AM when they wake me up.) The cats were then allowed to eat the kibble as they wanted for 24 hours. It was clear from the beginning of the experiment that the cats did prefer certain brands, though they would try others.

Bowls 1-IMG_5859

At the end of 24 hours, the bowls were reweighed. Remaining kibble was transferred to a bucket to be fed to my three outdoor feral cats, who will eat anything. The bowls were then filled and weighed again, and placed in their new randomized position.

Data Recorded each day included:

  • Day of experiment and time
  • Kibble brand
  • Bowl position
  • Weight of kibble not eaten
  • Weight of kibble provided for the next day

An informal inspection of the house was also conducted to identify any barfs that may have occurred.

Cats Three-IMG_5863

Planned Analysis

The dependent variable for the analysis was the brand of kibble. The independent variable for the analysis was the weight of kibble eaten, calculated as:

Weight of kibble eaten = Weight of kibble provided – Weight of kibble left over

The position of the bowls was a blocking factor used to control extraneous variance. The day-of-the-experiment was a repeated-measures factor. This design is a two-way repeated-measures Analysis of Variance (ANOVA).

Prior to conducting any statistical testing, an exploratory analysis was planned involving calculating descriptive statistics and constructing graphs.

Depending on the results of the exploratory analysis, global ANOVA tests were planned for detecting differences between the brands, with the effects of bowl position and day of the experiment held constant. A priori tests were also planned to detect any differences between individual brands and the control brand, Iams.

Cats Two-IMG_5873


Though not what I expected, it was obvious after a couple of days that the cats had a clear preference for Hills Science Diet. Consequently, I ended the experiment after two weeks.

Bowls 3-IMG_5868

Descriptive Summary

The following table summarizes the amount of each brand that the cats ate over the two-week experiment.

Table results

Statistical Testing

While the design of the experiment is technically a two-way repeated-measures Analysis of Variance (ANOVA), the large differences in brands and the lack of differences in bowl position and day of the experiment made calculating the model unnecessary. This solved the problem of my not having appropriate software to conduct that part of the analysis. Instead, the following sections describe two-way ANOVA results for the brand versus bowl position and the brand versus day of the experiment models. Statistical comparisons of the amounts of each brand eaten are also summarized, with emphasis on differences between each brand and the “control” brand, Iams.

Cats Two-IMG_5861

Table Kibble by Day

Bar chart day

The global ANOVA test for the brands was significant when the effects of the day of the experiment was controlled for. The day of the experiment had no impact on the amount of kibble eaten. No surprise there.

ANOVA Brands Days

Cats Two-IMG_5871

Table Kibble by Position

Bar chart bowl position

The global ANOVA test for the brands was significant when the effects of bowl position was controlled for. Bowl position had no impact on the amount of kibble eaten. Again, that’s not a big surprise although you can never be sure of a hypothesis when you’re working with cats.

ANOVA Brands Position

Cats One-IMG_5869

The following table summarizes the statistical tests between the brands. The important tests are the comparisons between each brand and the control brand, Iams, highlighted in yellow. The only significant tests were the comparisons between Hills Science Diet versus Iams and Purina True Nature versus Iams. This means that my cats like Hills and True Nature a lot more than what I’ve been feeding them for the last few years. Time to switch brands. I could have done worse had I fed them one of the other brands instead of Iams, but not significantly so.

ANOVA Post hoc

Cats Two-IMG_5870


First, things don’t always go the way you think they will. This is true in any experiment … and life in general. There was really no need to conduct the sophisticated ANOVA that I had planned, so I didn’t bother. Oh well, next time.

Second, my three cats eat about 135 grams (4.8 oz) of kibble in a day. Now I can use the automatic reorder feature on Chewy and save a few dollars.

Third, You know how you tend to eat a lot more after you come home with new groceries? Cats do it too. They ate a lot more on the first day of the experiment when they had five new brands of kibble to taste.

Fourth, I randomized bowl position as a way of controlling extraneous variation for the ANOVA. It seems that the middle positions had more kibble eaten per bowl than the outer positions. I have no explanation for this pattern and the cats aren’t talking.

Fifth, I can’t say that it reduced barfing because I had no baseline. My daughter, who doesn’t like cats, led me to believe that they barf about every twenty minutes. Still, there were only three barfs during the two-week experiment, which I considered not to be so bad.

Sixth, my cats clearly prefer Hill’s Science Diet as their kibble of choice. They don’t seem to want a variety of brands. I don’t know why my cats preferred Hills. It doesn’t appear to be the flavor, texture, or protein content since other brands had different combinations of these factors. If you look at reviews of other brands of kibble, you’ll find people who swear that their cat(s) likes the-brand-that-they-buy best. They’re probably right. Every cat or population of cats may have different tastes. I, myself, like pineapple on my pizza.

Finally, my experiment wasn’t large or sophisticated enough to isolate and analyze hypotheses about ingredients. If I could figure out why cats prefer one brand of kibble over another, though, I could probably get a job with Purina.

Cats Two-IMG_5865

Further Research

It’s always a good practice to describe additional research that could be done to make the world a better place. Who knows if somebody with money might see it and fund your further research? In this case, further research might involve testing different brands, especially if the brands could be selected to explore a variety of flavors, kibble shapes, sizes, densities, and types and concentrations of protein. Finally, I would recommend using many, many more cats if you can. My daughter won’t let me have any more.

So, if you find yourself with some time on your hands, consider conducting your own experiment on your cats. You might be surprised at what you learn. It’ll be fun.

More cats

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data analysis at, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , , , , , | 1 Comment

What to Look for in Data – Part 2

2-1 rskgilop2zb21What to Look for in Data – Part 1 discusses how to explore data snapshots, population characteristics, and changes. Part 2 looks at how to explore patterns, trends, and anomalies. There are many different types of patterns, trends, and anomalies, but graphs are always the best first place to look.

Patterns and Trends

There are at least ten types of data relationships – direct, feedback, common, mediated, stimulated, suppressed, inverse, threshold, and complex – and of course spurious relationships. They can all produce different patterns and trends, or no recognizable arrangement at all.

patternsThere are four patterns to look for:

  • Shocks,
  • Steps
  • Shifts
  • Cycles.

Shocks are seemingly random excursions far from the main body of data. They are outliers but they often reoccur, sometimes in a similar way suggesting a common, though sporadic cause. Some shocks may be attributed to an intermittent malfunction in the measurement instrument. Sometimes they occur in pairs, one in the positive direction and another of similar size in the negative direction. This is often because of missed reporting dates for business data.

Steps are periodic increases or decreases in the body of the data. Steps progress in the same direction because they reflect a progressive change in conditions. If the steps are small enough, they can appear to be, and be analyzed as, a linear trend.

Shifts are increases and/or decreases in the body of the data like steps, but shifts tend to be longer than steps and don’t necessarily progress in the same direction. Shifts reflect occasional changes in conditions. The changes may remain or revert to the previous conditions, making them more difficult to analyze with linear models.

Cycles are increases and decreases in the body of the data that usually appear as a waveform having fairly consistent amplitudes and frequencies. Cycles reflect periodic changes in conditions, often associated with time, such as daily or seasonal cycles. Cycles cannot be analyzed effectively with linear models. Sometimes different cycles add together making them more difficult to recognize and analyze.

stretch 6Simple trends can be easier to identify because they are more familiar to most data analysts. Again, graphs are the best place to look for trends.

linear curvilinear trendsLinear trends are easy to see; the data form a line. Curvilinear trends can be more difficult to recognize because they don’t follow a set path. With some experience and intuition, however, they can be identified. Nonlinear trends look similar to curvilinear trends but they require more complicated nonlinear models to analyze. Curvilinear trends can be analyzed with linear models with the use of transformations.

There are also more complex trends involving different dimensions, including:


  • complex 0fqg014jddc21Temporal
  • Spatial
  • Categorical
  • Hidden
  • Multivariate

Temporal Trends can be more difficult to identify because Time-series data can be combinations of shocks, steps, shifts, cycles, and linear and curvilinear trends. The effects may be seasonal, superimposed on each other within a given time period, or spread over many different time periods. Confounded effects are often impossible to separate, especially if the data record is short or the sampled intervals are irregular or too large.

time trends swc page 276

geostatistics page 287Spatial Trends present a different twist. Time is one-dimensional; at least as we now know it. Distance can be one-, two-, or three-dimensional. Distance can be in a straight line (“as the crow flies”) or along a path (such as driving distance). Defining the location of a unique point on a two-dimensional surface (i.e., a plane) requires at least two variables. The variables can represent coordinates (northing/easting, latitude/longitude) or distance and direction from a fixed starting point. At least three variables are needed to define a unique point location in a three-dimensional volume, so a variable for depth (or height) must be added to the location coordinates.

Looking for spatial patterns involves interpolation of geographic data using one of several available algorithms, like moving averages, inverse distances, or geostatistics.

cat trendsCategorical Trends are no more difficult to identify than any trend except you have to break out categories to do it, which can be a lot of work. One thing you might see when analyzing categories is Simpson’s paradox. The paradox occurs when trends appear in categories that are different from the overall group. Hidden Trends are trends that appear only in categories and not the overall group. You may be able to detect linear trends in categories without graphs if you have enough data in the categories to calculate correlation coefficients within each.

Multivariate Trends add a layer of complexity to most trends, which are bivariate. Still, you look for the same things, patterns and trends, only you have to examine at least one additional dimension. The extra dimension may be an additional axis or some other way of representing data, like icon type, size, or color.

2-2 42889692_10211841753228172_653625762535964672_n


Sometimes the most interesting revelations you can garner from a dataset are the ways that it doesn’t fit expectations. Three things to look for are:


  • Censoring
  • Heteroskedasticity
  • Outliers

Censoring is when a measurement is recorded as <value or >value, indicating that the measurement instrument was unable to quantify the real value. For example, the real value may be outside the range of a meter, counts can’t be approximated because there are too many or too few, or a time can only be estimated as before or after. Censoring is easy to detect in a dataset because they should be qualified with < or >.

heteroskedasticityHeteroskedasticity is when the variability in a variable is not uniform across its range. This is important because homoscedasticity (the opposite of heteroskedasicity) is assumed by probability statements in parametric statistics. Look for differing thicknesses in plotted data.



Influential observations and outliers are the data points that don’t fit the overall trends and patterns. Finding anomalies isn’t that difficult; deciding why they are anomalous and what to do with them are the really tough parts. Here are some examples of the types of outliers to look for.

How and Where to Look

That’s a lot of information to take in and remember, so here’s a summary you can refer to in the future if you ever need it.

summary table for where to look

And when you’re done, be sure to document your results so others can follow what you did.


Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at,, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , , , , , , , , , , , , | Leave a comment


1-1 explore 3Some activities are instinctive. A baby doesn’t need to be taught how to suckle. Most people can use an escalator, operate an elevator, and open a door instinctively. The same isn’t true of playing a guitar, driving a car, or analyzing data.

When faced with a new dataset, the first thing to consider is the objective you, your boss, or your client have in analyzing the dataset.


How you analyze data will depend in part on your objective. Consider these four possibilities, three are comparatively easy and one is a relative challenge.

  • Conduct a Specific Analysis – Your client only wants you to conduct a specific analysis, perhaps like descriptive statistics or a statistical test between two groups. No problem, just conduct the analysis. There’s no need to go further. That’s easy.
  • Answer a Specific Question – Some clients only want one thing — answer a specific question. Maybe it’s something like “is my water safe to drink” or “is traffic on my street worse on Wednesdays.” This will require more thought and perhaps some experience, but again, you have a specific direction to go in. That makes it easier.
  • Address a General Need – Projects with general goals often involve model building. You’ll have to establish whether they need a single forecast, map or model, or a tool that can be used again in the future. This will require quite a bit of thought and experience but at least you know what you need to do and where you need to end up. Not easy but straightforward.
  • Explore the Unknown – Every once in a while, a client will have nothing specific in mind, but will want to know whatever can be determined from the dataset. This is a challenge because there’s no guidance for where to start or where to finish. This blog will help you address this objective.

If your client is not clear about their objective, start at the very end. Ask what decisions will need to be made based on the results of your analysis. Ask what kind of outputs would be appropriate – a report, an infographic, a spreadsheet file, a presentation, or an application. If they have no expectations, it’s time to explore.

Got data?

1-2d6a59bb482855d080f219d7dee840abd--funny-animals-funny-catsScrubbing your data will make you familiar with what you have. That’s why it’s a good idea to know your objective first. There are many things you can do to scrub your data but the first thing is to put it into a matrix. Statistical analyses all begin with matrices. The form of the matrix isn’t always the same, but most commonly, the matrix has columns that represent variables (e.g., metrics, measurements) and rows that represent observations (e.g., individuals, students, patients, sample units, or dates). Data on the variables for each observation go into the cells. Usually this is usually done with spreadsheet software.

Data scrubbing can be cursory or exhaustive. Assuming the data are already available in electronic form, you’ll still have to achieve two goals – getting the numbers right and getting the right numbers.

Getting the numbers right requires correcting three types of data errors:

  • Alphanumeric substitution, which involves mixing letters and numbers (e.g., 0 and o or O, 1 and l, 5 and S, 6 and b), dropped or added digits, spelling mistakes in text fields that will be sorted or filtered, and random errors.
  • Specification errors involve bad data generation, perhaps attributable to recording mistakes, uncalibrated equipment, lab mistakes, or incorrect sample IDs and aliases.
  • Inappropriate Data Formats, such as extra columns and rows, inconsistent use of ND, NA, or NR flags, and the inappropriate presence of 0s versus blanks.

Getting the right numbers requires addressing a variety of data issues:

  • Variables and phenomenon. Are the variables sufficient to explore the phenomena in question?
  • Variable scales. Review the measurement scales of the variables so you know what analyses might be applicable to the data. Also, look for nominal and ordinal scale variables to consider how you might segment the data.
  • Representative sample. Considering the population being explored, does the sample appear to be representative.
  • Replicates. If there are replicate or other quality control samples, they should be removed from the analysis appropriately.
  • Censored data. If you have censored data (i.e., unquantified data above or below some limit), you can recode the data as some fraction of the limit, but not zero.
  • Missing data. If you have missing data, they should be recoded as blanks or use another accepted procedure for treating missing data.

Data scrubbing can consume a substantial amount of time, even more than the statistical calculations.

1-3 instinct 1What To Look For

If statistics wasn’t your major in college or you’re straight out of college and new to applied statistics, you might wonder where you might start looking at a dataset? Here are five places to consider looking.

  • Snapshot
  • Population or Sample Characteristics
  • Change
  • Trends and Patterns
  • Anomalies

Start with the entire dataset. Look at the highest levels of grouping variables. Divide and aggregate groupings later after you have a feel for the global situation. The reason for this is that the number of possible combinations of variables and levels of grouping variables can be large, overwhelming, each one being an analysis in itself. Like peeling an onion, explore one layer of data at a time until you get to the core.


What does the data look like at one point. Usually it’s at the same point in time but it could also be common conditions, like after a specific business activity, or at a certain temperature and pressure. What might you do?

1-4 cat_m3_cat_outside_1Snapshots aren’t difficult. You just decide where you want a snapshot and record all the variable values at that point. There are no descriptive statistics, graphs, or tests unless you decide to subdivide the data later. The only challenge is deciding whether taking a snapshot makes any sense for exploring the data.

The only thing you look for in a snapshot is something unexpected or unusual that might direct further analysis. It can also be used as a baseline to evaluate change.

Population Characteristics

It’s always a good idea to know everything you can about the populations you are exploring. Here’s what you might do.

This is a no-brainer; calculate descriptive statistics. Here’s a summary of what you might look at. It’s based on the measurement scale of the variable you are assessing.

table 1

For grouping (nominal scale) variables, look at the frequencies of the groups. You’ll want to know if there are enough observations in each group to break them out for further analysis. For progression (continuous) scales, look at the median and the mean. If they’re close, the frequency distribution is probably symmetrical. You can confirm this by looking at a histogram or the skewness. If the standard deviation divided by the mean (coefficient of variation) is over 1, the distribution may be lognormal, or at least, asymmetrical. Quartiles and deciles will support this finding. Look at the measures of central tendency and dispersion. If the dispersion is relatively large, statistical testing may be problematical.

Graphs are also a good way, in my mind, the best way to explore population characteristics. Never calculate a statistic without looking at its visual representation in a graph, and there are many types of graphs that will let you do that.

table 2

What you look for in a graph depends on what the graph is supposed to show – distribution, mixtures, properties, or relationships. There are other things you might look for but here are a few things to start with.

For distribution graphs (box plots, histograms, dot plots, stem-leaf diagrams, Q-Q plots, rose diagrams, and probability plots), look for symmetry. That will separate many theoretical distributions, say a normal distribution (symmetrical) from a lognormal distribution (asymmetrical). This will be useful information if you do any statistical testing later.

For mixture graphs (pie charts, rose diagrams, and ternary plots), look for imbalance. If you have some segments that are very large and others very small, here may be common and unique themes to the mix to explore. Maybe the unique segments can be combined. This will be useful information if you do break out subgroups later.

1-5 box2For properties graphs (bar charts, area charts, line charts, candlestick charts, control charts, means plots, deviation plots, spread plots, matrix plots, maps, block diagrams, and rose diagrams), look for the unexpected. Are the central tendency and dispersion what you might expect? Where are big deviations?

For relationship graphs (icon plots, 2D scatter plots, contour plots, bubble plots, 3D scatter plots, surface plots, and multivariable plots), look for linearity. You might find linear or curvilinear trends, repeating cycles, one-time shifts, continuing steps, periodic shocks, or just random points. This is the prelude for looking for more detailed patterns.


Change usually refers to differences between time periods but, like snapshots, it could also refer to some common conditions. Change can be difficult, or at least complicated,  to analyze because you must first calculate the changes you want to explore. When calculating changes, be sure the intervals of the change are consistent. But after that, what might you do?

First, look for very large, negative or positive changes. Are the percentages of change consistent for all variables? What might be some reasons for the changes.

Calculate the mean and median changes. If the indicators of central tendency are not near zero, you might have a trend. Verify the possibility by plotting the change data. You might even consider conducting a statistical test to confirm that the change is different from zero.

If you do think you have a trend or pattern, there are quite a few things to look for. This is what What to Look for in Data – Part 2 is about. 


Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at,, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , , | Leave a comment


testing 11-25-2018 small 2Part 3 of Dare to Compare shows how one-population statistical tests are conducted. Part 4 extends these concepts to two-population tests.

To review, this flowchart summarizes the the process of statistical testing.

First, you PLAN the comparison by understanding the populations you will take a representative sample of individuals from and measure the phenomenon on. Then you assess the frequency distributions of the measurements to see if they approximate a Normal distribution.

Second, you TEST the measurements by considering the test parameters, the type of test, the hypotheses, the test dimensionality, degrees of freedom, and violations of assumptions.

Third, you review the RESULTS by setting the confidence, determining the effect size and power of the test, and assessing the significance and meaningfulness of the test.

Now imagine this.

You’re a sophomore statistics major at Faber College and you need to sign up for the dreaded STATS 102 class. The class is taught in the Fall and the Spring by two different instructors (Dr. Statisticus and Prof. Modearity) as either three, one-hour sessions on Mondays, Wednesdays, and Fridays, or as two, hour and a half sessions on Tuesdays and Thursdays. You wonder if it makes a difference which class you take. Having completed STATS 101, you know everything there is to know about statistics, so you get the grades from the classes that were taught last year.  Here are the data.

table 4-1 2019-01-13_19-07-45

What class should you take to get the highest grade? Dr. Statisticus gave out the highest grades in the Fall; Prof. Modearity gave out a higher grade in the Spring. On the other side of the coin, only one person flunked (grade below 75) Dr. Statisticus’ classes but six people flunked Prof. Modearity’s classes. Three students flunked in the Fall while four students flunked in the Spring. Two people flunked TuTh classes and five people flunked MWF classes. This is complicated.

Looking at the averages, you think that taking Dr. Statisticus’ Tuesday-Thursday class in the fall would be your best bet. However, is a two or three point difference worth the class conflicts and scheduling hassles you might have? Does it really matter?

table 4-2 2019-01-13_19-09-03

Maybe it’s time for some statistical testing? But these would be two-population tests because you have to compare two semesters, two instructors, and two class lengths.

Two Population t-Tests

In a two-population test, you compare the average of the measurements in the first population to the average of the second population, using the formula:

equation 4-1 2019-01-13_19-09-47 - copy - copy

frightened kitten 6This is a bit more complicated than the formula for a one-population test because you can have different standard deviations and different numbers of measurements in the two populations.

Here’s what’s happening. The numerator (top part of the formula) is the same in both t-test formulas. The leftmost term in the denominator calculates a weighted average of the variances, called a pooled variance.

equation 4-2 2019-01-13_19-10-44 - copyIf the number of measurements taken of the two populations is the same, the test design is said to be balanced. If the variances of the measurements in the two populations are the same, the leftmost term in the denominator reduces to s2. So, the formula for a balanced two-population t-test with equal variances is:

equation 4-4 2019-01-13_19-11-17 - copyMuch more simple but not as useful as the more complicated formula. You might be able to control the number of samples from the populations but you can’t control the variances.

Once you calculate a t value, the rest of the test is similar to a one-population test. You compare the calculated t to a t-value from a table or other reference for the appropriate number of tails, the confidence (1- α), and the degrees of freedom (the number of samples in the sample of the population minus 1).

If the calculated t value is larger than the table t value, the test is SIGNIFICANT, meaning that the means are statistically different. If the table t value is larger than the calculated t value, the test is NOT SIGNIFICANT, meaning that the means are statistically the same.

2 pop nonsig nondir

2 pop sig nondir


Back to the example. You want to compare the differences between semesters, instructors, and class days. You have no expectations for what the best semester, instructor, or class day would be. To be conservative, you’ll accept a false positive rate (i.e., 1-confidence, α) of 0.05. Your null hypotheses are:

цFall Semester = цSpring Semester
цDr. Statisticus = цProf. Modearity
цMWF = цTuTh

Now for some calculations, first the semesters.

XFall Semester      = 84.0
XSpring Semester  = 83.5
NFall Semester      = 33
NSpring Semester  = 35
S2Fall Semester    = 49.7 (S = 7.05)
S2Spring Semester = 41.7 (S = 6.46)

equation 4-5 2019-01-13_19-12-18 - copy

And the tabled value is:

t(2-tailed, 0.05 confidence, 65 degrees of freedom) = 1.997

You can do these calculations in Excel with the formula:


Where type=3 is a t-test for two-samples with unequal variances. There are also a few online sites for the calculations, such as, from which this graphic was produced.

semesters 2019-01-07_16-28-14

4 teacher 15a5s4So there is no statistically significant difference between the Fall semester classes and the Spring semester classes.

Now for the instructors:

XDr. Statisticus    = 85.4
XProf. Modearity   = 82.0
NDr. Statisticus     = 35
NProf. Modearity   = 33
S2Dr. Statisticus   = 37.5 (S = 6.12)
S2Prof. Modearity = 48.5 (S = 6.96)

equation 4-6 2019-01-13_19-13-26

And the tabled value is:

t(2-tailed, 0.05 confidence, 66 degrees of freedom) = 1.996

So there is a statistically significant difference between instructors. Dr. Statisticus gives higher grades than Prof. Modearity.


instructor 2019-01-07_16-32-27

4 monday e8c828e1-7aaa-464d-8398-441da35e3184Now for the days of the week:

XMWF                = 82.4
XTuTh                = 85.2
NMWF                = 36
NTuTh                 = 32
S2MWF               = 47.8 (S = 6.91)
S2TuTh               = 39.4 (S = 6.28)

equation 4-7 2019-01-13_19-14-35

So there is no statistically significant difference between the one-hour classes on Mondays, Wednesdays, and Fridays and the hour-and-a-half classes on Tuesdays and Thursdays.


days 2019-01-07_16-36-01 - copy - copy

4 part49763641_10213631891056765_4351658033923751936_nHere is a summary of the three tests.

table 4-5 2019-01-13_19-15-50









So take Dr, Statisticus’ class when ever it fits in your schedule.


3- g1szbxu2qu21pboy71iba__vbf7ok3nzfdxnx0-ogikSo what do you do if you have more than two populations or more than one phenomenon or some other weird combinations of data? You use an Analysis of Variance (ANOVA).

ANOVA includes a variety of statistical designs used to analyze differences in group means. It is a generalization of the t-test of a factor (called maineffect or treatments in ANOVA) to more than two groups (called levels in ANOVA). In an ANOVA, the variances in the levels of factors being compared are partitioned between variation associated with the factors  in the design (called model variation) and random variation (called error variation). ANOVA is conceptually similar to multiple two-population t-tests, but produces fewer type I (false positive) errors. While t-tests use t-values from the t-distribution, ANOVAs use F-tests from the F-distribution. An F-test is the ratio of the model variation the error variation. When there are only two means to compare, the t-test and the ANOVA F-test are equivalent according tp the relationship F = t2.

Types of ANOVA

There are many types of ANOVA designs. One-way and multi-way ANOVAs are the most common.

One-Way ANOVAs

One-way ANOVA is used to test for differences among three or more independent levels of one effect. In the example t-test, a one-way ANOVA might involve more than two levels of one of the three factors. For example, a one-way ANOVA would allow testing more than two instructors or more than two semesters.

4-ragdoll-540x423Multi-Way ANOVAs

Multi-way ANOVAs (sometimes called factorial ANOVAs) are used to test for differences between two or more effects. A two-way ANOVA tests two effects, a three-way ANOVA tests three effects, and so on. Multi-way ANOVAs have the advantage of being able to test the significance of interaction effects. Interaction effects occur when two or more effects combine to affect measurements of the phenomenon. In the example t-test, a three-way ANOVA would allow simultaneous analysis of the semesters, instructors, and days, as well as interactions between them.

Other Types of ANOVA

There are numerous other types of ANOVA designs, some of which are too complex to explain in a sentence or two. Here are a few of the more commonly used designs.

Repeated Measures ANOVAs (also called as within-subjects ANOVA) are used when the same subjects are used for each treatment effect, as in a longitudinal study. In the example, if the scores for the students were recorded every month of the semester, it  could be analyzed with a Repeated Measures ANOVA.

Some ANOVAs use design elements to control extraneous variance. The significance of the design elements is not important to the dependent variable so long as it controls variability in the main effects. If the design element is a nominal-scale variable, it is called a blocking effect. If the design element is a continuous-scale variable, it is called a covariate and the model is called an Analysis of Covariance (ANCOVA). In the example, if students’ year in college (freshman, sophomore, junior, or senior, an ordinal scale measure) were added as an effect to control variance, it would be a blocking factor. If students’ GPA (grade point average, a continuous scale measure) as a covariate, it would be a ANCOVA design.

4 random catdownloadRandom Effects ANOVAs assume that the levels of a main effect are sampled from a population of possible levels so that the results can be extended to other possible levels. The Instructors main effect in the example could be a random effect if other instructors were considered part of the population that included Dr. Statisticus and Prof. Modearity. If only Dr. Statisticus and Prof. Modearity were levels of the effect, it would be called a fixed effect. If a design included both fixed and random effects, it is called a mixed effects design.

Multivariate analysis of variance (MANOVA) is used when there is more than one set of measurements (also called dependent variables or response variables) of the phenomenon.

Now What?

Dare to Compare is a fairly comprehensive summary of statistical comparisons. You may not hear about all of these concepts in Stats 101 and that’s fine. Learn what you need to to pass the course. Some topics are taught differently, especially hypothesis development and the normal curve. Follow what your instructor teaches. He or she will assign your grade.

Believe it or not, there’s quite a bit more to learn about all of the topics if you go further in statistics. There are special t-tests for proportions, regression coefficients, and samples that are not independent (called paired sample t-tests). There are tests based on other distributions besides the Normal and t-distributions, such as the binomial and chi2 distributions. There are also quite a few nonparametric tests, based on ranks. And, of course, there are many topics on the mathematics end and o2n more metaphysical concepts like meaningfulness.

Statistical testing is more complicated than portrayed by some people but it’s still not as formidible as, say, driving a car. You might learn to drive as a teenager but not discover statistics and statistical testing until college. Both statistical testing and driving are full of intracacies that you have to keep in mind. In testing you consider an issue once, while in driving you must do it continually. When you make a mistake in testing, you can go back and correct it. If you make a mistake in driving, you might get a ticket or cause an accident. After you learn to drive a car, you can go on to learn to drive motorcycles, trucks, busses, and racing vehicles. After you learn simple hypothesis testing, you can go on to learn ANOVA, regression, and many more advanced techniques. So if you think you can learn to drive a car, you can also learn to conduct a statistical test.

3-end 3

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data analysis at,, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , | Leave a comment


Cross eyed 3Parts 1 and 2 of Dare to Compare summarized fundamental topics about simple statistical comparisons. Part 3 shows how those concepts play a role in conducting statistical tests. The importance of these concept are highlighted in the following table.

Test Specification Why it is Important
Population Groups of individuals or items having some fundamental commonalities relative to the phenomenon being tested. Populations must be definable and readily reproducible so that results can be applied to other situations.
Number of populations being compared The number of populations determines whether a comparison can be a relatively simple 1- or 2-population test or a complex ANOVA test.
Phenomena The characteristic of the population being tested. It is usually measured as a continuous-scale attribute of a representative sample of the population.
Number of phenomenon The number of phenomenon determines whether a comparison will be a relatively simple univariate test or a complex multivariate test.
Representative sample A relatively small portion of all the possible measurements of the phenomenon on the population selected in such a way as to be a true depiction of the phenomenon.
Sample size The number of observations of the phenomenon used to characterize the population. The sample size contributes to the determinations of the type of test to be used, the size of the difference that can be detected, the power of the test, and the meaningfulness of the results.
Hypotheses You start statistical comparisons with a research hypothesis of what you expect to find about the phenomenon in the population. The research hypothesis is about the differences between the categories of the variable representing the population. You then create a null hypothesis that translates the research hypothesis into a mathematical statement that is the opposite of the research hypothesis, usually written in term of no change or no difference. This is the subject of the test. If you do not reject the null hypothesis, you adopt the alternative hypothesis.
Distribution Statistical tests examine chance occurrences of measurements on a phenomenon. These extreme measurements occur in the tails of the frequency distribution. Parametric statistical tests assume that the measurements are Normally distributed. If the distribution is different from the tails of a Normal distribution, the results of the test may be in error.
Directionality Null hypotheses can be non-directional or two-sided (i.e., ц=0), in which both tails of the distribution are assessed. They can also be nondirectional or one-sided (i.e., ц<0 or ц>0), in which only one tail of the distribution is assessed.
Assumptions Statistical tests assume that the measurements of the phenomenon are independent (not correlated) and are representative of the population. They also assume that errors are normally distributed and the variances of populations are equal.
Type of test Statistical tests can be based on a theoretical frequency distribution (parametric) or based on some imposed ordering (nonparametric). Parametric tests tend to be more powerful.
Test Parameters Test parameters are the statistics used in the test. For t-tests using the Normal distribution, this involves the mean and the standard deviation. For F-tests in ANOVA, this involves the variance. For nonparametric tests, this usually involves the median and range.
Confidence Confidence is 1 minus the false-positive error rate. The confidence is set by the person doing the test before testing as the maximum  false-positive error rate they will accept. Usually, an error rate of 0.05 (5%) is selected but sometimes 0.1 (10%) or 0.01 (1%) are used, corresponding to confidences of 95%, 90%, and 99%..
Power Power is the ability of a test to avoid false-negative errors (1-β). Power is based on sample size, confidence, and population variance and is NOT set by  the person doing the test, but instead, calculated after a significant test result..
Degrees of Freedom The number of values in the final calculation of a statistic that are free to vary. For a t-test, the degrees of freedom is equal to the number of samples minus 1.
Effect Size The smallest difference the test could have detected. Effect size is influenced by the variance, the sample size, and the confidence. Effect size can be too small, leading to false negatives, or too large, leading to false positives.
Significance Significance refers to the result of a statistical test in which the null hypothesis is rejected. Significance is expressed as a p-value.
Meaningfulness Meaningfulness is assessed by considering the difference detected by the test to what magnitude of difference would be important in reality.

3-1 INTRO 3

Normal Distributions

After defining the population, the phenomena, and the test hypotheses, you measure the phenomenon on an appropriate number of individuals in the population. These measurements need to be independent of each other and representative of the population. Then, you need to assess whether it’s safe to assume that the frequency distribution of the measurements is similar to a Normal distributed. If it is, a z-test or a t-test would be in order.

Yes, this is scary looking. It’s the equation for the Normal distribution. Relax, you will probably never have to use it.

This figure represents a Normal distribution. The area under the curve represents the total probability of measured values occurring, which is equal to 1.0. Values near the center of the distribution, near the mean, have a large probability of occurring while values near the tails (the extremes) of the distribution have a small probability of occurring.

In statistical testing, the Normal distribution is used to estimate the probability that the measurements of the phenomenon will fall within a particular range of values. To estimate the probability that a measurement will occur, you could use the values of the mean and the standard deviation in the formula for the Normal distribution. Actually though, you never have to do that because there are tables for the Normal distribution and the t-distribution. Even easier, the functions are available in many spreadsheet applications, like Microsoft Excel.

Statistical tests focus on the tails of the distribution where the probabilities are the smallest. It doesn’t matter much if the measurements of the phenomenon follow a normal distribution near the mean so long as it does in the tails. The z-distribution can be used if the sample size is large; some say as few as 30 measurements and others recommend more, perhaps 100 measurements. The t-distribution compensates for small sample sizes by having more area in the tails. It can be used instead of the z-distribution with any number of samples.

The concept behind statistical testing is to determine how likely it is that a difference in two populations parameters like the means (or a population parameter and a constant) could have occurred by chance. If the probability of the difference occurring is large enough to occur in the tails of the distribution, there is only a small probability that the difference could have occurred by chance. Differences having a probability of occurrence less then a pre-specified value (α) are said to be significant differences. The pre-specified value, which is the acceptable false positive error rate, α, may be any small percentage but is usually taken as five-in-a-hundred (0.05), one-in-a-hundred (0.01), or ten-in-a-hundred (0.10).

Here are a few examples of what the process of statistical testing looks like for comparing a population mean to a constant.


One Population z-Test or t-Test

All z-tests and t-tests involve either one or two populations and only one phenomenon. The population is represented by the nominal-scale, independent variable. The measurement of the phenomenon is the dependent variable, which can be measured using a nominal, ordinal, interval, or ratio scale.

For a one-population test, you would be comparing the average (or other parameter) of the measurements in the population to a constant. You do this using the formula for a one-population t-test value (or a z-test value) to calculate the t value for the test.

t-test equation

The Normal distribution and the t-distribution are symmetrical so it doesn’t matter if the numerator of the equation is positive or negative.

Then compare that value to a table of values for the t-distribution (for the appropriate number of tails, the confidence (1- α), and the degrees of freedom (the number of samples of the population minus 1). If the calculated t value is larger than the table t value, the test is SIGNIFICANT, meaning that the mean and the constant are statistically different. If the table t value is larger than the calculated t value, the test is NOT SIGNIFICANT, meaning that the mean and the constant are statistically the same.


Imagine you are comparing the average height of male, high school freshmen, in Minneapolis school district #1. You want to know how their average height compares to the height of 9th to 11th century Vikings (their mascot), for the school newspaper. Turn-of-the-century Vikings were typically about 5’9” or 69 inches (172 cm) tall.

This comparison doesn’t need to be too rigorous. The only possible negative consequence to the test is it being reported by Fox News as a liberal conspiracy, and they do that to everything anyway. You’ll accept a false positive rate (i.e., 1-confidence, α) of 0.10.

Nondirectional Tests

Say you don’t know many freshmen boys but you don’t think they are as tall as Vikings. You certainly don’t think of them as rampaging Vikings. They’re younger so maybe they’re shorter. Then again, they’ve grown up having better diets and medical care so maybe they’re taller. Therefore, your research hypothesis is that Freshmen are not likely to be the same height as Vikings. The null hypothesis you want to test is:

Height of Freshmen = Height of Vikings

which is a nondirectional test. If you reject the null hypothesis, the alternative hypothesis:

Height of Freshmen ≠ Height of Vikings

is probably true of the Freshmen. Say you then measure the heights of 10 freshmen and you get:

63.2, 63.8, 72.8, 56.9, 75.2, 70.8, 68.0, 64.0, 61.4, 65.2

The measurements average 66 inches with a standard deviation of 5.3 inches. The t-value would be equal to:

(Freshmen height – Viking height) / ((standard deviation / (√number of samples)))

t-value = (66 inches – 69 inches) / (5.3 inches / (√10 samples))

t-value = -1.790

Ignore the negative sign; it won’t matter.

In this comparison, the calculated t-value (1.79) is less than the table t-value (t(2-tailed, 90% confidence, 9 degrees of freedom) = 1.833) so the comparison is not significant. The comparison might look something like this:

1-pop nondir nonsig

There is no statistical difference in the average heights of Freshmen and Vikings. Both are around 5’6” to 5’9” tall. That isn’t to say that there weren’t 6’0” Vikings, or Freshmen, but as a group, the Freshmen are about the same height as a band of berserkers. I’m sure that there are high school principals who will agree with this.

When you get a nonsignificant test, it’s a good practice to conduct a power analysis to determine what protection you had against false negatives. For a t-test, this involves rearranging the t-test formula to solve for tbeta:

tbeta = (sqrt(n)/sd) * difference – talpha

The talpha is for the confidence you selected, in this case 90%. Then you look up the t-value you calculated to find the probability for beta. It’s a cumbersome but not difficult procedure. In this example, the calculated tbeta would have been 1.24 so the power would have been 88%. That’s not bad. Anything over 80% is usually considered acceptable.

Most statistical software will do this calculation for you. You can increase power by increasing the sample size or the acceptable Type 1 error rate (decrease the confidence) before conducting the test.

So if everything were the same (i.e., mean of students = 66 inches, standard deviation = 5.3 inches) except that you had collected 30 samples instead of 10 samples:

t-value = (69 inches – 66 inches) / (5.3 inches / (√30 samples))

t-value = 3.10

t(2-tailed, 90% confidence, 29 degrees of freedom) = 1.699

If you had collected 100 samples:

t-value = (69 inches – 66 inches) / (5.3 inches / (√100 samples))

t-value = 5.66

t(2-tailed, 90% confidence, 99 degrees of freedom) = 1.660

These comparisons are both significant, and might look something like this:

1-pop nondir sig

More samples give you better resolution.


Directional Tests

Now say, in a different reality, you know that many of those freshmen boys grew up on farms and they’re pretty buff. You even think that they might just be taller than the Vikings of a millennia ago. Therefore, your research hypothesis is that Freshmen are likely to be taller than the warfaring Vikings. The null hypothesis you want to test is:

Height of Freshmen ≤ Height of Vikings

which is a directional test. If you reject the null hypothesis, the alternative hypothesis:

Height of Freshmen >Height of Vikings

is probably true of the Freshmen. Then you measure the heights of 10 freshmen and get:

72.4, 71.1, 75.4, 69.0, 75.7, 73.3, 76.0, 58.8, 70.4, 78.6

The measurements average 71.2 inches with a standard deviation of 5.3 inches. The t-value would be equal to:

(Freshmen height – Viking height) / (standard deviation / (√number of samples))

t-value = (72 inches – 69 inches) / (5.3 inches / (√10 samples))

t-value = 1.790

t-valuesIn this comparison, the table t-value you would use is for a one-tailed (directional) test at 90% confidence for 10 samples, t(1-tailed, α = 0.1, 9 degrees of freedom) = 1.383. For comparison, the value of t(2-tailed, 0.9 confidence, 9 degrees of freedom), which was used in the first example, is equal to 1.833, as is t(1-tailed, 0.95 confidence, 9 degrees of freedom). The reason is that you only have to look in half of the t-distribution area in a one-tailed test compared to a two-tailed test. That means that if you use a directional test you can have a smaller false positive rate.

The table t value you would use, t(1-tailed, 0.1 confidence, 9 degrees of freedom), is equal to 1.383. which is smaller than the calculated t-value, 1.790, so the comparison is significant. The comparison might look something like this:

1 Pop Sig Dir

In this comparison, the Freshmen are on average at least 3 inches taller than their frenzied Viking ancestors. Genetics, better diet, and healthy living win out.

But what if the farm boys averaged only 71 inches:

 (Freshmen height – Viking height) / (standard deviation / (√number of samples))

t-value = (71 inches – 69 inches) / (5.3 inches / (√10 samples))

t-value = 1.193

The table t value you would use, t(1-tailed, 0.1 confidence, 9 degrees of freedom), is equal to 1.383. which is larger than the calculated t-value, 1.193, so the comparison is not significant. The comparison might look something like this:

1 Pop NonSig Dir

And that’s what one-population t-tests look like. Now for some two-population tests in Dare to Compare – Part 4.


Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data analysis at,, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , , , , | 1 Comment

You Need Statistics to Make Wine

Todd P Chang 10845962_10204840870808885_3322491173165553713_nThe American Statistical Association has identified 146 college majors that require statistics to complete a degree.

You probably wouldn’t be surprised that statistics is required for degrees in mathematics, engineering, physics, astronomy, chemistry, meteorology, and even biology and geology. Most business-related degrees also require statistics. Agronomy degrees require statistics as do degrees in dairy science, aquatic sciences, and veterinary sciences. Degrees for medical professions such as nursing, nutrition, physical therapy, occupational health, pharmacy, and speech-language-hearing all require statistics. And, many social science degrees require statistics, including economics, psychology, sociology, anthropology, political science, education, and criminology. What may be surprising though is that statistics is required for some degrees in history, archaeology, geography, culinary science, viticulture (grape horticulture), journalism, graphic communications, library science, and linguistics. Pretty much everybody needs to know statistics.


Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data analysis at,, or other online booksellers.

Posted in Uncategorized | Tagged , , , | 1 Comment

Dare to Compare – Part 2

2-1 INTRO cats-big-blue-eyes-cat-animals-free-wallpapers-736x491Part 1 of Dare to Compare summarized several fundamental topics about statistical comparisons.

Statistical comparisons, or statistical tests as they are usually called, involve populations, groups of individuals or items having some fundamental commonalities. The members of a population also have one or more characteristics, called phenomena, which are what is compared in the populations. You don’t have to measure the phenomena in every member of the population. You can take a representative sample. Statistical tests can involve one population (comparing a population phenomenon to a constant), two populations (comparing a population phenomenon the phenomenon in another population), or three or more populations. You can also compare just one phenomenon (called univariate tests) or two or more phenomena (called multivariate tests).

Parametric statistical tests compare frequency distributions, the number of times each value of the measured phenomena appears in the population. Most tests involve the Normal distribution in which the center of the distribution of values is estimated by the average, also called the mean. The variability of the distribution is estimated by the variance or the standard deviation, the square root of the variance. The mean and standard deviation are called parameters of the Normal distribution because they are in the mathematical formula that defines the form of the distribution. Formulas for statistical tests usually involve some measure of accuracy (involving the mean) divided by some measure of precision (involving the variance). Most statistical tests focus on the extreme ends of the Normal distribution, called the tails. Tests of whether population means are equal are called non-directional, two-sided, or two-tailed tests because differences in both tails of the Normal distribution are considered. Tests of whether population means are less then or greater then are called directional, one-sided, or one-tailed tests because the difference in only one tail of the Normal distribution is considered.

2-2 NORMAL Why-do-kittens-meow-so-muchStatistical tests that don’t rely on the distributions of the phenomenon in the populations are called nonparametric tests. Nonparametric tests often involve converting the data to ranks and analyzing the ranks using the median and the range.

The nice thing about statistical comparisons is that you don’t have to measure the phenomenon in the entire population at the same place or the same time, and you can then make inferences about groups (populations) instead of just individuals or items. What may even be better is that if you follow statistical testing procedures, most people will agree with your findings.

Now for even more …


There are just a few more things you need to know before conducting statistical comparisons.

You start with a research hypothesis, a statement of what you expect to find about the phenomenon in the population. From there, you create a null hypothesis that translates the research hypothesis into a mathematical statement about the opposite of the research hypothesis. Statistical comparisons are sometimes called hypothesis tests. The null hypothesis is usually also written in term of no change or no difference. For example, if you expect that the average heights of students in two school districts will be different because of some demographic factors (your research hypothesis), then your null hypothesis would be that the means of the two populations are equal.

2-3 HYPOTHESESWhen you conduct a statistical test, the result does not mean you prove your hypothesis. Rather, you can only reject or fail to reject the null hypothesis. If you reject the null hypothesis, you adopt the alternative hypothesis. This would mean that it is more likely that the null hypothesis is not true in the populations. If you fail to reject the null hypothesis, it is more likely that the null hypothesis is true in the populations.

The results of statistical tests are sometimes in error, but fortunately, you have some control over the rates at which errors occur. There are four possibilities for the results of a statistical test.

  • True Positive – The statistical test fails to reject a null hypothesis that is true in the population.
  • True Negative – The statistical test rejects a null hypothesis that is false in the population.
  • False Positive – The statistical test rejects a null hypothesis that is true in the population. This is called a Type I error and is represented by α. The Type I error rate you will accept for a test is called the Confidence. Typically the confidence is set at 0.05, a 5% Type I error rate, although sometimes 0.10 (more acceptable error) or 0.001 (less acceptable error) are used.
  • False Negative – The statistical test fails to reject a null hypothesis that is false. This is called a Type II error and is represented by β. The ability of a particular comparison to avoid a Type II error is represented by 1-β and is called the Power of the test. Typically, power should be at least 0.8 for a 20% Type II error rate.

When you design a statistical test, you specify the hypotheses including the number of populations and directionality, the type of test, the confidence, and the number of observations in your representative sample of the population. From the sample, you calculate the mean and standard deviation. You calculate the test statistic and compare it to standard values in a table based on the distribution. If the test statistic is greater than the standard value, you reject the null hypothesis. When you reject the null hypothesis the comparison is said to be significant. If the test statistic is less than the standard value, you fail to reject the null hypothesis and the comparison is said to be nonsignificant. Most statistical software now provide exact probabilities, called p-values, that the null hypothesis is false so no tables are necessary.

2-4 ERRORS cat-with-kittens-e1464736782810After you conduct the test, there are two pieces of information you need to determine – the sensitivity of the test to detect differences, called the effect size, and the power of the test. The power of the test will depend on the sample size, the confidence, and the effect size. The effect size also provides insight into whether the test results are meaningful. Meaningfulness is important because a test may be able to detect a difference far smaller than what might of interest, such as a difference in mean student heights less than a millimeter. Perhaps surprisingly, the most common reason for being able to detect differences that are too small to be meaningful is having too large a sample size. More samples are not always better.


It seems like there are hundreds of kinds of statistical tests, and in a way there are, but most are just variations of the concept of the accuracy in terms of the precision. In most tests, you calculate a test statistic and compare it to a standard. If the test statistic is greater than the standard, the difference is larger than might have been expected by chance, and is said to be statistically significant. For the most part, statistical software now reports exact probabilities for statistical tests instead of relying on manual comparisons.

Don’t worry too much about remembering formulas for the statistical tests (unless a teacher tells you to). Most testing is done using software with the test formulas already programmed. If you need a test formula, you can always search the Internet.

Tests depend on the scales of the data to be used in the statistical comparison. Usually, the dependent variable (the measurements of the phenomenon) is continuous and the independent variable (the divisions of the populations being tested) is categorical for parametric tests. Sometimes there are also grouping variables used as independent variables, called effects. In advanced designs, continuous-scale variables used as independent variables are called covariates. Some other scales of measurement for the dependent variable, like binary scales and restricted-range scales, requires special tests or test modifications.

Here are a few of the most common parametric statistical tests.






1 0 Pop mean vs constant z-test, t-test
1 Pop level 1 mean vs Pop level 2 mean z-test, t-test
2 or more   ANOVA F-test
2 0   z-test, t-test
  any   ANOVA F-test
3 or more any any ANOVA F-test


2-5 TEST shutterstock_100483381z-Tests and t-Tests

The z-test and the t-test have similar forms relating the difference between a population mean and a constant (one-population test) or two population means (two-population test) to some measure of the uncertainty in the population(s). The difference in the tests is that a z-test is for Normally distributed populations where the variance is known and t-tests are for populations where the variance is unknown and must be estimated from the sample. t-Tests depend on the number of observations made on the sample of the population. The greater the sample size, the closer the t-test is to the z-test. Adjustments of two-population t-tests are made when the sample sizes or variances are different in the two populations. These tests can also be used to compare paired (e.g., before vs after) data.


Unlike t-tests that are calculated from means and standard deviations, F-tests are calculated from variances. The formula for the one-way ANOVA F-test is:

  • F = explained variance / unexplained variance, or
  • F = between-group variability / within-group variability, or
  • F = Mean square for treatments / Mean square for error

These are all equivalent. Also, as it turns out, F = t2.

2-6 TESTχ2 Tests

The chi-squared test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in mutually exclusive categories of a contingency table. The test statistic is the square of the observed frequency minus the expected frequency divided by the expected frequency.

Nonparametric Tests

Nonparametric tests are also called distribution-free tests because they don’t rely on any assumptions concerning the frequency distribution of the test measurements. Instead, the tests use ranks or other imposed orderings of the data to identify differences. Here are a few of the most common nonparametric statistical tests.

Dependent Variable Scale

Levels of Categorical Independent Variable


Categorical Percentage of the target population binomial test
2 matched groups McNemar’s test for 2×2 contingency tables
2 or more independent groups Chi-square test for contingency tables
2 independent groups Fisher’s exact test
Continuous 2 matched groups Wilcoxon rank-sum test

Wilcoxon sign-rank test

2 independent groups Mann-Whitney U test

Wilcoxon-Mann Whitney test

2 or more matched groups Friedman test
2 or more independent groups Kruskall-Wallace H test


PUBLISHED by catsmob.comAssumptions

You make a few assumptions in conducting statistical tests. First you assume your population is real (i.e., not a phantom population) and that your samples of the population are representative of all the possible measurements. Then, if you plan to do a parametric test, you assume (and hope) that the measurements of the phenomenon are Normally distributed and that the variances are the same in all the populations being compared. The closer these assumptions are met, the more valid are the comparisons. The reason for this is that you are using Normal distributions, defined by means and variances, to represent the phenomenon in the populations. If the true distributions of the phenomenon in the populations do not exactly follow the Normal distribution, the comparison will be somewhat in error. Of course, the Normal distribution is a theoretical mathematical distribution so there is always going to be some deviation from it and real world data. Likewise with variances in multi-population comparisons. Thus, the question is always how much deviation from the assumptions is tolerable before the test becomes misleading.

Data that do not satisfy the assumptions can often be transformed to satisfy the assumptions. Adding a constant to data or multiplying data by a constant does not affect statistical tests, so transformations have to be more involved, like roots, powers, reciprocals, and logs. Box-Cox transformations are especially useful but are laborious to calculate without supporting software. Ultimately, ranks and nonparametric tests can be used in which there is no assumption about the Normal distribution.

Next, we’ll see how it all comes together …

2-8 One does not
Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data analysis at,, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , | 1 Comment