Becoming Part of the Group

Imagine looking for patterns in a scatter plot of two variables. You see no linear trends, no curvilinear trends, and no cyclic or sinusoidal trends. Does that mean there are no associations between the variables? Maybe not.

No sooner than he had gotten out of bed, two clusters of black fur formed on the blanket.

Most people think of statistics as hypothesis tests and regression lines, but of course, there’s much more https://statswithcats.wordpress.com/2010/08/22/the-five-pursuits-you-meet-in-statistics/. Classification is often an important goal of data analysis. You can classify data visually by sorting or filtering metadata, and by plotting histograms and setting thresholds. But that approach is inefficient, especially compared to cluster analysis.

Cluster Luck

Cluster analysis refers to a number of procedures for arranging ungrouped items into statistically similar collections. Either samples or variables can be clustered. Sample clusters can be used to better describe the data using descriptive statistics or coded as grouping variables for other types of statistical analysis. Variable clusters can be used to help evaluate what a set of variables actually measures. Cluster analysis can also be used to identify atypical groups, even individual outliers.

There are several types of cluster analysis, each with many options for directing the clustering process. The most commonly used type of cluster analysis is hierarchical cluster analysis. Results from a hierarchical clustering are usually expressed as a tree diagram, which looks a bit like a company’s organization chart. The challenge in hierarchical cluster analysis is to interpret a tree diagram and select the appropriate clusters.

Cluster analysis has been used to classify animal and plant species, soil and rock types, astronomical bodies, and weather systems. It is used in education research to classify students, schools and districts. It is used to analyze customer preferences, market segments, target markets, and social networks. It is used to identify crime hot spots and anatomical features in forensic analysis.

Food for Thought

Consider this example. People follow special diets for a variety of reasons, such as controlling weight or blood glucose. But food is complex. Ignoring taste, food is characterized by the energy it provides (i.e., calories), the rate it is metabolized (i.e., Glycemic index for carbohydrates), its components (i.e., carbohydrates, proteins, and fats), and many other attributes. So, it is useful for nutritionists to classify foods to help consumers make healthy choices. Cluster analysis is one approach for such a characterization.

Data for this analysis consisted of values of five variables (i.e., calories, carbohydrates, proteins, fats, and Glycemic index) for 213 sample foods. The figure shows the tree diagram produced by the cluster analysis (although only 38 of the 213 foods are listed to aid readability). From the tree diagram, an appropriate number of clusters are selected. Cluster selection requires a combination of information on the statistical differences between potential clusters, an understanding of the data to interpret why each member might belong to a certain cluster, and a sense of how many clusters might be reasonable for characterizing the data. The letters in the tree diagram of the figure show one of the many possible sets of clusters.

Tree diagram for the Cluster Analysis of Food Types.

Once clusters ore chosen, they are characterized based on the characteristics of their members. The table summarizes how the six food clusters could be interpreted. These interpretations might have been different if the original variables or the number of clusters were different.

Characteristics of Six Food Categories Identified with Cluster Analysis.

Food Category and Description		Calories	Metabolism	Protein	Carbo- hydrates	Fats	Foods
A	Muscle- maintenance foods	Low	Very Slow	High	Low	Moderate	Eggs, most fish, ham, salami, bacon, liverwurst, frankfurters
B	Quick-energy foods	Low	Fast	Low	Moderate to High	Low to Moderate	Milk, fruit juices, apples, bananas, cherries, grapes, pears, mangos, papayas, potatoes, crackers, pretzels
C	Low-calorie foods	Very Low	Moderate to Fast	Low	Moderate to High	Low to Moderate	Bread, peas, carrots, citrus fruits, peaches, plums, kiwis, watermelon, anchovies, caviar, gefiltefish, pepperoni
D	Sustained-energy foods	High	Fast to Very Fast	Moderate	High	Moderate	Yogurt, dates, prunes, pasta, rice, beans, French fries
E	Muscle-building foods	High	Very Slow	High	Low	Very High	Catfish, abalone, flounder, herring, mackerel, corned beef, liver, skinned chicken, turkey, venison, veal
F	Weight-gain foods	Very High	Slow to Moderate	High	Low	Very High	Raisins, soybeans, bass, kingfish, most beef, chicken and pork

One thing you should do after every analysis is to ask yourself if the results make sense. This isn’t the same as trying to bias the results, or at least it shouldn’t be. If you really understand your data, you should be able to tell if a result fits with the conventional wisdom. In the table, for example, does it make sense that raisins and soybeans are weight-gain foods and pepperoni is a low-calorie food? Could there be errors in the data? Might the serving sizes be non-representative of what might be eaten at one time? Perhaps. In other cases, it might also be possible that a different clustering algorithm, a different measure of data distance, or a different number of clusters would allow a better interpretation of the data.

Cluster analysis is a powerful technique for exploring patterns of similarity and difference in samples or variables. It is considered to be an exploratory statistical technique. It requires considerable knowledge of the phenomenon the data represent to interpret the results. For applied statisticians, though, this is where data analysis really gets fun.

[The data for this analysis came from http://www.ast-ss.com/research/food/food_listing_all.asp. Values for calories, carbohydrates, proteins, and fats are contingent on serving size. Not all of the foods were included in the analysis because of missing data.]

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.com, barnesandnoble.com, or other online booksellers.