Imagine looking for patterns in a scatter plot of two variables. You see no linear trends, no curvilinear trends, and no cyclic or sinusoidal trends. Does that mean there are no associations between the variables? Maybe not.
Most people think of statistics as hypothesis tests and regression lines, but of course, there’s much more https://statswithcats.wordpress.com/2010/08/22/the-five-pursuits-you-meet-in-statistics/. Classification is often an important goal of data analysis. You can classify data visually by sorting or filtering metadata, and by plotting histograms and setting thresholds. But that approach is inefficient, especially compared to cluster analysis.
Cluster Luck
Cluster analysis refers to a number of procedures for arranging ungrouped items into statistically similar collections. Either samples or variables can be clustered. Sample clusters can be used to better describe the data using descriptive statistics or coded as grouping variables for other types of statistical analysis. Variable clusters can be used to help evaluate what a set of variables actually measures. Cluster analysis can also be used to identify atypical groups, even individual outliers.
There are several types of cluster analysis, each with many options for directing the clustering process. The most commonly used type of cluster analysis is hierarchical cluster analysis. Results from a hierarchical clustering are usually expressed as a tree diagram, which looks a bit like a company’s organization chart. The challenge in hierarchical cluster analysis is to interpret a tree diagram and select the appropriate clusters.
Cluster analysis has been used to classify animal and plant species, soil and rock types, astronomical bodies, and weather systems. It is used in education research to classify students, schools and districts. It is used to analyze customer preferences, market segments, target markets, and social networks. It is used to identify crime hot spots and anatomical features in forensic analysis.
Food for Thought
Consider this example. People follow special diets for a variety of reasons, such as controlling weight or blood glucose. But food is complex. Ignoring taste, food is characterized by the energy it provides (i.e., calories), the rate it is metabolized (i.e., Glycemic index for carbohydrates), its components (i.e., carbohydrates, proteins, and fats), and many other attributes. So, it is useful for nutritionists to classify foods to help consumers make healthy choices. Cluster analysis is one approach for such a characterization.
Data for this analysis consisted of values of five variables (i.e., calories, carbohydrates, proteins, fats, and Glycemic index) for 213 sample foods. The figure shows the tree diagram produced by the cluster analysis (although only 38 of the 213 foods are listed to aid readability). From the tree diagram, an appropriate number of clusters are selected. Cluster selection requires a combination of information on the statistical differences between potential clusters, an understanding of the data to interpret why each member might belong to a certain cluster, and a sense of how many clusters might be reasonable for characterizing the data. The letters in the tree diagram of the figure show one of the many possible sets of clusters.
Once clusters ore chosen, they are characterized based on the characteristics of their members. The table summarizes how the six food clusters could be interpreted. These interpretations might have been different if the original variables or the number of clusters were different.
Characteristics of Six Food Categories Identified with Cluster Analysis.
Food Category and Description |
Calories |
Metabolism |
Protein |
Carbo- hydrates |
Fats |
Foods |
|
A |
Muscle- maintenance foods |
Low |
Very Slow |
High |
Low |
Moderate |
Eggs, most fish, ham, salami, bacon, liverwurst, frankfurters |
B |
Quick-energy foods |
Low |
Fast |
Low |
Moderate to High |
Low to Moderate |
Milk, fruit juices, apples, bananas, cherries, grapes, pears, mangos, papayas, potatoes, crackers, pretzels |
C |
Low-calorie foods |
Very Low |
Moderate to Fast |
Low |
Moderate to High |
Low to Moderate |
Bread, peas, carrots, citrus fruits, peaches, plums, kiwis, watermelon, anchovies, caviar, gefiltefish, pepperoni |
D |
Sustained-energy foods |
High |
Fast to Very Fast |
Moderate |
High |
Moderate |
Yogurt, dates, prunes, pasta, rice, beans, French fries |
E |
Muscle-building foods |
High |
Very Slow |
High |
Low |
Very High |
Catfish, abalone, flounder, herring, mackerel, corned beef, liver, skinned chicken, turkey, venison, veal |
F |
Weight-gain foods |
Very High |
Slow to Moderate |
High |
Low |
Very High |
Raisins, soybeans, bass, kingfish, most beef, chicken and pork |
One thing you should do after every analysis is to ask yourself if the results make sense. This isn’t the same as trying to bias the results, or at least it shouldn’t be. If you really understand your data, you should be able to tell if a result fits with the conventional wisdom. In the table, for example, does it make sense that raisins and soybeans are weight-gain foods and pepperoni is a low-calorie food? Could there be errors in the data? Might the serving sizes be non-representative of what might be eaten at one time? Perhaps. In other cases, it might also be possible that a different clustering algorithm, a different measure of data distance, or a different number of clusters would allow a better interpretation of the data.
Cluster analysis is a powerful technique for exploring patterns of similarity and difference in samples or variables. It is considered to be an exploratory statistical technique. It requires considerable knowledge of the phenomenon the data represent to interpret the results. For applied statisticians, though, this is where data analysis really gets fun.
[The data for this analysis came from http://www.ast-ss.com/research/food/food_listing_all.asp. Values for calories, carbohydrates, proteins, and fats are contingent on serving size. Not all of the foods were included in the analysis because of missing data.]
Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.com, barnesandnoble.com, or other online booksellers.
I like clustering 😉
Are there cluster techniques for nominal data ?
Correspondence analysis?
http://en.wikipedia.org/wiki/Correspondence_analysis
http://www.statsoft.com/textbook/correspondence-analysis/
Pingback: Ten Tactics used in the War on Error | Stats With Cats Blog
Thanks for breaking it down!
Please which software did you use in charting the diagram?
Statistica, from Statsoft, Inc. (http://statsoft.com/)
Pingback: Becoming Part of the Group | Pets
Pingback: Ten Ways Statistical Models Can Break Your Heart | Stats With Cats Blog
Pingback: Searching for Answers | Stats With Cats Blog