Becoming Part of the Group

Imagine looking for patterns in a scatter plot of two variables. You see no linear trends, no curvilinear trends, and no cyclic or sinusoidal trends. Does that mean there are no associations between the variables? Maybe not.

No sooner than he had gotten out of bed, two clusters of black fur formed on the blanket.

Most people think of statistics as hypothesis tests and regression lines, but of course, there’s much more https://statswithcats.wordpress.com/2010/08/22/the-five-pursuits-you-meet-in-statistics/. Classification is often an important goal of data analysis. You can classify data visually by sorting or filtering metadata, and by plotting histograms and setting thresholds. But that approach is inefficient, especially compared to cluster analysis.

Cluster Luck

Cluster analysis refers to a number of procedures for arranging ungrouped items into statistically similar collections. Either samples or variables can be clustered. Sample clusters can be used to better describe the data using descriptive statistics or coded as grouping variables for other types of statistical analysis. Variable clusters can be used to help evaluate what a set of variables actually measures. Cluster analysis can also be used to identify atypical groups, even individual outliers.

There are several types of cluster analysis, each with many options for directing the clustering process. The most commonly used type of cluster analysis is hierarchical cluster analysis. Results from a hierarchical clustering are usually expressed as a tree diagram, which looks a bit like a company’s organization chart. The challenge in hierarchical cluster analysis is to interpret a tree diagram and select the appropriate clusters.

Cluster analysis has been used to classify animal and plant species, soil and rock types, astronomical bodies, and weather systems. It is used in education research to classify students, schools and districts. It is used to analyze customer preferences, market segments, target markets, and social networks. It is used to identify crime hot spots and anatomical features in forensic analysis.

Food for Thought

Consider this example. People follow special diets for a variety of reasons, such as controlling weight or blood glucose. But food is complex. Ignoring taste, food is characterized by the energy it provides (i.e., calories), the rate it is metabolized (i.e., Glycemic index for carbohydrates), its components (i.e., carbohydrates, proteins, and fats), and many other attributes. So, it is useful for nutritionists to classify foods to help consumers make healthy choices. Cluster analysis is one approach for such a characterization.

Data for this analysis consisted of values of five variables (i.e., calories, carbohydrates, proteins, fats, and Glycemic index) for 213 sample foods. The figure shows the tree diagram produced by the cluster analysis (although only 38 of the 213 foods are listed to aid readability). From the tree diagram, an appropriate number of clusters are selected. Cluster selection requires a combination of information on the statistical differences between potential clusters, an understanding of the data to interpret why each member might belong to a certain cluster, and a sense of how many clusters might be reasonable for characterizing the data. The letters in the tree diagram of the figure show one of the many possible sets of clusters.

Tree diagram for the Cluster Analysis of Food Types.

Once clusters ore chosen, they are characterized based on the characteristics of their members. The table summarizes how the six food clusters could be interpreted. These interpretations might have been different if the original variables or the number of clusters were different.

Characteristics of Six Food Categories Identified with Cluster Analysis.

Food Category and Description

Calories

Metabolism

Protein

Carbo-

hydrates

Fats

Foods

A

Muscle- maintenance foods

Low

Very Slow

High

Low

Moderate

Eggs, most fish, ham, salami, bacon, liverwurst, frankfurters

B

Quick-energy foods

Low

Fast

Low

Moderate to High

Low to Moderate

Milk, fruit juices, apples, bananas, cherries, grapes, pears, mangos, papayas, potatoes, crackers, pretzels

C

Low-calorie foods

Very Low

Moderate to Fast

Low

Moderate to High

Low to Moderate

Bread, peas, carrots, citrus fruits, peaches, plums, kiwis, watermelon, anchovies, caviar, gefiltefish, pepperoni

D

Sustained-energy foods

High

Fast to Very Fast

Moderate

High

Moderate

Yogurt, dates, prunes, pasta, rice, beans, French fries

E

Muscle-building foods

High

Very Slow

High

Low

Very High

Catfish, abalone, flounder, herring, mackerel, corned beef, liver, skinned chicken, turkey, venison, veal

F

Weight-gain foods

Very High

Slow to Moderate

High

Low

Very High

Raisins, soybeans, bass, kingfish, most beef, chicken and pork

One thing you should do after every analysis is to ask yourself if the results make sense. This isn’t the same as trying to bias the results, or at least it shouldn’t be. If you really understand your data, you should be able to tell if a result fits with the conventional wisdom. In the table, for example, does it make sense that raisins and soybeans are weight-gain foods and pepperoni is a low-calorie food? Could there be errors in the data? Might the serving sizes be non-representative of what might be eaten at one time? Perhaps. In other cases, it might also be possible that a different clustering algorithm, a different measure of data distance, or a different number of clusters would allow a better interpretation of the data.

Cluster analysis is a powerful technique for exploring patterns of similarity and difference in samples or variables. It is considered to be an exploratory statistical technique. It requires considerable knowledge of the phenomenon the data represent to interpret the results. For applied statisticians, though, this is where data analysis really gets fun.

[The data for this analysis came from http://www.ast-ss.com/research/food/food_listing_all.asp. Values for calories, carbohydrates, proteins, and fats are contingent on serving size. Not all of the foods were included in the analysis because of missing data.]

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.combarnesandnoble.com, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , | 8 Comments

The Data Dozen

Data can take a variety of forms. Some are readily amenable to statistical analysis and some are better suited to other methods of analysis. When you’re trying to solve some problem or research question, though, you need to use whatever is available that fits. Here are twelve types of data to think about using in your next analysis.

Data Type

Description

Generation

Examples

Automatic Measurements

Information generated by devices, usually electronic or mechanical, that operate without human involvement (other than calibration and sample introduction). Experimenter-Device Thermocouples, strain-gage scales, electronic meters

Manual Measurements

Information generated by devices that require human involvement to carry out the measurement. Experimenter-Device Rulers, calipers, thermometers, balance-beam scales

Archived Records

Information generated by an identifiable person or organization Known individual or organization Government records, financial data, personal diaries, logs, notes

Directed Responses

Information receives as the result of a specific direct inquiry. Experimenter-Subject Surveys, focus groups, interrogations

Electronic Recordings

Information stored on audiovisual devices Experimenter-Device Videos, audio recordings, photos, false-color images

Metadata

Data about data— their origins, qualities, scales, and so on. Data Generator Time, location, and method of data generation

Transformations

Information created from other information. Data Analyst Percentages, sums, z-scores, ratios, and so on.

Analog Data

Information from a source that resembles in some respect a phenomenon under investigation Experimenter Experimental lab animals, models

First Person Reports

Descriptive, qualitative information derived from a first-person encounter Individual Eyewitness accounts

Secondhand Reports

Information summarized or retold by a second party based on first-person accounts. Known individual or organization News stories

Unverified Reports

Information, written or retold, which cannot be disproven or verified. Unknown individual or organization Anecdotes, stories, legends

Conjectures

Information created from thought experiments rather than physical experiments. Known individual or organization Expert opinions

Automatic and manual measurements are used commonly in statistical analysis when they can be generated in large numbers at reasonable costs. Furthermore, they are often measured on continuous, or at least, quantitative scales. These measurements are usually easy to reproduce but may be time or location dependent.

Data? I thought you said tuna.

Archive records are also used commonly in statistical analyses, usually as government records and financial data, when they are measured on quantitative scales. These data are often considered “official” because they have been verified even though they are not reproducible. Archive records may also provide qualitative information, usually in small amounts, such as personal diaries, logs, notes, and so on. These can be used to support statistical analyses and are a mainstay of scientific investigations.

Directed responses, information received as the result of specific questions, includes results of surveys and focus groups, which are commonly analyzes with statistics. Direct response data is also generated by direct and cross examinations in court and by military and law enforcement interrogations. Direct response data comes from individuals, so their responses may not always be true and consistent.

Two types of data that are used in almost all data analyses are metadata and transformations. Metadata are data about data, such as descriptions of their origins, qualities, scales, and so on. Transformations are data created from other data, which includes percentages, z-scores, sums, ratios, mathematical functions and so on (https://statswithcats.wordpress.com/2010/11/21/fifty-ways-to-fix-your-data/).

Analogs are data sources that substitute for the actual phenomenon of interest. Models are a type of analog as are animals used in medical experiments (much to their and my displeasure). Statistics is all about models (https://statswithcats.wordpress.com/2010/08/08/the-zen-of-modeling/), from basing test probabilities on the Normal distribution to creating regression models from data.

Electronic recordings, like videos and audio recordings, would seem to be a good type of data to analyze. Recordings have a great data density, though it can be laborious to extract individual data elements from the qualitative recording source. They can be faked, but so too can all the other types of data.

Reports come from witnesses. First person reports come from eyewitnesses. The information is typically descriptive, qualitative, and may be verifiable but typically isn’t reproducible and may not even be true. Secondhand reports are eyewitness reports that are summarized or retold by a second party, such as news agencies. Unverified reports, anecdotes, stories, and legends that may be written or retold, come from sources that are unknown. These reports usually cannot be disproven or verified. Reports don’t often provide data elements for statistical analyses but may provide supporting evidence or metadata.

Finally, conjectures are data produced by experts through thought experiments rather than physical experiments. The Delphi process (http://en.wikipedia.org/wiki/Delphi_method) is a good example of the use of conjecture. Usually conjecture is used in situations in which data cannot be collected, such as forecasting the future.

Did you see that?

Data analysts use all these data types. Statisticians want to use data types that provide many observations so they can assess variability. Scientists and engineers may be satisfied with the results of a single, albeit well controlled, experiment. They are truly deterministic breeds. Courts want every piece of evidence to be attested to by an individual, whether an eyewitness or an expert witness. They want to be able to cross-examine witnesses. Historians don’t usually have eyewitnesses so they rely on reports, especially secondhand and even unverified reports. They’ll use whatever they can find.

Certainly, this classification is not the only way to look at data. For example, the U.S. legal system defines courtroom evidence as either:

  • Real—physical objects, like a weapon.
  • Demonstrative—illustrations of evidence, like a map of the crime scene.
  • Documentary—items that contains human language, like contracts and newspaper articles.
  • Testimonial—oral or written evidence from witnesses.

(http://people.howstuffworks.com/inadmissible-evidence1.htm). To be admissible in court, these types of evidence have to be relevant (i.e., proves or disproves a fact), material (i.e., essential to the case), and competent (i.e., proven to be reliable). Trial lawyers use witnesses to tell compelling stories that will keep judges and juries attentive, which non-testimony evidence may not. In contrast, noted scientist and lecturer Neil deGrasse Tyson counters that “In courts, eyewitness testimony is considered great evidence. In science it’s considered worthless.” But that’s not quite true if the observation can be witnessed by others, such as in the cases of astronomical observations and replicated experiments. UFO eyewitnesses don’t fare so well with scientists. Statisticians want more, though. Our analyses aren’t based on cause-and-effect; association works just fine. But whatever you perspective on data, be sure you understand the pluses and minuses of what you’re working with.

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.combarnesandnoble.com, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , , | 6 Comments

Statistics: a Remedy for Football Withdrawal

TOUCHDOWN!

One thing that makes sports so much fun to follow is the plethora of statistics associated with every player, every game, every team, and every season. Other than government agencies, you won’t find better sources of data to practice on. It’s a simple matter to go to the website of a professional sport and find some raw data that needs analyzing.

In football (the American kind) it is often said that good offense provides excitement but good defense wins games. Fans of the 2006 Indianapolis Colts probably wouldn’t agree. Ranked 3rd in offense but 21st of 32 teams in defense, the Colts had a regular season record of 12 wins and 4 losses and won the Super Bowl. Maybe they were an anomaly. So the question is: are teams that make the post-season playoffs better defensively than the rest of the league as the conventional wisdom claims?

Data for this analysis consisted of 26 variables (i.e., team performance statistics, such as number of plays, penalties, fumbles, 3rd and 4th down conversions, and time of possession) for the 32 NFL teams (thank you nfl.com). Having that many performance variables with comparably few teams is a flag that factor analysis might be a useful way to proceed (https://statswithcats.wordpress.com/2010/08/27/the-right-tool-for-the-job/). Factor analysis (FA) is based on the concept that the variation in a set of variables can be rearranged and attributed to new variables, called factors. The use of factors instead of raw variables is sometimes preferable because factors are more efficient (i.e., fewer factors are needed to evaluate almost the same proportion of variability as the original variables).

FA requires some intuition to interpret. FA produces equations that define each factor in terms of the original variables:

F1 = a11x1 + a12x2 + a13x3a1nxn

F2 = a21x1 + a22x2 + a23x3a2nxn

.
.
.

Fm = am1x1 + am2x2 + am3x3amnxn

where:

F1 through Fm are the m factors that replace the original n variables

x1 through xn are the original variables

a1 through an are factor analysis weights.

m is always less than or equal to n, but is a lot less if you’re lucky.

What you have to do is look at the correlations between the original variables and the factors and guess what each factor might mean. It’s like being given a big box of parts—gears, transistors, tires, fabric, motors, pipes, wires, and lumber—and trying to figure out what they’re supposed to make. Some parts will be integral and others will be left over.

FA derived two factors from the 26 NFL statistics—an Offense Factor and a Defense Factor. No big surprise there, in fact, that’s what we were hoping for. Each factor accounts for about 20% of the total variation in the original variables. So, we’ve lost 60% of the information contained in the original 26 variables in exchange for the simplicity of having just two variables. That’s a good example of why FA is often referred to as a data reduction technique.

Two Factors that Summarize 26 Team Performance Statistics.

FA and the associated data reduction techniques of correspondence analysis and multidimensional scaling are like photographs. A photograph conveys only two of three spatial dimensions and usually includes no information about time, odors, sounds, temperature, or other circumstances, yet it still presents enough information so that observers can discern what is happening. So data reduction shouldn’t be taken as a pejorative descriptor. Sometimes simplifying a problem is the best way to solve it; at least that’s what William of Ockham thought. And after all, isn’t that what modeling is about?

Once the number of variables has been reduced to a manageable few factors, you can analyze patterns of relationships much more efficiently. Consider the scatter plot of how the 32 teams scored on the two factors and how far they got in the postseason. The two gray lines represent the averages of the Offense and Defense Factors. The Seattle Seahawks could be considered the average team of the 2006 season because they are located closest to the intersection of these two lines. Draw an imaginary line through the plot origin and the intersection of the lines (i.e., a 45° angle), and you’ll identify the most balanced teams, the teams with about the same scores for their Offense and Defense Factors. The most balanced teams from best to worst would be the Pittsburgh Steelers, the New York Giants, the Seattle Seahawks, the Tennessee Titans, the Cleveland Browns, and the Houston Texans. Of these, only the Giants and the Seahawks made the playoffs. So much for the importance of balance.

Factor Analysis of National Football League Teams.

[Note: There’s a reason why there are no values on the axes. Some readers who saw this graph were totally baffled by the numbers, so I took them out (https://statswithcats.wordpress.com/2011/01/16/ockham%E2%80%99s-spatula/). The units of the analysis were normalized and are meaningful only in relative terms. Both axes do have the same scale increments, however. A difference of 1 on the offense scale is analogous to a difference of 1 on the defense scale.]

The 2006 Super Bowl champion Colts had the highest score on the Offense Factor but the lowest score on the Defense Factor of any of the playoff teams. In fact, 63% of teams with an above average Offense Factor score made the playoffs compared to 44% of teams with an above average Defense Factor score. So, is the notion that good defense beats good offense wrong? Not necessarily; but it sure didn’t apply in 2006.

So remember, if there’s no NFL football in 2011 because of contractual problems, you can always fall back on statistics to fill the gap. Then again, there’s always sabermetrics …

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.combarnesandnoble.com, or other online booksellers.

Posted in Uncategorized | Tagged , , , , | 5 Comments

Six Misconceptions about Statistics You May Get From Stats 101

When you learn new things, you can develop misconceptions. Maybe it’s the result of something you didn’t understand correctly. Maybe it’s the way the instructor explains something. Or maybe, it’s something unspoken, something you assume or infer from what was said. Here are six misconceptions about statistics you might have gotten from Stats 101.

Misconception 1: “Statistics is Math

Yeah, we love stats, or math, or whatever it is.

How could you not come to believe this? Even before you took Stats 101, you learned you had to take the course to fulfill a math requirement. It was taught by the Math Department. Then when you took the course, it was all numbers. Homework and exams were almost all about calculations. Stat 101 was all math. Statistics must be all math too.

Reality

Statistics uses numbers but numbers are not the primary focus of statistics, at least to most practitioners. Applied statistics is a form of inductive reasoning that uses math as one of its tools. It also uses sorting for ranks, filtering for classification, and all kinds of graphics. The point of using statistics is to discover new knowledge and solve problems through the use of inductive reasoning involving numbers. It’s not just about doing calculations. That’s why it’s required for college majors in business, social sciences, and many other disciplines. That’s why it’s taught by professors in all those disciplines, too. Yes, it’s required for math degrees and is taught by math professors at many schools. That’s so there will be mathematical statisticians who will invent statistical tools for the applied statisticians to use. You can love statistics and be good at statistical thinking even if you think you hate math.

Misconception 2: “Statistics Requires a Lot of Data

Stats 101 doesn’t teach you how to work with individual pieces of information, like a solitary measurement, or a picture, or eyewitness testimony. Statistics uses data, lots of data, the more data the better. The number of samples is a term in almost every equation. And anyway, that’s what the law of large numbers says, the more data the better the results.

Reality

The number of samples you really need for a statistical analysis is contingent on how much resolution you want. Think of the resolving power of a telescope or a microscope, or the number of pixels in a computer image. The greater the resolution, the more detail you’ll see. It’s the same way with statistics (https://statswithcats.wordpress.com/2010/07/17/purrfect-resolution/).

Hey, Here’s a bunch of data over here.

What’s more important than the number of data points is the quality of the data points. In statistics, the quality of a set of data point is how well the data points represent the population from which they are drawn. But representative data can be incredibly difficult to generate. How do you decide which registered voters are actually likely to vote in the next election? How do you decide who might use a product you might want to sell?

The number of samples is easy to determine. The quality of the samples is virtually impossible to determine. Nevertheless, what you should remember is that more data may be better but better data are always best.

Misconception 3: “Data are Dependable

In Stats 101, you do a lot of number crunching. You use small datasets and big datasets, real data and fake data, but never were you told to delete data. You figured that data are like facts. You don’t delete them for any reason or you will bias your results.

Reality

Yes I double checked the data. Why do you ask?

Data are messy. Most newly generated datasets have errors, missing observations, and unrepresentative samples. Some population properties may be under-represented or over-represented. There may be samples that should not be included in the analysis, like replicates, QA samples, and metadata. All these problems with data require a lot of processing before an analysis can begin (https://statswithcats.wordpress.com/2010/10/17/the-data-scrub-3/). In fact, data scrubbing often consumes the majority of a project budget and schedule, but you have to do it anyway.

Misconception 4: “Statistics Provides Unique Solutions

In all the problems your Stats 101 instructor solved in class, and all the homework assignments you did, and all the exams you took, there was only one “right answer” to a question. So, any statistical analysis should provide the same results no matter who does it.

Reality

Even if two statisticians start with identical data sets, they may not come to identical results, and sometimes, even identical conclusions. This is because they may make different assumptions and scrub the data differently. Furthermore, there may be more than one way, even many ways, to approach a problem (https://statswithcats.wordpress.com/2010/08/22/the-five-pursuits-you-meet-in-statistics/). There may also be different statistical analysis techniques that can be used, or even different options within the same technique (https://statswithcats.wordpress.com/2010/08/27/the-right-tool-for-the-job/). It would probably be more surprising for two statisticians to calculate the same results from a dataset than for them to have some differences. Just like most problems in the real world, there may have more than one right answer from a statistical analysis.

Misconception 5: “Statistics Provides Unambiguous Results

Results are either significant or they’re not. That’s pretty unambiguous.

Reality

That’s another way to look at it.

Statistical results are based on data and assumptions about the data. Change the number of samples and you change the resolution of the statistical procedure. Change the data or the assumptions and you change the estimates of variability. Change the resolution or the estimates of variability and you have different results. There is indeed uncertainty in uncertainty. Sometimes uncertainty brings with it ambiguity.
Is there really a difference between Type I error rates of 0.049 and 0.051? Many decision makers who never got past Stats 101 think so. But interpretations of these results are based on the assumptions and biases a statistician brings with him. One statistician might take a firm stance and say “significant” and another might say, “maybe not.” Results have uncertainty; interpretations have ambiguity, and decisions have risks. That’s statistics.

Misconception 6: “It’s Easy to Lie with Statistics

With her identity protected, the witness told how she faked the survey results.

Darrell Huff wrote “How to Lie with Statistics” in 1954 (http://www.amazon.com/How-Lie-Statistics-Darrell-Huff/dp/0393310728/ref=pd_sim_b_2). Michael Wheeler wrote “Lies, Damn Lies, and Statistics: The Manipulation of Public Opinion in America” in 1976 (http://www.amazon.com/Lies-Damn-Statistics-Manipulation-Opinion/dp/0393331490/ref=sr_1_17?ie=UTF8&qid=1298231730&sr=8-17). John Allen Paulos wrote “Innumeracy: Mathematical Illiteracy and Its Consequences” in 1988 (http://www.amazon.com/Innumeracy-Mathematical-Illiteracy-Its-Consequences/dp/0809058405/ref=ntt_at_ep_dpi_1). Joel Best wrote “Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists” in 2001 (http://www.amazon.com/Damned-Lies-Statistics-Untangling-Politicians/dp/0520219783/ref=sr_1_3?ie=UTF8&qid=1298231253&sr=8-3).
So it must be pretty easy to lie with statistics since everybody is doing it.

Reality

It’s hard to do statistics right but it’s also a lot of work to do them wrong, too. You have to collect data, crunch the numbers, and cook up your story, or perhaps more correctly, cook up your story, make up the data, and call the press conference. But if you’re going to mislead an audience, it’s much easier to use made up facts, phony anecdotes, and illogical conjectures. So why do so many people, particularly politicians, even bother lying with statistics? It’s because numbers provide credibility. If you have little credibility yourself, using numbers can confer the illusion of expertise. And that is why people use statistics in the first place.

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.combarnesandnoble.com, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , , , , , , , , , | 10 Comments

Consumer Guide to Statistics 101

But I don’t wanna take Stats 101.

Whether you took or are taking an introductory course on statistics, you probably didn’t get to choose from a dozen candidate offerings. You had to take the specific course required for your major. You can, though, evaluate what you got. Did you get your money’s worth from that introduction to statistics class? Here are a few things to think about.

Your Expectations

Why did you take Statistics 101? Was it a requirement for a degree? Many majors, especially for advanced degrees, require some statistics (https://statswithcats.wordpress.com/2010/06/08/why-do-i-have-to-take-statistics/). Was it a less frightening alternative to other courses? Statistics can be used as a substitute for calculus in some undergraduate programs and for a foreign language in some Ph.D. programs. Or, maybe you didn’t have any expectations other than to learn something new.

I took statistics thirty-five years ago when I was majoring in geology, which doesn’t have a lot of quantitative deterministic theories for describing natural processes. Even today, predicting earthquakes, landslides, volcanic eruptions and other earth phenomena remain elusive goals. I wanted to learn how to develop mathematical equations, models, to explain and predict phenomenon. Regression analysis turned on the light bulb over my head. That’s how I got here.

Well, that’s not what I expected.

When you buy an expensive product at a store, you usually have some expectations of what you should get for your money. When you buy a new car, for instance, you may want it to look and handle a certain way. You may not voice your expectations, or even be able to describe what you want, but you do have expectations. Your Stats 101 course is similar. You paid a lot of tuition to take the course; you must have had some expectation, even subconscious, of what you would take from it. This is important because it sets a reference point for what you experience in the course. So ask yourself, did your course give you what you expected, and just as important, were your expectations reasonable?

Your Instructor

Think of the person who taught you in Statistics 101. How would you rate him or her on these four criteria?

  • Knowledge—Knowledge may be the first thing you think about when you think of a college professor, and for that reason, it’s probably the least important discriminator of instructor quality. They all have adequate knowledge, at least from your level of understanding. It’s what the instructors do with their knowledge that makes the differences.
  • Communication Skills—Being able to convey knowledge is a necessity for an instructor. Some instructors communicate information better than others, and unfortunately, some instructors do not communicate well at all. They may be inarticulate, have an accent or a speech defect, be speaking in a second language, or just not be able to explain difficult concepts or answer questions well.
  • Engagement—Instructors usually teach the same courses over and over again. Some instructors add content and try new descriptions with each class. Others use the repetition to become automatons, teaching the same content in the same way year after year, even to the point of reading their past lectures.
  • Empathy—Instructors have an obligation to teach certain content but they also should be sensitive to what their students want to learn and need to learn to further their careers. Empathetic instructors might try to tailor their teaching to the interests of their students, like citing examples from the disciplines of their majors. Oblivious instructors will teach about their latest interests, regardless of the applicability to the students.

The minimum requirements for an instructor are to know the subject and be able to communicate that knowledge. What separates the best instructors from the rest are their level of engagement and their empathy to the needs of the student. So ask yourself, did your instructor convey his excitement over what he was teaching? Did you leave the class curious about what else there might be to learn about statistics and about how you could use statistics yourself?

Number Crunching

This is more the kind of crunching I want to do.

Artists draw, chefs cook, and statisticians calculate, but they do so in many ways. When I was learning statistics, my choices for doing calculations were a very unfriendly mainframe computer, a hand calculator, or pencil and paper. This choice may be why there are so few old statisticians around today.

How your instructor had you calculate statistics says something not only about the times but also about his level of empathy. Here’s why:

  • Pencil and paper—No professional statistician calculates any serious statistics nowadays by hand, except perhaps on drink-stained cocktail napkins. Still, many instructors want their students to get the hands on feel of number manipulation. That’s valid. If it goes beyond probabilities, descriptive statistics, and simple tests of hypotheses, your instructor is a sadist.
  • Calculator— No professional statistician calculates any serious statistics nowadays with a calculator, unless cocktail napkins are not available. If you plan on being an artist or a chef, manual calculations are fine. If you have any intentions of using statistics in your major, you need to learn software.
  • Spreadsheet Software—Spreadsheet software is probably the best choice for most students. It can be used to set up and edit datasets, calculate statistics, and prepare graphs. Plus, it’s relative easy to use, and likely to be available to the student at school, work, and home.
  • Statistical Software—Statistical software can be another good choice, depending on the learning curve. Simple statistical software can allow students to concentrate on interpreting statistics instead of calculating them. Advanced statistical packages like SAS and SPSS, while necessary for advanced courses, are beyond what introductory students need to learn unless they plan a career in statistics. Because of the cost of these packages, they are not likely to be available to the student at work and at home.

    You do the R programming, I’ll do the puRRRRR part.

  • Programming—If you plan on a career in statistics, you will probably learn the R language or some other coding tool. If you learn it in Stat 101 and you are not a statistics major, you have to wonder what your instructor is thinking, if he is.

So, was your instructor thinking about your needs when he decided how the class would do calculations? If you can’t use his method of choice in the future, it’s kind of a wasted effort.

Concepts vs. Skills vs. Thinking

In designing your Stats 101 course, the instructor had to decide how to proportion class time between teaching concepts, skills, and statistical, thinking.

Concepts are the whys of statistics. They are the reasons why statistics work. Examples include populations, probability, the law of large numbers, and the central limit theorem. Instructors tend to devise games and demonstrations to help students remember fundamental concepts. Learning statistical concepts is beneficial to statistics majors and non-majors, both in school and in later life.

Skills are the whats and hows of statistics. They involve calculations, like probabilities, descriptive statistics, and simple tests of hypotheses. Skills are learned by repetition. You learn them by doing the calculations in the homework assignments, at least the even-numbered problems. After Stats 101, skills like designing data matrices for a particular analysis are much more important than the calculations themselves, which are usually carried out by software.

Yeah, I got skillz.

Concepts and skills account for the majority of Stat 101 classes. This is perhaps unfortunate, for the greatest need in society is for people to understand statistical thinking. Statistical thinking involves understanding how to define a problem in light of some objective, what uncertainty and risk are and how they can be controlled, and the difference between significance and meaningfulness.

You want me to learn about this? Why?

So, did the things you learned in Statistics 101 mostly involve concepts, skills, or statistical thinking? What things were you able to take from the class and use in later life?

What Do You Think?

Now all of this ignores course content. That’s a BIG topic for another time. For now, think about what your introduction to statistics course was like. Was what you expected? What would have made a better Stat 101 for you?

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.combarnesandnoble.com, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , , , , | 7 Comments

Dealing with Dilemmas

Suddenly, Mika screeched and the printer stopped ..

A decade or so ago, I always feared and was the frequent victim of hardware and software problems. It was a logical consequence of a craftsman routinely pushing his tools way beyond the limits of their capabilities. But the software is far better now, and the hardware is cheap enough to allow extraordinary redundancy. It isn’t often that a problem goes away so completely with so little fanfare. But that doesn’t say there are no technical problems that can cause major difficulties in a data analysis project. Here are three of the most common.

Inadequate Data

This problem seems to occur on every project in which the client is responsible for providing previously collected data. Data delivery might be late, incomplete, or in the wrong format. More times than I can count, clients have given me spreadsheets they used as a data table in a report—with footnotes, blank rows and columns, and all kinds of extraneous formatting—all the time thinking that the table was ready for statistical analysis. Those things happen, and in fact, should be anticipated and built into the project budget.

I’m done with the analysis and NOW you want to change the data?

The real problem is when the client provides data sets after you’ve started your analysis. Statistical analysis projects are pretty much once-and-done endeavors. They might be repeated yearly or at some other provocation, they may use some of the same data and be done by the same analysts, but each analysis is expected to have at least some new data and new results, and most importantly, a new budget and schedule. This point is lost on many clients.

Updated data sets usually just include new data the client generated since they gave you the original data set. If the new data can be merged into your working data set, that’s less of a problem and more of an annoyance if you only have to scrub the new data (https://statswithcats.wordpress.com/2010/10/17/the-data-scrub-3/). Usually, you have to at least look at the original data in light of the new data. It’s a real problem, though, when the updated data set includes or excludes subjects the client thought they weren’t going to use, or involves a modified database query, or provides recalibrated measurements, or worst of all, corrects a few random errors they noticed. Too many times I’ve asked about possible errors in a database only to get a corrected and updated, totally new data file. It’s back to square one for Sisyphus the statistician. Whose fault is that?

Updating incorrect data is a two-edged sword. It will improve your analysis, sometimes substantially, but you lose any analysis work you’ve already done. If a client corrects one error, how can you be sure there won’t be more corrections? I had one client who agreed to deliver a complete, error-free data set that I would then analyze within sixteen weeks. They delivered a table, which I reformatted and scrubbed, and identified errors. They sent an updated table with corrected data. I reformatted and scrubbed the new data set, only to be told they had new data they wanted included. So we had a meeting to redefine what data would be included in the analysis, which they promised to deliver the following week. Two weeks later, more data arrived, but not all the data we agreed to. More data were delivered a few weeks later. This continued for weeks until the final pieces were delivered just three days before the original project deadline. You can probably guess what happened. The client was outraged when I hadn’t completed the sixteen week analysis in the three days I had the complete data set.

Perhaps the worst problem with inadequate data is when data errors aren’t noticed until after you’ve pretty much completed the analysis. Your dilemma is telling your client, in a nice way, that they screwed up and there are consequences. If the analysis is small and you want to keep the client, grit your teeth, get the correct data, redo the analysis, and take the loss. But if the analysis is more complex and you’ve passed the point of easy return, you have to explain to the client that they have two options—let you finish the analysis with the data you have or pay for you to redo the analysis. Changing only a couple of numbers might not change their decision based on the results but it will change all the numbers presented in the report. So, if the client plans to release the results to adversarial reviewers, they need to understand their alternatives.

Unwelcome Results

Most of the time, your analysis will confirm what you and your client already suspect. No problem. Occasionally, you’ll reach some unexpected finding. Most clients don’t even mind this. They feel they got something new for their money. But there are two other kinds of findings that are problematical—complex and inconclusive results.

I’m sorry it didn’t work out like you wanted.

Exceedingly complex findings are difficult to communicate, especially to a non-technical audience. There’s only so much you can show in pie charts and bad graphs, even if you use cutesy icons of money bags and people. If the client doesn’t understand your findings, and especially, the value of your findings, your work will never see the light of day. Likewise if they don’t believe your results they’ll never be acted upon (https://statswithcats.wordpress.com/2011/01/16/ockham%E2%80%99s-spatula/). Even more troubling are inconclusive results. It’s difficult explaining to a client that you finished the work, spent all the money, but didn’t reach any conclusive findings. Imagine how you might feel if your mechanic were to tell you he couldn’t find or fix the problem with your car, but then charge you $500.

Unavailability of Key Staff

This happens on all projects not just statistical projects. Sometimes people get sick or resign and take new jobs. Sometimes, management reassigns your staff during lulls in the work, never to return to your project. There’s not much you can do to prevent these dilemmas. You just have to react quickly when the problem arises.

Faster, Better, Cheaper. Pick two. Get one.

Consultants always want to do a better job than their competitors, complete the job sooner, and charge less for their work. It never happens that way though. Some consultants always do superior work, but they may take longer to achieve their vision of perfection. Some consultants pride themselves in being the lowest cost, but their work is often mediocre. Other consultants specialize in quick response, no matter what it takes.

It’s like college. Most students have to do “academic triage”—pick the courses they will excel in and coast through the rest. Nobody is good at everything but that’s what clients want and expect. Besides, you probably said in your proposal that you were faster, better, and cheaper. Now it’s time to deliver.

So is it best to be faster? Should you try to be better? Is being cheaper what clients want most? Consider this analogy. Say you hire a painter to paint the outside of your house. You tell him what you want done and agree to a price and a schedule. Then something goes wrong. Maybe you have to leave town, or the painter can’t get the paint you want, or it rains for two weeks straight. Suddenly the whole agreement is in upheaval. Now, fast-forward a few years. Do you remember that the job took a month longer because of the rain or cost more because the paint had to be special ordered? Maybe, but chances are you don’t think about it nearly as often as you think about the appearance of the chipping, bubbling paint caused by the poor application.

In general, the memory of poor quality lasts far longer than memories of missed schedules or overrun budgets. Quality, however, is a matter of opinion. It’s easy to tell when budgets and schedules are missed. So you have to try to balance all three. But if you find that you can’t be faster, better, and cheaper, you’ll have to do “management triage.” If there’s no money left in the budget, you may have to put in some free time even if it results in a delay. If you have an immovable deadline, get help even if you have to eat some costs. If you have no budget or schedule flexibility, stop where you are and package the deliverable with recommendations for the work you wanted to do but couldn’t finish. Maybe you’ll get lucky.

Picking between faster, better, and cheaper is both a technical and a business decision that is never pleasant. If you decide not to pick quality, beware of the long-term consequences. Whatever you decide to do, don’t wait to inform the client. Clients hate surprises. Confirmed bad news delivered late in a project is much worse than potential bad news delivered early in the project.

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.combarnesandnoble.com, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , | 4 Comments

Limits of Confusion

Cat whiskers are like confidence intervals. They let the cat know how big it’s spread is.

A confidence interval is the numerical interval around the mean of a sample from a population that has a certain confidence of including the mean of the entire population. “Say what?” OK, let’s take it one point at a time.

Say you collect 30 water samples from a lake. Oh wait. That use of the word sample will be confusing to some people. A sample is a portion of a population, but the word can refer to an individual piece of a population or a collection of pieces of the population (https://statswithcats.wordpress.com/2010/07/03/it’s-all-greek/). It’s like the word fish—one fish, two fish, school of fish, and so on.

Anyway, say you collect 30 aliquots (i.e., samples) of water from the lake and analyze the aliquots for iron content. Then, you sum the 30 iron concentrations and divide by 30 to get the mean iron concentration of your collection of aliquots (i.e., sample). But you don’t really care about the mean iron concentration of your sample of 30 samples collection of 30 aliquots. What you want to know is the average iron concentration of all the water in the lake. No problem. You can use the mean iron concentration of the 30 aliquots as an approximation of the mean iron concentration of the lake (population).

Now, that would be fine for most people except for neurotic individuals who don’t understand the Central Limit Theorem. These persons have a couple of options. They can go back to the lake and collect 30 more aliquots of water (this is sometimes referred to as a working vacation if the collection of fish samples is also involved), then recalculate the mean, and see what they get. They can do the same thing again, and again, and again (referred to as a vacation if the consumption of beer and potato chips is involved, https://statswithcats.wordpress.com/2010/07/26/samples-and-potato-chips/) until they have enough means to say how variable the lake’s mean iron concentration might be. (Note: If the neurotic individuals can get someone else to pay for everything, they are called consultants. If the neurotic individuals can get everyone else to pay for everything, they are called politicians.)

For people who can’t afford to collect more samples of samples, there’s an alternative approach called resampling. It’s the computer equivalent of a cushy government contract for data collection. In a resampling approach, you would collect the 30 aliquots of lake water, analyze them for iron content, and calculate the mean of your sample. Then you would have specialized software randomly select a certain number of the original 30 samples (the process is called bootstrapping or jackknifing depending on how it’s done; feel free to google away) to create a new dataset, from which you could calculate a new mean iron concentration. Then do that again, and again, and again until you have enough means to say how variable the mean iron concentration is.

A third alternative, which involves no fishing, less computer time, and as much beer as you need, is to calculate a confidence interval. First, calculate the mean and standard deviation of the 30 iron concentrations. Then calculate a confidence interval around the sample mean using the formula

Sample Mean ± Sample Standard deviation divided by square root of the Number of Samples times a t-value

In the lake example, the mean, standard deviation and number of samples would be calculated from the iron concentrations determined in the aliquots of lake water. The t-value would be calculated using software or selected from a table of values of the t-distribution on the basis of:

Degrees-of-freedom. The number of samples minus one. In this case, 30 water aliquots minus 1 equals 29.

Alpha. One minus the confidence that you won’t find any estimates of the mean outside the interval you calculate divided by the number of limits you will calculate, in this case, two because you want upper and lower limits.

The boundaries of a confidence interval are called the upper confidence limit and the lower confidence limit.

For example, if:

  • Mean iron concentration were 50
  • Standard deviation of iron concentration were 10
  • t-value for 29 degrees-of-freedom (based on 30 iron concentrations) and alpha of .005 (based on 99% confidence for a two-sided limit) were 3.04

the 99% lower confidence limit would be 44.45 (i.e., 50 – (3.04 * (10/30)) and the 95% upper confidence limit would be 55.55 (i.e., 50 + (3.04 * (10/30))

You would have about 99% confidence that this interval would include the mean iron concentration of the lake.

But what if you think 44 to 56 is too wide a range for the lake’s mean iron concentration. What can you do? You could go back to the lake and collect another 30 samples and try again. Better yet, you could go back to the lake and take 120 or even more samples https://statswithcats.wordpress.com/2010/07/11/30-samples-standard-suggestion-or-superstition/), but that’s a lot of expensive work vacation.

Look back at the formula for the confidence limits. The limits are calculated from the mean, the standard deviation, the number of samples, and the t-value. If you’re not going back to the lake, you can’t change the mean, the standard deviation, or the number of samples. That leaves the t-value. The t-value would be based on the degrees-of-freedom and the confidence. The degrees-of-freedom are determined from the number of samples, so that’s still no help. BUT, the choice of the confidence is yours.

Consider this. If you choose the confidence level to be:

99%, the confidence limit would be 44.45 to 55.55
95%, the confidence limit would be 45.68 to 54.32
90%, the confidence limit would be 46.27 to 54.32

Or for that matter,

50%, the confidence limit would be 47.86 to 52.14

although it wouldn’t be very useful if your interval only had a 50% chance of including the real mean iron concentration of the lake.

Consider the analogy of a nearsighted man playing a ring-toss game at a carnival. The location of the peg he will toss his ring at is like the mean of a population of possible measurements. The diameter of the peg is like the inherent variability of the population of measurements. The fuzziness with which he sees the peg because of his near sightedness is like the additional variation associated with sampling, measurement, and environmental variability (https://statswithcats.wordpress.com/2010/08/01/there%E2%80%99s-something-about-variance/). The size of the ring he tosses is like the size of the confidence interval. If he wanted to be very confident that he could toss a ring over the peg, he would use a large ring to give him that confidence (i.e., the higher the confidence the larger the confidence interval).

The man cannot change the location and diameter of the peg (i.e., the population values are fixed). However, he would have a greater chance of success if he could see better (i.e., extraneous variation in the samples is controlled, https://statswithcats.wordpress.com/2010/09/05/the-heart-and-soul-of-variance-control/; https://statswithcats.wordpress.com/2010/09/19/it%E2%80%99s-all-in-the-technique/) or if he could use a very large ring (i.e., a relatively wide confidence interval). If the ring (the confidence interval) becomes too large, though, the game becomes meaningless. Thus, there must be some limits on how large the ring should be.

Obsidian in a 90% confidence drawer.

By convention, most statistical inferences, including confidence intervals, use a 95% confidence level. Sometimes either a 90% level (resulting in a smaller confidence interval) or a 99% level (resulting in a larger interval) is used. A 90% level would be more appropriate when the consequences of not including the true population value in the interval are relatively minor. Confirmatory inferences, on the other hand, often use a 99% confidence level. When in doubt, use 95%.

Some people dislike putting confidence limits around means they calculate. Limits show how imprecise data, and statistics calculated from them, actually are. But if you are going to make an informed decision, you have to know not just the evidence, but the reliability of the evidence as well. Maybe that’s why lawyers hate to have statisticians sitting in the jury pool.

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.combarnesandnoble.com, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , , | 8 Comments

A Picture Worth 140,000 Words

This data analysis stuff is hard.

Even if it’s been a while since your last statistics class, when you read Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis you’ll figure out that there’s much more to data analysis than just calculating a few averages and creating bar charts. Data analysis is definitely not easy. Even so, there are more political pollsters than ever and baseball announcers still talk endlessly about statistics between pitches.

Most of the things you’ll read about in Stats with Cats were never mentioned in your Statistics 101 class. You’ll have to know about these things, though, if you want to analyze your own data at home and at work. It may look formidable at first, at second, and at third. But like baseball, if data analysis wasn’t hard everybody would do it because it’s a lot of fun.

Stats with Cats has 140,000 words on 376, 7×10-inch pages divided into 25 chapters in 6 parts with 47 figures, 24 tables, 107 quotes, and 99 photos of cats. If all that information could be distilled into one picture, this is what it would look like:

You use statistics because you can; you have the knowledge and software is readily available. You use statistics because you need to, to analyze uncertainty, especially when there are too many data to just make a graph. You use statistics when you have to, such as when the problem can’t be solved any other way or when regulations mandate their use.

As a data analyst, you have to know many things, not just about statistics and the project background, but also about the project’s contract, scope, schedule, budget, and deliverables. You have to communicate effectively, both in speech and in writing, and establish good working relationships with project stakeholders. You have to decide on a performance strategy, ensure you get paid, and never compromise your ethics. Finally, you have to have the expertise and time to do the work, and above all, you have to practice, practice, practice.

Data analysis begins when you want to investigate some phenomenon that occurs in a definable population. You collect samples of the population using an appropriate sampling scheme and other measures to control variance and avoid biases so that you will meet your targets for precision and accuracy. You may need to collect more (or less) than thirty samples to meet the resolution you need for the analysis. You measure variables relevant to the phenomenon on appropriate scales. These measurements are the data, which along with the metadata, form the information you structure in a file format your software can recognize as a matrix. Your objectives and aims for model use, together with the scales and natures of your variables, enable you to select appropriate statistical methods. You scrub the information and do an initial analysis, which together with the objectives and methods, lead to your model specifications. Using the specs, you go through the steps of the modeling process to develop and calibrate a model, and evaluate possible violations of assumptions. From the model, you build on your knowledge of the phenomenon. Eventually, from critical analysis through statistics, you can synthesize the wisdom you need to make informed decisions.

The book needs more pictures of MEEEE!

The last three paragraphs describe the contents of Stats with Cats in about 350 words, without the cats of course. See what a big difference they make?

Over time, as you analyze different datasets, you’ll become more comfortable with the process. You’ll learn shortcuts to doing things. You’ll develop an instinct for things that will work and things that won’t. You’ll even be able to impress your friends and co-workers with all the new jargon you’ve learned. You might also learn a bit about yourself. Are you more of a right-brained, intuitive, visual, big-picture, inductive thinker or are you more of a left-brained analytical, verbal, detail-oriented, deductive thinker. Understanding your own preferred thought processes will help you find your best paths in life as well as data analysis.

Being able to analyze data is the asset that sets the knowledge-wielding experts apart from the arm-waving storytellers. Don’t wait for your boss or teacher to send you out into the many unmarked routes of the databahn. Journey to the land of data analysis at your own speed along paths you’re comfortable with. Don’t just endure a data analysis project. Make the journey as fulfilling as the arrival. Make data analysis your passion.

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.combarnesandnoble.com, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , , , , , , , , | 4 Comments

Ockham’s Spatula

OK, now how do I get down from here?

Model building is like climbing a mountain. It’s what you spend so much time planning for. It’s what everybody wants to talk about. It’s what gives you that euphoric feeling of accomplishment when you’re finished. But just as mountain climbers have to descend, model builders have to deploy. You have to put your model in a form that will be palatable to users.

I had a client, a very skilled engineer, who wanted a model to predict how many workers he would need to hire during the year. His company produced three lines of products, most of which were customized for individual customers. A few years earlier, he had gone to great effort and expense to develop a model to predict his man-power needs. He collected data on how many of each type of product he had produced over the past five years and from that data had his managers estimate how long it took to make each product and complete the most common customizations. Then he had his sales force estimate the number of orders they expected the following year. He reasoned that adding up the time it took to produce a product multiplied by the number of expected orders would give him the number of man hours he would need. It was a classic bottom-up modeling approach.

The model had a problem, though. It didn’t work. Even after tinkering with the manufacturing times and correcting for employee leave, administrative functions, and inefficiency, the model still wasn’t very accurate. Moreover, it took his administrative assistant several weeks each year to collect the projected sales data to input into the model. Some of the sales force estimated more sales than they expected to try to impress the boss. Others estimated fewer sales so that they would have a better chance of making whatever goal might be given to them. A few avoided giving the administrative assistant any forecasts at all, so she just used numbers from the previous year.

Using a statistical modeling approach, I found that his historical staffing was highly correlated to just one factor, the number of units of one of the products he produced in the prior year. It made sense to me. His historical staffing levels were appropriate because he had hired staff as he needed them, albeit somewhat after his backlog reached a crisis. His business had also been growing at a fairly steady rate. So long as conditions in his market did not change, predicting future staffing needs was straightforward. He didn’t need to rely on projections from his psychologically fragile sales force.

Simple is best.

But my model proved to be quite unsettling to many. The manager of the product line that was used as the basis of the model claimed the model proved his division merited a greater share of corporate resources, and bigger bonuses for him and his staff. Managers of the two product lines that were not included in the model claimed the model was too simplistic because it ignored their contributions.

At that point, the client had a complex model that he liked but didn’t work and a simple model that worked but nobody liked. He probably would have continued to use the complex model if it didn’t take so much work to gather the input data. Valid or not, the simple model had no credibility with his managers. He could calculate a forecast with the model but was reluctant to favor the model over the intuitions of the managers. So given his two flawed alternatives, the client decided to move manpower forecasting to the back burner until the next crisis would again bring it to a boil.

I wish I could say that this was an isolated case, but it’s more of a rule than an exception especially with technically oriented clients who are most comfortable working from the bottom details up to the prediction.

A model may look good but not be adequate representation of a phenomenon.

I once developed a model for a client to predict the relative risks associated with real estate they managed. The managers wanted a quick-and-dirty way to set priorities for conducting more thorough risk evaluations of the properties. I based my model on information that would be readily available to the client. They could evaluate a property for a few hundred dollars and decide in a day or two whether further evaluation was needed immediately or whether it could be deferred. When the model-development project was done, the model was turned over to the operations group for implementation. The first thing the operations manager did was invite “experts” he worked with to refine the model. Very quickly, the refinements became expansions. The model went from quick and dirty to comprehensive and protracted. It took the operations group on average $50,000 over six-months to evaluate each property. The priorities set by the refined model were virtually identical to the priorities set by the quick-and-dirty model.

Was one of these models good and the other bad? Not exactly, there’s an important distinction to be made. Statisticians, and for that matter, scientists and engineers and many other professionals, are taught that, all else being equal, simple is best. It’s Ockham’s razor. A simple model that predicts the same answers as a more complicated model should be considered to be better. It’s more efficient. But sometimes you, as the statistician, have to be more flexible.

Statisticians, like cats, have to be flexible.

The operations manager wasn’t comfortable with a simple model. He needed to be confident in the results, which, for him, required adding every theoretical possibility his experts could think of. He didn’t want to ignore any sources of risk, even if they were rare or unlikely. That made for a very inefficient model, but if you don’t have confidence in a model and don’t use it, it’s not the tool you need.

These cases illustrate how there’s more to modeling than just the technical details. There are also artistic and psychological aspects to be mastered. Textbooks describe statistical methods to find the best model components but not necessarily the ones that will work for the model’s users. Sometimes you have to be flexible. Think of Ockham razor as more of a spatula than a cleaver.

A model is only as successful as the use to which it is put.

Like sausages, models need to look good on the outside especially if there are things on the inside that might make most users choke. You have to package the model. First, it can’t look so intimidating that users break out in a sweat when they see it. Leave the equations to the technical reviewers; hide them from the naive users. USDA inspectors have to look inside sausages, but you don’t. Second, put the model in a form that can be used easily. That inch-thick report may be great documentation, but it’ll garner more dust than users. If your users are familiar with Excel, program the model as a spreadsheet. If you know a computer language, put the model in a standalone application. A model is only as successful as the use to which it is put.

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.combarnesandnoble.com, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , , | 5 Comments

Grasping at Flaws

Even if you’re not a statistician, you may one day find yourself in the position of reviewing a statistical analysis that was done by someone else. It may be an associate, someone who works for you, or even a competitor. Don’t panic. Critiquing someone else’s work has got to be one of the easiest jobs in the world. After all, your boss does it all the time (https://statswithcats.wordpress.com/2010/11/14/you-can-lead-a-boss-to-data-but-you-can%e2%80%99t-make-him-think/). Doing it in a constructive manner is another story.

Don’t expect to find a major flaw in a multivariate analysis of variance, or a neural network, or a factor analysis. Look for the simple and fundamental errors of logic and performance. It’s probably what you are best suited for and will be most useful to the report writer who can no longer see the statistical forest through the numerical trees.

I feel like I’m being watched.

 

So here’s the deal. I’ll give you some bulletproof leads on what criticisms to level on that statistical report you’re reading. In exchange, you must promise tobe gracious, forgiving, understanding, and, above all, constructive in your remarks. If you don’t, you will be forever cursed to receive the same manner of comments that you dish out.

With that said, here are some things to look for.

The Red-Face Test

Start with an overall look at the calculations and findings. Not infrequently, there is a glaring error that is invisible to all the poor folks who have been living with the analysis 24/7 for the last several months. The error is usually simple, obvious once detected, very embarrassing, and enough to send them back to their computers. Look for:

  • Wrong number of samples. Either samples were unintentionally omitted or replicates were included when they shouldn’t have been.
  • Unreasonable means. Calculated means look too high or low, sometimes by a lot. The cause may be a mistaken data entry, an incorrect calculation, or an untreated outlier.
  • Nonsensical conclusions. A stated conclusion seems counterintuitive or unlikely given known conditions. This may be caused by a lost sign on a correlation or regression coefficient, a misinterpreted test probability, or an inappropriate statistical design or analysis.

Nobody Expects the Sample Inquisition

Start with the samples. If you can cast doubt on the representativeness of the samples, everything else done after that doesn’t matter. If you are reviewing a product from a mathematically trained statistician, probably the only place to look for difficulties is in the samples. There are a few reasons for this. First, a statistician may not be familiar with some of the technical complexities of sampling the medium or population being investigated. Second, he or she may have been handed the dataset with little or no explanation of the methods used to generate the data. Third, he or she will probably get everything else right. Focus on what the data analyst knows the least about.

 

Data Alone Do Not an Analysis Make

Calculations

Unless you see the report writer counting on his or her fingers, don’t worry about the calculations being correct. There’s so much good statistical software available that getting the calculations right shouldn’t be a problem (https://statswithcats.wordpress.com/2010/06/27/weapons-of-math-production/). It should be sufficient to simply verify that he or she used tested statistical software. Likewise, don’t bother asking for the data unless you plan to redo the analysis. You won’t be able to get much out of a quick look at a database, especially if it is large. Even if you redo the analysis, you may not make the same decisions about outliers and other data issues that will lead to slightly different results (https://statswithcats.wordpress.com/2010/10/17/the-data-scrub-3/). Waste your time on other things.

Descriptive Statistics

Descriptive statistics are usually the first place you might notice something amiss in a dataset. Be sure the report provides means, variances, minimums, and maximums, and numbers of samples. Anything else is gravy. Look for obvious data problems like a minimum that’s way too low or a maximum that’s way too high. Be sure the sample sizes are correct. Watch out for the analysis that claims to have a large number of samples but also a large number of grouping factors. The total number of samples might be sufficient, but the number in each group may be too small to be analyzed reliably.

Correlations

You might be provided a matrix with dozens of correlation coefficients (https://statswithcats.wordpress.com/2010/11/28/secrets-of-good-correlations/). For any correlation that is important to the analysis in the report, be sure you get a t-test to determine whether the correlation coefficient is different from zero, and a plot of the two correlated variables to verify that the relationship between the two variables is linear and there are no outliers.

Regression

Regression models are one of the most popular types of statistical analyses conducted by non-statisticians. Needless to say, there are usually quite a few areas that can be critiqued. Here are probably the most common errors.

Statistical Tests

Statistical tests are often done by report writers with no notion of what they mean. Look for some description of the null hypothesis (the assumption the test is trying to disprove) for the test. It doesn’t matter if it is in words or mathematical shorthand. Does it make sense? For example, if the analysis is trying to prove that a pharmaceutical is effective, the null hypothesis should be that the pharmaceutical is not effective. After that, look for the test statistics and probabilities. If you don’t understand what they mean, just be sure they were reported. If you want to take it to the next step, look for violations of statistical assumptions (https://statswithcats.wordpress.com/2010/10/03/assuming-the-worst/).

Analysis of Variance

An ANOVA is like a testosterone-induced, steroid-driven, rampaging horde of statistical tests. There are many many ways the analysis can be misspecified, miscalculated, misinterpreted, and misapplied. You’ll probably never find most kinds of ANOVA flaws unless you’re a professional statistician, so stick with the simple stuff.

A good ANOVA will include the traditional ANOVA summary table, an analysis of deviations from assumptions, and a power analysis. You hardly ever get the last two items. Not getting the ANOVA table in one form or another is cause for suspicion. It might be that there was something in the analysis, or the data analyst didn’t know it should be included.

If the ANOVA design doesn’t have the same number of samples in each cell, the design is termed unbalanced. That’s not a fatal flaw but violations of assumptions are more serious for unbalanced designs.

If the sample sizes are very small, only large difference can be detected in the means of the parameter being investigated. In this case, be suspicious of finding no significant differences when there should be some.

Assumptions Giveth and Assumptions Taketh Away

Statistical models usually make at least four assumptions: the model is linear; the errors (residuals) from the model are independent; Normally-distributed; and have the same variance for all groups. A first-class analysis will include some mention of violations of assumptions. Violating an assumption does not necessarily invalidate a model but may require that some caveats be placed on the results.

The independence assumption is the most critical. This is usually addressed by using some form of randomization to select samples. If you’re dealing with spatial or temporal data, you probably have a problem unless some additional steps were taken to compensate for autocorrelation.

Equality of variances is a bit more tricky. There are tests to evaluate this assumption, but they may not have been cited by the report writer. Here’s a rule of thumb. If the largest variance in an ANOVA group or regression level is twice as big as the smallest variance, you might have a problem. If the difference is a factor of five or more, you definitely have a problem.

The Normality of the residuals may be important although it is sometimes afforded too much attention. The most serious problems are associated with sample distributions that are truncated on one side. If the analysis used a one-sided statistical test on the same side as the truncated end of the distribution, you have a problem. Distributions that are too peaked or flat can result in slightly higher rates of false negative or false positive tests but it would be hard to tell without a closer look than just a review.

Look at a few scatter plots of correlations with the dependent variable, then forget the linearity assumption. It’s most likely not an issue. If the report goes into nonlinear models, you’re probably in over your head.

We’re Gonna Need a Bigger Report

Statistical Graphics

There are scores of ways that data analysts mislead their readers and themselves with graphs (https://statswithcats.wordpress.com/2010/09/26/it-was-professor-plot-in-the-diagram-with-a-graph/). Here’s the first hint. If most of the results appear as pie charts or bar graphs, you’re probably dealing with a statistical novice. These charts are simple and used commonly, but they are notorious for distorting reality. Also, be sure to check the scales of the axes to be sure they’re reasonable for displaying the data across the graphic. If comparisons are being made between graphics, the scales of the graphics should be the same. Make sure everything is labeled appropriately.

Maps

As with graphs, there are so many things that can make a map invalid that critiquing them is almost no challenge at all. Start by making sure the basics—north arrow, coordinates, scale, contours, and legend—are correct and appropriate for the information being depicted. Compare extreme data points with their depiction. Most interpolation algorithms smooth the data, so the contours won’t necessarily honor individual points. But if the contour and a nearby datum are too different, some correction may be needed. Check the actual locations of data points to ensure that contours don’t extend (too far) into areas with no samples. Be sure the northing and easting scales are identical, easily done if there is an overlay of some physical features. Finally, step back and look for contour artifacts. These generally appear as sharp bends or long parallel lines, but they may take other forms.

Documentation

I’m sorry. I ate your documentation.

It’s always handy in a review to say that all the documentation was not included. But let’s be realistic. Even an average statistical analysis can generate a couple of inches of paper. A good statistician will provide what’s relevant to the final results. If you’re not going to look at it probably no one else will either. Again, waste your time on other things. On the other hand, if you really need some information that was omitted, you can’t be faulted for making the comment.

You’ve Got Nothing

If, after reading the report cover-to-cover, you can’t find anything to comment on, you can sit back and relax. Just make sure you haven’t also missed a fatal flaw (https://statswithcats.wordpress.com/2010/11/07/ten-fatal-flaws-in-data-analysis/).

If you’re the suspicious sort, though, there is another thing you can try. This ploy requires some acting skills. Tell the data analyst/report writer that you are concerned that the samples may not fairly represent the population being analyzed.

Expressing concern over the representativeness of a sample is like questioning whether a nuclear power plant is safe. No matter how much you try, there is no absolute certainty. Even experienced statisticians will gasp at the implications of a comment concerning the sample not being representative of the population. That one problem could undermine everything they’ve done.

Here’s what to look for in a response. If the statistician explains the measures that were used to ensure representativeness, prevent bias, and minimize extraneous variation, the sample is probably all right. If the statistician mumbles about not being able to tell if the sample is representative and talks only about the numbers and not about the population, there may be a problem. If the statistician ignores the comment or tries to dismiss it with a stream of meaningless generalities and unintelligible jargon (https://statswithcats.wordpress.com/2010/07/03/it%e2%80%99s-all-greek/), there is a problem and the statistician probably knows it. If he or she won’t look you in the eyes, you’ve definitely got something. If you get an open-mouth, big-eye vacant stare, he or she knows less about statistics than you do. Be gentle!

Now It’s Up to You

So that’s my quick-and-dirty guide to critiquing statistical analyses. Sure there’s a lot more to it, but you should be able to find something in these tips that you could apply to almost any statistical report you have to review. At a minimum, you should be able to provide at least some constructive feedback that will benefit both the writer and the report. Maybe you’ll even be able to prevent a catastrophe. If nothing else, you’ll have earned your day’s pay, and if you critique constructively, the respect of the report writer as well.

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.combarnesandnoble.com, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , , , , , , , , , , , , , , , | 8 Comments