Grasping at Flaws

Even if you’re not a statistician, you may one day find yourself in the position of reviewing a statistical analysis that was done by someone else. It may be an associate, someone who works for you, or even a competitor. Don’t panic. Critiquing someone else’s work has got to be one of the easiest jobs in the world. After all, your boss does it all the time (https://statswithcats.wordpress.com/2010/11/14/you-can-lead-a-boss-to-data-but-you-can%e2%80%99t-make-him-think/). Doing it in a constructive manner is another story.

Don’t expect to find a major flaw in a multivariate analysis of variance, or a neural network, or a factor analysis. Look for the simple and fundamental errors of logic and performance. It’s probably what you are best suited for and will be most useful to the report writer who can no longer see the statistical forest through the numerical trees.

I feel like I’m being watched.

 

So here’s the deal. I’ll give you some bulletproof leads on what criticisms to level on that statistical report you’re reading. In exchange, you must promise tobe gracious, forgiving, understanding, and, above all, constructive in your remarks. If you don’t, you will be forever cursed to receive the same manner of comments that you dish out.

With that said, here are some things to look for.

The Red-Face Test

Start with an overall look at the calculations and findings. Not infrequently, there is a glaring error that is invisible to all the poor folks who have been living with the analysis 24/7 for the last several months. The error is usually simple, obvious once detected, very embarrassing, and enough to send them back to their computers. Look for:

  • Wrong number of samples. Either samples were unintentionally omitted or replicates were included when they shouldn’t have been.
  • Unreasonable means. Calculated means look too high or low, sometimes by a lot. The cause may be a mistaken data entry, an incorrect calculation, or an untreated outlier.
  • Nonsensical conclusions. A stated conclusion seems counterintuitive or unlikely given known conditions. This may be caused by a lost sign on a correlation or regression coefficient, a misinterpreted test probability, or an inappropriate statistical design or analysis.

Nobody Expects the Sample Inquisition

Start with the samples. If you can cast doubt on the representativeness of the samples, everything else done after that doesn’t matter. If you are reviewing a product from a mathematically trained statistician, probably the only place to look for difficulties is in the samples. There are a few reasons for this. First, a statistician may not be familiar with some of the technical complexities of sampling the medium or population being investigated. Second, he or she may have been handed the dataset with little or no explanation of the methods used to generate the data. Third, he or she will probably get everything else right. Focus on what the data analyst knows the least about.

 

Data Alone Do Not an Analysis Make

Calculations

Unless you see the report writer counting on his or her fingers, don’t worry about the calculations being correct. There’s so much good statistical software available that getting the calculations right shouldn’t be a problem (https://statswithcats.wordpress.com/2010/06/27/weapons-of-math-production/). It should be sufficient to simply verify that he or she used tested statistical software. Likewise, don’t bother asking for the data unless you plan to redo the analysis. You won’t be able to get much out of a quick look at a database, especially if it is large. Even if you redo the analysis, you may not make the same decisions about outliers and other data issues that will lead to slightly different results (https://statswithcats.wordpress.com/2010/10/17/the-data-scrub-3/). Waste your time on other things.

Descriptive Statistics

Descriptive statistics are usually the first place you might notice something amiss in a dataset. Be sure the report provides means, variances, minimums, and maximums, and numbers of samples. Anything else is gravy. Look for obvious data problems like a minimum that’s way too low or a maximum that’s way too high. Be sure the sample sizes are correct. Watch out for the analysis that claims to have a large number of samples but also a large number of grouping factors. The total number of samples might be sufficient, but the number in each group may be too small to be analyzed reliably.

Correlations

You might be provided a matrix with dozens of correlation coefficients (https://statswithcats.wordpress.com/2010/11/28/secrets-of-good-correlations/). For any correlation that is important to the analysis in the report, be sure you get a t-test to determine whether the correlation coefficient is different from zero, and a plot of the two correlated variables to verify that the relationship between the two variables is linear and there are no outliers.

Regression

Regression models are one of the most popular types of statistical analyses conducted by non-statisticians. Needless to say, there are usually quite a few areas that can be critiqued. Here are probably the most common errors.

Statistical Tests

Statistical tests are often done by report writers with no notion of what they mean. Look for some description of the null hypothesis (the assumption the test is trying to disprove) for the test. It doesn’t matter if it is in words or mathematical shorthand. Does it make sense? For example, if the analysis is trying to prove that a pharmaceutical is effective, the null hypothesis should be that the pharmaceutical is not effective. After that, look for the test statistics and probabilities. If you don’t understand what they mean, just be sure they were reported. If you want to take it to the next step, look for violations of statistical assumptions (https://statswithcats.wordpress.com/2010/10/03/assuming-the-worst/).

Analysis of Variance

An ANOVA is like a testosterone-induced, steroid-driven, rampaging horde of statistical tests. There are many many ways the analysis can be misspecified, miscalculated, misinterpreted, and misapplied. You’ll probably never find most kinds of ANOVA flaws unless you’re a professional statistician, so stick with the simple stuff.

A good ANOVA will include the traditional ANOVA summary table, an analysis of deviations from assumptions, and a power analysis. You hardly ever get the last two items. Not getting the ANOVA table in one form or another is cause for suspicion. It might be that there was something in the analysis, or the data analyst didn’t know it should be included.

If the ANOVA design doesn’t have the same number of samples in each cell, the design is termed unbalanced. That’s not a fatal flaw but violations of assumptions are more serious for unbalanced designs.

If the sample sizes are very small, only large difference can be detected in the means of the parameter being investigated. In this case, be suspicious of finding no significant differences when there should be some.

Assumptions Giveth and Assumptions Taketh Away

Statistical models usually make at least four assumptions: the model is linear; the errors (residuals) from the model are independent; Normally-distributed; and have the same variance for all groups. A first-class analysis will include some mention of violations of assumptions. Violating an assumption does not necessarily invalidate a model but may require that some caveats be placed on the results.

The independence assumption is the most critical. This is usually addressed by using some form of randomization to select samples. If you’re dealing with spatial or temporal data, you probably have a problem unless some additional steps were taken to compensate for autocorrelation.

Equality of variances is a bit more tricky. There are tests to evaluate this assumption, but they may not have been cited by the report writer. Here’s a rule of thumb. If the largest variance in an ANOVA group or regression level is twice as big as the smallest variance, you might have a problem. If the difference is a factor of five or more, you definitely have a problem.

The Normality of the residuals may be important although it is sometimes afforded too much attention. The most serious problems are associated with sample distributions that are truncated on one side. If the analysis used a one-sided statistical test on the same side as the truncated end of the distribution, you have a problem. Distributions that are too peaked or flat can result in slightly higher rates of false negative or false positive tests but it would be hard to tell without a closer look than just a review.

Look at a few scatter plots of correlations with the dependent variable, then forget the linearity assumption. It’s most likely not an issue. If the report goes into nonlinear models, you’re probably in over your head.

We’re Gonna Need a Bigger Report

Statistical Graphics

There are scores of ways that data analysts mislead their readers and themselves with graphs (https://statswithcats.wordpress.com/2010/09/26/it-was-professor-plot-in-the-diagram-with-a-graph/). Here’s the first hint. If most of the results appear as pie charts or bar graphs, you’re probably dealing with a statistical novice. These charts are simple and used commonly, but they are notorious for distorting reality. Also, be sure to check the scales of the axes to be sure they’re reasonable for displaying the data across the graphic. If comparisons are being made between graphics, the scales of the graphics should be the same. Make sure everything is labeled appropriately.

Maps

As with graphs, there are so many things that can make a map invalid that critiquing them is almost no challenge at all. Start by making sure the basics—north arrow, coordinates, scale, contours, and legend—are correct and appropriate for the information being depicted. Compare extreme data points with their depiction. Most interpolation algorithms smooth the data, so the contours won’t necessarily honor individual points. But if the contour and a nearby datum are too different, some correction may be needed. Check the actual locations of data points to ensure that contours don’t extend (too far) into areas with no samples. Be sure the northing and easting scales are identical, easily done if there is an overlay of some physical features. Finally, step back and look for contour artifacts. These generally appear as sharp bends or long parallel lines, but they may take other forms.

Documentation

I’m sorry. I ate your documentation.

It’s always handy in a review to say that all the documentation was not included. But let’s be realistic. Even an average statistical analysis can generate a couple of inches of paper. A good statistician will provide what’s relevant to the final results. If you’re not going to look at it probably no one else will either. Again, waste your time on other things. On the other hand, if you really need some information that was omitted, you can’t be faulted for making the comment.

You’ve Got Nothing

If, after reading the report cover-to-cover, you can’t find anything to comment on, you can sit back and relax. Just make sure you haven’t also missed a fatal flaw (https://statswithcats.wordpress.com/2010/11/07/ten-fatal-flaws-in-data-analysis/).

If you’re the suspicious sort, though, there is another thing you can try. This ploy requires some acting skills. Tell the data analyst/report writer that you are concerned that the samples may not fairly represent the population being analyzed.

Expressing concern over the representativeness of a sample is like questioning whether a nuclear power plant is safe. No matter how much you try, there is no absolute certainty. Even experienced statisticians will gasp at the implications of a comment concerning the sample not being representative of the population. That one problem could undermine everything they’ve done.

Here’s what to look for in a response. If the statistician explains the measures that were used to ensure representativeness, prevent bias, and minimize extraneous variation, the sample is probably all right. If the statistician mumbles about not being able to tell if the sample is representative and talks only about the numbers and not about the population, there may be a problem. If the statistician ignores the comment or tries to dismiss it with a stream of meaningless generalities and unintelligible jargon (https://statswithcats.wordpress.com/2010/07/03/it%e2%80%99s-all-greek/), there is a problem and the statistician probably knows it. If he or she won’t look you in the eyes, you’ve definitely got something. If you get an open-mouth, big-eye vacant stare, he or she knows less about statistics than you do. Be gentle!

Now It’s Up to You

So that’s my quick-and-dirty guide to critiquing statistical analyses. Sure there’s a lot more to it, but you should be able to find something in these tips that you could apply to almost any statistical report you have to review. At a minimum, you should be able to provide at least some constructive feedback that will benefit both the writer and the report. Maybe you’ll even be able to prevent a catastrophe. If nothing else, you’ll have earned your day’s pay, and if you critique constructively, the respect of the report writer as well.

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.combarnesandnoble.com, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , , , , , , , , , , , , , , , | 8 Comments

Stats With Cats Blog: 2010 in review

The stats helper monkeys at WordPress.com mulled over how this blog did in 2010, and here’s a high level summary of its overall blog health:

Healthy blog!

The Blog-Health-o-Meter™ reads Wow.

Crunchy numbers

Featured image

The average container ship can carry about 4,500 containers. This blog was viewed about 19,000 times in 2010. If each view were a shipping container, your blog would have filled about 4 fully loaded ships.

In 2010, there were 31 new posts, not bad for the first year! There were 98 pictures uploaded, taking up a total of 36mb. That’s about 2 pictures per week.

The busiest day of the year was November 8th with 4,691 views. The most popular post that day was Ten Fatal Flaws in Data Analysis.

Where did they come from?

The top referring sites in 2010 were reddit.com, mail.live.com, mail.yahoo.com, facebook.com, and Google Reader.

Some visitors came searching, mostly for stats with cats, cats, why take statistics, statswithcats, and stats and cats.

Attractions in 2010

These are the posts and pages that got the most views in 2010.

1

Ten Fatal Flaws in Data Analysis November 2010
9 comments and 2 Likes on WordPress.com

2

Try This At Home June 2010
1 comment

3

30 Samples. Standard, Suggestion, or Superstition? July 2010
4 comments

4

The Right Tool for the Job August 2010
1 comment

5

The Five Pursuits You Meet in Statistics August 2010
3 comments

Posted in Uncategorized | Leave a comment

Live Long and Publish

How I Finished My Book in Only a Decade

If you want to write a book, you just need to get a round tuit.

Do you have a half written book in your desk drawer at home? How about a file bulging with outlines and ideas you’re storing until you have the time? I’ve had those for decades, still do. In a few weeks, though, I’ll have published my book. Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis (http://www.wheatmark.com/merchant2/merchant.mvc?Screen=PROD&Store_Code=BS&Product_Code=9781604944723). Stats with Cats is an attempt to help people who have some training in statistics to apply their skills outside of the classroom. Maybe in time, I’ll be able to refer to it as my first book.

Writing

I started writing what would become Stats with Cats in the conventional way. I identified who I thought my audience would be. I created detailed objectives and outlines. And, I collected the scores of articles on statistical topics I had written over the years that I thought I could use as seeds for the book. At that time, the book was about using statistics to solve environmental problems.

No. No. No. This all has to be rewritten.

By the time I was through, I had restructured the book twice, thrown out most of what I had written in the past, rewritten every chapter at least three or four times, and edited sections more times than I wanted to count. I must have revised the first fifty pages twenty times. I had done everything I could do to it but finish.

When I look back at the book I planned to write at the beginning, I’m glad I took the time to let my writing mature and transform (https://statswithcats.wordpress.com/2010/05/29/stats-with-cats-whats-inside/).

Publishing

Back in the 1980s when I started thinking seriously about writing a book, I followed the traditional advice. I took classes. I talked to agents. I wrote cover letters and book proposals, but I never made a sale. Then technology changed the rules of the game. Advances in tools for book design and printing gave rise to Print On Demand (POD) publishing. POD allows publishers to order small runs of a book, thus eliminating the need for warehousing. This, in turn, opened the market to small publishing ventures and brought the cost of book publishing within the reach of aspiring authors.

Like any business, book publishing involves controlling risks. In the traditional business model, the big publishing houses take most of the business risks. They select the authors and the books that are published. They fund the book editing and design, printing, warehousing, marketing, and fulfillment (i.e., providing the books and managing the money). Sometimes they even advance money to authors to write the book. The publisher controls all aspects of an author’s book—its size, its price, the publishing schedule, and sometimes even the title and contents. They reap the bulk of any profits, leaving the author with only a few percent of the revenues. But, the author assumes almost no risk. If not a single book is sold, authors lose nothing other than their investment of their own time.

POD has liberated aspiring authors from the tyranny of the big publishing houses. Authors can take all or some of the risks and retain more of the control and rewards by self-publishing. Typically, a self-publishing author will pay a POD publisher to edit and design the book, obtain copyrights and registrations, arrange for printing and fulfillment, and handle all the money. Self-publishing authors usually retain the copyrights, receive more than 10% royalties on books sold, and can influence if not dictate most anything, right down to fonts and the price of the book. Last year, close to 200,000 books were published in the U.S., 80% of which were self-published. That number is expected to increase in the future (Don Harold, BookWhirl.com).

Aren’t you done yet?

For my book I searched the internet and in just a few minutes identified a score of POD publishers, including Trafford, Xlibris, iUniverse, Lulu, Dog Ear, Wheatmark, and quite a few others. I filled in an online form and later emailed the part of the book I had completed so they could send me a book proposal. I selected Wheatmark, signed the contract, and paid the fee in only a few days.

My book was expensive, several thousand dollars. That’s because it is 140,000 words on 374, 7×10-inch pages with 47 figures, 24 tables, and 99 photos of cats. I figure it cost me about $25 per graphic. So here’s a hint—publishing your book will cost hundreds instead of thousands of dollars if you don’t include graphics. But what’s non-fiction without pictures? I had to do it. Don’t even think about interior color, though, unless you’re going to publish a fifteen page children’s book. It’s absurdly expensive.

Marketing

Publishing is just the middle step in completing a book. You have to market it so people will know it’s there. Most aspiring authors probably don’t think much about marketing their book. Most successful self publishers think about marketing a LOT.

My marketing plan includes:

  • A description of the book along with some promotional text, taglines, and pictures I use in advertising, and a list of features and benefits of the book that show how the book is valuable and unique.
  • A description of the book’s audiences, their relative size, how they might find out about the book, where they might purchase the book, and the probability they might purchase the book. From this I selected target groups that I would focus on marketing.
  • A list of companion and competitor books including year of publication, price, size and number of pages, publisher, and Amazon sales ranking. This information helped me set the price for Stats with Cats that was well below the cost of books used in introductory statistics classes.
  • A list of websites where I plan to post announcements of the book’s availability, such as alumni groups and social networking sites, and possible venues for press releases and paid advertisements.

My first marketing effort was a blog, which I started in June, eight months before Stats with Cats will be published. My blog is at https://statswithcats.wordpress.com/and is linked to my accounts on Facebook and Linkedin. I also have a Facebook group for Stats with Cats. Every Sunday I post an excerpt from the book, which I also then post to reddit.com (i.e., the /statistics, /matheducation, and /learnmath subreddits), scribd.com, digg.com, and stumbleupon.com. Since November, I’ve been averaging about 100 views per day. Hopefully, this trend will increase substantially once the book begins shipping.

Lessons Learned

If I get to publish another book, I know a few things I would do differently. This is the advice I would offer to aspiring authors:

  1. Be sure you understand why you want to publish a book—Decide what’s important. Are you looking to stimulate your career or business? Keep the price low, even give the book away. The book is a means to the end. Are you looking to make money? Be sure you have a good marketing plan. In any case, have a measurable goal whether it is books sold, blog followers, or new business attributable to the book.
  2. Define your audience in terms of marketing—Statisticians talk about populations all the time. It’s fundamental to what we do. But there’s a concept called phantom populations, a group of subjects that have no practical commonalities. For example, it would make no sense to say the audience for your book is people who wear red shirts. Define your audience in terms of how you will get the attention of potential buyers. In statistics, this is called a frame. If you are writing a children’s book, for instance, your audience is not five-year-olds. What do they know, thay can’t even read? Your real audience is parents and relatives who will buy the book for their five-year-old. Eighty percent of books purchases are given as gifts.
  3. Let your book evolve if it needs to—Don’t get too enamored with titles and outlines. Your perspective may change while you are writing the book. Be adaptable. Don’t be afraid to throw stuff away. Defer rewriting until you’ve had a chance to forget what you wrote (this turns out to be quite easy in people my age). Look at your writing with fresh eyes. And don’t just review your writing once. Keep rewriting so that each time you make fewer and more minor changes. Eventually, you won’t be able to change anything to make it better, only different. It’ll be like Fonzi combing his hair in the restroom mirror. Finally, know when to stop making changes. If you’re not sure when that may be, your publisher will tell you. It’s when they charge you extra for any changes you make.

Don’t be afraid of failure, the experience alone is worth the effort. Anything you complete will empower you to more and greater successes. All you have to do is start the journey and take a small step forward from time to time until you arrive at your goal. Good luck!

Any questions?

You can read a longer version of this blog at
http://www.scribd.com/doc/45815415/Live-Long-and-Publish.

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.combarnesandnoble.com, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , | 4 Comments

The Santa Claus Strategy

I’ve been very good this year. I don’t know why the humans call me Mischief.

I’m working all out
Deadline is near
Model’s in doubt
Dooming my career.
Sta-tis-tics will chill my meltdown.

I’m adding new vars
Testing them twice
Trying to find out which ones’ll suffice
Sta-tis-tics will give the lowdown.

I see the best predictors.
I know what steps come next
I clean up my dataset and
Regress my y on my x.

Ohhhhh!
My work is all through
My deadline was met
My client paid up
Now I’m out of debt.
Sta-tis-tics helped thwart my shutdown.

Sing to the tune of “Santa Claus Is Coming to Town”

Make a list. Check it twice. That’s sage advice from an old fat guy with a beard. Here’s what that means if you’re analyzing data.

What a Phenomenal Concept

The first step in assembling a set of variables for your analysis is to identify the concepts or aspects of the phenomenon you want to investigate. By concepts, I mean to include hypotheses and theories as well as ideas, suppositions, beliefs, assertions, and premises, which may be less definitive or accepted. These concepts will come from the relationships known and supposed about the phenomenon. The reasons for doing this are that concepts can be multifaceted and linked to other concepts creating a framework of relationships underlying the phenomena. In traditional research, this is what a literature search is for. Literature searches, though, are considered by some to be an academic activity not applicable to analyses done on the job. Not true. The process of thinking through what you want to measure is necessary.

Once you have specific ideas you want to explore, identify ways they could be measured. Start with conventional measures, the ones everyone would recognize and know what you did to determine. Then consider whether there are any other ways to measure the concept directly. From there, establish whether there are any indirect measures or surrogates that could be used in lieu of a direct measurement. Finally, if there are no other options, explore whether it would be feasible to develop a new measure based on theory. Keep in mind that developing a new measure or a new scale of measurement is more difficult for the experimenter and less understandable for reviewers than using an established measure.

On a Scale of ½ to VIII

Of the possible measures you identify, select scales of measurement and consider how difficult it would be to generate the data. For example:

  • Qualities are usually more difficult to measure accurately and consistently than quantities because there is more complex judgments involved.
  • Counts are straightforward when they involve simple judgments as to what to count. Some judgments, such as species counts, can be relatively complex because you have to be able to identify the species before you can count it. Counts have no decimals and no negative numbers.
  • Amounts are usually more difficult to measure than counts because the judgment process is more complex. Amounts have decimals but no negative numbers unless losses are admissible.
  • Ratio measures, such as concentrations, rates, and percentages, are usually more difficult to measure than amounts because they involve two or more amounts. Ratio measures have both decimals and negative numbers.

Once you know what you might measure, evaluate the sources of measurement variability (benchmark, process, and judgment described in https://statswithcats.wordpress.com/2010/09/12/the-measure-of-a-measure/) in each measure.

Finally, take into account your objective and the ultimate use of your statistics (https://statswithcats.wordpress.com/2010/08/22/the-five-pursuits-you-meet-in-statistics/). For example, if you want to predict some dependent variable, quantitative independent variables would usually be preferable to qualitative variables because they would provide more scale resolution. Furthermore, you could dumb down a quantitative variable you measured to a less finely divided scale or even a qualitative scale. You usually can’t go in the other direction. If you want your prediction model to be simple and inexpensive to use, don’t select predictors that are expensive and time-consuming to measure.

Consider building some redundancy into your variables if there is more than one way to measure a concept. Sometimes one variable will display a higher correlation with your model’s dependant variable or help explain analogous measurements in a related measure. For example, redundant measures are often included in opinion surveys by using differently worded questions to solicit the same information. One question might ask “Did you like [something]?” and then a later question ask “Would you recommend [something] to your friends?” or “Would you use [something] again in the future?” to assess consistency in a respondent’s opinion about a product. Redundant variables can be a good check on data quality (https://statswithcats.wordpress.com/2010/09/19/it%E2%80%99s-all-in-the-technique/).

The Santa Claus Strategy

So make a list and check it twice.

Here’s a checklist you can use to help you think about your variables. Complete a checklist for each variable you plan to record. This may seem like a formidable amount of work, but it’s worth the effort. The checklist will help you think about your measurements, visualize how they will be generated, and ultimately produce results with less bias and variability. The checklists also provide concise documentation that can be added to a report appendix or project file. Furthermore, if you work with the same data often, you’ll find that completing such a checklist becomes much easier once you have thought through the process the first time. If this checklist doesn’t meet your needs, use it as a starting point to create your own. The important point is to think about what you plan to do.


Also, remember that this isn’t a once-and-done process. Be sure to revisit your thought process periodically throughout your analysis. It’ll help keep you on track.

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.combarnesandnoble.com, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , | 4 Comments

You’re Off to Be a Wizard

It’s naptime. Nobody gets to see the Wizard. Not nobody, not nohow!

The process of developing a statistical model (https://statswithcats.wordpress.com/2010/12/04/many-paths-lead-to-models/) involves finding the mathematical equation of a line, curve, or other pattern that faithfully represents the data with the least amount of error (i.e., variability). Variability and pattern are the yin and yang of models. They are opposites yet they are intertwined. Improve the fit of the model’s pattern to the pattern of the data, and you’ll reduce the variability in the model and vice versa. It’s wizardry.

Follow the Modeling Code

Say you have a conceptual model (https://statswithcats.wordpress.com/2010/12/12/the-seeds-of-a-model/) with a dependent variable (y) and one or more independent variables (x1
through xn) in the fear-provoking-yet-oh-so-convenient mathematical shorthand:

y = a0 + a1x1 + a2x2 + a3x3anxn + e

Estimating values for the model’s parameters (i.e., a0
through an) and the model’s uncertainty (i.e., the e) so that the model is the best fit for the data with the least imprecision is a process called calibrating or fitting a model. Every statistical method has criteria that the procedure uses to calculate the parameters of the best model given the variables, data, and statistical options you specify. Your job is to specify those variables, data, and statistical options.

This is how it works:

  1. You collect data that represent the y and the xs for each of the samples.
  2. You make sure the data are correct and appropriate for the phenomenon and put the values in a dataset.
  3. Using the software for the statistical procedure you selected, you specify the dependent variable, the independent variables, and any statistical option you want to use. Every statistical procedure has a variety of options that can be specified. If you’re doing a factor analysis, for instance, you can try different extraction techniques, different communalities, different numbers of factors, and so on. If you’re a statistician, you know what I mean. If you’re not a statistician, don’t worry about this.
  4. Magic happens. This is what you learn about if you major in statistics.
  5. You evaluate the output from the software and, if all is well, you record the parameters and the error, and you have a calibrated statistical model. If the model fit isn’t what you would like, which is what usually happens, you make changes and try again.

Consider your choices wisely.

What changes could you make? Here are a few hints. If you are well acquainted with statistics, you can try making adjustments to the variables and the statistical options, and perhaps even the data, to see how the different combinations affect the model. For example, you can try including or excluding influential observations, filling in missing data, changing the variables in the model, or breaking down the analysis by some grouping factor (https://statswithcats.wordpress.com/2010/11/21/fifty-ways-to-fix-your-data/). If you are well acquainted with the data but not statistics, you might rely more on your intuition than your computations. Look for differences between the different models as well as between the results and your own expectations based on established theory.

Models and Variables and Samples, Oh My

If you specify only one way that you want to combine the variables, data, and statistical options, the statistical method will give you the best model. However, if you specify more than one combination of independent variables, you have to have some criteria for selecting which of the models to use as your final model and then decide how good the model is. The three most commonly used criteria are the coefficient of determination, the standard error of estimate, and the F-test.

  • Coefficient of Determination—also called R2 or R-square, is the square of the correlation of the independent variables with the dependent variable. R-squared ranges from 0 to 1. It is thought of as the proportion of the variation in the dependent variable that is accounted for by the independent variables, or similarly, the proportion of the total variation in the relationship that the model accounts for. It is a measure of how well the pattern of the model fits the pattern of the data, and hence, is a measure of accuracy. Some statisticians believe that R-square is overused and flawed because it always increases as terms are added to a model. Whine. Whine. Whine.
  • Standard Error of Estimate—also called sxy or SEE, is the standard deviation of the residuals. The residuals are the differences (i.e., errors) between the observed values of the dependent variable and the values calculated by the model. The SEE takes into account the number of samples (more is better) and the number of variables (fewer is better) in the model, and is in the same units as the dependent variable. It is a measure of how much scatter there is between the model and the data, and hence, is a measure of precision. For a set of models you are considering, the largest coefficient of determination usually will correspond to the smallest standard error of estimate. Consequently, many people look only at the coefficient of determination because it is easier to understand that statistic given its bounded scale. It’s essential to look at the standard error of estimate as well, though, because it will allow you to evaluate the uncertainty in the model’s predictions. In other words, R-square might tell you which of several models may be best while SEE will tell you if that best is good enough for what you need to do with the model.
  • F-test and probability—A test of whether the R-square value is different from zero. The F-value will vary with the numbers of samples and terms in the model. The probability is customarily required to be less than 0.05. Many statisticians start by looking at the results of the F-test, using the probability as a threshold, and then look at the R-square and SEE.

Evaluating models doesn’t end with R-square, SEE, and F-test. There are many other diagnostic tools for evaluating the overall quality of statistical models, including:

  • AIC and BIC—The Akaike’s Information Criterion and the Bayesian Information Criterion are statistics for comparing alternative models. For any collection of models, the one with the lowest values of AIC and BIC is the preferred model.
  • Mallows’ Cp Criterion—A relative measure of inaccuracy in the model given the number of terms. Cp should be small and close to the number of terms in the model. Large values of Cp may indicate that the model is overfit.
  • Plot of Observed vs. Predicted—On a graph with observed values on the y-axis and predicted values on the x-axis, data points should plot close to a straight 45-degree line passing through the origin of the axes. Systematic deviations from the line indicate a lack-of-fit of the model to the data. Individual data points that deviate substantially from the line may be considered outliers.
  • Plot of Observed vs. Residuals—On a graph with observed values on the y-axis and residuals (predicted values minus observed values) on the x-axis, data points should plot randomly around the origin of the axes.
  • Histogram of Residuals—If the frequency distribution of the model’s residuals does not approximate a Normal distribution, the probabilities calculated for the F-test may be in error.

There are a lot of things you’ll want to look at.

Usually, all of these statistics should be considered when building a model. Once a small number of alternative models is selected, statistical diagnostics are used to evaluate the components of a statistical model, the variables, including:

  • Regression Coefficients—If you use statistical software, you’ll see two types of regression coefficients. The unstandardized regression coefficients are the a0 through an terms in the model. They are also referred to as B or b. These are the values you use if you want to calculate a prediction of the y variable from the values of the x variables. The standardized regression coefficients are equal to the unstandardized regression coefficients divided by the standard errors of the coefficients. Standardized regression coefficients, also called Beta coefficients, are used to compare the relative importance of the independent variables. If you forget which is which, remember that there is no standardized coefficient for the constant intercept term in the model. The column with a number for the model intercept contains the unstandardized coefficients you use for calculating predictions.
  • t-tests and probabilities—Tests of whether the regression coefficients are different from zero. The t-values may change significantly depending on what other terms are in the model. The probability for the tests are commonly used to include or discard independent variables.
  • Variance Inflation Factor—VIFs are measures of how much the model’s coefficients change because of correlations between the independent variables. The VIF for a variable should be less than 10 and ideally near 1 or multicollinearity may be a concern. The reciprocal of the VIF is called the tolerance.
  • Partial Regression Leverage Plots—Leverage plots are graphs of the dependent variable (y-axis) versus an independent variable from which the effects of the other independent variables in the model have been removed (x-axis). The slope of a line fit to the leverage plot is the regression coefficient for that independent variable. These plots are useful for identifying outliers and other concerns in the relationship between the independent variable and the dependent variable.

These statistics are calculated for each independent variable in a model.

Finally, the observations used to create the statistical model are evaluated using diagnostic statistics, including:

  • Residuals—Residuals are the differences between the observed values and the model’s predictions. The residuals should all be small and Normally distributed.
  • DFBETAs—The changes in the regression coefficients that would result from deleting the observation. DFBETAs should all be small and relatively consistent for all the observations.
  • Studentized Deleted Residual—A measure of whether an observation of the dependent variable might be overly influential. The studentized deleted residual is like a t-statistic; it should be small, preferably less than 2, if the observation is not overly influential.
  • Leverage— A measure of whether an observation for an independent variable might be overly influential. The leverage for an observation should be less than two times the number of terms in the model divided by the sample size.
  • Cook’s Distance— A measure of the overall impact of an observation on the coefficients of the model. If the CD for an observation is less than 0.2, the observation has little impact on the model. A CD value over 0.5 indicates a very influential observation.

These statistics are calculated for each sample used to create the model.

You won’t necessarily use all of these diagnostics every time you build a model. Then again, you may also have to use some of the many other diagnostic statistics. You have to have the brains to know what statistics to use, the heart to follow through all the calculations and plots, and the courage to decide what diagnostics to ignore and what parts of the model you should change.

No Place for a Tome

In step 5 of the modeling process, “If all is well” means that all the statistical tests and graphics that your software provides indicate that the model will be satisfactory for your needs. This, of course, is the crux of statistical modeling that statisticians write all those books about. You’ll want to get at least one reference for the type of analysis you want to do and maybe another one for the software you plan to use. Then you actually have to read them. Good luck with that.

Let’s not have any big surprises. OK?

The best results you can hope for, in a way, are the mundane conclusions that confirm what you expect, especially if they add a bit of illumination to the dark places on the horizon of your current knowledge. Expect that there will be some minor differences between simulations. They’ll probably be inconsequential. But be cautious if the results are a big surprise. Be skeptical of anything that might make you want to call a press conference. It’s OK to get surprising results, just be sure you aren’t the one surprised later to find an error or misinterpretation.

After you’re done with model calibration, you’re ready to implement the model in a process called deployment or rollout. You’ll find a lot of information about deployment on the Internet, particularly in regards to software. Most data analyses give birth to reports, mostly shelf debris. Statistical models that perform a function, though, usually involve software. These models can be programmed into a standalone application or integrated into available software like Access or Excel. Consider your audience. Perhaps the best advice is to keep a deployed model as simple as possible. Most users won’t have to know the details of the model, only how to use it. Be sure you provide enough documentation, though, so that any number crunchers in the group can marvel at your accomplishment.

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.combarnesandnoble.com, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , , , , , , , , , , , , , , , , , , | 8 Comments

The Seeds of a Model

Always start with good seeds.

Perhaps the most complicated and time-consuming aspect of model building is selecting the components of your model—the variables, the samples, and the data (https://statswithcats.wordpress.com/2010/12/04/many-paths-lead-to-models/). Here are a few tips for collecting the seeds of your model.

Models Revisited

Here’s a quick review of the components of a statistical model. The key variable that characterizes the phenomenon to be modeled is called the criterion variable, or more commonly, the dependent variable. Variables (usually, but not necessarily, more than one) that will be used to test, predict, or explain the dependent variable in the model are called grouping variables, predictor variables, explanatory variables, or most commonly, independent variables. A prototype model is represented as:

Dependent variable that
characterizes the phenomenon

=

Independent variable(s) that test, predict, or explain the dependent variable

By convention, the criterion or dependent variable is always placed to the left of the equals sign, and the independent variables are placed to the right. This representation says that the information in the dependent variable can be obtained from the information in the independent variable(s). Usually, though, the independent variables in a model won’t all be equally important for describing a dependent variable. Each independent variable has to be weighted by multiplying it by an adjustment factor to account for the differences. The adjustment factors also correct for the independent variables being measured in different units, or even scales of measurement. So a more detailed representation of a model would be:

Dependent variable

=

Variable 1 Adjustment Factor * Independent variable 1 +
Variable 2 Adjustment Factor * Independent variable 2 +
… and so on … +
Model Adjustment Factor

This says that the information in your dependent variable can be expressed as the sum of your independent variables, which have been adjusted to account for their scales of measurement and for their contributions to the model, plus an adjustment factor for the entire model not related to a specific independent variable. If all of the adjustment factors are constants in a given model, which they usually are, you have a linear model. The values for the adjustment factors are determined by the technique you’re using to calibrate the model. If the value of a dependent variable is always equal to the sum of the adjustment factors times the values of the independent variables, plus the model constant, the model is called exact or deterministic (https://statswithcats.wordpress.com/2010/08/08/the-zen-of-modeling/).

Even with all those adjustment factors, though, sometimes the independent variables can’t quite reproduce the values of the dependent variable, so there are errors. Add an error term to the model and you have a statistical model:

Dependent variable

=

Variable 1 Adjustment Factor * Independent variable 1 +
Variable 2 Adjustment Factor * Independent variable 2 +
… and so on … +
Model Adjustment Factor +
Error

To be more concise, the terms in the model can be represented by letters and rewritten as:

y = a0 + a1x1 + a2x2 +anxn + e

where:

y is the dependent variable that characterizes the phenomenon.

x1 through xn are the independent variables that test, predict, or explain the dependent variable.

a0 is the Model Adjustment Factor.

a1 through an are the Variable Adjustment Factors. a1 through an are constants called coefficients or parameters of the model. If a1 through an
aren’t constants, you have a nonlinear model.

e is the Error term, which allows you to characterize the uncertainty in the model.

The y and the xs are the variables you create and measure on your samples. The as and the e are the constants the statistical procedure estimates. That’s a statistical model. To add a little more perspective, if you have only one dependent variable, only one independent variable, and no error, the model reduces to:

y = a + bx

Now that’s a different way to look at things.

which you may remember from high school algebra is the equation of a straight line where a is the y-intercept and b is the slope of the line. So mathematical models really aren’t so mysterious and shouldn’t induce the terror of, say, getting sucked down the toilet in the restroom of a Boeing 747 and falling 35,000 feet into a fetid swamp full of vampire bats, ticks, leeches, and IRS agents, then having to give an hour-long presentation on your experience au naturel at the next Christian Nudist Convocation. Try both, you’ll see.

Dependent Variables

To build your model, select as many dependent variables as you feel you’ll need to characterize the phenomenon. Usually, statistical models have only one dependent variable. These are called univariate statistical models. If you think you’ll need only one dependent variable, that’s great. It will make for a fairly straightforward analysis.

If more than one dependent variable is needed to describe a phenomenon, the model is called a multivariate statistical model. (Some statistical textbooks, particularly in the social sciences, refer to statistical procedures that analyze more than one kind of variable, either dependent or independent, as multivariate. But the complexity of the analysis is far greater if there are multiple dependent variables then if there are multiple independent variables.)

If you need more than one dependent variable, try to limit the number. If you have more than a few dependent variables, here are a few things you can do to reduce the number of candidate dependent variables.

  • Focus on Aspects of the Phenomenon—Some phenomena are very complex or at least multifaceted. You may be able to reduce the number of dependent variables you are considering by focusing on just one aspect of the phenomenon.
  • Narrow the Objective—If you are trying to do too much in one study, you might try to reduce your aims, or break up the project into parts and conduct the subprojects sequentially.
  • Focus on Hard Information—Hard information involves measurements of tangible, observable demonstrations as opposed to measurements of intangible beliefs or opinions. Focus on dependent variables that involve hard information.
  • Focus on Direct Information—Direct information involves measurements specifically of the phenomenon being investigated, as opposed to measurements of factors associated with the phenomenon. Focus on dependent variables that directly measure the phenomenon.
  • Eliminate Correlated Variables—If several candidate dependent variables are highly intercorrelated, pick the best and eliminate the rest.
  • Create Multiple Models—If you have to have more than one dependent variable, create a different model for each one. This is like subdividing the objectives—not optimal but sometimes a necessary evil.
  • Conduct a Factor Analysis—You might be able to reduce the number of dependent variables using factor analysis to combine the multiple variables into one.

If you can’t do any of these things, you’re probably headed for a multivariate analysis. Consider looking for help.

Independent Variables

Your selection of independent variables will hinge on what you plan to use the model for. Here are a few tips for identifying candidate measures and scales:

  • Variables for Characterizing, Classifying, Identifying, and Explaining—Select enough variables to address all the theoretical aspects of the phenomenon, even to the point of having some redundancy. Sometimes two differently measured or differently scaled variables that address the same theoretical concept will make dissimilar contributions to the model. When you calibrate the model, the extra variables will drop out.
  • Variables for Comparing—Test what you want to know, not everything under the sun. Keep the number of variables to an absolute minimum or your analysis will become intractable. Try to use conventionally recognized variables and scales rather than creating new ones if you can. This will facilitate replication studies.
  • Variables for Predicting—Be sure that the variables and scales you select are relatively inexpensive and easy to create or obtain. A prediction model won’t be very useful if the prediction variables cost more to generate than the prediction is worth. For example, if you plan to use the model repeatedly, say to make monthly forecasts, you’ll want the model inputs to be simple enough that you could generate all the data you would need in a couple of weeks at most. If the inputs were so complex that they take months to generate, you wouldn’t be able to use the model as you wanted. Stress precision in selecting variables. Accuracy tends to come easy while precision is elusive. Prediction models usually keep only the variables that work best in making a prediction, so the number of variables you select initially isn’t that important. Recognize, though, that the more variables you have in your conceptual model, the more work it will be to winnow out the ones you don’t need.

Some of the variables may have several possible scales (https://statswithcats.wordpress.com/2010/09/12/the-measure-of-a-measure/). If these extra scales are related to each other by a linear algebraic relationship, keep only one. This is because the variables will be perfectly correlated, and thus, will add no new information to the model. For example, if you measure temperature in degrees Fahrenheit, you don’t need to also include temperature in degrees Celsius because °C = 5/9(°F − 32). Pick the scale that will give you the best resolution. In the example of temperature, Fahrenheit-scaled thermometers can be read with greater precision than Celsius-scaled thermometers because they have smaller divisions. Better yet, get a digital thermometer that displays several decimal places.

If two measures have unrelated scales or can be measured differently, keep them all at this point. You will sort out the best measures when you calibrate the model. For example, you could measure pH using pH paper, a field meter, or a lab titration. If a concept that you want to evaluate with your model were a person’s size, you could use a height scale and a weight scale. However, you wouldn’t need to include weight in both pounds and kilograms because the two scales are linearly related (1 kg = 2.2 lbs). You could include weight measured by a balance beam, a strain gage, a spring scale, or even a circus weight-guesser because they use different techniques to measure weight (although they would probably be highly correlated).

Samples and Data

The samples you select must represent the population you want to analyze. A lot of thought must go into defining the population and finding samples that will fairly represent that population. Then all those mental maneuvers go into fitting considerations, like the sample hierarchy, resolution and the number of samples (https://statswithcats.wordpress.com/2010/07/17/purrfect-resolution/), and the sampling scheme, into a comprehensive sampling plan. So the last thing you want to have happen is to have the sampler, the person who will generate the data, stray from the carefully thought-out plan. You don’t want field technicians moving sampling locations so that they don’t have to walk so far from their truck. You don’t want doctors reassigning their friends to experimental groups that will get preferential treatment. You don’t want your survey takers concentrating on attractive members of the opposite sex. You get the idea.

I think I’ll take a sample here

When it comes to samples, samplers should have little or no discretion to stray from the plan. Follow the map that will find the population you’re looking for. Then there’s the process of generating the data. As much as you plan to minimize variance with reference, replication, and randomization (https://statswithcats.wordpress.com/2010/09/05/the-heart-and-soul-of-variance-control/), there will always be opportunities at the point of data collection to improve the process. A dropped meter may require recalibration that’s not called for in the sampling plan. A survey taker might ask a clarifying question, check spelling, or point out a math error before a respondent forever disappears. A surveyor can correct a map with an incorrectly located sampling point. An accountant can adjust misclassified debits in financial records. As the data analyst, you are mostly powerless to make such corrections and clarifications until it’s too late, and you have to puzzle over the cause of an outlier. You need to rely on the knowledge and experience of the people collecting the data. So when it comes to data, samplers should have considerable discretion to use their initiative to ensure the quality of the data, minimize variance, and achieve the intent, if not the letter, of the sampling plan.

Once you know what you want your model to do and you know what you need to measure, you can consider the statistical techniques you might use (https://statswithcats.wordpress.com/2010/08/27/the-right-tool-for-the-job/).

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.combarnesandnoble.com, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , , , , , , , , | 7 Comments

Many Paths Lead to Models

I know where I am and where I want to go.

If you’ve never created a statistical model before, you might be surprised to find that the process involves a lot more than statistics. It’s like traveling. You don’t start by thinking about your transport, the plane, train, or bus you might take. You start by knowing where you are and where you want to go. Only then do you create your itinerary, select a carrier, buy a ticket, pack your belongings, and make the trip. Likewise, modeling starts with the phenomenon you’re trying to model and ends with the model. Between those two points, though, there are many possible routes.

For example, after studying a phenomenon, you might decide how you would use a model, and from that, decide what the model should focus on, what data you’ll need, what statistical method you’ll use, and how you’ll calibrate the model. Or, you may be given a dataset by a client, and from the samples and variables, you determine what models could be created, and what statistical methods would be required. Sometimes, you decide what you want the model to do, but the variables would be too difficult or costly to collect, so you have to revise the model specifications and reconsider the samples and variables. Similarly, you might find that the statistical method you want to use will require different data or model boundaries, so you have to reconsider your plans. It’s not uncommon to iterate through these considerations several times before you’re ready to advance to the actual modeling process.

As you model more and more phenomena, you’re likely to take these paths and many more. Each excursion through the maze of modeling elements will be a new and different adventure for you to learn from. If you get lost, find someone to give you directions. Here are a few to get you started.

The Phenomenon

The first thing you’ll have to do is to think about the phenomenon you want to model. This may sound trivial, but it’s not. Even if you were assigned the work by your boss or academic advisor, you’re going to have to make a lot of decisions on your own. If they were going to make all the decisions, they wouldn’t have given the project to you.

The nature of the phenomenon has to do first with how tangible the phenomenon is. Is the phenomenon an object that can be seen and touched? Is it a process that can be watched and interacted with or a behavior that can be observed but not necessarily manipulated? Is it a condition that can be monitored, or if not visible, at least measurable (like radioactivity)? Or, is it an opinion that can’t be seen or touched, and may not even be measurable directly?

The nature of the phenomenon also has to do with how changeable the phenomenon is. Is it something that is fixed and unchangeable? If it changes, what is the rate of change? Is it too slow or too fast to be observable? Does the phenomenon exist in states of equilibrium and disequilibrium? Can changes be manipulated by an experimenter? Thinking about the nature of the phenomenon will help you narrow your options for what form might be appropriate for the model. For example, would it be possible to build a physical model or will the model have to be a less tangible written model, blueprint, computer application or mathematical equation? It’s not uncommon for several types of models to be developed to display, manipulate, or substitute for the phenomena. Automakers, for example, make many types of models of the automobiles they sell, from the styling, to the performance, to the marketing.

Model Use and Specifications

After the phenomenon, you’ll need to think about what you want to do with the model and how it will be designed. You can use a model to:

  • Display—use the model to describe or characterize the sample or the population.
  • Substitute—use the model in place of the phenomenon, such as for prediction.
  • Manipulate—use the model to explain aspects of the phenomenon.

As a point of reference, most models involve the simple display of descriptive information. If you plan to use them for substitution or manipulation, you’ll have to know more about the phenomenon, more about modeling, and more about statistics.

Whatever your planned use, you’ll have to think about how you want to approach the modeling. Three factors you ought to consider are the viewpoint you’ll take to develop the model, the level of detail of the model, and the boundaries of the model relative to the phenomenon.

Modeler’s Viewpoint

Your viewpoint in modeling is how you plan to approach the effort, that is, either from the top down or the bottom up. A top-down viewpoint will require you to understand the big picture, things like what the phenomena is associated with. This viewpoint is more correlative and is commonly used in statistical models, especially predictive models. A bottom-up viewpoint will require you to understand the details, the conditions that cause or affect the phenomena. This viewpoint is more deterministic and is commonly used in theoretical models and in statistical models for explanation.

Top-down models usually don’t require as many variables as bottom-up models so long as they are the right variables. The problem with top-down models is that sometimes relationships appear to be oversimplified or obscure. Why should skirt length predict stock prices, for example? It makes no sense, but a high correlation has been found between the two measures.

Bottom-up models tend to require more variables to characterize all the facets of a phenomenon. Larger numbers of variables, in turn, require greater levels of effort than for top-down models. Furthermore, many of the details included in a bottom-up model are often found not to have a significant impact on the overall model. Hence, bottom-up modeling tends to be labor intensive and inefficient, but in the end, at least you know how everything fits together.

Some modelers take their viewpoint as an extension of their own personalities. Big picture people think of a phenomenon in terms of general concepts, mechanisms, trends, and patterns and tend to model from the top down. They don’t care if their favorite team has weaknesses as long as the team’s winning percentage is high. Details people think of discrete parts or elements that make up a phenomenon, and tend to model from the bottom up. They believe the whole is equal to the sum of the parts. Their team could be in first place, but they’re concerned about one player who is in a slump.

Often both viewpoints work equally well for modeling a phenomenon. Sometimes, though, one or the other viewpoint will work better, be easier, or even be the only feasible approach. For example, say you want to model the performance of an automobile. Using a top-down viewpoint, you might focus on acceleration, gas mileage, top speed, and so on. You might be able to model how the automobile will perform under certain driving conditions, but you won’t learn anything about how the components of the automobile work together. Using a bottom-up viewpoint, you might focus on number of cylinders, gear ratios, timing, and so on. You might be able to model how changing a component could boost or diminish its function, but you won’t know if the change would provide the same effect to the automobile’s overall performance. You have to be sure that your viewpoint is appropriate for how you plan to use the model or else the model won’t be useful. At every step in your modeling effort, ask yourself, “will I be able to do what I need to do with the results of the model?”

Model Details

Every phenomenon complex enough to have to be modeled assuredly has many levels of detail. You have to decide how much detail to put in your model, especially if your viewpoint is bottom-up. Still, there are practical limits imposed by restrictive budgets and schedules or by what is known about the phenomenon. For example, if you want to model the performance of an automobile, do you concentrate on the engine or also consider aerodynamics, steering, braking, and other components? If you concentrate on the engine, do you focus on the internal combustion components or also consider the pollution control devices, the electrical system, and other components? If you concentrate on the internal combustion components, do you focus on the pistons or also consider the spark plugs and the fuel?

Model Boundaries

Where will your model end? This is easy to visualize with location and time; you can draw a line on a map or block out dates on a calendar. Many phenomena aren’t so easy to isolate, though. Processes, in particular, often use inputs from other processes or contain subprocesses that can’t be isolated. In modeling the performance of an automobile, for example, do you include different makes (e.g., Ford, Honda), different models (e.g., sedans, SUVs), different options (e.g., engines, transmissions), different drivers, different types of road conditions, and so on.

These determinations will affect everything else you do.

Other Model Specifications

There are many other things about your model that might have a bearing on the variables and samples you select, the statistical methods you use, and how you go about optimizing the model. Here are a few specifications that may be relevant to your model:

  • Users—Who will be using the model? If it’s just you, the model may not need to have a polished appearance and extensive documentation. If others will be using the model, though, consider that audience. You may not have to build a comprehensive user interface, but you’ll at least need to try to make it understandable and sufficiently documented. Don’t try to make it idiot-proof; it’s not worth the effort. God is just too good at making idiots.
  • Frequency of Use—If the model will be used on a recurring basis, make sure there will be some provision for you or some other qualified individual to review the model periodically to ensure it is being used correctly and is still appropriate for representing the phenomenon.
  • Accuracy and Precision—As a general rule, statistical models tend to be fairly accurate but never as precise as you need them to be. Have some notion of the accuracy and precision you want. That way you’ll know when either you’re done or it’s time to quit. A good way to specify the precision you want is to start from a gut feeling and specify the precision as a percentage, for example, ±5 percent or ±10 percent. Then you’ll have to control variance and manipulate the number of samples and the confidence level so that a confidence interval is close to your target precision.
  • Limit of Complexity—Some models were not meant to be. If you can’t fit the model to the data, you have to be prepared to call it quits. In a way, this is equivalent to a Do Not Resuscitate order in medicine, and likewise, it can be a sensitive subject. It’s usually easier to create new variables or try some other statistical manipulation than it is to give the bad news, and the bill, to the client.

So, those are some of the issues you’ll want to consider when you build a statistical model. There’s much more to think about, of course, especially when you start collecting the provisions for your model—the samples, variables, and data. Now to be candid, you’ll give some of these topics only a few nanoseconds of thought before you jump into the maelstrom of model building. Some you’ll think about constantly throughout your modeling effort. Some will just be what they turn out to be. Model building is an adventure. Every journey is unique so savor the experiences.

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.combarnesandnoble.com, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , | 6 Comments

Secrets of Good Correlations

If you’ve ever seen a correlation coefficient, you’ve probably looked at the number and wondered, is that good? Is a correlation of -0.73 good but not a correlation of +0.58? Just what is a good correlation and what makes a correlation good?

Negative Feline Correlation

The strength of the relationship between two variables is usually expressed by the Pearson Product Moment correlation coefficient, denoted by r. Pearson correlation coefficients range in value from -1.0 to +1.0, where:

-1.0 represents a perfect correlation in which all measured points fall on a line having a negative slope

No Feline Correlation

0.0 represents absolutely no linear relationship between the variables

+1.0 represents a perfect correlation of points on a line having a positive slope.

Positive Feline Correlation

If you have a dataset with more than one variable, you’ll want to look at correlation coefficients.

The Pearson correlation coefficient is used when both variables are measured on a continuous (i.e., interval or ratio) scale. There are several variations of the Pearson Product correlation coefficients. The multiple correlation coefficient, denoted by R, indicates the strength of the relationship between a dependent variable and two or more independent variables. The partial correlation coefficient indicates the strength of the relationship between a dependent variable and one or more independent variables with the effects of other independent variables held constant. The adjusted or shrunken correlation coefficient indicates the strength of a relationship between variables after correcting for the number of variables and the number of data points. There are also correlation coefficients for variables measured on noncontinuous scales. The Spearman R, for instance, is computed from ordinal-scale ranks.

Types of Correlation Coefficients.

So, what is a good correlation? It depends on who you ask.

  • I once asked a chemist who was calibrating a laboratory instrument to a standard what value of the correlation coefficient she was looking for. “0.9 is too low. You need at least 0.98 or 0.99.” She got the number from a government guidance document.
  • I once asked an engineer who was conducting a regression analysis of a treatment process what value of the correlation coefficient he was looking for. “Anything between 0.6 and 0.8 is acceptable.” His college professor told him this.
  • I once asked a biologist who was conducting an ANOVA of the size of field mice living in contaminated versus pristine soils what value of the correlation coefficient he was looking for. He didn’t know, but his cutoff was 0.2 based on the smallest size difference his model could detect with the number of samples he had.

Is 0.2 a good correlation or does a good correlation have to be at least 0.6 or even 0.98? As it turns out, the chemist, the engineer, and the biologist were all right. Those correlations were all good for those uses. So, the meaningfulness of a correlation coefficient depends, in part, on the expectations of the person using it.

But how do you know what value of a correlation coefficient you should expect for it to be good? One answer is to look at the square of the correlation coefficient, called the coefficient of determination, R-square, or just R2. R-square is an estimate of the proportion of variance in the dependent variable that is accounted for by the independent variable(s). It is used commonly to interpret the strength of the relationship between variables and to compare alternative statistical models.

You might be able to decide how good your correlation is from a gut feel for how much of the variability you wanted a relationship to account for. For example, correlation coefficient values between approximately -0.3 and +0.3 account for less than 9 percent of the variance in the relationship between two variables, which might indicate a weak or non-existent relationship. Values between -0.3 and -0.6 or +0.3 and +0.6 account for 9 percent to 36 percent of the variance, which might indicate a weak to moderately strong relationship. Values between -0.6 and -0.8 or +0.6 and +0.8 account for 36 percent to 64 percent of the variance, which might indicate moderately strong to strong relationship. Values between -0.8 and -1.0 or +0.8 and +1.0 account for more than 64 percent of the variance, which might indicate very strong relationship.

That’s only part of the story, though. Two other things you have to do to decide if a correlation is good are plot the data and conduct a statistical test.

Plots—You should always plot the data used to calculate a correlation to ensure that the coefficient adequately represents the relationship. The magnitude of r is very sensitive to the presence of nonlinear trends and outliers. Nonlinear trends in the data cause the magnitude of the relationship to be underestimated. You can often use transformations to straighten any nonlinear patterns you see (https://statswithcats.wordpress.com/2010/11/21/fifty-ways-to-fix-your-data/). Outliers (i.e., data values not representative of the population) that are located perpendicular to the data trend cause the relationship to be underestimated. Outliers parallel to the data trend cause the relationship to be overestimated.

Tests—Every calculated correlation coefficient is an estimate. The “real” value may be somewhat more or somewhat less. You can conduct a statistical test to determine if the correlation you calculated is different from zero. If it’s not, there is no evidence of a relationship between your variables. This test looks at the absolute value of the correlation coefficient and the number of data pairs used to calculate it. The larger the value of the correlation and the greater the number of data pairs, the more likely the correlation will be significantly different from zero. For example, a correlation of 0.5 would be significantly greater than zero based on about 11 data pairs but a correlation of 0.1 wouldn’t be significantly different from zero with 380 data pairs. That’s why all statistical software outputs the number of data pairs and the test probability with a correlation. With some software, you can also calculate a confidence interval around your estimate to see if the interval includes the value you set as a goal. But one way or the other, you have to consider the variability of your calculated estimate to decide if the correlation is good.

Correlation coefficients have a few other pitfalls to be aware of. For example, the value of a multiple or partial correlation coefficient may not necessarily meet your definition of a good correlation even if it is significantly different from zero. That’s because the calculated values will tend to be inflated if there are many variables but only a few data pairs, hence the need for that shrunken correlation coefficient. Then there’s the paradox that a large correlation isn’t necessarily a good thing. If you are developing a statistical model and find that your predictor variables are highly correlated with your dependent variable, that’s great. But if you find that your predictor variables are highly correlated with each other, that’s not good, and you’ll have to deal with this multicollinearity in your analysis. Finally, if you’re calculating many correlation coefficients from a large data set, you might find that the number of data pairs is different for each calculation because of missing data. Some statisticians believe it is acceptable to compare correlations calculated with different numbers of data pairs and other statisticians believe it is unwarranted, nonsensical, dishonest, fraudulent, heinous, and sickeningly evil.

What to Look for in Correlations.

What makes a good correlation, then, depends on what your expectations are, the value of the estimate, whether the estimate is significantly different from zero, and whether the data pairs form a linear pattern without any unrepresentative outliers. You have to consider correlations on a case-by-case basis. Remember too, though, that “no relationship” may also be an important finding.

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.combarnesandnoble.com, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , , , , , , , , , , , | 36 Comments

Fifty Ways to Fix your Data

Fifty Ways to Fix your Data

(Sing to the tune of “Fifty Ways to Leave Your Lover” by Paul Simon)

The problem is all about your scales, she said to me
The R-squares will be better if you’ve matched ’em mathematically
It’s just a way to make your model fit nicely
There must be fifty ways to fix your data

She said it’s really not my preference to transform
‘Cause sometimes, the new scales confuse, overfit, or misinform
But I’ll Box-Cox ’em all if it means they’ll fit the norm
There must be fifty ways to fix your data
Fifty ways to fix your data

Take the tails for a trim, Kim
Try a replace, Grace
You can use the rank, Hank
Just try ’em and see
Make it more smooth, Suz
Lots of functions you can choose
A higher degree, Dee
Will get you more fee.

There must be fifty ways to fix your data.

In exploring a dataset, you need to be sure that you have the right numbers and that those numbers are right. You need to find and fix problems with individual data points like errors and outliers. You need to find and fix problems with observations like censored data and replicates. And you need to find and fix problems with variables like their frequency distributions and their correlations with other variables. Often, rehabilitating variables involves transformations, methods of changing the scales of your variables that might further your analyses.

As part of this process, you should consider what other information you can add that might be relevant to your analysis. This is especially important if you are planning to develop an exploratory statistical model. Experience will tell you when expanding your dataset might make a difference and when it won’t. If you don’t have that experience yet, start by learning about why you might transform variables and how it can be done. Then practice; try a variety of different techniques and learn along the way. But first you need to understand some of the pros and cons of what you might do to your dataset.

Transformations: Yes But No But Yes

There is some controversy over the use of transformations (and other methods of creating new variables) that has caused a few statisticians to argue forcefully either for or against their use. The three most common arguments that have been made against the use of transformations involve:

  • Analyze what you measure. Don’t complicate the analysis unnecessarily. Stick to what the instrument was designed to measure in the way it was designed to measure it.
  • Use scales consistently. Don’t confuse your readers unnecessarily. Report results in the same units that you used to measure the data.
  • Let the data decide. Don’t capitalize on chance by overfitting your model. Your results should work on other samples from the same population.

There is a single simple argument for the use of transformations—they work better than the original variable scales. If they don’t work better, you don’t use them. William of Ockham would have liked that argument. So what constitutes working better? Consider these examples of the three ways that transformations are used.

One, perhaps the most important use of transformations is to reduce the effects of violations of statistical assumptions. If you plan to do any statistical analysis that involves using a Normal (or other) distribution as a model of your dependent variable, it’s important to use a scale that makes the data fit the distribution as closely as possible. If the data aren’t a good fit for the distribution, probabilities calculated for some tests and statistics will be in error. Because costly or risky decisions may be made from these probabilities, inaccuracies can be a big deal. So using a transformation to correct violations of statistical assumptions is a very important use.

Two, perhaps the most common use of transformations is to find scales that optimize the linear correlation between data for a dependent variable and data for independent variables. Statistical model building almost always benefits from this use of transformations. Everybody does it.

Three, perhaps the most overlooked use of transformations is, in a word, convenience. Sometimes transformations are used to convert measured data to more familiar units, improve computational efficiency, eliminate replicates, reduce the number of variables, and other actions that facilitate, but not necessarily improve, the analysis.

Now that you’ve been warned, here are four things you can do that might further your analyses:

  • Sample Adjustments—methods for fixing missing, erroneous, or unrepresentative data points.
  • Dependent Variable Transformations—methods for changing the scale of the dependent variable to minimize the effects of violations of statistical assumptions.
  • Independent Variable Transformations—methods for creating new variables from the original independent variables, which have better correlations with the dependent variable.
  • Supplemental Variables—methods for creating new variables from untapped data sources.

There is nothing sacred about this classification. Some of the categories might overlap or omit other ideas, so use these examples to stimulate your own thinking. In time, you’ll develop a sense of what you need for a particular analysis.

Sample Adjustments

Sample adjustments involve changing individual data points for a variable. Unlike most transformations which result in the creation of a new variable, sample adjustments leave the original variable intact. You use adjustments to correct errors, fill in missing data, reveal censored data, rein in unrepresentative replicates, and accommodate outliers. Using sample adjustments is a good place to start enhancing your data set. They are like digging out weeds and filling in holes before you plant a new lawn. It wouldn’t make sense to do it later, or worse, not at all.

Dependent Variable Transformations

After you’ve filled all the holes in your data matrix with sample adjustments, the next thing you should do is to make sure the dependent variable approximates a Normal distribution. If you haven’t looked at histograms and other indicators of Normality, always do that first. Then if your data distribution differs enough from the Normal distribution to make you nervous about your analysis, try a transformation of the dependent variable. Try several, in fact. Transformations of dependent variables create new variables but you’ll keep only one of the candidate dependent variables for an analysis. You want to pick the candidate dependent variable that fits a theoretical distribution best so that calculations of test probabilities are most accurate. If the frequency distribution of your dependent variable is skewed toward higher values, try a root transformation. If the frequency distribution is skewed toward lower values, try a power transformation. Better yet, try a Box-Cox transformation. Box-Cox transformations include the most commonly used transformations—roots, powers, reciprocals, and logs—as well as an infinite number of minor variations in between. The only downside is that the process is labor intensive if you don’t have statistical software that performs the analysis.

Independent Variable Transformations

Once you have the dependent variable you want to work with, you can go on to examine all the relationships between that dependent variable and the independent variables. While the target for transforming a dependent variable is the Normal frequency distribution, the target for transforming independent variables is a straight-line correlation between the dependent variable and each independent variable. This can be a lot of work. Remember, you have to look at correlations and plots, perhaps even for special groupings of the data. That’s the reason you always start by finding a scale for the dependent variable that fits a Normal distribution. You wouldn’t want to repeat this process for more than one dependent variable if you didn’t have to.

Variable Adjustments

Variable adjustments are changes, some quite minor, made to all the values for a variable (as opposed to just modifying specific samples as in sample adjustments). All variable adjustments create new independent variables for analysis. Examples of variable adjustments include:

  • Differencing. Differencing involves subtracting the value of a variable from a subsequent value of the same variable, usually used to highlight differences. Durations in a time-series are calculated by differencing.
  • Smoothing. Smoothing is the opposite of differencing and usually involves some type of averaging. Smoothing is used to suppress data noise so patterns become more evident.
  • Shifting. Data shifting involves moving data up or down one or more rows in a data matrix to produce new variables called lags (when previous times are shifted to the current time) or leads (when subsequent times are shifted to the current time). Shifting all the data by one row is called a first-order lag or lead. Shifting data for a variable by k rows is called a k-order lag or lead.
  • Standardizing. Standardizing involves equating the scales of some variables, usually by dividing the values by a reference value. Examples include adjusting currency for inflation, and calculating z-scores and percentages. differencing, data shifting, smoothing, and standardizing.

Variable Rescaling

Sometimes it can be useful to change the scale or units of a variable to simplify or facilitate an analysis, by:

  • Rescaling for computational efficiency
  • Converting units, either with or without adding information
  • Converting quantitative scales to qualitative scales
  • Increasing the number of scale divisions
  • Decreasing the number of scale divisions

Changing the scale of a variable is different from changing the units of a variable. Both are important. Changing units usually involves only simple mathematical calculations with or without the addition or deletion of information. Rescaling variables involves adding or removing information or changing a point of reference. Rescaling usually involves making changes based on logic but may include mathematical calculations as well. Recoding is perhaps the most common way of rescaling a variable. Some statistical software have utilities to facilitate recoding.

Variable Linearization

Improving the correlation between a dependent variable and an independent variable is a big part of statistical modeling. The objectives of this type of transformation are (1) to persuade the data to follow a straight line, and (2) minimize the scatter of the data around the line. Here are a few examples of mathematical functions used as transformations.


Variable Combinations

Variable combinations are new variables created from two or more existing variables using simple arithmetic operations (sums, differences, products, and ratios) or more complex mathematical functions. Variable combinations should be based on theory rather than created for convenience. Usually the variables combined should have the same units (e.g., dollars), although units can be standardized using z-scores.

Supplemental Variables

You won’t necessarily just add variables at the beginning of your analysis. You may add them continuously throughout your analysis as you learn more about your data. Some variables may turn out to be critical to the analysis and others will just facilitate reporting or some other ancillary function. Supplemental variables can be creating by concatenating or partitioning existing variables, or by adding new information from metadata or external references (e.g., federal census data).

You Can Teach Old Data New Tricks

When your instructor gave you a dataset in Statistics 101, that was it. You did what the assignment called for, got the desired answer, and you were finished. But it doesn’t work that way in the real world overflowing with data but lacking in wisdom. Sometimes you have to put more effort into making sense of things. Statistics is the mortar that brings data and metadata together to make building blocks of information into a temple of wisdom. Transformations are like mason’s tools. They can smooth, reshape, adjust, add texture, augment, condense, and on and on. Suffice it to say that with transformations, there must be at least fifty ways to fix your data.


Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.combarnesandnoble.com, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , , , , , , , , , , , , , | 31 Comments

You Can Lead a Boss to Data but You Can’t Make Him Think

The most carefully planned data analysis may not survive the intervention of a boss (or a client or other reviewer), whether well intentioned or not. Your aim may be to generate sound data and conduct a thorough and valid analysis, but your boss may have different motives and concerns. He or she may have budget or schedule constraints, not to mention business vulnerabilities and office politics to contend with. So be prepared when the unthinkable happens … like just a few days before you planned to start collecting data your boss calls you into his office and says:

Change This Before You Start

Add a sample, reword a procedure, drop a measurement, and other requests that will defile the perfect data collection and analysis plan you spent so much time creating. What do you do?

Adding a sample or measurement may not be a problem so long as you have the equipment and sampling supplies available. Changing a survey shouldn’t be a big deal if you haven’t already printed the questionnaires. Do what the boss wants and don’t worry about it too much.

Changing sampling procedures or rewording survey questions may or may not be a problem. Keep an open mind. It’s when the validity of the sampling procedure or meaning of the survey question is changed that you have to be careful. Sometimes the changes may seem subtle. For example, changing the order in which samples are collected may seem inconsequential but it may have an impact on the results. Asking survey respondents if they do something once a week is not the same as asking if they do something often. Asking if something is good, is not the same as asking if something is better than expected.

Dropping a measurement or question is problematical. If you didn’t need it, you wouldn’t have included it in the first place, but your boss won’t see it that way. Metadata and supporting measurements seem to be a favorite target of bosses looking to put their stamp on your plan. They might not understand that you really need those survey questions that characterize the respondents. These omissions are tough to lose but it’s better to have something over nothing. Make sure the boss knows what he’ll be losing and go with what you can.

We Can’t Afford This

That’s more than I expected to spend is a mantra statisticians hear often. You may even have said it yourself to the dealer of that hot red convertible you want, or to the plumber who came to fix your broken water main, or to the tax preparer you finally cornered at 5:00 pm on April 15. Did you pay up, pass on the offer, or negotiate something in between? Likewise, there are three things that your boss can do in this situation:

  • Relent and pay for the study
  • Negotiate a reduced price
  • Cancel the study (or get someone cheaper to do it).

You want the first thing to happen and can live with the second, but the third would be a disaster.

So put yourself in your boss’s position. Is the analysis something he has to do? If so, gently review the consequences of not doing it. What is it he doesn’t want but might get if the study isn’t done? Will his boss be upset? Will a regulatory agency come calling? Paint a picture of the cost of the consequences compared to the cost of the study. If the analysis is something he doesn’t have to do, you’ll have to convince him that it is something he wants to do it, or even better, he needs to do. Consider what the boss is looking to do with the results. Show him how your data analysis will add value to his operation. Give him a clear vision of what the payback will be.

If that doesn’t work, try working out a compromise. Just remember, a cut in price has to be balanced by a cut in scope, otherwise you’ll have no credibility. If the cuts will impair the analysis so much that little will be gained, it’s better to pass on the study and survive to analyze another day.

That’s Too Many Samples

Having your boss (or client) ask if you can get by with fewer samples is almost a given. Unless the boss understands statistics and variability, it’s unlikely he’ll see the need for as many samples as you planned to collect. Agreeing to reduce the number of samples is a self-inflicted but nonfatal wound. The margin of error will be bigger but quantifiable. Let the boss know what he’ll be getting. He should appreciate the concept of trade-offs since he has to deal with them all the time in his own work. This assumes, of course, that the boss is at least in the same ballpark as you. If you want 800 samples, and he was thinking of just asking a few customers some questions over lunch, you’re toast.

Here Are the Samples You Should Take

After spending a lot of time ensuring that your samples will be truly representative of your population, your boss gives you his list of samples to be collected. Will your boss’ list fairly represent the population or is there some perhaps unintentional bias? Does the boss want you to survey the biggest or oldest customers, or worse, the customers who will give the best reviews? Does he want you to sample only the processes or waste streams that he knows are already within specifications? What do you do?

This can be a deal breaker from your perspective. There’s no reason to collect data and do a statistical analysis with a judgment sample let alone a highly biased judgment sample. You’ll only be deluding your boss and yourself. Try to convince your boss that his directed samples will invalidate the study. If you can’t, take a list as a window onto your boss’s real reason for conducting the survey. He may just want something flattering to show his boss. If you still have to analyze the results even with the biased list, be sure to caveat your findings.

We’re Not Going to Do This After All

There are two common reasons the boss will pull the plug on a survey he asked for. The first is that he didn’t get approvals from his bosses. If you raise the level of attention of the survey in the organization early in the process, like when you go in search of data, this shouldn’t be a problem.

The second reason is funding. Some bosses relax during the first two quarters of the fiscal year then suddenly realize in the third quarter that they aren’t going to make their business plan unless they take drastic measures. Your survey will be the first thing to go, deferred if you’re lucky, canceled if you’re not. Try to schedule your survey for early in the fiscal year.

The Results Aren’t What We Expected

If your boss says he is surprised with your results, it could mean a couple of different things.

If the boss says either that the results were too general to tell anybody anything or that the results were too detailed for anybody to figure out, it’s a presentation problem. That’s actually a good thing. It can be fixed. Don’t be reluctant to get a communications specialist to help translate your work into a better presentation. You will benefit from more people being able to understand what you did.

If the boss is surprised by your results, and it’s not the presentation, it’s a virtual certainty that the results are unfavorable. There are at least four possible scenarios to consider:

  • The boss is really surprised by the results because he is out of touch with his business or whatever it is you studied. In this case, watch how he reacts as you explain some of the intricacies of the findings. As long as he doesn’t become defensive, it could be a great opportunity for both of you. If he does become defensive, try stressing that the results don’t take into consideration all the constraints he is under. Protect his fragile ego. You need him to take the next step and do something positive.
  • The boss is really surprised by the results because he thinks there may be a problem with the data or analysis. Double-check your work to be sure nothing is amiss statistically, and keep an open mind. The boss may be aware of some aspects of the data that may have skewed your results.
  • The boss is not surprised by the results but doesn’t want to admit it because he was hoping for something different. He’s playing dumb. This is his way of trying to deflect some responsibility away from himself. Play along. Remember, the only important thing is that he takes some positive action based on your results.
  • The boss was not surprised by the results but doesn’t want to admit it because the results don’t match what his boss wants to see. There’s not much you can do about this. The boss will probably take the results, and you’ll never hear of it again. Consider it a valuable life experience.

Just Give Me the Results; I’ll Take Care of the Rest

Everybody who has a boss has lived this alternative reality, whether you’re analyzing customer satisfaction data for the CEO or wrapping burgers for hungry customers. You give your work to the boss who becomes the face of the product. The boss gets all the accolades, and perhaps, an occasional complaint, while you get little if any recognition.

True, some bosses do this to try to claim credit for work they didn’t do, but this is usually transparent to anyone who matters. The CEO knows your boss neither conducted the analysis nor has the knowledge or the time to have done it. He is responsible, though, for the work you do. You know when you go to a restaurant that the host didn’t prepare the food. You know when the politician cuts the ribbon at the opening of the new library that he didn’t build the building or shelve the books. Those people represent the work of many.

Just give me the tuna. I’ll take care of the rest.

But a boss may want to be the spokesperson for your work for other reasons. He may want to bury unfavorable results. He may want to control access to the results because knowledge is power. Even more important, he may want to control access to you. Your demonstrated skills make your boss even more powerful. Finally, it’s possible that your boss doesn’t want you involved in decisions made as a result of your work. While this judgment is usually blamed on ego flexing, it may also be attributable to self-loathing over the decision-making process. If you think data analysis is messy, you should see what decision makers do. They are often illogical and inconsistent. They choose paths of least resistance over avenues of greatest effectiveness. They pay more attention to anecdotes than data. They worry about obscure possibilities and ignore likely consequences. They have a penchant for inaction and a desire to preserve the status quo. Unless you have a frustration deficit in your life, it might be better to let this one go.

One big issue with this response is that you don’t get any feedback. It’s nice to get recognition for your work but it’s absolutely essential that you get feedback. That’s how you grow as a professional and as an individual. If you can’t get any feedback from your boss, look elsewhere. Do any of your colleagues have opinions about the study? How did the high priest of the database like working with you? Do you ever run into the CEO or other managers in the hallway who might be familiar with what you did? Take whatever you can get (without making your boss paranoid) and put the knowledge to good use.

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Wheatmark, amazon.combarnesandnoble.com, or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , , , , , | 3 Comments