Why You Don’t Always Get the Correlation You Expect

If you’ve ever taken a statistics class on correlation, you’ve probably come to expect that a large value for a correlation coefficient, either positive or negative, means that there is a noteworthy relationship between two phenomena. This is not always the case. Furthermore, a small correlation may not always mean there is no relationship between the phenomena.

Correlation Size Doesn’t Always Matter

That’s not what I expected.

A small correlation coefficient does not necessarily imply a lack of a relationship any more than a large correlation coefficient implies a strong relationship. It depends on the type of relationship and the data used to characterize it. This is important because analysts often devote all their time investigating large correlations while disregarding relationships that have small correlation coefficients, especially if they don’t have an expectation for what a good value might actually be.

Statistical Reasons for Small Correlations

There are several statistical reasons for unexpected correlations:

Non-linear relationships — Correlation coefficients assume that the relationship between two variables is linear. Nonlinear relationships result in smaller than expected correlation coefficients. A scatterplot of the variables can usually confirm this problem, which can often be corrected with a data transformation.
Outliers — The strength of a correlation coefficient can be deflated or inflated by outliers. A scatterplot can usually confirm the presence of outliers although deciding how to treat them may be more problematical.
Excessive uncontrolled variance — Sometimes, data points that appear to be outliers may just be instances of excess variance. Excess variance is probably the most common cause of smaller than expected correlations. Usually, excess variance is the result of a lack of adequate control in data generation.
Inappropriate sample — Data points that look like outliers or excess variance may be sham samples. Sham samples are not representative of the population being analyzed, and so, confound any calculated statistics. Samples may also represent trends hidden in subpopulations, perhaps even resulting in Simpson’s paradox.
Inefficient metrics — Variables used in the analysis may not be appropriate for investigating the phenomenon in question. As a consequence, the strength of a relationship will be smaller than expected.

That’s why any evaluation of a correlation should include looking at the coefficient’s sign and size, a scatterplot of the relationship, and a statistical test of significance. There’s a lot more information in data relationships than can be expressed by a single statistic.

If there are no statistical issues with a dataset, it’s important to also consider what types of relationships between the variables are possible.

Relationship Reasons for Small Correlations

Types of Relationships

When analysts see a large correlation coefficient, they begin speculating about possible reasons. They’ll naturally gravitate toward their initial hypothesis (or preconceived notion) which set them to investigate the data relationship in the first place. Because hypotheses are commonly about causation, they often begin with this least likely type of relationship. Besides causation, relationships can also reflect influence or association:

Causes — A cause is a condition or event that directly triggers, initiates, makes happen, or brings into being another condition or event. A cause is a sine qua non; without a cause a consequent will not occur. Causes are directional. A cause must precede its consequent.
We’re influences.

Influences — An influence is a condition or event that changes the manifestation of an existing condition or event. Influences can be direct or mediated by a separate condition or event. Influences may exist at any time before or after the influenced condition or event. Influences may be unidirectional or bidirectional.
Associations — Associations are two conditions or events that appear to change in a related way. Any two variables that change in a similar way will appear to be associated. Thus, associations can be spurious or real. Associations may exist at any time before or after the associated condition or event. Unlike causes and influences, associated variables have no effect on each other and may not exist in different populations or in the same population at different times or places.

Associations are commonplace. Most observed correlations are probably just associations. Influences and causes are less common but, unlike associations, they can be supported by the science or other principles on which the data are based. The strength of a correlation coefficient is not related to the type of relationship. Causes, influences, and associations can all have strong as well as weak correlations depending on the efficiency of the variables being correlated and the pattern of the relationship.

Patterns of Relationships

Most discussions of correlation and causation focus on the simple direct relationship that one event or condition, designated as A, is related to a second event or condition, designated as B. For example, gravitational forces from the Moon and Sun cause ocean tides on the Earth. A causes B but B does not cause A. Another direct relationship is that age influences height and weight. Age doesn’t cause height and weight but we tend to grow larger as we age so A influences B.

Direct relationships are easy to understand and, if there are no statistical obfuscations, should exhibit a high degree of correlation. In practice, though, not every relationship is direct or simple. Here are eight:

Shields up! ……………… Phasers locked on target.

Feedback Relationship: A and B are linked in a loop; A causes or influences B which then causes or influences A and so on. For example, poor performance in school or at work (A) creates stress (B) which degrades performance further (A) leading to more stress (B) and so on.

Common Relationship: A third event or condition, C, causes or influences both A and B. For example, hot weather (C) causes people to wear shorts (A) and drink cool beverages (B). Wearing shorts doesn’t cause or influence beverage consumption, although the two are associated by their common cause. Another example is the influence obesity has on susceptibility to a variety of health maladies.
Mediated Relationship: A causes or influences C and C causes or influences B so that it appears that A causes B. For example, rainy weather (A) often induces people to go to their local shopping mall for something to do (C). While there, they shop, eat lunch, and go to the movies and other entertainment venues thus providing the mall with increased revenues (B). In contrast, snowstorms (A) often induce people to stay at home (C) thus decreasing mall revenues (B). Bad weather doesn’t cause or influence mall revenues directly but does influence whether people visit the mall.
Stimulated Relationship: A causes or influences B but only in the presence of C. There are many examples of this pattern, such as metabolic and chemical reactions involving enzymes or catalysts.
Suppressed Relationship: A causes or influences B but not in the presence of C. For example, pathogens (A) cause infections (B) but not in the presence of antibiotics (C). Some drugs (A) cause side effects (B) only in certain at-risk populations (C).
We’re an inverse relationship.

Inverse Relationship: The absence of A causes or influences B. For example, vitamin deficiencies (A) cause or influence a wide variety of symptoms (B).
Threshold Relationship: A causes or influences B only when A is above a certain level. For example, rain (A) causes flooding (B) only when the volume or intensity is very high.
Complex Relationship: Many A factors or events contribute to the cause or influence of B. Numerous environmental processes fit this pattern. For example, A variety of atmospheric and astronomical factors (A) contribute to influencing climate change (B).

Spurious Relationships

There are also a variety of spurious relationships in which A appears to cause or influence B but does not. Often the reason is that the relationship is based on anecdotal evidence that is not valid more generally. Sometimes spurious relationships may be some other kind of relationship that isn’t understood. Here are five other reasons why spurious relationships are so common.

Misunderstood relationships: The science behind a relationship may not be understood correctly. For example, doctors used to think that spicy foods and stress caused ulcers. Now, there is greater recognition of the role of bacterial infection. Likewise, hormones have been found to be the leading cause of acne rather than diet (i.e., consumption of chocolate and fried foods).
Misinterpreted statistics: There are many examples of statistical relationships being interpreted incorrectly. For example, the sizes of homeless populations appear to influence crime. Then again, so do the numbers of museums and the availability of public transportation. All of these factors are associated with urban areas.
We’re a spurious relationship.

Misinterpreted observations: Incorrect reasons are attached to real observations. Many old wives tales are based on credible observations. For example, the notion that hair and nails continue to grow after death is an incorrect explanation for the legitimate observation.
Urban legends: Some urban legends have a basis in truth and some are pure fabrications, but they all involve spurious relationships. For example, In South Korea, it believed that sleeping with a fan in a closed room will result in death.
Biased Assertions: Some spurious relationships are not based on any evidence but instead are claimed in an attempt to persuade others of their validity. For example, the claim that masturbation makes you have hairy palms is not only ludicrous but also easily refutable. Likewise, almost any advertisement in support of a candidate in an election contains some sort of bias, such as cherry picking.

Correlation and Causation — Not Always What You Expect

Calculated correlation coefficients are innocent bystanders in debates over causation, influence, and association. With all the statistical and relational nuances that can affect their interpretation, it’s a wonder that they are so often used alone as determinants of causality and influence. As with all statistics, correlation coefficients need to be interpreted in the context provided by other types information. Certainty correlation does not imply causation, but according to Edward Tufte and others, sometimes it’s a good hint.

Well, that’s not what I expected.

Read more about using statistics at the Stats with Cats blog. Join other fans at the Stats with Cats Facebook group and the Stats with Cats Facebook page. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at amazon.com, barnesandnoble.com, or other online booksellers.

8 Responses to Why You Don’t Always Get the Correlation You Expect

fuhriello says:

November 3, 2014 at 3:25 AM

Reblogged this on fuhriello macht Fuhrwerk bekannt.

waltika says:

November 3, 2014 at 11:02 AM

Reblogged this on Waltika.

Ken Butler says:

November 3, 2014 at 1:27 PM

Nice discussion. I especially like “patterns of relationships”. I might be borrowing from this for my class!

slhbennett says:

November 6, 2014 at 9:19 PM

Reblogged this on CrossInnovating and commented:
This sums it up: “As with all statistics, correlation coefficients need to be interpreted in the context provided by other types information. Certainty correlation does not imply causation, but according to Edward Tufte and others, sometimes it’s a good hint”

Pingback: How to Tell if Correlation Implies Causation | Stats With Cats Blog
Pingback: Ten Ways Statistical Models Can Break Your Heart | Stats With Cats Blog
Pingback: Searching for Answers | Stats With Cats Blog
Pingback: What to Look for in Data – Part 2 | Stats With Cats Blog