A Typology of Data Relationships

Nine patterns of three types of relationships that aren’t spurious.

When analysts see a large correlation coefficient, they begin speculating about possible reasons. They’ll naturally gravitate toward their initial hypothesis (or preconceived notion) which set them to investigate the data relationship in the first place. Because hypotheses are commonly about causation, they often begin with this least likely type of relationship using the most simplistic of relationship pattern, a direct one-event-causes-another.

A topology of data relationships is important because it helps people to understand that not all relationships reflect a cause. They may just be the result of an influence or an association or even mere coincidence. Furthermore, you can’t always tell what type and pattern of relationship a data set represents. There are at least 27 possibilities not even counting spurious relationships. That’s where numbercrunching ends and statistical-thinking shifts into high-gear. Be prepared.

Types of Data Relationships

Besides causation, relationships can also reflect influence or association.

Causes

A cause is a condition or event that directly triggers, initiates, makes happen, or brings into being another condition or event. A cause is a sine qua non; without a cause a consequent will not occur. Causes are directional. A cause must precede its consequent.

Influences

An influence is a condition or event that changes the manifestation of an existing condition or event. Influences can be direct or mediated by a separate condition or event. Influences may exist at any time before or after the influenced condition or event. Influences may be unidirectional or bidirectional.

Associations

Associations are two conditions or events that appear to change in a related way. Any two variables that change in a similar way will appear to be associated. Thus, associations can be spurious or real. Associations may exist at any time before or after the associated condition or event. Unlike causes and influences, associated variables have no effect on each other and may not exist in different populations or in the same population at different times or places.

Associations are commonplace. Most observed correlations are probably just associations. Influences and causes are less common but, unlike associations, they can be supported by the science or other principles on which the data are based. The strength of a correlation coefficient is not related to the type of relationship. Causes, influences, and associations can all have strong as well as weak correlations depending on the efficiency of the variables being correlated and the pattern of the relationship.

Image for post

Patterns of Data Relationships

Direct relationships are easy to understand and, if there are no statistical obfuscations, should exhibit a high degree of correlation. In practice, though, not every relationship is direct or simple. Some are downright complex.

Here are nine relationships that I could think of. There may be more. These relationships involve events or conditions termed AB, and C.

Image for post

Direct Relationship

Most discussions of correlation and causation focus on the simple, direct relationship that one event or condition, A, is related to a second event or condition, B. The relationship proceeds in only one direction. For example, gravitational forces from the Moon and Sun cause ocean tides on the Earth. A causes B but B does not cause A. Another direct relationship is that age influences height and weight. Age doesn’t cause height and weight but we tend to grow larger as we age so A influences B. B does not influence A.

Image for post

Feedback Relationship

In a feedback relationship, A and B are linked in a loop. A causes or influences B, which then causes or influences A, and so on. Feedback relationships are bidirectional. They will be correlated. For example, poor performance in school or at work (A) creates stress (B) which degrades performance further (A) leading to more stress (B) and so on.

Image for post

Common-Cause Relationship

In a common-cause relationship, a third event or condition, C, causes or influences both A and B. For example, hot weather © causes people to wear shorts (Aand drink cool beverages (B). Wearing shorts (A) doesn’t cause or influence beverage consumption (B), although the two are associated by their common cause. A plot of this data will show that A and B are correlated, but the correlation represents an underlying association rather than an influence or a cause. Another example is the influence obesity has on susceptibility to a variety of health maladies.

Image for post

Mediated Relationship

In a mediated relationship, A causes or influences C and C causes or influences so that it appears that A causes BA and B will be correlated. For example, rainy weather (A) often induces people to go to their local shopping mall for something to do ©. While there, they shop, eat lunch, and go to the movies or other entertainment venues thus providing the mall with increased revenues (B). In contrast, snowstorms (A) often induce people to stay at home © thus decreasing mall revenues (B). Bad weather doesn’t cause or influence mall revenues directly but does influence whether people visit the mall.

Image for post

Stimulated Relationship

In a stimulated relationship, A causes or influences B but only in the presence of C. Stimulated relationships may not appear to be correlated using a Pearson correlation coefficient but may using a partial correlationThere are many examples of this pattern, such as metabolic and chemical reactions involving enzymes or catalysts.

Image for post

Suppressed Relationship

In a suppressed relationship, A causes or influences B but not in the presence of C. As with stimulated relationships, suppressed relationships may only appear to be correlated using a partial correlation coefficient. Medicine has many examples of suppressed and stimulated relationships. For example, pathogens (A) cause infections (B) but not in the presence of antibiotics (C). Some drugs (A) cause side effects (B) only in certain at-risk populations (C).

Image for post

Inverse Relationship

In inverse relationships, the absence of A causes or influences B, OR the presence of A minimizes B. Correlation coefficients for inverse relationships are negative. For example, vitamin deficiencies (A) cause or influence a wide variety of symptoms (B).

Image for post

Threshold Relationship

In threshold relationships, A causes or influences B only when A is above a certain level. For example, rain (A) causes flooding (B) only when the volume or intensity is very high. These relationships aren’t usually revealed by correlation coefficients.

Image for post

Complex Relationship

In complex relationships, many A factors or events contribute to the cause or influence of B. Numerous environmental processes fit this pattern. For example, A variety of atmospheric and astronomical factors (A) contribute to influencing climate change (B). Even many correlation coefficients may not explain this type of relationship; it takes more involved statistical analyses.

Image for post

Spurious Data Relationships

There are also a variety of spurious relationships in which A appears to cause or influence B, but does not. Often the reason is that the relationship is based on anecdotal evidence that is not valid more generally. Sometimes spurious relationships may be some other kind of relationship that isn’t understood. Here are five other reasons why spurious relationships are so common.

Misunderstood relationships

The science behind a relationship may not be understood correctly. For example, doctors used to think that spicy foods and stress caused ulcers. Now, there is greater recognition of the role of bacterial infection. Likewise, hormones have been found to be the leading cause of acne rather than diet (i.e., consumption of chocolate and fried foods).

Misinterpreted statistics

There are many examples of statistical relationships being interpreted incorrectly. For example, the sizes of homeless populations appear to influence crime. Then again, so do the numbers of museums and the availability of public transportation. All of these factors are associated with urban areas, but not necessarily crime.

Misinterpreted observations

Incorrect reasons are attached to real observations. Many old wives tales are based on credible observations. For example, the notion that hair and nails continue to grow after death is an incorrect explanation for the legitimate observation.

Urban legends

Some urban legends have a basis in truth and some are pure fabrications, but they all involve spurious relationships. For example, In South Korea, it is believed that sleeping with a fan in a closed room will result in death.

Biased Assertions

Some spurious relationships are not based on any evidence, but instead, are claimed in an attempt to persuade others of their validity. For example, the claim that masturbation makes you have hairy palms is not only ludicrous but also easily refutable. Likewise, almost any advertisement in support of a candidate in an election contains some sort of bias, such as cherry picking.

Coincidences

Mother Nature has a wicked sense of humor. Don’t believe every correlation coefficient you calculate.

Image for post
Image for post
Photo by Alec Favale on Unsplash
Posted in Uncategorized | Tagged , , , , | Leave a comment

The Evolution of Data Science … As I Remember It

Anonymous feline in witness protection program.

Those who cannot remember the past are condemned to repeat it.

George Santayana

History isn’t always clear-cut. It’s written by anyone with the will to write it down and the forum to distribute it. It’s valuable to understand different perspectives and the contexts that created them. The evolution of the term Data Science is a good example.

I learned statistics in the 1970s in a department of behavioral scientists and educators rather than a department of mathematics. At that time, the image of statistics was framed by academic mathematical-statisticians. They wrote the textbooks and controlled the jargon. Applied statisticians were the silent majority, a sizable group overshadowed by the academic celebrities. For me, reading Tukey’s 1977 book Exploratory Data Analysis was a revelation. He came from a background of mathematical statistics yet wrote about applied statistics, a very different animal.

My applied-statistics cohorts and I were a diverse group—educational statisticians, biostatisticians, geostatisticians, psychometricians, social statisticians, and econometricians, nary a mathematician in the group. We referred to ourselves collectively as data-scientists, a term we heard from our professor. We were all data scientists, despite our different educational backgrounds, because we all worked with data. But the term never stuck and faded away for over the years.

Feline crunching on number keys.

Age of Number-Crunching: 1960s-1970s

Applied statistics had been very important during World War II, most notably in code breaking but also in military applications and more mundane logistics and demographic analyses. After the war, the dominance of deterministic engineering analysis grew and drew most of the public’s attention. There were many new technologies in consumer goods and transportation, especially aviation and the space race, so statistics wasn’t on most people’s radar. Statistics was considered to be a field of mathematics. The public’s perception of a statistician was a mathematician, wearing a white lab coat, employed in a university mathematics department, who was working on who-knows-what.

One of the technologies that came out of WWII was ENIAC, which led to the IBM/360 mainframes of the early 1960s. These computers were still huge and complex, but compared to ENIAC, quite manageable. They were a technological leap forward and inexpensive enough to become part of most university campuses. Mainframes became the mainstays of education. Applied statisticians and programmers led the way; computer rooms across the country were packed with them.

In 1962, John Tukey wrote in “The Future of Data Analysis”

“For a long time, I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt…I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.”

I read that paper as part of my graduate studies. Perhaps applied statisticians saw this paper as an opportunity to develop their own identity, apart from determinism and mathematics, and even mathematical statistics. But it really wasn’t an organized movement, it just evolved.

One of my cohorts and I discussing data science.

So as my cohorts and I understood it, the term data-sciences was really just an attempt to coin a collective noun for all the number-crunching, just as social-sciences was a collective noun for sociology. anthropology, and related fields. The data sciences included any field that analyzed data,regardless of the domain specialization,as opposed to pure mathematical manipulations. Mathematical statistics was NOT a data science because it didn’t involve data. Biostatistics, chemometrics, psychometrics, social and educational statistics, epidemiology, agricultural statistics, econometrics, and other applications were part of data science. Business statistics, outside of actuarial science, was virtually nonexistent. There were surveys but business leaders preferred to call their own shots. Data-driven business didn’t become popular until the 21st century. But if it had been a substantial field, it would have been a data science.

Computer programming might have involved managing data but to statisticians it was not a data science because it didn’t involve any analysis of data. There was no science involved. At the time, it was called data processing. It involved getting data into a database and reporting them, but not analyzing them further. Naur (1974) had a different perspective. Naur was a computer scientist who considered data science to encompass dealing with existing data, and not how the data were generated or were to be analyzed. This was just the opposite of the view of applied statisticians. Different perspectives.

Programming in the 1950s and 1960s was evolving from the days of flipping switches on a mainframe behemoth, but was still pretty much limited to Fortran, COBOL, and a bit of Algol. There were issues with applied statisticians doing all their own programming. They tended to be less efficient than programmers and were sometimes unreliable. To paraphrase Dr. McCoy, I’m an applied statistician not a computer programmer.” This philosophy was reinforced by British statistician Michael Healy when he said:

No single statistician can be expected to have a detailed knowledge of all aspects of statistics and this has consequences for employers. Statisticians flourish best in teams—a lone applied statistician is likely to find himself continually forced against the edges of his competence.

M.J.R. Healy. 1973. The Varieties of Statistician. J. Royal Statistical Society. 136(1), p. 71-74.

So when the late 1960s brought statistical-software-packages, most notably BMDP and later SPSS and SAS, applied statisticians were in Heaven. Still, the statistical packages were expensive programs that could only run on mainframes, so only the government, universities, and major corporations could afford their annual licenses, the mainframes to run them, and the operators to care for the mainframes. I was fortunate. My university had all the major statistical packages that were available at the time, some of which no longer exist. We learned them all, and not just the coding. It was a real education to see how the same statistical procedures were implemented in the different packages.

Waiting for the mainframe to print out my analysis.

Throughout the 1970s, statistical analyses were done on those big-as-dinosaurs, IBM/360 mainframe computers. They had to be sequestered in their own climate-controlled quarters, waited on command and reboot by a priesthood of system operators. No food and no smoking allowed! Users never got to see the mainframes except, maybe, through a small window in the locked door. They used magnetic tapes. I saw ‘em.

Conducting a statistical analysis was an involved process. To analyze a data set, you first had to write your own programs. Some people used standalone programming languages, usually Fortran. Others used the languages of SAS or SPSS. There were no GUIs (Graphical User Interfaces) or code writing applications. The statistical packages were easier to use than the programming languages but they were still complicated

Once you had handwritten the data-analysis program, you had to wait in line for an available keypunch machine so you could transfer your program code and all your data onto 3¼-by-7⅜-inch computer punch-cards. After that, you waited so you could feed the cards through the mechanical card-reader. On a good day, it didn’t jam … much. Finally, you waited for the mainframe to run your program and the printer to output your results. Then the priesthood would transfer the printouts to bins for pickup. When you picked up your output sometimes all you got was a page of error codes. You had to decipher the codes, decide what to do next, and start the process all over again. Life wasn’t slower back then, it just required more waiting.

In the 1970s, personal computers, or what would eventually evolve into what we now know as PCs, were like mammals during the Jurassic period, hiding in protected niches while the mainframe dinosaurs ruled. Before 1974, most PCs were built by hobbyists from kits. The MITS Altair is generally acknowledged as the first personal computer, although there are more than a few other claimants. Consumer-friendly PCs were a decade away. (My first PC was a Radio Shack TRS-80, AKA Trash 80, that I got in 1980; it didn’t do any statistics but I did learn BASIC and word processing.) Big businesses had their mainframes but smaller businesses didn’t have any appreciable computing power until the mid-1980s. By that time, statistical software for PCs began to spring out of academia. There was a ready market of applied statisticians who learned on a mainframe using SAS and SPSS but didn’t have them in their workplaces.

Feline PC watching feline mainframe.

Age of Data-Wrangling: 1980s-2000s

Statistical analysis changed a lot after the 1970s. Punch cards and their supporting machinery became extinct. Mainframes were becoming an endangered species, having been exiled to specialty niches by PCs that could sit on a desk. Secure, climate-controlled rooms weren’t needed nor were the operators. Now companies had IT Departments. The technicians sat in their own areas, where they could eat and smoke, and went out to the users who had a computer problem. It was as if all the Doctors left their hospital practices to make house calls.

Inexpensive statistical packages that ran on PCs multiplied like rabbits. All of these packages had GUIs; all were kludgy and even unusable by today’s standards. Even the venerable ancients, SAS and SPSS, evolved point-and-click faces (although you could still write code if you wanted). By the mid-1980s, you could run even the most complex statistical analysis in less time than it takes to drink a cup of coffee … so long as your computer didn’t crash.

PC sales had reached almost a million per year by 1980. But then in 1981, IBM introduced their 8088 PC. Over the next two decades, the number of IBM-compatible PCs that were sold increased annually to almost 200 million. From the early 1990s, sales of PCs had been fueled by Pentium-speed, GUIs, the Internet, and affordable, user-friendly software, including spreadsheets with statistical functions. MITS and the Altair were long gone, now seen only in museums, but Microsoft survived, evolved, and became the apex predator.

The maturation of the Internet also created many new opportunities. You no longer had to have access to a huge library of books to do a statistical analysis. There were dozens of websites with reference materials for statistics. Instead of purchasing one expensive book, you could consult a dozen different discussions on the same topic, free. No dead trees need clutter your office. If you couldn’t find website with what you wanted, there were discussion groups where you could post your questions. Perhaps most importantly, though, data that would have been difficult or impossible to obtain in the 1970s were now just a few mouse-clicks away, usually from the federal government.

So, with computer sales skyrocketing and the Internet becoming as addictive as crack, it’s not surprising that the use of statistics might also be on the rise. Consider the trends shown in this figure. The red squares represent the number of computers sold from 1981 to 2005. The blue diamonds, which follow a trend similar to computer sales, represent revenues for SPSS, Inc. So at least some of those computers were being used for statistical analyses.

Another major event in the 1980s was the introduction of Lotus 1-2-3. The spreadsheet software provided users with the ability to manage their data, perform calculations, and create charts. It was HUGE. Everybody who analyzed data used it, if for nothing else, to scrub their data and arrange them in a matrix. Like a firecracker, the life of Lotus 1-2-3 was explosive but brief. A decade after its introduction, it lost its prominence to Microsoft Excel, and by the time data science got sexy in the 2010s, it was gone.

With the availability of more computers and more statistical software, you might expect that there may be more statistical analyses being done. That’s a tough trend to quantify, but consider the increases in the numbers of political polls and pollsters. Before 1988, there were on-average only one or two presidential approval polls conducted each month. Within a decade, that number had increased to more than a dozen. In the figure, the green circles represent the number of polls conducted on presidential approval. This trend is quite similar to the trends for computer sales and SPSS revenues. Correlation doesn’t imply causation but sometimes it sure makes a lot of sense.

Perhaps even more revealing is the increase in the number of pollsters. Before 1990, the Gallup Organization was pretty much the only organization conducting presidential approval polls. Now, there are several dozen. These pollsters don’t just ask about Presidential approval, either. There are a plethora of polls for every issue of real importance and most of the issues of contrived importance. Many of these polls are repeated to look for changes in opinions over time, between locations, and for different demographics. And that’s just political polls. There has been an even faster increase in polling for marketing, product development, and other business applications. Even without including non-professional polls conducted on the Internet, the growth of polling has been exponential.

Statistics was going through a phase of explosive evolution. By the mid-1980s, statistical analysis was no longer considered the exclusive domain of professionals. With PCs and statistical software proliferating and universities requiring a statistics course for a wide variety of degrees, it became common for non-professionals to conduct their own analyses.  Sabermetrics, for example, was popularized by baseball professionals who were not statisticians. Bosses who couldn’t program the clock on a microwave thought nothing of expecting their subordinates to do all kinds of data analysis. And they did. It’s no wonder that statistical analyses were becoming commonplace wherever there were numbers to crunch.

Big Data cat.

Against that backdrop of applied statistics came the explosion of data wrangling capabilities. Relational databases and Sequel (SQL) data retrieval became the vogue. Technology also exerted its influence. Not only were PCs becoming faster but, perhaps more importantly, hard disk drives were getting bigger and less expensive. This led to data warehousing, and eventually, the emergence of Big Data. Big data brought Data Mining and black-box modeling. BI (Business Intelligence) emerged in 1989, mainly in major corporations.

Then came the 1990s. Technology went into overdrive. Bulletin Boards Systems (BBSs) and Internet Relay Chat (IRC) evolved into instant messaging, social media, and blogging. The amount of data generated by and available from the Internet skyrocketed. Google and other search engines proliferated. Data sets were now not just big, they were BIG. Big Data required special software, like Hadoop, not just because of its volume but also because much of it was unstructured.

At this point, applied statisticians and programmers had symbiotic, though sometimes contentious, relationships. For example, data wranglers always put data into relational databases that statisticians had to reformat into matrices before they could be analyzed. Then, 1995-2000 brought the R programming language. This was notable for several reasons. Colleges that couldn’t afford the licensing and operational costs of SAS and SPSS began teaching R, which was free. This had the consequence of bringing programming back to the applied-statistics curriculum. It also freed graduates from worrying about having a way to do their statistical modeling at their new jobs wherever they might be.

Conducting a data analysis in the 1990s was nowhere near as onerous as it was twenty years before. You could work at your desk on your PC instead of camping out in the computer room. Many companies had their own data wranglers who built centralized data repositories for everyone to use. You didn’t have to enter your data manually very often, and if you did, it was by keyboarding rather than keypunching. Big companies had their big data but most data sets were small enough to handle in Access if not Excel. Cheap, GUI-equipped statistical software was readily available for any analysis Excel couldn’t handle. Analyses took minutes rather than hours. It took longer to plan an analysis than it did to conduct it. Anyone who took a statistics class in college began analyzing their own data. The 1990s produced a lot of cringeworthy statistical analyses and misleading charts and graphs. Oh, those were the days.

The 2000s brought more technology. Most people had an email account. You could bring a library of ebooks anywhere. Cell phones evolved into smartphones. Flash drives made datasets portable. Tablets augmented PCs and smartphones. Bluetooth facilitated data transfer. Then something else important happened—funding.

Funding for activities related to data science and big data became available from variety of sources, especially government agencies like the National Science Foundation, the Nation Institutes of Health, and the National Cancer Institute. (The NIH released its first strategic plan for data science in 2018.) The UK government also funds training in artificial intelligence and data science. So too do businesses and non-profit organizations. Major universities responded by expanding their programs to accommodate the criteria that would bring them the additional funding. What had been called applied statistics and programming were rebranded as data science and big data.

Donoho captured the sentiment of statisticians in his address at the 2015 Tukey Centennial workshop:

Data Scientist means a professional who uses scientific methods to liberate and create meaning from raw data. … Statistics means the practice or science of collecting and analyzing numerical data in large quantities.

To a statistician, [the definition of data scientist] sounds an awful lot like what applied statisticians do: use methodology to make inferences from data. … [the] definition of statistics seems already to encompass anything that the definition of Data Scientist might encompass …

The statistics profession is caught at a confusing moment: the activities which preoccupied it over centuries are now in the limelight, but those activities are claimed to be bright shiny new, and carried out by (although not actually invented by) upstarts and strangers.

Feline statistician observing feline data scientist.

Age of Data Science: 2010s-Present

The rest of the story of Data Science is more clearly remembered because it is recent. Most of today’s data scientists hadn’t even graduated from college by the 2010s. They might remember, though, the technological advances, the surge in social connectedness, and the money pouring into data science programs in anticipation of the money that would be generated from them. Those factors led to a revolution.

The average age of data scientists in 2018 was 30.5, the median was lower. The younger half of data scientists were just entering college in the 2000s, just when all that funding was hitting academia. (FWIW, I’m in the imperceptibly tiny bar on the upper left of the chart along with 193 others.) But KDnuggets concluded that:

“… rather than attracting individuals from new demographics to computing and technology, the growth of data science jobs has merely creating [sic] a new career path for those who were likely to become developers anyway.”

The event that propelled Data Science into the public’s consciousness, though, was undoubtedly the 2012 Harvard Business Review article that declared data scientist to be the sexiest job of the 21st century. The article by Davenport and Patil described a data scientist as “a high-ranking professional with the training and curiosity to make discoveries in the world of big data.” Ignoring the thirty-year history of the term, though not the concept which was new, the article notes that there were already “thousands of data scientists … working at both start-ups and well-established companies” in just five years. I doubt they were all high-ranking.

Davenport and Patil attributed the emergence of data-scientist as a job title to the varieties and volumes of unstructured Big Data in business. But a consistent definition of data scientist proved to be elusive. Six years later in 2018, KDnuggets described Data Science as an interdisciplinary field at the intersection of Statistics, Computer Science, Machine Learning, and Business, quite a bit more specific than the HBR article. There were also quite a few other opinions about what data science actually was. Everybody wanted to be on the bandwagon that was sexy, prestigious, and lucrative.

The numbers of Google searches related to topics concerning data reveal the popularity, or at least the curiosity, of the public. Topics related to search term statistics—most notably statistics, data mining, and data warehouse—all decreased in popularity from about 80 searches per month in 2004 to 25 searches per month in 2020. Six Sigma and SQL were somewhat more popular than these topics between 2004 and 2011. Computer Programming rose in popularity slightly from 2014 to 2016. Business Intelligence followed a pattern similar to SQL but had 10 to 30 more searches per month.

Topics related to the search term data scienceData Science, Big Data, and Machine Learning—had fewer than 20 searches per month from 2004 until 2012 when they began increasing rapidly. Big Data peaked in 2014 then decreased steadily. Data Science and Machine Learning increased until about 2018 and then leveled off. The term Python has increased from about 35 searches per month in 2013 to 90 searches per month in 2020. The term Artificial Intelligence decreased from 70 searches per month in 2004 to a minimum of 30 searches per month from 2008 to 2014, then increased to 80 searches per month in 2019.

While people believe Artificial Intelligence is a relative recent field of study, mostly an idea of science fiction, it actually goes back to ancient history. Autopilots in airplanes and ships date back to the early 20th century, now we have driverless cars and trucks. Computers, perhaps the ultimate AI, were first developed in the 1940. Voice recognition began in the 1950s, now we can talk to Siri and Cortana. Amazon and Netflix tell us what we want to do. But perhaps the e single event that caught the public’s attention was in 1997 when Deep Blue became the first computer AI to beat a reigning, world chess champion, Garry Kasparov. This led to AI being applied to other games, like Go and Jeopardy, which increased the public’s awareness of AI.

Aviation went from its first flight to landing on the moon in 65 years. Music went from vinyl to tape to disk to digital in 30 years. Data science overtook statistics in popularity in less than a decade.

It is interesting to compare the patterns of searches for the terms: statistics; AI; big data; ML; and data science. Everybody knows what statistics is. They see statistics every day on the local weather reports. AI entered the public’s consciousness with the game demonstrations and movies, like Terminator and Star Wars. Big data isn’t all that mysterious, especially since the definition is rock solid even if new V-definitions appear occasionally. But ML and data science are more enigmatic. ML is conceptionally difficult to understand because, unlike AI, it is far from what the public sees. The definition of data science, however, suffers from too much diversity of opinion. In the 1970s, Tukey and Naur had diametrically-opposed definitions. Many others since then have added more obfuscation than clarity. Fayyad and Hamutcu conclude that “there is no commonly agreed on definition for data science,” and furthermore, “there is no consensus on what it is.”

So, universities train students to be data scientists, businesses hire graduates to work as data scientists, and people who call themselves data scientists write articles about what they do. But as professions, we can’t agree on what data science is. As Humpty Dumpty said:

“When I use a word,” Humpty Dumpty said, in rather a scornful tone, “it means just what I choose it to mean—neither more nor less.” “The question is,” said Alice, “whether you can make words mean so many different things.” “The question is,” said Humpty Dumpty, “which is to be master—that’s all.”

Lewis Carroll (Charles L. Dodgson), Through the Looking-Glass, chapter 6, p. 205 (1934). First published in 1872.
Feline looking for definition of Data Science.

What Is A Data Scientist?

The term data scientist has never had a consistent meaning. Tukey’s followers thought it applied to all applied statisticians and data analysts. Naur’s followers thought it referred to all programmers and data wranglers. Those were both collective nouns, but they were exclusive. Tukey’s definition excluded data wranglers. Naur’s definition excluded data analysts. Almost forty years later, Davenport and Patil used the term for anyone with the skills to solve problems using Big Data from business. Some of today’s definitions specify that individual data scientists must be adept at wrangling, analysis, modeling, and business expertise. Of course there are disagreements.

Skills—Some definitions redefine what the skills are. Statistics is the primary example. Some definitions limit statistics to hypothesis testing even though modeling and prediction have been part of the discipline for over a century. The implication is that anything that isn’t hypothesis testing isn’t statistics.

Data—Some definitions specify that data science uses Big Data related to business. The implication is that smaller data sets from non-business domains are not part of data science.

Novelty—Some definitions focus on new, especially state-of-the-art technologies and methods over traditional approaches. Data generation is the primary example. Modern technologies, like automatic web scraping with Python, are key methods of some definitions of data science. The implication is that traditional probabilistic sampling methods are not part of data science.

Specialization—Some definitions require data scientists to be multifaceted, generalist, jacks-of-all-trades. This strategy of skills has been abandoned by virtually all scientific professions. As Healy suggested, you can’t expect a computer programmer to be a statistician any more than you can expect a statistician to be a programmer. Yes, there still are generalists, nexialists, interdisciplinarians; they make good project managers and maybe even politicians. But, would you go to a GP (general practitioner) for cancer treatments?

These disagreements have led to some disrespectful opinions—you’re not a real data scientist, you’re a programmer, statistician, data analyst, or some other appellation. So, the fundamental question is whether the term data science refers to a big tent that holds all the skills, and methods, and types of data that might solve a problem or it refers to a small tent that can only hold the specific skills useful for Big Data from business.

What’s in a name? That which we call a rose by any other name would smell as sweet.

William Shakespeare, Romeo and Juliet, Act II, Scene II

The definition of data science is a modern retelling of the parable of the blind men and an elephant:

A group of blind men heard that a strange animal, called an elephant, had been brought to the town, but none of them were aware of its shape and form. Out of curiosity, they said: “We must inspect and know it by touch, of which we are capable”. So, they sought it out, and when they found it they groped about it. The first person, whose hand landed on the trunk, said, “This being is like a thick snake”. For another one whose hand reached its ear, it seemed like a kind of fan. As for another person, whose hand was upon its leg, said, the elephant is a pillar like a tree-trunk. The blind man who placed his hand upon its side said the elephant, “is a wall”. Another who felt its tail, described it as a rope. The last felt its tusk, stating the elephant is that which is hard, smooth and like a spear.

Data science is an elephant. The harder we try to define it the more unrecognizable it becomes. Is it a collective noun or an exclusionary filter? There is no consensus. But that’s the way the world works. Maybe in fifty years, colleges will have programs to train Wisdom Oracles to take the work of pedestrian data scientists and turn it into something really useful.

Anything that is in the world when you’re born is normal and ordinary and is just a natural part of the way the world works. Anything that’s invented between when you’re fifteen and thirty-five is new and exciting and revolutionary and you can probably get a career in it. Anything invented after you’re thirty-five is against the natural order of things.

Douglas Adams

All photos and graphs by author except as noted.

Posted in Uncategorized | Tagged , , , , , , , , , , , , , , , , | 2 Comments

PII. The Great Taboo of Data Analysis

By JenWaller is licensed under CC BY-NC-SA 2.0

Many of us have had our personal information stolen from the Internet, some more than once. Even governments can’t prevent the thefts. Professionals who work with data come under a lot of scrutiny because of PII.

What Is PII?

Personally identifiable information (PII) is any data that can be used to identify a specific individual. PII is used mainly within the U.S.; Personal Data is the approximate equivalent of PII in Europe. Examples of PII include:

  • Full name. This depends on your name and the population you’re looking in. If you’re looking for Charlie Kufs in the U.S., you’ll find just one. If you’re looking for John Smith in the world, you’ll find quite a few.
  • Home address. A full address is usually unique although there may be aliases. A name plus a home address is good enough for a voter registration.
  • Date of birth. A lot of people have the same date-of-birth but tie it to a name and you’ll probably have a unique identification.
  • Email addresses and telephone numbers. Everybody seems to have many, some of which may be linked back to a real person.
  • Personal IDs. Social security number, passport number, driver’s license number, and credit card numbers. These are unique and will identify a specific individual even better than a name.
  • Computer-related details. Log-in and usage information, device IDs, IP addresses, GPS tracking, cell phone records, and social media links. Police caught the BTK killer because he sent them a document that had his name and location in the metadata.
  • Biometric data. Finger and palm prints, retinal scans, and DNA profile. They’ll find you; they always do.

To these PII elements I would add information that would not be able to identify a specific person unless combined with other information. Examples include: gender, race, age range, former address, and education and work history. There is also personal information that might be a password security question—first pet, grandmother’s maiden name, least favorite boss, first person you kissed, favorite teacher—although I don’t know why this information would be in a database.

Where Does PII Come From?

PII can come from a variety of sources. You can generate PII by conducting surveys, for instance. You can obtain it from a caretaker, like a Human Resources Department of an organization, Or, you can collect it in a variety of ways from the Internet.

Once you have your PII, you have to scrub it, which is a whole other discussion. Ultimately, you have to decide what is worth keeping for your analysis and what should be deleted immediately to prevent its loss. If you keep it for analysis, be sure you have a plan for what analyses you plan to conduct and how you’re going to safeguard the data when you’re not actively using it.

I’ve analyzed a lot of data in my career in both the private sector and in government. Federal PII requirements are much more strict, by a lot. We had to complete training on PII and computer security every year. I wasn’t supposed to keep PII on my work computer or in the cloud; it was only supposed to reside on the secure government server. The data set had to be password-protected and not shared with co-workers without a “business need to know.” And I wasn’t working for the NSA either. This was the General Services Administration. They manage federal buildings and provide office supplies and other things. They have some sensitive data (SBU) but nothing classified as even Secret. Nonetheless, I used some types of PII data quite often.

I obtained data from a variety of sources. My co-workers in our data analytics group (I can’t remember what it was called; it went through several name changes) provided most of the internal business data. I often supplemented that with data from sources on the Internet, usually other government databases, some public and some restricted. PII data came from Human Resources, usually requiring approvals from at least one higher level of management. Sometimes requests had to go through Headquarters offices in Washington D.C. I also conducted my own surveys, which also had to be approved by Headquarters. With all those different sources, data compilation and scrubbing was always an adventure.

The data sets I analyzed were miniscule by big-data standards. I almost always had fewer than thousands of rows and hundreds of columns. Typically, I had just hundreds of rows and dozens of columns. Most of the PII data I analyzed came from individuals in the same organization I worked in, though I did do a few analyses for outside organizations. Consequently, I was usually able to develop a good rapport with my data.

I rarely received social security numbers and other personal ID numbers. They may have been useful for sorting and data merges but there were always other data elements that could be used instead.  I’ve never had a reason to use them so I deleted them immediately. I also routinely deleted log in details, telephone numbers, and all but one of the multiple versions of name and email address that might be in a data set, mostly because they were an extraneous nuisance. Other PII that I saw often was an employee’s ID number and organizational unit, which I usually kept, and their supervisor, which I usually deleted as superfluous.

By Librarian Avenger, licensed under CC BY 2.0

What Do Analysts Do With PII?

My analyses that involved PII covered a range of business issues involving staff, both individually and in groups. Examples included: staff recruitment, hiring, , demographics, engagement, satisfaction, morale, productivity, capabilities, and workload; telework and wanderwork; and usage and preferences for data, cell phones, and computer hardware and software.

For these analyses, I used name, email address, and employee ID number for sorts and merges. I used home address for one analysis to assess employee commutes. I used race on one occasion to assess hiring practices. Getting that information was involved and required a lot of persistence. I used log-in information for online surveys to evaluate survey difficulty and patterns of responses.

I used sex and date-of-birth on almost every analysis I conducted concerning staff. In all those analyses, sex was never a significant factor. Still, it was important to verify that non-significance. I used birth date to calculate age. From that I could also determine the age they joined the agency and a few other age-related employment factors. Age was a significant fact in a great many of my analyses. I also used age to evaluate employees’ generations. My boss was a Gen-Xer who was convinced Millennials did not behave like older staff members. None of the analyses he had me do suggested that generation was any more than a minor factor. In those cases, the ratio-scaled age was a much better explanatory variable.

One time during a slow period over the end-of-year holidays I decided to have some fun and calculated the zodiac signs for the staff from the birth dates, then I conducted the same analyses of staff characteristics and preferences that I had previously completed. Not surprisingly, nothing related to astrological sign was significant, but now at least I have analytical evidence.

Data analysts are only interested in population characteristics. You are of no real interest as an individual. It’s true, “you’re just a statistic” unless you are some kind of crazy outlier. In that case, you might be interesting.

By Steve took it  licensed under CC BY-NC-SA 2.0

Posted in Uncategorized | Tagged , , , | Leave a comment

Anecdotes and Big Data

Image | Posted on by | Tagged , , , | Leave a comment

WHAT ARE THE ODDS?

Probability is the core of statistics. You hear the phrase “what’s the probability of …” all the time in statistics. You also hear that phrase in everyday life, too. What’s the probability of rain tomorrow? What’s the probability of winning the lottery? You also hear the phrase “what are the odds …? What are the odds my team will win the game? What are the odds I’ll catch the flu? It sounds just like probability, but it’s not.

Probability is defined as the number of favorable events divided by the total number of events. Probabilities range from 0 (0%) to 1 (100%). To convert from a probability to odds, divide the probability by one minus that probability.

Odds are defined as the number of favorable events divided by the number of unfavorable events. This is the same as saying that odds are defined as the probability that the event will occur divided by the probability that the event will not occur. Odds range from 0 to infinity. To convert from odds to a probability, divide the odds by one plus the odds.

A probability of 0 is the same as odds of 0. Probabilities between 0 and 0.5 equal odds less than 1.0. A probability of 0.5 is the same as odds of 1.0. As the probability goes up from 0.5 to 1.0, the odds increase from 1.0 to infinity.

Probability and odds are expressed in several ways, as a fraction, as a decimal, as a percentage, or in words. Words used to express probability are “out of” and “in.” For example, a probability could be expressed as 1/5, 0.2, 20%, 1 out of 5, or 1 in 5. The words used to express odds are “-” and “to.” For example, odds could be expressed as 1/4, 0.25, 25%, 1-4, or 1 to 4. There are even more ways to express odds for gambling, but you won’t have to know about them to pass Stats 101. In fact, you might not even hear about expressing uncertainty using odds in an introductory statistics course. Odds become important in more advanced statistics classes in which odds ratio and relative risk are discussed.

Odds, like probabilities, come from four sources—logic, data, oracles, and models. Perhaps the best-known examples of odds creation come from the oracles in Las Vegas. Those oracles, called bookmakers, decide what the odds should be for a particular bet. Historically, before computers, bookmakers would compile as much information as they could about prior events and conditions related to the bet, then make educated guesses about what might happen based on their experience, intuition, and judgment. With computers, this job has become bigger, faster, more complicated, and more reliant on statistics and technology. The field of sports statistics has exploded since the 1970s when baseball statistics (sabermetrics) emerged. Modern gambling odds even fluctuate to take account of wagers placed before the actual event. To keep up with the demand, bookmakers now employ teams of Traders who compile the data and use statistical models to estimate the odds. But you’ll never have to know that for Stats 101.

Next, the answer to the question “which came first, the model or the odds?”

Posted in Uncategorized | Tagged , , , , | Leave a comment

WHY YOU NEED TO TAKE STATS 101

Whether you’re in high school, college, or a continuing education program, you may have the opportunity, or be required, to take an introductory course in statistics, call it Stats 101. It’s certainly a requirement for degrees in the sciences and engineering, economics, the social sciences, education, criminology, and even some degrees in history, culinary science, journalism, library science, and linguistics. It is often a requirement for professional certifications in IT, data science, accounting, engineering, and health-related careers. Statistics is like hydrogen, it’s omnipresent and fundamental to everything. It’s easier to understand than calculus and you’ll use it more than algebra. In fact, if you read the news, watch weather forecasts, or play games, you already use statistics more than you realize.

Download this blog to find out why.

Why You Need to Take Stats 101

Morris horiz

Posted in Uncategorized | Tagged , | Leave a comment

PROBABILITY IS SIMPLE … KINDA

Language isn’t very precise in dealing with uncertainty. Probably is more certain than possibly, but who knows where dollars-to-doughnuts falls on the spectrum. Statistics needs to deal with uncertainty more quantitatively. That’s where probability comes in. For the most part, probability is a simple concept with fairly straightforward formulas. The applications, however, can be mind bending.

Where Do Probabilities Come From?

The first question you may have is “were do probabilities come from?” They sure don’t come from arm-waving know-it-alls at work or anonymous memes on the internet. Real probabilities come from four sources:

Logic – Some probabilities are calculated from the number of logical possibilities for a situation. For example, there are two sides to a coin so there are two possibilities. Standard dice have six sides so there are six possibilities. A standard deck of playing cards has 52 cards so there are 52 possibilities. The formulas for calculating probabilities aren’t that difficult either. By the time you finish reading this blog, you’ll be able to calculate all kinds of probabilities.

Data – Some probabilities are based on surveys of large populations or experiments repeated a large number of times. For example, the probability of a random American having a blood type of A-positive is 0.34 because approximately 34% of the people in the U.S. have that blood type. Likewise, there are probabilities that a person is a male (0.492), a non-Hispanic white (0.634), having brown eyes (0.34) and brown hair (0.58). The internet has more data than you can imagine for calculating probabilities.

Oracles – Some probabilities come from the heads of experts. Not the arm-wavers on the internet, but real professionals educated in a data-driven specialty. Experts abound. Sports gurus live in Las Vegas, survey builders reside largely in academia, and political prognosticators dwell everywhere. Some experts make predictions based on their knowledge and some use their knowledge to build predictive models. Probabilities developed from expert opinions may not all be as reliable as probabilities based on logic or data, but we rely heavily on them.

Models – A great many probabilities are derived from mathematical models, built by experts from data and scientific principles. You hear meteorological probabilities developed from models reported every night on the news, for instance. These models help you plan your daily life, so their impact is great. Perhaps even more importantly, though, are the probabilistic models that serve as the foundation of statistics itself, like the Normal distribution.

Turning Probably Into Probability

04 - Probably ORACLEProbabilities are discussed in terms of events or alternatives or outcomes. They all refer to the same thing, something that can happen. A few basic things you need to know about the probability of events are:

  • The probability of any outcome or event can range only from 0 (no chance) to 1 (certainty).
  • The probability of a single event is equal to 1 divided by the number of possible events.
  • If more than one event is being considered as favorable, then the probability of the favorable events occurring is equal to the number of favorable events divided by the number of possible events.

These rules are referred to as simple probability because they apply to the probability of a single independent, disjoint event from a single trial. (Independent and disjoint events are described in the following paragraphs.) A trial is the activity you preform, also called a test or experiment, to determine the probability of an event. If the probability involves more than one trial, it is called a joint probability. Joint probabilities are calculated by multiplying together the relevant simple probabilities. So, for example, if the probability of event A is 0.3 (30%) and the probability of event B is 0.7 (70%), the joint probability of the two events occurring is 0.3 times 0.7, or 0.21 (21%). The probability of a brown-eyed (0.34), brown-haired (0.58), non-Hispanic-white (0.63) male (0.49) having A-positive blood (0.34), for instance, is only 2%, quite rare. Joint probabilities also range only from 0 to 1.
05 - Probably PROBABILITY 3Events can be independent of each other or dependent on other events. For example, if you roll a dice or flip a coin, there is no connection between what happens in each roll or flip. They are independent events. On the other hand, if you draw a card from a standard deck of playing cards, your next draw will be different from the first because there are now fewer cards in the deck. Those two draws are called dependent events. Calculating the probability of dependent events has to account for changes in the number of total possible outcomes. Other than that, the formula for probability calculations is the same.

Some outcomes don’t overlap. They are one-or-the-other. They both can’t occur at the same time. These outcomes are said to be mutually exclusive or disjoint. Examples of disjoint outcomes might involve coin flips, dice rolls, card draws, or any event that can be described as either-or. For a collection of disjoint events, the sum of the probabilities is equal to 1. This is called the Rule of Complementary Events or the Rule of Special Addition.

Some outcomes do overlap. They can both occur at the same time. These outcomes are called non-disjoint. Examples of non-disjoint outcomes include a student getting a grade of B in two different courses, a used car having heated seats and a manual transmission, and a playing card being a queen and in a red suit. For a collection of non-disjoint events, the sum of the simple probabilities minus the probability in common for the events is equal to 1. This is called the Rule of General Addition. The joint probability of non-disjoint events is called a Conditional Probability.

05-06 Chart of Probability Formulas
There is a LOT more to probability than that, but that’s enough to get you through Stats 101. Read through the examples to see how probability calculations work.

You’ve Probably Thought of These Examples

06 - Yikes 3

Probability does not indicate what will happen, it only suggests how likely, on a fixed numerical scale, something is to happen. If it were definitive, it would be called certainility not probability. Here are some examples.

Coins. Find a coin with two different sides, call one side A and the other side B.

What is the probability that B will land facing upward if you flip the coin and let it land on the ground?

  • Probability = number of favorable events / total number of events
  • Probability = 1 / 2
  • Probability = 50% or 0.5 or ½ or 1 out of 2.

07 - Probably COINSThis is a probability calculation for two independent, disjoint outcomes. Coin edges aren’t included in the total-number-of-events because the probability of flipping a coin so it lands on an edge is much much smaller than the probability of flipping a coin so it lands on a side. Alternative events have to have an observable (non-zero) probability of occurring for a calculation to be valid.

Probability can be expressed in several ways — as a percentage, as a decimal, as a fraction, or as the relative frequency of occurrence. 

Using the same coin, record the results of 100 coin-flips. Count the number of times the results were the A-side and how many times the results were the B-side. Then flip the coin one more time.

What is the probability that side B will land facing upward?

  • Probability = number of favorable events / total number of events
  • Probability = 1 / 2
  • Probability = 50% or 0.5 or ½ or 1 out of 2.

Each coin flip is independent of the results of every other coin flip. So, whether you flip the coin 100 times or a million times, the probability of the next flip will always be ½. When you flipped the coin 100 times, for instance, you might have recorded 53 B-sides and 47 A-sides. The probability of the B-side facing upward after a flip would NOT be 53/100 because the flips are independent of each other. What happens on one flip has no bearing on any other flip.

08 - Probably TOAST AToast. Make two pieces of toast and spread butter on one side of each. Eat one and toss the other into the air.

What is the probability that the buttered side will land facing upward?

  • Probability = number of favorable events / total number of events
  • Probability = 1 / 2
  • Probability = 50% or 0.5 or ½ or 1 out of 2.

WRONG. You knew this was wrong because the buttered side of a piece of toast usually lands facing down. That’s the result of the buttered side being heavier than the unbuttered side. The two sides aren’t the same in terms of characteristics that will dictate how they will land. For probability calculations to be valid, each event has to have a known, constant chance of occurring. Unlike a coin, the toast is “loaded” so that the heavier side faces downward more often. Now, say you knew the buttered side landed downward 85% of the time. You could calculate the probability that the buttered side of your next toast flip will land facing upward as:

  • Probability = number of favorable events / total number of events
  • Probability = (100-85) / 100 = 15 / 100
  • Probability = 15% or 0.15 or 3/20 or 1 out of 6⅔.

To make this calculation valid, all you would have to do is establish that, when tossed into the air, buttered toast will land with its unbuttered-side upward a constant percentage of the time. So, make 100 pieces of toast and butter one side of each …. Let me know how this turns out.

09 - kitten diceStandard Dice. Standard dice are six-sided with a number (from 1 to 6) or a set of 1 to 6 small dots (called pips) on each side. The numbers (or number of pips) on opposite sides sum to 7.

What is the probability that a 6 (or 6 pips) will land facing upward when you toss the dice?

 

  • Probability = number of favorable events / total number of events
  • Probability = 1 / 6
  • Probability = 17% or 0.167 or ⅙ or 1 out of 6.

This is a probability calculation for 6 independent, disjoint outcomes.

What is the probability that an even number (2, 4, or 6) will land face upward when you toss the dice?

  • Probability = number of favorable events / total number of events
  • Probability = 3 / 6
  • Probability = 50% or 0.5 or ½ or 1 out of 2.

This calculation considers 3 sides of the dice to be favorable outcomes.

What is the probability that a 1 will land face upward on two consecutive tosses of the dice?

  • Probability = (Probability of Event A) times (Probability of Event B)
  • Probability = (1 / 6) times (1 / 6)
  • Probability = 3% or 0.28 or 1/36 or 1 out of 36.

This calculation estimates the joint probability of 2 independent, disjoint outcomes occurring based on 2 rolls of 1 dice.

What is the probability that, using 2 dice, you will roll “snake eyes” (only 1 pip on each dice)?

  • Probability = (Probability of Event A) times (Probability of Event B)
  • Probability = (1 / 6) times (1 / 6)
  • Probability = 3% or 0.28 or 1/36 or 1 out of 36.

This calculation also estimates the joint probability of 2 independent, disjoint outcomes occurring based on 1 roll of 2 dice.

What is the probability that a 6 will land face upward on 3 consecutive rolls of the dice?

  • Probability = (Probability of Event A) times (Probability of Event B) times (Probability of Event C)
  • Probability = (1 / 6) times (1 / 6) times (1 / 6)
  • Probability = 0.5% or 0.0046 or 23/5,000 or 1 out of 216.

Roll 1 dice 3 times or 3 dice 1 time, if you get 666 either the dice is loaded or the Devil is messing with you.

10 - Probably DICE DND

DnD Dice. Dice used to play Dungeons and Dragons (DnD) have different numbers of sides, usually 4, 6, 8, 10, 12, and 20.

What is the probability that 6 will land facing upward when you throw a 20-sided (icosahedron) dice?

  • Probability = number of favorable events / total number of events
  • Probability = 1 / 20
  • Probability = 5% or 0.05 or 1/20 or 1 out of 20.

This is a probability calculation for 20 independent, disjoint outcomes.

What is the probability that a 6 will land face upward on 3 consecutive tosses of the 20-sided (icosahedron) dice?

  • Probability = (Probability of Event A) times (Probability of Event B) times (Probability of Event C)
  • Probability = (1 / 20) times (1 / 20) times (1 / 20)
  • Probability = 0.01% or 0.000125 or 1/8,000 or 1 out of 8,000.

So, you have a smaller chance of summoning the Devil by rolling 666 if you use a 20-sided DnD dice instead of a standard 6-sided dice.

What is the probability that 6 will land facing upward when you throw a 4-sided (tetrahedron, Caltrop) dice?

  • Probability = number of favorable events / total number of events
  • Probability = 0 / 4
  • Probability = 0% or 0.0 or 0/4 or 0 out of 4.

There is no 6 on the 4-sided dice. Not everything in life is possible.

11 - Probably CARD APlaying Cards. A standard deck of playing cards consists of 52 cards in 13 ranks (an Ace, the numbers from 2 to 10, plus a jack, queen, and king) in each of 4 suits – clubs (♣), diamonds (♦), hearts (♥) and spades (♠).

What is the probability that you will draw a 6 of clubs from a complete deck?

  • Probability = number of favorable events / total number of events
  • Probability = 1 / 52
  • Probability = 2% or 0.019 or 1/52 or 1 out of 52.

This is a probability calculation for 52 disjoint outcomes. The draw is independent because only one card is being drawn.

What is the probability that you will draw a 6 from a complete deck?

  • Probability = number of favorable events / total number of events
  • Probability = 4 / 52
  • Probability = 8% or 0.077 or 1/13 or 1 out of 13.

This is a probability calculation for 4 favorable disjoint outcomes out of 52 because there is a 6 in each of the 4 suits.

What is the probability that you will draw a club (♣), from a complete deck?

  • Probability = number of favorable events / total number of events
  • Probability = 13 / 52
  • Probability = 25% or 0.25 or ¼ or 1 out of 4.

This is a probability calculation for 13 favorable disjoint outcomes out of 52 because there are 13 club cards in the deck.

What is the probability that you will draw a club (♣), from a partial deck?

You can’t calculate that probability without knowing what cards are in the partial deck.

12 - PROBABLY TAROTTarot Cards. A deck of Tarot cards consists of 78 cards, 22 in the Major Arcana and 56 in the Minor Arcana. The cards of the Minor Arcana are like the cards of a standard deck except that the Jack is also called a Knight, there are 4 additional cards called Pages, 1 in each suit, and the suits are Wands (Clubs), Pentacles (Diamonds), Cups (Hearts), and Swords (Spades).

What is the probability that you will draw a Major Arcana card from a complete deck?

  • Probability = number of favorable events / total number of events
  • Probability = 22 / 78
  • Probability = 28% or 0.282 or 11/39 or 1 out of 3.54.

This is a probability calculation for 22 disjoint outcomes. The draw is independent because only 1 card is being drawn.

What is the probability that you will draw a Knight from a complete deck?

  • Probability = number of favorable events / total number of events
  • Probability = 4 / 78
  • Probability = 5% or 0.051 or 2/39 or 1 out of 19.5.

This is a probability calculation for 4 disjoint outcomes. The draw is independent because only 1 card is being drawn.

What is the probability that you will draw Death (a Major Arcana card) from a complete deck?

  • Probability = number of favorable events / total number of events
  • Probability = 1 / 78
  • Probability = 1% or 0.0128 or 1/78 or 1 out of 78.

If the Death card turns up a lot more than 1% of the time, maybe seek professional help.

Assuming you have already drawn Death, what is the probability that you will draw either The Tower, Judgement, or The Devil (other Major Arcana cards) from the same deck?

  • Probability = number of favorable events / total number of events
  • Probability = 3 / 77
  • Probability = 4% or 0.039 or 3/77 or 1 out of 25.7.

This is a probability calculation for 77 disjoint outcomes. The draw is dependent because one card (Death) has already been drawn, leaving the deck with 77 cards. If a The Tower, Judgement, or The Devil card does turn up after you have already drawn Death, definitely get professional help. Do NOT use a Ouija Board to summons help.

What is the probability that you will draw Death followed by Judgement from a complete deck on sequential draws?

  • Probability = (Probability of Event A) times (Probability of Event B)
  • Probability = (1 / 78) times (1 / 77)
  • Probability = 0.02% or 0.0002 or 167/1.000,000 or 1 out of 6,006.

If you draw Death and Judgement consecutively, you are toast. Refer to the second  example.

13 - A Probably CANDY BCandy Bars, Your son has just returned from trick-or-treating. He inventories his stash and has: 5 Snickers; 6 Hershey’s bars; 4 Pay Days; 5 Kit Kats; 3 Butterfingers; 2 Charleston Chews; 5 Tootsie Roll bars; a box of raisins; and an apple. You throw away the apple because it’s probably full of razor blades. He throws away the raisins because they’re raisins. After the boy is asleep, you sneak into his room and, without turning on the light, find his stash. Putting your hand quietly into the bag, you realize that it’s too dark to see and all the bars feel alike.

What’s the probability that you’ll pull a Snickers out of the bag?

  • Probability = number of favorable events / total number of events
  • Probability = 5 / 30
  • Probability = 17% or 0.167 or 1/6 or 1 out of 6.

This is a probability calculation for 30 independent disjoint outcomes. The draw is independent because only 1 bar is being drawn.

What’s the probability that you’ll pull out a Snickers on your next attempt if you put back any bar you pull out that isn’t a Snickers?

  • Probability = number of favorable events / total number of events
  • Probability = 5 / 30
  • Probability = 17% or 0.167 or 1/6 or 1 out of 6.

This is called probability with replacement because by returning the non-Snickers bars to the bag, you are restoring the original total number of bars. The outcomes are independent of each other.

How many bars do you have to pull out before you have at least a 50% probability of getting a Snickers if you put the bars you pull out that aren’t Snickers into a separate pile (not back into the bag)?

  • Probability = number of favorable events / total number of events
  • 1st bar pulled Snickers probability = 5 / 30 = 17%
  • 2nd bar pulled Snickers probability = 5 / 29 = 17%
  • 3rd bar pulled Snickers probability = 5 / 28 = 18%
  • 4th bar pulled Snickers probability = 5 / 27 = 19%
  • 5th bar pulled Snickers probability = 5 / 26 = 19%
  • 6th bar pulled Snickers probability = 5 / 25 = 20%
  • 7th bar pulled Snickers probability = 5 / 24 = 21%
  • 8th bar pulled Snickers probability = 5 / 23 = 22%
  • 9th bar pulled Snickers probability = 5 / 22 = 23%
  • 10th bar pulled Snickers probability = 5 / 21 = 24%
  • 11th bar pulled Snickers probability = 5 / 20 = 25%
  • 12th bar pulled Snickers probability = 5 / 19 = 26%
  • 13h bar pulled Snickers probability = 5 / 18 = 28%
  • 14th bar pulled Snickers probability = 5 / 17 = 29%
  • 15th bar pulled Snickers probability = 5 / 16 = 31%
  • 16th bar pulled Snickers probability = 5 / 15 = 33%
  • 17th bar pulled Snickers probability = 5 / 14 = 36%
  • 18th bar pulled Snickers probability = 5 / 13 = 38%
  • 19th bar pulled Snickers probability = 5 / 12 = 42%
  • 20th bar pulled Snickers probability = 5 / 11 = 45%
  • 21th bar pulled Snickers probability = 5 / 10 = 50%
  • 22th bar pulled Snickers probability = 5 / 9 = 56%
  • 23th bar pulled Snickers probability = 5 / 8 = 63%
  • 24th bar pulled Snickers probability = 5 / 7 = 71%
  • 25th bar pulled Snickers probability = 5 / 6 = 83%
  • 26th bar pulled Snickers probability = 5 / 5 = 100%

13 - B Probably CANDY

This is called probability without replacement. The outcomes are dependent on how many bars have already been taken out of the bag. You would have to try 21 times until you get to 50% probability. 11 Tries will get you to 25% probability. Still, if you returned the bars to the bag you would never have better than a 17% chance of grabbing a Snickers.

Say you picked a Snickers on your first grab. What’s the probability that you’ll pull out a Snickers on subsequent grabs?

  • Probability = number of favorable events / total number of events
  • 1st bar pulled is a Snickers
  • 2nd bar pulled Snickers probability = 4 / 29 = 14%
  • 3rd bar pulled Snickers probability = 4 / 28 = 14%
  • 4th bar pulled Snickers probability = 4 / 27 = 15%
  • 5th bar pulled Snickers probability = 4 / 26 = 15%
  • 6th bar pulled Snickers probability = 4 / 25 = 16%
  • 7th bar pulled Snickers probability = 4 / 24 = 17%
  • 8th bar pulled Snickers probability = 4 / 23 = 17%
  • 9th bar pulled Snickers probability = 4 / 22 = 18%
  • 10th bar pulled Snickers probability = 4 / 21 = 19%
  • 11th bar pulled Snickers probability = 4 / 20 = 20%

Once you do grab a Snickers, the probability that you’ll get another goes down because there are fewer Snickers in the bag. So, the lesson is: Don’t Be Greedy!

You are allergic to peanuts. What’s the probability that you’ll pull out a peanut-free bar (i.e., Charleston Chews, Tootsie Rolls, Kit Kats, or Hershey’s bars)?

  • Probability = number of favorable events / total number of events
  • Probability = 18 / 30
  • Probability = 60% or 0.6 or 3/5 or 1 out of 1.67.

Watch out for the bars that may have been produced in facilities that also process peanuts.

What’s the probability that your son will notice you raided his stash?

  • Probability = 1.0 or 100%. 

Are you kidding?

14 - Probably CARD

Next, consider “What Are The Odds?”

Read more about using statistics at the Stats with Cats blog at https://statswithcats.net. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Amazon.com or other online booksellers.

 

Posted in Uncategorized | Tagged , , , , , , , , , | 1 Comment

What To Look For In Data

Sometimes you have to do things when you have no idea where to start. It can be a stressful experience. If you’ve ever had to analyze a data set, you know the anxiety.

Deciding how and where to start exploring a new data set can be perplexing. Typically. the first thing to consider is the objective you, your boss, or your client have in analyzing the dataset. That will give you a sense of where you need to go. Then you have to ensure the data set is reasonably free of errors. After that, you decide whether to look at snapshots, population or sample characteristics, changes over time or under different conditions, and multi-metric trends and patterns. This blog will give you some ideas for where and how to start.

884 KB pdf file

Posted in Uncategorized | Tagged , , , , , , , , , , , | Leave a comment

35 Ways Data Go Bad

When you take your first statistics class, your professor will be a kind person who cares about your mental well-being. OK, maybe not, but what the professor won’t do is give you real-world data sets. The data may represent things you find in the real world but the data set will be free of errors. Real-world data is packed full of all kinds of errors – in individual data points and pervasive within metrics in the data set – that can be easy to find or buried deep in the details about the data (i.e., called metadata).

There are a dozen different kinds of information, some more prone to errors than others. Therefore, real-world data has to be scrubbed before it can be analyzed. Data cleansing is an unfortunate misnomer that is sometimes applied to removing errors from a data set. The term implies that a data set can be thoroughly cleaned so it is free of errors. That doesn’t really happen. It’s like taking a shower. You remove the dirt and big stuff but there’s still a lot of bacteria left behind. Data scrubbing can be exhausting and often takes 80% of the time spent on a statistical analysis. With all the bad things that can happen to data it’s remarkable that statisticians can produce any worthwhile analyses at all.

Here are 35 kinds of errors you’ll find in real-world data sets, divided for convenience into 7 categories. Some of these errors may seem simplistic, but looking at every entry of a large data set can be overwhelming if you don’t know what to look for and where to look, and then what to do when you find an error.

Invalid Data

Invalid data are values that are originally generated incorrectly. They may be individual data points or include all the measurements for a specific metric. Invalid data can be difficult to identify visually but may become apparent during an exploratory statistical analysis. They generally have to be removed from the data set.

Bad generation

Bad generation data result from a bad measurement tool. For example, a meter may be out of calibration or a survey question can be ambiguous or misleading. They can appear as individual data points or affect an entire metric. Their presence is usually revealed during analysis. They have to be removed.

Recorded wrong

Data can be recorded incorrectly, usually randomly in a data set, by the person creating or transcribing the data. These errors may originate on a data collection or data entry form, and thus, are difficult to detect without considerable cross checking. If they are identified, they have to be removed unless the real information can be discerned allowing replacement. Ambiguous votes in the 2000 presidential election in Florida are examples.

Bad coding

Bad coding results when information for a nominal-scale or ordinal-scale metric is entered inconsistently, either randomly or for an entire metric. This is especially troublesome if there are a large number of codes for a metric. Detection can be problematical. Sometimes other metrics can present inconsistencies that will reveal bad coding. For example, a subject’s sex coded as “male” might be in error if other data exist pointing to “female” characteristics. Colors are especially frustrating. They can be specified in a variety of ways – RGB, CMYK, hexadecimal, Munsell, and many other systems – all of which produce far too many categories to be practical. Plus, people perceive colors differently. Males see six colors where females see thirty. Bad coding can be replaced if it can be detected.

Wrong thing measured

Measuring the wrong thing seems ridiculous but it is not uncommon. The wrong fraction of a biological sample could be analyzed. The wrong specification of a manufactured part could be measured. And in surveys, the demographic defining the frame could be off-target. This can occur for an individual data point or the whole data set. Identification can be challenging if not impossible. Detected errors have to be removed.

Data quality exceptions

Some data sets undergo independent validation in addition to the verification conducted by the data analyst. While “verification is a simple concept embodied in simple yet time-consuming processes, … validation is a complex concept embodied in a complex time-consuming process.” A data quality exception might occur, for instance, when a data point is generated under conditions outside the parameters of a test, such as on an improperly prepared sample or at an unacceptable temperature. Identifying data quality exceptions is easy only because it is done by someone else. Removal of the exception is the ultimate fix.

Missing and Extraneous Data

Sometimes data points don’t make it into the data set. They are missing. This is a big deal because statistical procedures don’t allow missing data. Either the entire observation or the entire metric with the missing data point has to be excluded from the analysis OR a suitable replacement for the missing value has to be included in its place. Neither is a great alternative. Why the data points are missing is critical.

The opposite of missing values, extra data observations, also can appear in datasets. These most often occur for known reasons, such as quality control samples and merges of overlapping data sets. Missing data tends to affect metrics. Extra data points tend to affect observations.

Missing data

Data don’t just go missing, they (usually) go missing for a reason. It’s important to explore why data points for a metric are missing. If the missing values are truly random, they are said to be missing completely at random. If other metrics in the data set suggest why they are missing, they are said to be missing at random. However, if the reason they are missing is related to the metric they are missing from, that’s bad. Those data values are said to be missing not at random.

Missing-completely-at-random (MCAR) data

If there is no connection between missing data values for a metric and the values of that metric or any other metric in the data set (i.e., there is no reason for why the data point is missing), the values are said to be Missing Completely at Random (MCAR). MCAR values can occur with or without any explanation. An automated meter may malfunction or a laboratory result might be lost. A field measurement may be forgotten before it is recorded or just not recorded. MCAR data can be replaced by some appropriate value (there are several approaches for doing this), but they are usually ignored. In this case, either the metric or the observation has to be removed from the analysis.

Missing-at-random (MAR) data

If there is some connection between missing data values for a metric and the values of any of the other metrics in the data set (but not the metric with the missing values), the values are said to be Missing at Random (MAR). The true value of a MAR data point has nothing to do with why the value is missing, although other metrics do explain the omission. MAR data can occur when survey respondents refuse to answer questions they feel are inappropriate. For example, some females may decline to answer questions about their sexual history while males might answer readily (although not necessarily honestly). The sex of the respondent would explain why some data are missing and others are not. Likewise, a meter might not function if the temperature is too cold, resulting in MAR data. MAR data can be replaced by some appropriate value (there are several approaches for doing this), in which case, the pattern of replacement can be analyzed as a new metric. If the MAR data are ignored, either the metric or the observation has to be removed from the analysis.

Missing-not-at-random (MNAR) data

If there is some connection between a missing data value and the value of that metric, the values are said to be Missing Not at Random (MNAR). This is considered the worst case for a missing value. It has to be dealt with. For example, like MAR data, MNAR data can occur when survey respondents refuse to answer questions they feel are inappropriate, only in the MNAR case, because of what their answer might be. Examples might include sexual activity, drug use, medical conditions, age, weight, or income. Likewise, a meter might not function if real data are outside its range of measurement. These data are also said to be “censored.” MNAR data can be replaced by some appropriate value (there are several approaches for doing this), in which case, the pattern of replacement must be analyzed as a new metric. MNAR data should not be ignored because the pattern of their occurrence is valuable information.

Uncollected data

Some data go missing because they simply weren’t collected. This occurs in surveys that branch, in which different questions are asked of participants based on their prior responses. In these cases, data sets are reconstructed to analyzed only the portions of the branch that has no missing data. Another example is when a conscious decision is made not to collect certain data or not collect data from certain segments of a population because “ignorance is bliss.” The decisions to limit testing for Covid-19 and not record details of the imprisonment of illegal alien families are current examples. The data that are missing can never be recovered. Worse, generations in the future when such data are reexamined, the biases introduced by not collecting the data may be unrecognized.

Replicates

More than one suite of data from the same observational unit (e.g., individual, environmental sample, manufactured part, etc.) are sometimes collected to evaluate variability. These multiple results are called duplicates, triplicates, or in general, replicates. Intentionally collected replicates are usually consolidated into a single observation by averaging the values for each metric. Replicates can also be created when two overlapping data sets are merged. In these cases, the replicated observations should be identical so that only one is retained.

QA/QC samples

Additional observations are sometimes created for the purpose of evaluating the quality of data generation. Examples of such Quality Assurance/Quality Control (QA/QC) samples focus on laboratory performance, sample collection and transport, and equipment calibration. These results may be included in a data set when the data set is created as a convenience. They should not be part of any statistical analysis; they must be evaluated separately. Consequently, QA/QC samples should be removed from analytical data sets.

Extraneous unexplained

Rarely, extra data points may spontaneously appear in a data set for no apparent reason. They are idiopathic in the sense that their cause is unknown. They should be removed.

Dirty Data

Dirty data includes individual data points that have erroneous characters as well as whole metrics that cannot be analyzed because of some inconsistency or textual irregularity. Dirty data can usually be identified visually; they stand out. Unfortunately, most of these types of errors appear randomly so the entire dataset has to be searched, although there are tricks for doing this.

Incorrect characters

Just about anything can end up being an incorrect character, especially if data entry was manual. There are random typos. There are lookalike characters, like O for 0, l for 1, S for 5 or 8, and b for 6. There are digits that have been inadvertently reversed, added, dropped, or repeated. These errors can be challenging to detect visually, especially if they are random. Once detected though, they are east to repair manually.

Problematic characters

Problematic characters can be either unique or common but in a different context. Unique characters include currency symbols, icons used as bullets, and footnote symbols. Common characters that are problematic include leading or trailing spaces and punctuation marks like quotes, exclamation points, asterisks, parentheses, ampersands, carets, hashes, at signs, and slashes. Extra or missing commas wreak havoc when importing csv files. This can happen when commas are used instead of periods for dollar values. These errors can be challenging to detect visually, especially if they are random. Once detected though, they are east to repair manually.

Concatenated data

Some data elements include several pieces of information in a single entry that may need to be extracted to be analyzed. Examples include timestamps, geographic coordinates, telephone numbers, zip plus four, email addresses, social security numbers, account numbers, credit card numbers, and other identification numbers. Often, the parts of the values are delimited with hyphens, periods, slashes, or parentheses. These data metrics are easy to identify and process or remove.

Aliases and misspellings

IDs, names, and addresses are common places to find aliases and misspellings. They’re not always easy to spot, but sorting and looking for duplicates is a start. Upper/lower case may be an issue for some software depending on the analysis to be done. Fix the errors by replacing all but one of the data entries.

Useless data

Any metric that has no values or has values that are all the same are useless in an analysis and should be removed. Metrics with no values can occur, for instance, from filtering or from importing a table with breaker columns or rows. Some metrics may be irrelevant to an analysis or duplicate information in another metric. For example, names can be specified in a variety of formats, such as “first last.” “last, first,” and so on. Only one format needs to be retained. Useless data can be removed unless there is some reason to keep the original data set metrics intact.

Invalid fields

All kinds of weird entries can appear in a dataset, especially one that is imported electronically. Examples include file header information, multi-row titles from tables, images and icons, and some types of delimiters. These must all be removed. Data values that appear to be digits but are formatted as text must also be reformatted.

Out-of-Spec Data

Some data may appear fine at first glance but are actually problematical because they don’t fit expected norms. Some of these errors apply to individual data points and some apply to all the measurements for a metric. Identification and recovery depend on the nature of the error.

Out-of-bounds data

Some data errors involve impossible values that are outside the boundaries of the measurement. Examples include pH outside of 0 to 14, an earthquake larger than 9 on the Richter scale, a human body temperature of 115°F, negative ages and body weights, and sometimes, percentages outside of 0% to 100%. Out-of-bounds data should be corrected, if possible, or removed if not.

Data with different precisions

Data should all have the same precision, though this is not always the case. Currency data is often a problem. For example, sometimes dollar amounts are recorded in cents and sometimes in much larger amounts, This adds extraneous variability to calculated statistics.

Data with different units

Data for a metric should all be measured and reported in the same units. Sometimes, measurements can be made in both English and metric units but not converted when included into a dataset. Sometimes, an additional metric is included to specify the unit, however, this can lead to confusion. A famous example of confusion over units was when NASA lost the $125 million Mars Climate Orbiter in 1999. Fixing metrics that have inconsistent units is usually straightforward.

Data with different measurement scales

Having data measured on different scales for a metric is rare but it does happen. In particular, a nominal-scale metric can appear to be an ordinal-scale metric if numbers are used to identify the categories. Time and location scales can also be problematic. Compared to fixing metrics with inconsistent units, fixing metrics with inconsistent scales can be challenging.

Data with large ranges

Data with large ranges, perhaps ranging from zero to millions of units, are an issue because they can cause computational inaccuracies. Replacement by a logarithm or other transformation can address this problem.

Messy Data

Messy data give statisticians nightmares. Untrained analysts would probably not even notice these problems. In fact, even for statisticians, they can be difficult to diagnose because expertise and judgment are needed to establish their presence. Once identified, additional analytical techniques are needed to address the issues. And then, there may not be a consensus on the appropriate response.

Outliers

Outliers are anomalous data points that just don’t fit with the rest of the metric in the data set. You might think that outliers are easy to identify and fix, and there are many ways to accomplish those things, but there is enough judgment involved in those processes to allow damning criticism from even untrained adversaries. That is a nightmare for an applied statistician. They can be 100% in the right yet still made to appear as a con artist.

Large variances

Some data are accurate but not precise. That is a nightmare for a statistician because statistical tests rely on extraneous variance to be controlled. You can’t find significant differences between mean values of a metric if the variance in the data is too large. A large variance in a metric of a data set is easy to identify just by calculating the standard deviation and comparing it to the mean for the metric (called the coefficient of variation). There are methods to adjust for large variances, but the best strategy is prevention.

Non-constant variances

The variance of some metrics occasionally changes with time or with changes in a controlling condition. For example, the variance in a metric may diminish over time as methods of measurement improve. Some biochemical reactions become more variable with changes in temperature, pressure, or pH. This is a nightmare for a statistician because statistical modeling assumes homoskedasticity (i.e., constant variances). Heteroskedasticity in a metric of a data set is easy to identify by calculating and plotting the variances between time periods or categories of other metrics. There are methods to adjust for non-constant variances but they introduce other issues.

Censored data

Some data can’t be measured accuracy because of limitations in the measurement instrument. Those data are reported as “less than” (<) or “greater than” (>) the limit of measurement. They are said to be censored. Very low concentrations of pollutants, for example, are often reported as <DL (less than the detection limit) because the instrument can detect the pollutant but not quantify its concentration. There are a variety of ways to address this issue either by replacing affected data points or by using statistical procedures that account for censored data. Nevertheless, censored data are a nightmare for applied statisticians because there is no consensus on the best way to approach the problem in a given situation.

Corrupted Data

Corrupted data are created when some improper operation is applied, either manually or by machine, to data needing refinement after it is generated. These errors can be detected most easily at the time they are created. They tend not to be obvious if they are not identified immediately after they occur.

Electronic glitches

Electronic glitches occur when some interference corrupts a data stream from an automated device. These errors can be detected visually if the data are reviewed. Often, however, such data streams are automated so they do not have to be reviewed regularly. Removal is the typical fix.

Bad extraction

It is not uncommon for data elements to have to be extracted from a concatenated metric. For example, a month might have to be extracted from a value formatted as mmddyy (e.g., 070420), or a zip code have to be extracted from a value formatted as a zip code plus four. Such extractions are usually automated. If an error is made in the extraction formula, however, the extracted data will be in error. These errors are usually noticeable and can be replaced by correct data by running a revised extraction formula.

Bad processing

As with extraction, It is not uncommon for metrics to have to be processed to correct errors or give them more desirable properties. Such processing is usually automated. If an error is made in the processing algorithm, the resulting data will be in error. For example, NASA has occasionally had instances in which processing photogrammetric data has caused space debris to appear as UFOs and planetary landforms to appear as alien structures (e.g., the Cydonia Region on Mars). These errors are usually noticeable, at least by critics. Processing errors can be replaced by corrected data by running a revised processing algorithm.

Bad reorganization

Data sets are often manipulated manually to optimize their organizations and formatting for analysis. Cut and paste operations are often used for this purpose. Occasionally, a cut/paste operation will go awry. Detection is easiest at the moment it occurs, when it can be reversed effortlessly. These errors tend not to be so obvious or easy to fix if they are not identified immediately after they occur.

Mismatched Data

A great many statistical analyses rely on data collected and published by others, usually organizations dedicated to a cause, and often, government agencies. There is usually a presumption that these data are error-free and, at least for government sources, unbiased. They are, of course, neither, but data analysts are limited to using the tools they have at hand. Some errors, or at least inconsistencies, in these data sets are attributable to differences in the nature of the data being measured, differences in data definitions, and differences related to the passage of time. These differences can be overtly stated in metadata or buried deep in the way the creation of the data evolved. In either case, the errors aren’t always visible in the actual data points; they have to be discovered. And even if you discover inconsistencies, you may not be able to fix them.

Different sources

Errors in data sets built from published data take a variety of forms. First, everything that can happen in the creation of a locally-created data set can happen in a published data set, so there could be just as wide a variety of errors. Reputable sources, however, will scrub out invalid data, dirty data, out-of-spec data, corrupted data, and extraneous data. Most will not address missing data. None will deal with messy data. Missing and messy data are the responsibility of the data analyst. Second, different sources will have assembled their data using different contexts – data definitions, data acquisition methods, business rules, and data administration policies. None of these is usually readily apparent. Some errors may also occur when data sets from different sources are merged. Examples of such errors include replicates and extraction errors. It goes without saying that merging data from different sources can be satisfying yet terrifying, like bungy jumping, cave diving, and registering for Statistics 101. So what could go wrong in your analysis if you don’t consider the possibility of mismatched data?

Different definitions

When you combine data from different sources, or even evaluate data from a single source, be sure you know how the data metrics were defined. Sometimes data definitions change over time or under different conditions. For example, some counts of students in college might include full-time students at both two-year and four-year colleges, other counts may exclude two-year colleges but include part-time students. Say you’re analyzing the number of diabetics in the U.S. The first glucose meter was introduced in 1969, but before 1979, blood glucose testing was complicated and not quantitative. In 1979, a diagnoses of diabetes was defined as a fasting blood glucose of 140 mg/dL or higher. In 1997 the definition was changed to 126 mg/dL or higher. Today, a level of 100 to 125 mg/dL is considered prediabetic. Data definitions make a real difference. So, if you’re analyzing a phenomenon that uses some judgement in data generation, especially phenomena involving technology, be aware of how those judgments might have evolved.

Different contexts

In addition to different data definitions, the context under which a metric was created may be relevant. For example, in 143 years, the Major League Baseball (MLB) record for most home runs in a season has been held by 8 men. The 4 who have held the record the longest being: Babe Ruth (60 home runs, 1919 to 1960); Roger Maris (61 home runs, 1961 to 1997); Ned Williamson (27 home runs, 1884 to 1918); and Barry Bonds (73 home runs, 2001 to 2019). The other 4 recordholders held their record for fewer than 5 years each. During that time, there have been changes in rules, facilities, equipment, coaching strategies, drugs, and of course, players, so it would be ridiculous to compare Ned Williamson’s 27 home runs to Barry Bonds’ 73 home runs. Consider also how perceptions of race and gender might be different in different sources, say a religious organization versus a federal agency. Even surveys by the same organization of the same population using different frames can produce different results. Be sure you understand the contexts data have been generated under when you merge files.

Different times

Time is perhaps the most challenging framework to match data on. In business data, for example, relevant parameters might include: fiscal and calendar year; event years (e.g., elections, census, leap years); daily, monthly, and quarterly cutoff days, and seasonality and seasonal adjustments. Data may represent snapshots, statistics (e.g., moving averages, extrapolations), and planned versus reprogramed values. And sometimes, the rules change over time. The first fiscal year of the U.S. Government started on January 1, 1789. Congress changed the beginning of the fiscal year from January 1 to July 1 in 1842, and from July 1 to October 1 in 1977. Time is not on your side.

Summary

There seems to be an endless number of ways that data can go bad. There are at least 35. That realization is soul-crushing for most statisticians, so they come by it slowly. Some do come to grips with the concept that no data set is error-free, or can be error-free, but still can’t imagine the creativity nature has for making this happen. This blog is an attempt to enumerate some of these hazards.

Data errors can occur in individual data points and whole data metrics (and sometimes observations). They can be identified visually, using descriptive statistics, or statistical graphics, depending on the type of error.

The manner of dataset creation provides insight into the types of errors that might be present. Original datasets, one-time creations that become a source of data, are prone to invalid data, dirty data, out-of-spec data, and missing data. Combined datasets (also referred to as merged, blended, fused, aggregated, concatenated, joined, and united) are built from multiple sources of data, either manually or by automation, at one time, periodically, or continuously. These data sets are more prone to corrupted and mismatched data

Recovering bad data involve the 3 Rs of fixing data errors – Repair, Replacement, and Removal.

Read more about using statistics at the Stats with Cats blog at https://statswithcats.net. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Amazon.com or other online booksellers.

Posted in Uncategorized | Tagged , , , , , , , , , , , , , , | 1 Comment

The Most Important Statistical Assumptions

Image | Posted on by | Tagged , , | 3 Comments