When you take your first statistics class, your professor will be a kind person who cares about your mental well-being. OK, maybe not, but what the professor won’t do is give you real-world data sets. The data may represent things you find in the real world but the data set will be free of errors. Real-world data is packed full of all kinds of errors – in individual data points and pervasive within metrics in the data set – that can be easy to find or buried deep in the details about the data (i.e., called metadata).
There are a dozen different kinds of information, some more prone to errors than others. Therefore, real-world data has to be scrubbed before it can be analyzed. Data cleansing is an unfortunate misnomer that is sometimes applied to removing errors from a data set. The term implies that a data set can be thoroughly cleaned so it is free of errors. That doesn’t really happen. It’s like taking a shower. You remove the dirt and big stuff but there’s still a lot of bacteria left behind. Data scrubbing can be exhausting and often takes 80% of the time spent on a statistical analysis. With all the bad things that can happen to data it’s remarkable that statisticians can produce any worthwhile analyses at all.
Here are 35 kinds of errors you’ll find in real-world data sets, divided for convenience into 7 categories. Some of these errors may seem simplistic, but looking at every entry of a large data set can be overwhelming if you don’t know what to look for and where to look, and then what to do when you find an error.
Invalid data are values that are originally generated incorrectly. They may be individual data points or include all the measurements for a specific metric. Invalid data can be difficult to identify visually but may become apparent during an exploratory statistical analysis. They generally have to be removed from the data set.
Bad generation data result from a bad measurement tool. For example, a meter may be out of calibration or a survey question can be ambiguous or misleading. They can appear as individual data points or affect an entire metric. Their presence is usually revealed during analysis. They have to be removed.
Data can be recorded incorrectly, usually randomly in a data set, by the person creating or transcribing the data. These errors may originate on a data collection or data entry form, and thus, are difficult to detect without considerable cross checking. If they are identified, they have to be removed unless the real information can be discerned allowing replacement. Ambiguous votes in the 2000 presidential election in Florida are examples.
Bad coding results when information for a nominal-scale or ordinal-scale metric is entered inconsistently, either randomly or for an entire metric. This is especially troublesome if there are a large number of codes for a metric. Detection can be problematical. Sometimes other metrics can present inconsistencies that will reveal bad coding. For example, a subject’s sex coded as “male” might be in error if other data exist pointing to “female” characteristics. Colors are especially frustrating. They can be specified in a variety of ways – RGB, CMYK, hexadecimal, Munsell, and many other systems – all of which produce far too many categories to be practical. Plus, people perceive colors differently. Males see six colors where females see thirty. Bad coding can be replaced if it can be detected.
Wrong thing measured
Measuring the wrong thing seems ridiculous but it is not uncommon. The wrong fraction of a biological sample could be analyzed. The wrong specification of a manufactured part could be measured. And in surveys, the demographic defining the frame could be off-target. This can occur for an individual data point or the whole data set. Identification can be challenging if not impossible. Detected errors have to be removed.
Data quality exceptions
Some data sets undergo independent validation in addition to the verification conducted by the data analyst. While “verification is a simple concept embodied in simple yet time-consuming processes, … validation is a complex concept embodied in a complex time-consuming process.” A data quality exception might occur, for instance, when a data point is generated under conditions outside the parameters of a test, such as on an improperly prepared sample or at an unacceptable temperature. Identifying data quality exceptions is easy only because it is done by someone else. Removal of the exception is the ultimate fix.
Missing and Extraneous Data
Sometimes data points don’t make it into the data set. They are missing. This is a big deal because statistical procedures don’t allow missing data. Either the entire observation or the entire metric with the missing data point has to be excluded from the analysis OR a suitable replacement for the missing value has to be included in its place. Neither is a great alternative. Why the data points are missing is critical.
The opposite of missing values, extra data observations, also can appear in datasets. These most often occur for known reasons, such as quality control samples and merges of overlapping data sets. Missing data tends to affect metrics. Extra data points tend to affect observations.
Data don’t just go missing, they (usually) go missing for a reason. It’s important to explore why data points for a metric are missing. If the missing values are truly random, they are said to be missing completely at random. If other metrics in the data set suggest why they are missing, they are said to be missing at random. However, if the reason they are missing is related to the metric they are missing from, that’s bad. Those data values are said to be missing not at random.
Missing-completely-at-random (MCAR) data
If there is no connection between missing data values for a metric and the values of that metric or any other metric in the data set (i.e., there is no reason for why the data point is missing), the values are said to be Missing Completely at Random (MCAR). MCAR values can occur with or without any explanation. An automated meter may malfunction or a laboratory result might be lost. A field measurement may be forgotten before it is recorded or just not recorded. MCAR data can be replaced by some appropriate value (there are several approaches for doing this), but they are usually ignored. In this case, either the metric or the observation has to be removed from the analysis.
Missing-at-random (MAR) data
If there is some connection between missing data values for a metric and the values of any of the other metrics in the data set (but not the metric with the missing values), the values are said to be Missing at Random (MAR). The true value of a MAR data point has nothing to do with why the value is missing, although other metrics do explain the omission. MAR data can occur when survey respondents refuse to answer questions they feel are inappropriate. For example, some females may decline to answer questions about their sexual history while males might answer readily (although not necessarily honestly). The sex of the respondent would explain why some data are missing and others are not. Likewise, a meter might not function if the temperature is too cold, resulting in MAR data. MAR data can be replaced by some appropriate value (there are several approaches for doing this), in which case, the pattern of replacement can be analyzed as a new metric. If the MAR data are ignored, either the metric or the observation has to be removed from the analysis.
Missing-not-at-random (MNAR) data
If there is some connection between a missing data value and the value of that metric, the values are said to be Missing Not at Random (MNAR). This is considered the worst case for a missing value. It has to be dealt with. For example, like MAR data, MNAR data can occur when survey respondents refuse to answer questions they feel are inappropriate, only in the MNAR case, because of what their answer might be. Examples might include sexual activity, drug use, medical conditions, age, weight, or income. Likewise, a meter might not function if real data are outside its range of measurement. These data are also said to be “censored.” MNAR data can be replaced by some appropriate value (there are several approaches for doing this), in which case, the pattern of replacement must be analyzed as a new metric. MNAR data should not be ignored because the pattern of their occurrence is valuable information.
Some data go missing because they simply weren’t collected. This occurs in surveys that branch, in which different questions are asked of participants based on their prior responses. In these cases, data sets are reconstructed to analyzed only the portions of the branch that has no missing data. Another example is when a conscious decision is made not to collect certain data or not collect data from certain segments of a population because “ignorance is bliss.” The decisions to limit testing for Covid-19 and not record details of the imprisonment of illegal alien families are current examples. The data that are missing can never be recovered. Worse, generations in the future when such data are reexamined, the biases introduced by not collecting the data may be unrecognized.
More than one suite of data from the same observational unit (e.g., individual, environmental sample, manufactured part, etc.) are sometimes collected to evaluate variability. These multiple results are called duplicates, triplicates, or in general, replicates. Intentionally collected replicates are usually consolidated into a single observation by averaging the values for each metric. Replicates can also be created when two overlapping data sets are merged. In these cases, the replicated observations should be identical so that only one is retained.
Additional observations are sometimes created for the purpose of evaluating the quality of data generation. Examples of such Quality Assurance/Quality Control (QA/QC) samples focus on laboratory performance, sample collection and transport, and equipment calibration. These results may be included in a data set when the data set is created as a convenience. They should not be part of any statistical analysis; they must be evaluated separately. Consequently, QA/QC samples should be removed from analytical data sets.
Rarely, extra data points may spontaneously appear in a data set for no apparent reason. They are idiopathic in the sense that their cause is unknown. They should be removed.
Dirty data includes individual data points that have erroneous characters as well as whole metrics that cannot be analyzed because of some inconsistency or textual irregularity. Dirty data can usually be identified visually; they stand out. Unfortunately, most of these types of errors appear randomly so the entire dataset has to be searched, although there are tricks for doing this.
Just about anything can end up being an incorrect character, especially if data entry was manual. There are random typos. There are lookalike characters, like O for 0, l for 1, S for 5 or 8, and b for 6. There are digits that have been inadvertently reversed, added, dropped, or repeated. These errors can be challenging to detect visually, especially if they are random. Once detected though, they are east to repair manually.
Problematic characters can be either unique or common but in a different context. Unique characters include currency symbols, icons used as bullets, and footnote symbols. Common characters that are problematic include leading or trailing spaces and punctuation marks like quotes, exclamation points, asterisks, parentheses, ampersands, carets, hashes, at signs, and slashes. Extra or missing commas wreak havoc when importing csv files. This can happen when commas are used instead of periods for dollar values. These errors can be challenging to detect visually, especially if they are random. Once detected though, they are east to repair manually.
Some data elements include several pieces of information in a single entry that may need to be extracted to be analyzed. Examples include timestamps, geographic coordinates, telephone numbers, zip plus four, email addresses, social security numbers, account numbers, credit card numbers, and other identification numbers. Often, the parts of the values are delimited with hyphens, periods, slashes, or parentheses. These data metrics are easy to identify and process or remove.
Aliases and misspellings
IDs, names, and addresses are common places to find aliases and misspellings. They’re not always easy to spot, but sorting and looking for duplicates is a start. Upper/lower case may be an issue for some software depending on the analysis to be done. Fix the errors by replacing all but one of the data entries.
Any metric that has no values or has values that are all the same are useless in an analysis and should be removed. Metrics with no values can occur, for instance, from filtering or from importing a table with breaker columns or rows. Some metrics may be irrelevant to an analysis or duplicate information in another metric. For example, names can be specified in a variety of formats, such as “first last.” “last, first,” and so on. Only one format needs to be retained. Useless data can be removed unless there is some reason to keep the original data set metrics intact.
All kinds of weird entries can appear in a dataset, especially one that is imported electronically. Examples include file header information, multi-row titles from tables, images and icons, and some types of delimiters. These must all be removed. Data values that appear to be digits but are formatted as text must also be reformatted.
Some data may appear fine at first glance but are actually problematical because they don’t fit expected norms. Some of these errors apply to individual data points and some apply to all the measurements for a metric. Identification and recovery depend on the nature of the error.
Some data errors involve impossible values that are outside the boundaries of the measurement. Examples include pH outside of 0 to 14, an earthquake larger than 9 on the Richter scale, a human body temperature of 115°F, negative ages and body weights, and sometimes, percentages outside of 0% to 100%. Out-of-bounds data should be corrected, if possible, or removed if not.
Data with different precisions
Data should all have the same precision, though this is not always the case. Currency data is often a problem. For example, sometimes dollar amounts are recorded in cents and sometimes in much larger amounts, This adds extraneous variability to calculated statistics.
Data with different units
Data for a metric should all be measured and reported in the same units. Sometimes, measurements can be made in both English and metric units but not converted when included into a dataset. Sometimes, an additional metric is included to specify the unit, however, this can lead to confusion. A famous example of confusion over units was when NASA lost the $125 million Mars Climate Orbiter in 1999. Fixing metrics that have inconsistent units is usually straightforward.
Data with different measurement scales
Having data measured on different scales for a metric is rare but it does happen. In particular, a nominal-scale metric can appear to be an ordinal-scale metric if numbers are used to identify the categories. Time and location scales can also be problematic. Compared to fixing metrics with inconsistent units, fixing metrics with inconsistent scales can be challenging.
Data with large ranges
Data with large ranges, perhaps ranging from zero to millions of units, are an issue because they can cause computational inaccuracies. Replacement by a logarithm or other transformation can address this problem.
Messy data give statisticians nightmares. Untrained analysts would probably not even notice these problems. In fact, even for statisticians, they can be difficult to diagnose because expertise and judgment are needed to establish their presence. Once identified, additional analytical techniques are needed to address the issues. And then, there may not be a consensus on the appropriate response.
Outliers are anomalous data points that just don’t fit with the rest of the metric in the data set. You might think that outliers are easy to identify and fix, and there are many ways to accomplish those things, but there is enough judgment involved in those processes to allow damning criticism from even untrained adversaries. That is a nightmare for an applied statistician. They can be 100% in the right yet still made to appear as a con artist.
Some data are accurate but not precise. That is a nightmare for a statistician because statistical tests rely on extraneous variance to be controlled. You can’t find significant differences between mean values of a metric if the variance in the data is too large. A large variance in a metric of a data set is easy to identify just by calculating the standard deviation and comparing it to the mean for the metric (called the coefficient of variation). There are methods to adjust for large variances, but the best strategy is prevention.
The variance of some metrics occasionally changes with time or with changes in a controlling condition. For example, the variance in a metric may diminish over time as methods of measurement improve. Some biochemical reactions become more variable with changes in temperature, pressure, or pH. This is a nightmare for a statistician because statistical modeling assumes homoskedasticity (i.e., constant variances). Heteroskedasticity in a metric of a data set is easy to identify by calculating and plotting the variances between time periods or categories of other metrics. There are methods to adjust for non-constant variances but they introduce other issues.
Some data can’t be measured accuracy because of limitations in the measurement instrument. Those data are reported as “less than” (<) or “greater than” (>) the limit of measurement. They are said to be censored. Very low concentrations of pollutants, for example, are often reported as <DL (less than the detection limit) because the instrument can detect the pollutant but not quantify its concentration. There are a variety of ways to address this issue either by replacing affected data points or by using statistical procedures that account for censored data. Nevertheless, censored data are a nightmare for applied statisticians because there is no consensus on the best way to approach the problem in a given situation.
Corrupted data are created when some improper operation is applied, either manually or by machine, to data needing refinement after it is generated. These errors can be detected most easily at the time they are created. They tend not to be obvious if they are not identified immediately after they occur.
Electronic glitches occur when some interference corrupts a data stream from an automated device. These errors can be detected visually if the data are reviewed. Often, however, such data streams are automated so they do not have to be reviewed regularly. Removal is the typical fix.
It is not uncommon for data elements to have to be extracted from a concatenated metric. For example, a month might have to be extracted from a value formatted as mmddyy (e.g., 070420), or a zip code have to be extracted from a value formatted as a zip code plus four. Such extractions are usually automated. If an error is made in the extraction formula, however, the extracted data will be in error. These errors are usually noticeable and can be replaced by correct data by running a revised extraction formula.
As with extraction, It is not uncommon for metrics to have to be processed to correct errors or give them more desirable properties. Such processing is usually automated. If an error is made in the processing algorithm, the resulting data will be in error. For example, NASA has occasionally had instances in which processing photogrammetric data has caused space debris to appear as UFOs and planetary landforms to appear as alien structures (e.g., the Cydonia Region on Mars). These errors are usually noticeable, at least by critics. Processing errors can be replaced by corrected data by running a revised processing algorithm.
Data sets are often manipulated manually to optimize their organizations and formatting for analysis. Cut and paste operations are often used for this purpose. Occasionally, a cut/paste operation will go awry. Detection is easiest at the moment it occurs, when it can be reversed effortlessly. These errors tend not to be so obvious or easy to fix if they are not identified immediately after they occur.
A great many statistical analyses rely on data collected and published by others, usually organizations dedicated to a cause, and often, government agencies. There is usually a presumption that these data are error-free and, at least for government sources, unbiased. They are, of course, neither, but data analysts are limited to using the tools they have at hand. Some errors, or at least inconsistencies, in these data sets are attributable to differences in the nature of the data being measured, differences in data definitions, and differences related to the passage of time. These differences can be overtly stated in metadata or buried deep in the way the creation of the data evolved. In either case, the errors aren’t always visible in the actual data points; they have to be discovered. And even if you discover inconsistencies, you may not be able to fix them.
Errors in data sets built from published data take a variety of forms. First, everything that can happen in the creation of a locally-created data set can happen in a published data set, so there could be just as wide a variety of errors. Reputable sources, however, will scrub out invalid data, dirty data, out-of-spec data, corrupted data, and extraneous data. Most will not address missing data. None will deal with messy data. Missing and messy data are the responsibility of the data analyst. Second, different sources will have assembled their data using different contexts – data definitions, data acquisition methods, business rules, and data administration policies. None of these is usually readily apparent. Some errors may also occur when data sets from different sources are merged. Examples of such errors include replicates and extraction errors. It goes without saying that merging data from different sources can be satisfying yet terrifying, like bungy jumping, cave diving, and registering for Statistics 101. So what could go wrong in your analysis if you don’t consider the possibility of mismatched data?
When you combine data from different sources, or even evaluate data from a single source, be sure you know how the data metrics were defined. Sometimes data definitions change over time or under different conditions. For example, some counts of students in college might include full-time students at both two-year and four-year colleges, other counts may exclude two-year colleges but include part-time students. Say you’re analyzing the number of diabetics in the U.S. The first glucose meter was introduced in 1969, but before 1979, blood glucose testing was complicated and not quantitative. In 1979, a diagnoses of diabetes was defined as a fasting blood glucose of 140 mg/dL or higher. In 1997 the definition was changed to 126 mg/dL or higher. Today, a level of 100 to 125 mg/dL is considered prediabetic. Data definitions make a real difference. So, if you’re analyzing a phenomenon that uses some judgement in data generation, especially phenomena involving technology, be aware of how those judgments might have evolved.
In addition to different data definitions, the context under which a metric was created may be relevant. For example, in 143 years, the Major League Baseball (MLB) record for most home runs in a season has been held by 8 men. The 4 who have held the record the longest being: Babe Ruth (60 home runs, 1919 to 1960); Roger Maris (61 home runs, 1961 to 1997); Ned Williamson (27 home runs, 1884 to 1918); and Barry Bonds (73 home runs, 2001 to 2019). The other 4 recordholders held their record for fewer than 5 years each. During that time, there have been changes in rules, facilities, equipment, coaching strategies, drugs, and of course, players, so it would be ridiculous to compare Ned Williamson’s 27 home runs to Barry Bonds’ 73 home runs. Consider also how perceptions of race and gender might be different in different sources, say a religious organization versus a federal agency. Even surveys by the same organization of the same population using different frames can produce different results. Be sure you understand the contexts data have been generated under when you merge files.
Time is perhaps the most challenging framework to match data on. In business data, for example, relevant parameters might include: fiscal and calendar year; event years (e.g., elections, census, leap years); daily, monthly, and quarterly cutoff days, and seasonality and seasonal adjustments. Data may represent snapshots, statistics (e.g., moving averages, extrapolations), and planned versus reprogramed values. And sometimes, the rules change over time. The first fiscal year of the U.S. Government started on January 1, 1789. Congress changed the beginning of the fiscal year from January 1 to July 1 in 1842, and from July 1 to October 1 in 1977. Time is not on your side.
There seems to be an endless number of ways that data can go bad. There are at least 35. That realization is soul-crushing for most statisticians, so they come by it slowly. Some do come to grips with the concept that no data set is error-free, or can be error-free, but still can’t imagine the creativity nature has for making this happen. This blog is an attempt to enumerate some of these hazards.
Data errors can occur in individual data points and whole data metrics (and sometimes observations). They can be identified visually, using descriptive statistics, or statistical graphics, depending on the type of error.
The manner of dataset creation provides insight into the types of errors that might be present. Original datasets, one-time creations that become a source of data, are prone to invalid data, dirty data, out-of-spec data, and missing data. Combined datasets (also referred to as merged, blended, fused, aggregated, concatenated, joined, and united) are built from multiple sources of data, either manually or by automation, at one time, periodically, or continuously. These data sets are more prone to corrupted and mismatched data
Recovering bad data involve the 3 Rs of fixing data errors – Repair, Replacement, and Removal.
Read more about using statistics at the Stats with Cats blog at https://statswithcats.net. Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs, and Other Breeds of Data Analysis at Amazon.com or other online booksellers.