## The Most Important Statistical Assumptions

Charlie Kufs has been crunching numbers for over thirty years. He retired in 2019 and is currently working on Stats with Kittens, the prequel to Stats with Cats.
This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.

### 3 Responses to The Most Important Statistical Assumptions

1. Also crucial is ensuring that the data is not capturing explicit or implicit bias. Statistical models built on biased data will learn those biases and propagate them forwards. For an example of the potentially devastating effects of this, see any study about the racial bias exhibited in facial recognition systems.

2. ecoquant says:

Almost no data set of substance is “error free”. The most common “errors” are missing data values i n records, so imputation or other techniques need to always be considered. Dropping records with missing values is probably the worst approach, because the relationship between missingness of values and the outcome being studied is not known in advance. The alternative is to make whatever imputation scheme used as unaffecting of conclusions about the outcome as possible.

Miscodings and mistypings can be checked by comparisons with preexisting population samples. Fraudulent codings can be often detected using Benford’s Law, but its application demands statistical agility with its theory.

Representativeness is often a false idol. See Y. Till&eacuate;, 2006, Sampling Algorithms, Springer, section 1.2. Quoting its open:

One often says that a sample is representative if it a reduced model of the population. Representativeness is then adduced as an argument of validity: a good sample must resemble the population of interest in such a way that some categories appear with the same proportions in the same as in the population. This theory, currently spread by the media, is, however, erroneous. It is often more desirable to overrepresent some categories of the population or even select units with unequal probabilities. The sample must not be a reduced model of the population; it is only a tool used to provide estimates.

Till&eacuate; goes on with a specific example, and the rest of his book is about unequal probability sampling. Also, C-E Särndahl, B. Swensson, J. Wretman, Model Assisted Survey Sampling spend most of their text teaching how to do unequal probability sampling, and why.

• I agree, “Almost no data set of substance is “error free,” which is why we should question that presumption. I slap my head whenever I hear a data analyst say they have “cleansed” their dataset. It’s like taking a shower. You remove the dirt and big stuff but there’s a lot of bacteria left behind. IMO, that’s not cleansed, which means to be thoroughly cleaned.

I don’t agree with the quote. If you want to make inferences to a population, your data must reflect the population or you’re making inferences to some other population, which may not even exist. Imagine if is your population was frozen pizzas of a particular brand. The population would be characterized by the proportions of bread, cheese, toppings, spices, etc.. If you oversample the toppings, for instance, any statistical inferences you made about the brand would be biased. Tillé’s example with the steel mills is a straw man. It is not a very good characterization of the population of steel mills and will, consequently, produce illegitimate results.