I agree, “Almost no data set of substance is “error free,” which is why we should question that presumption. I slap my head whenever I hear a data analyst say they have “cleansed” their dataset. It’s like taking a shower. You remove the dirt and big stuff but there’s a lot of bacteria left behind. IMO, that’s not cleansed, which means to be thoroughly cleaned.

I don’t agree with the quote. If you want to make inferences to a population, your data must reflect the population or you’re making inferences to some other population, which may not even exist. Imagine if is your population was frozen pizzas of a particular brand. The population would be characterized by the proportions of bread, cheese, toppings, spices, etc.. If you oversample the toppings, for instance, any statistical inferences you made about the brand would be biased. Tillé’s example with the steel mills is a straw man. It is not a very good characterization of the population of steel mills and will, consequently, produce illegitimate results.

]]>Miscodings and mistypings can be checked by comparisons with preexisting population samples. Fraudulent codings can be often detected using Benford’s Law, but its application demands statistical agility with its theory.

*Representativeness* is often a false idol. See Y. Till&eacuate;, 2006, *Sampling Algorithms*, Springer, section 1.2. Quoting its open:

*One often says that a sample is representative if it a reduced model of the population. Representativeness is then adduced as an argument of validity: a good sample must resemble the population of interest in such a way that some categories appear with the same proportions in the same as in the population. This theory, currently spread by the media, is, however, erroneous. It is often more desirable to overrepresent some categories of the population or even select units with unequal probabilities. The sample must not be a reduced model of the population; it is only a tool used to provide estimates.*

Till&eacuate; goes on with a specific example, and the rest of his book is about unequal probability sampling. Also, C-E Särndahl, B. Swensson, J. Wretman, *Model Assisted Survey Sampling* spend most of their text teaching how to do unequal probability sampling, and why.