The Most Important Statistical Assumptions

Posted on July 5, 2020 by statswithcats

About statswithcats

Charlie Kufs has been crunching numbers for over thirty years. He retired in 2019 and is currently working on Stats with Kittens, the prequel to Stats with Cats.

View all posts by statswithcats →

This entry was posted in Uncategorized and tagged data scrubbing, Representativeness, statistical assumptions. Bookmark the permalink.

3 Responses to The Most Important Statistical Assumptions

Anthony Bonifonte says:

July 5, 2020 at 9:04 PM

Also crucial is ensuring that the data is not capturing explicit or implicit bias. Statistical models built on biased data will learn those biases and propagate them forwards. For an example of the potentially devastating effects of this, see any study about the racial bias exhibited in facial recognition systems.

Reply
ecoquant says:

July 5, 2020 at 10:07 PM

Almost no data set of substance is “error free”. The most common “errors” are missing data values i n records, so imputation or other techniques need to always be considered. Dropping records with missing values is probably the worst approach, because the relationship between missingness of values and the outcome being studied is not known in advance. The alternative is to make whatever imputation scheme used as unaffecting of conclusions about the outcome as possible.

Miscodings and mistypings can be checked by comparisons with preexisting population samples. Fraudulent codings can be often detected using Benford’s Law, but its application demands statistical agility with its theory.

Representativeness is often a false idol. See Y. Till&eacuate;, 2006, Sampling Algorithms, Springer, section 1.2. Quoting its open:

One often says that a sample is representative if it a reduced model of the population. Representativeness is then adduced as an argument of validity: a good sample must resemble the population of interest in such a way that some categories appear with the same proportions in the same as in the population. This theory, currently spread by the media, is, however, erroneous. It is often more desirable to overrepresent some categories of the population or even select units with unequal probabilities. The sample must not be a reduced model of the population; it is only a tool used to provide estimates.

Till&eacuate; goes on with a specific example, and the rest of his book is about unequal probability sampling. Also, C-E Särndahl, B. Swensson, J. Wretman, Model Assisted Survey Sampling spend most of their text teaching how to do unequal probability sampling, and why.

Reply
- statswithcats says:
  
  July 9, 2020 at 1:16 AM
  
  I agree, “Almost no data set of substance is “error free,” which is why we should question that presumption. I slap my head whenever I hear a data analyst say they have “cleansed” their dataset. It’s like taking a shower. You remove the dirt and big stuff but there’s a lot of bacteria left behind. IMO, that’s not cleansed, which means to be thoroughly cleaned.
  
  I don’t agree with the quote. If you want to make inferences to a population, your data must reflect the population or you’re making inferences to some other population, which may not even exist. Imagine if is your population was frozen pizzas of a particular brand. The population would be characterized by the proportions of bread, cheese, toppings, spices, etc.. If you oversample the toppings, for instance, any statistical inferences you made about the brand would be biased. Tillé’s example with the steel mills is a straw man. It is not a very good characterization of the population of steel mills and will, consequently, produce illegitimate results.
  
  Reply

The Most Important Statistical Assumptions

About statswithcats

3 Responses to The Most Important Statistical Assumptions

Leave a comment Cancel reply

DISCLAIMER

Recent Posts

Archives

RSS Links

Feedburner

Follow Blog via Email

Blogroll

Recent Posts from: Random TerraBytes

Wanderwork: An Unintended Consequence of Telework—2017 Update

We Are Our Experiences

A March of Evidence

Republicans, Democrats, and Independents–2022

<strong>What’s Your Superhero Story?</strong>

Inflation and Corporate Profits

Meta

The Most Important Statistical Assumptions

Share this:

Related

About statswithcats

3 Responses to The Most Important Statistical Assumptions

Leave a comment Cancel reply

DISCLAIMER

Recent Posts

Archives

RSS Links

Feedburner

Follow Blog via Email

Blogroll

Recent Posts from: Random TerraBytes

Wanderwork: An Unintended Consequence of Telework—2017 Update

We Are Our Experiences

A March of Evidence

Republicans, Democrats, and Independents–2022

<strong>What’s Your Superhero Story?</strong>

Inflation and Corporate Profits

Meta