
DISCLAIMER
The postings on this blog are my own (except as noted) and do not necessarily represent the positions, strategies or opinions of my current, past, and future employers, cats and other family members, relatives, Facebook friends, real friends, Charlie Sheen, people who sit next to me on public transportation, or myself when I’m in my right mind.-
Recent Posts
Archives
- April 2023
- February 2022
- November 2020
- October 2020
- September 2020
- August 2020
- July 2020
- May 2020
- January 2019
- December 2018
- October 2018
- September 2018
- November 2017
- September 2017
- May 2017
- February 2017
- January 2017
- December 2016
- September 2016
- August 2016
- July 2016
- June 2016
- January 2016
- July 2015
- February 2015
- January 2015
- December 2014
- November 2014
- October 2014
- January 2014
- September 2013
- May 2013
- April 2013
- February 2013
- August 2012
- July 2012
- June 2012
- February 2012
- May 2011
- April 2011
- March 2011
- February 2011
- January 2011
- December 2010
- November 2010
- October 2010
- September 2010
- August 2010
- July 2010
- June 2010
- May 2010
RSS Links
Feedburner
-
Blogroll
Recent Posts from: Random TerraBytes
Meta
Also crucial is ensuring that the data is not capturing explicit or implicit bias. Statistical models built on biased data will learn those biases and propagate them forwards. For an example of the potentially devastating effects of this, see any study about the racial bias exhibited in facial recognition systems.
Almost no data set of substance is “error free”. The most common “errors” are missing data values i n records, so imputation or other techniques need to always be considered. Dropping records with missing values is probably the worst approach, because the relationship between missingness of values and the outcome being studied is not known in advance. The alternative is to make whatever imputation scheme used as unaffecting of conclusions about the outcome as possible.
Miscodings and mistypings can be checked by comparisons with preexisting population samples. Fraudulent codings can be often detected using Benford’s Law, but its application demands statistical agility with its theory.
Representativeness is often a false idol. See Y. Till&eacuate;, 2006, Sampling Algorithms, Springer, section 1.2. Quoting its open:
One often says that a sample is representative if it a reduced model of the population. Representativeness is then adduced as an argument of validity: a good sample must resemble the population of interest in such a way that some categories appear with the same proportions in the same as in the population. This theory, currently spread by the media, is, however, erroneous. It is often more desirable to overrepresent some categories of the population or even select units with unequal probabilities. The sample must not be a reduced model of the population; it is only a tool used to provide estimates.
Till&eacuate; goes on with a specific example, and the rest of his book is about unequal probability sampling. Also, C-E Särndahl, B. Swensson, J. Wretman, Model Assisted Survey Sampling spend most of their text teaching how to do unequal probability sampling, and why.
I agree, “Almost no data set of substance is “error free,” which is why we should question that presumption. I slap my head whenever I hear a data analyst say they have “cleansed” their dataset. It’s like taking a shower. You remove the dirt and big stuff but there’s a lot of bacteria left behind. IMO, that’s not cleansed, which means to be thoroughly cleaned.
I don’t agree with the quote. If you want to make inferences to a population, your data must reflect the population or you’re making inferences to some other population, which may not even exist. Imagine if is your population was frozen pizzas of a particular brand. The population would be characterized by the proportions of bread, cheese, toppings, spices, etc.. If you oversample the toppings, for instance, any statistical inferences you made about the brand would be biased. Tillé’s example with the steel mills is a straw man. It is not a very good characterization of the population of steel mills and will, consequently, produce illegitimate results.