Suppose you have data that for some reason has been summarized into bins of width h. You don’t have the original data, only the number of counts in each bin.
You can’t exactly find the sample mean or sample variance of the data because you don’t actually have the data. But what’s the best you can do? A sensible approach would be to imagine that the contents of each bin represents samples that fell in the middle of the bin. For example, suppose your data were rounded down to the nearest integer. If you have 3 observations in the bin [5, 6] you could think of that as the value 5.5 repeated three times.
When we do this, we don’t recover the data set, but we create a pseudo data set that we can then find the sample mean and sample variance of.
This works well for estimating the mean, but it overestimates the variance. William Sheppard (1863–1936) recommended what is now known as Sheppard’s correction, subtracting h²/12 from the sample variance of the pseudo data. Richardson’s paper mentioned in the previous post fits Sheppard’s correction into the general framework of “the deferred approach to the limit.”
A more probabilistic justification of Sheppard’s correction is that h²/12 is the variance of a uniform random variable over an interval of width h. You could think of Sheppard’s correction as jittering the observations uniformly within each bin.
Let’s do a little simulation to demonstrate the accuracy of estimating the sample variance with and without Sheppard’s correction.
from numpy import random, floor from scipy.stats import norm random.seed(20230731) numruns = 1000 rawbias = 0 sheppardbias = 0 for run in range(numruns): y = norm(0, 3).rvs(size=1000) ypseudo = floor(y) + 0.5 v = ypseudo.var(ddof=1) rawbias = v - y.var(ddof=1) sheppardbias = rawbias - 1/12 print(rawbias, sheppardbias)
This Python script computes the variance of the pseudodata, with and without the Sheppard correction, and compares this value to the sample variance of the actual data. This process is repeated one thousand times.
On average, the uncorrected variance of the pseudodata was off by 0.0973 compared to the variance of the actual data. With Sheppard’s correction this drops to an average of 0.0140, i.e. the calculation with Sheppard’s correction was about 7.5 times more accurate.
The post Variance of binned data first appeared on John D. Cook.