SoatDev IT Consulting
SoatDev IT Consulting
  • About us
  • Expertise
  • Services
  • How it works
  • Contact Us
  • News
  • July 31, 2023
  • Rss Fetcher

Suppose you have data that for some reason has been summarized into bins of width h. You don’t have the original data, only the number of counts in each bin.

You can’t exactly find the sample mean or sample variance of the data because you don’t actually have the data. But what’s the best you can do? A sensible approach would be to imagine that the contents of each bin represents samples that fell in the middle of the bin. For example, suppose your data were rounded down to the nearest integer. If you have 3 observations in the bin [5, 6] you could think of that as the value 5.5 repeated three times.

When we do this, we don’t recover the data set, but we create a pseudo data set that we can then find the sample mean and sample variance of.

This works well for estimating the mean, but it overestimates the variance. William Sheppard (1863–1936) recommended what is now known as Sheppard’s correction, subtracting h²/12 from the sample variance of the pseudo data. Richardson’s paper mentioned in the previous post fits Sheppard’s correction into the general framework of “the deferred approach to the limit.”

A more probabilistic justification of Sheppard’s correction is that h²/12 is the variance of a uniform random variable over an interval of width h. You could think of Sheppard’s correction as jittering the observations uniformly within each bin.

Let’s do a little simulation to demonstrate the accuracy of estimating the sample variance with and without Sheppard’s correction.

    from numpy import random, floor
    from scipy.stats import norm

    random.seed(20230731)

    numruns = 1000
    rawbias = 0
    sheppardbias = 0
    for run in range(numruns):
        y = norm(0, 3).rvs(size=1000)
        ypseudo = floor(y) + 0.5
        v = ypseudo.var(ddof=1)
        rawbias = v - y.var(ddof=1) 
        sheppardbias = rawbias - 1/12

    print(rawbias, sheppardbias)

This Python script computes the variance of the pseudodata, with and without the Sheppard correction, and compares this value to the sample variance of the actual data. This process is repeated one thousand times.

On average, the uncorrected variance of the pseudodata was off by 0.0973 compared to the variance of the actual data. With Sheppard’s correction this drops to an average of 0.0140, i.e. the calculation with Sheppard’s correction was about 7.5 times more accurate.

The post Variance of binned data first appeared on John D. Cook.

Previous Post
Next Post

Recent Posts

  • Build, don’t bind: Accel’s Sonali De Rycker on Europe’s AI crossroads
  • OpenAI’s planned data center in Abu Dhabi would be bigger than Monaco
  • Google I/O 2025: What to expect, including updates to Gemini and Android 16
  • Thousands of people have embarked on a virtual road trip via Google Street View
  • How Silicon Valley’s influence in Washington benefits the tech elite

Categories

  • Industry News
  • Programming
  • RSS Fetched Articles
  • Uncategorized

Archives

  • May 2025
  • April 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023

Tap into the power of Microservices, MVC Architecture, Cloud, Containers, UML, and Scrum methodologies to bolster your project planning, execution, and application development processes.

Solutions

  • IT Consultation
  • Agile Transformation
  • Software Development
  • DevOps & CI/CD

Regions Covered

  • Montreal
  • New York
  • Paris
  • Mauritius
  • Abidjan
  • Dakar

Subscribe to Newsletter

Join our monthly newsletter subscribers to get the latest news and insights.

© Copyright 2023. All Rights Reserved by Soatdev IT Consulting Inc.