SoatDev IT Consulting
SoatDev IT Consulting
  • About us
  • Expertise
  • Services
  • How it works
  • Contact Us
  • News
  • July 23, 2023
  • Rss Fetcher

I recently ran across a paper on typesetting rare Chinese characters. From the abstract:

Written Chinese has tens of thousands of characters. But most available fonts contain only around 6 to 12 thousand common characters that can meet the needs of everyday users. However, in publications and information exchange in many professional fields, a number of rare characters that are not in common fonts are needed in each document.

There’s sort of a paradox here: the author is saying it’s common to need rare words. Aren’t rare words, you know, rare? Of course they are, but the chances of needing some rare word, not just a particular rare word, can be large, particularly in lengthy documents.

This post gives a sort of back-of-the-envelope calculation to justify the preceding paragraph.

Word frequencies often approximately follow Zipf’s law, where the frequency of the nth most common word is proportional to n raised to some negative power s. I’ve seen estimates that there are around N = 50,000 characters in Chinese, but that 1,000 characters make up about 90% of usage. This would correspond to a value of s around 1.25.

In practice, Zipf’s law, like all power laws, fits better over some parts of its range than others. We’re making a simplifying assumption by applying Zipf’s law to the entire vocabulary of Chinese, but this post isn’t trying to precisely model Chinese character frequency, only to show that the statement quoted above is plausible.

With our Zipf’s law model, the 10,000th most common character in Chinese would appear about 2 times in a million characters. But the frequency of all the words from the 10,000th most common to the 50,000th most common would be about 0.03.

So if we list all characters in order of frequency and call everything after the 10,000th position on the list rare, the combined frequency of all rare words is quite high, about 3%. To put it another way, a document of 1,000 words would likely contain around 30 rare words, according to the simplified model presented here.

Related posts

  • Chinese character frequency and entropy
  • Estimating vocabulary size with Heaps law
  • Passwords and power laws
  • Twitter follower distribution

The post How rare is it to encounter a rare word? first appeared on John D. Cook.

Previous Post
Next Post

Recent Posts

  • How is Technology Modernizing Recruitment in Temporary Employment Services
  • Banking on a serverless world
  • Court denies Apple’s request to pause ruling on App Store payment fees
  • Cursor’s Anysphere nabs $9.9B valuation, soars past $500M ARR
  • Circle IPO soars, giving hope to more startups waiting to go public

Categories

  • Industry News
  • Programming
  • RSS Fetched Articles
  • Uncategorized

Archives

  • June 2025
  • May 2025
  • April 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023

Tap into the power of Microservices, MVC Architecture, Cloud, Containers, UML, and Scrum methodologies to bolster your project planning, execution, and application development processes.

Solutions

  • IT Consultation
  • Agile Transformation
  • Software Development
  • DevOps & CI/CD

Regions Covered

  • Montreal
  • New York
  • Paris
  • Mauritius
  • Abidjan
  • Dakar

Subscribe to Newsletter

Join our monthly newsletter subscribers to get the latest news and insights.

© Copyright 2023. All Rights Reserved by Soatdev IT Consulting Inc.