SoatDev IT Consulting
SoatDev IT Consulting
  • About us
  • Expertise
  • Services
  • How it works
  • Contact Us
  • News
  • March 7, 2024
  • Rss Fetcher

Imagine you are a code breaker living a century ago. You’ve intercepted a message, and you go through your bag of tricks, starting with the simplest techniques first. Maybe the message has been encrypted using a simple substitution cipher, so you start with that.

Simple substitution ciphers can be broken by frequency analysis: the most common letter probably corresponds to E, the next most common letter probably corresponds to T, etc. But that’s only for English prose. Maybe the message was composed in French. Or maybe it was composed in Japanese, then transliterated into the Latin alphabet so it could be transmitted via Morse code. You’d like to know what language the message was written in before you try identifying letters via their frequency.

William Friedman’s idea was to compute a statistic, what he dubbed the index of coincidence, to infer the probable language of the source. Since this statistic only depends on symbol frequencies, it gives the same value whether computed on clear text or text encrypted with simple substitution. It also gives the same value if the text has been run through a transposition cipher as well.

(Classical cryptanalysis techniques, such as computing the index of coincidence, are completely useless against modern cryptography. And yet ideas from classical cryptanalysis are still useful for other applications. Here’s an example that came up in a consulting project recently.)

As I mentioned at the top of the post, you’d try breaking the simplest encryption first. If the index of coincidence is lower than you’d expect for a natural language, you might suspect that the message has been encrypted using polyalphabetic substitution. That is, instead of using one substitution alphabet for every letter, maybe the message has been encrypted using a cycle of n different alphabets, such as the Vigenère cypher.

How would you break such a cipher? First, you’d like to know what n is. How would you do that? By trial and error. Try splitting the text into groups of letters according to their position mod n, then compute the index of coincidence again for each group. If the index statistics are much larger when n = 7, you’re probably looking a message encrypted with a key of length 7.

The source language would still leave its signature. If the message was encrypted by cycling through seven scrambled alphabets, each group of seven letters would most likely have the statistical distribution of the language used in the clear text.

Friedman’s index of coincidence, published in 1922, was one statistic that could be computed based on letter frequencies, one that worked well in practice, but you could try other statistics, and presumably people did. The index of coincidence is essentially Rényi entropy with parameter α = 2. You might try different values of α.

If the approach above doesn’t work, you might suspect that the text was not encrypted one letter at a time, even using multiple alphabets. Maybe pairs of letters were encrypted, as in the Playfair cipher. You could test this hypothesis by looking that the frequencies of pairs of letters in the encrypted text, calculating an index of coincidence (or some other statistic) based on pairs of letters.

Here again letter pair frequencies may suggest the original language. It might not distinguish Spanish from Portuguese, for example, but it would distinguish Japanese written in Roman letters from English.

The post Detecting the language of encrypted text first appeared on John D. Cook.

Previous Post
Next Post

Recent Posts

  • Crypto elite increasingly worried about their personal safety
  • Grok says it’s ‘skeptical’ about Holocaust death toll, then blames ‘programming error’
  • Heybike’s Alpha step-through e-bike is an affordable, all-terrain dreamboat
  • U.S. lawmakers have concerns about Apple-Alibaba deal
  • Microsoft’s Satya Nadella is choosing chatbots over podcasts

Categories

  • Industry News
  • Programming
  • RSS Fetched Articles
  • Uncategorized

Archives

  • May 2025
  • April 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023

Tap into the power of Microservices, MVC Architecture, Cloud, Containers, UML, and Scrum methodologies to bolster your project planning, execution, and application development processes.

Solutions

  • IT Consultation
  • Agile Transformation
  • Software Development
  • DevOps & CI/CD

Regions Covered

  • Montreal
  • New York
  • Paris
  • Mauritius
  • Abidjan
  • Dakar

Subscribe to Newsletter

Join our monthly newsletter subscribers to get the latest news and insights.

© Copyright 2023. All Rights Reserved by Soatdev IT Consulting Inc.