I’ve typed Karl Weierstrass’ name quite a few times lately and thought about how you’ll sometimes see his name written as Weierstraß in English text. That led me to look up the rules for when to use ß and when to use ss. The rules are moderately complicated, and have varied over time and location. Here I just want to look at the capitalization rules.
Typically ß is replaced with SS when converting from lower case to upper case. This means that the length of a string can change when changing case. Surely this has caused numerous software bugs.
>>> w = "Weierstraß" >>> len(w), len(w.upper()) (10, 11)
There was no uppercase counterpart of ß (U+00DF) until ẞ (U+1E9E) was introduced in 2008. I wondered whether the code above would run differently if I set my locale to de_DE (Germany). Would w.upper()
return WEIERSTRASS
or WEIERSTRAẞ
?
It turns out that Python follows Unicode’s case mapping rules, and these rules say ß becomes SS when changing to uppercase. The code will run the same everywhere, independent of locale. So if you want ß to convert to uppercase as ẞ you’ll have to use customized code.
ASCII was designed so that uppercase and lowercase English letters differed by 32 (i.e. 0x20 in hex). This convention was carried over into Unicode for other alphabets, with a few exceptions, and it almost holds for German as the following code shows.
upper = "ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜẞ" lower = "abcdefghijklmnopqrstuvwxyzäöüß" for u, l in zip(upper, lower): if ord(l) - ord(u) != 32: print("Exception:", u, l)
This prints
Exception: ẞ ß
The code points for Ä and ä, Ö and ö, and Ü and ü were spaced 32 points apart in extensions of ASCII that predate Unicode and the spacing carried over into Unicode. But the uppercase ẞ could not have Unicode value U+00BF because that code point was already occupied by the inverted question mark ¿.
The post Uppercase Eszett first appeared on John D. Cook.