I mentioned in the previous post that I had been poking around in HTML entities and noticed symbols for Fourier transforms and such. I also noticed HTML entities for Cyrillic letters. These entities have the form
&
+ transliteration + cy;
.
For example, the Cyrillic letter П is based on the Greek letter Π and its closest English counterpart is P, and its HTML entity is П.
The Cyrillic letter Р has HTML entity &Rpcy; and not П because although it looks like an English P, it sounds more like an English R.
Just as a hack, I decided to write code to transliterate Russian text by converting letters to their HTML entities, then chopping off the initial &
and the final cy;
.
I don’t speak Russian, but according to Google Translate, the Russian translation of “Hello world” is “Привет, мир.”
Here’s my hello-world program for transliterating Russian.
from bs4.dammit import EntitySubstitution def transliterate(ch): entity = escaper.substitute_html(ch)[1:] return entity[:-3] a = [transliterate(c) for c in "Привет, мир."] print(" ".join(a))
This prints
P r i v ie t m i r
Here’s what I get trying to transliterate Chebyshev’s native name Пафну́тий Льво́вич Чебышёв.
P a f n u t i j L soft v o v i ch CH ie b y sh io v
I put a space between letters because of possible outputs like “soft v” above.
This was just a fun hack. Here’s what I’d get if I used software intended to be used for transliteration.
import unidecode for x in ["Привет, мир", "Пафну́тий Льво́вич Чебышёв"]: print(unidecode.unidecode(x))
This produces
Privet, mir
Pafnutii L’vovich Chebyshiov
The results are similar.
Related posts
The post Russian transliteration hack first appeared on John D. Cook.