I have a little script that will print the frequency of the most common characters in a file and the number of lines. All numbers are displayed along with their factorizations. It also prints the number of non-ASCII characters.
CSV files
These simple statistics are surprisingly useful. For example, when I ran it on an CSV file that I downloaded recently I got the following.
, 397907424 = 2^5 3^3 19 24239 0 58200944 = 2^4 1699 2141 2 52955465 = 5 467 22679 1 46413310 = 2 5 23 201797 3 34811225 = 5^2 1392449 Num lines: 1745208 = 2^3 3^2 24239 All ASCII characters
This strongly implies that the CSV file really is a CSV (comma-separated value) file. Sometimes you’ll get a file with a .csv extension but the separator is a tab, a pipe, or some other character.
The number of commas is a multiple of the number of lines. That’s a good sign. Apparently this is a CSV file with 12×19 columns and 1,745,208 rows. If the number of separators is not a multiple of the number of lines, maybe some lines are incomplete. Or maybe your file separator appears inside a quoted string. This is not necessarily a problem, but it means the most naive parsing won’t work.
In the file above, the most common characters, other than commas, are digits, so the file probably contains mostly numeric data.
If your file contains quotation marks, better hope it contains an even number. Even better, an even multiple of the number of lines. If not, you have some troubleshooting to do.
Incidentally, whenever the subject of CSV files comes up, someone will say “Why are you using CSV files?! Don’t you know there are better formats?” My reply is that I’m a consultant and I take data in whatever format I can get it, and that most often means a delimiter-separated text file. That works fine, except, of course, when it doesn’t.
Unicode characters
A file with lots of non-ASCII characters is not a problem. A file with one non-ASCII character very often is a problem.
A single non-ASCII character could be an invisible character that will gum up parsing. This can be maddening to find if you’re relying on visual inspection. But if you know there’s a non-ASCII character where it shouldn’t be, such as in a file of digits and commas, then you can simply delete it.
JSON
If you’re inspecting a JSON file, you’d expect to see lots of braces. Hopefully you have an equal number of open and close braces. But if not, you know where to being troubleshooting. You should also expect a lot of colons. Knowing the number of braces and colons gives you a clue to the structure of the file.
Troubleshooting and guessing
When a file has no complications, the stats above tell you things you’d know from looking at the first line or two of the file. However, when there are complications, the stats can be useful.
The stats could also be useful in a context where it’s OK to make guesses. For example, you might have a script that guesses the structure of a file and proceeds accordingly. That’s fine when wrong guesses lead to obviously wrong output. It’s hard to imagine, for example, that mistaking an XML file for a CVS file would produce a subtle error.
Related posts
The post Data file character frequencies first appeared on John D. Cook.