That means that ciphertext "O" might actually correspond to plaintext "e" or "t" or "a", while "N" might correspond to "x" or "q" or "z". Basically, Holmes can do no more with this analysis than obtain general groups of candidate substitutions.
Fortunately, he has only scratched the surface of his bag of tricks of frequency analysis. The next thing he can do is obtain statistics of pairs of letters, or "digraphs", in the ciphertext, and compare them to a table of average frequencies of such digraphs.
A full table of the average frequency distribution of digraphs in English would be too elaborate to include here, but the general idea is straightforward. Suppose Holmes finds that the digraph "OO" is common in his ciphertext. He has reason to believe that "O" might be "e" or "t" or "a", but he also knows that the digraphs "ee" and "tt" are common in English, while "aa" is not, and so "O" very likely is not a substitution for "a".
There are other patterns, sometimes very specific patterns, that occur with digraphs. For example, in English, a "q" is almost always followed by a "u", so if Holmes determines that "H" in his ciphertext actually substitutes for "q", then if he runs across the digraph "HJ", it is likely that "J" substitutes for "u".
This is an unusually strict rule for digraphs, but other patterns can be picked out. The digraph "ea" is the most common vowel pair, while "ae" is the least. The three high-frequency vowels "a", "i", and "o" tend to not pair up with each other. With an understanding of such rules, Holmes can gradually track down specific letters hidden in the ciphertext.
He can also obtain statistics on triplets of letters, or "trigraphs", or identify entire words. The most common words in English are:
the of and to a in that is I it for as with
* As Holmes expands his analysis of the ciphertext, he focuses less on the mechanical rules of frequency analysis and brings his broader knowledge into play. For example, if he were trying to crack a Nazi cipher, he might know from other messages that have been cracked in the past that it will likely start in plaintext with the salutation: "heil hitler".
If the ciphertext began with: "GOSD GSBDOE", then he would immediately have the mappings:
G: h
O: e
S: i
D: l
B: t
E: r
Such predictable phrases in plaintext are known as "cribs". Military correspondence tends to follow standard formats and is often loaded with cribs.
Holmes can also use his knowledge to reconstruct a full word of plaintext if he knows just a few letters and the context of the message. For example, if Holmes has an incomplete word of the form "-u-m-ri-e" and the message was sent from a naval base, he might guess that the full word is "submarine". This is the same skill that is used to solve crossword-puzzles, and is known to cryptologists as "anagramming". This usage of the word is somewhat different from the popular usage, which refers to a scrambling of the letters of one word into a different word.
* Incidentally, if the frequency analysis of the letters of a ciphertext gives results that don't match the average frequency distribution of letters in English, that may indicate that the plaintext is in some other language.
As the average frequency distributions of letters in different languages is a fairly good "fingerprint" of that language, the frequency distribution of the letters from the ciphertext may be a good clue to what language it is written in. In any case, usually Holmes will have from context some idea of what possible languages the message might be written in -- possibly English, French, or Arabic, but not Dutch or Serbo-Croatian.
If Holmes obtains the frequency distribution of the letters of a ciphertext and finds out it more or less basically maps to that of a normal plaintext message without any substitution, he may wonder why the ciphertext is unreadable, but not for long, since he will quickly realize that a transposition has been performed on the text. Similarly, if Holmes obtains the frequency distribution of the letters of a ciphertext that indicates a substitution cipher on English plaintext, but can't get the mappings to make sense and gets crazy results of frequency analysis of the digraphs from the message, then he will likely conclude that the plaintext has been put through both a substitution and a transposition.
As discussed earlier, a simple transposition can be solved simply by trying various row sizes and arrangements. A knowledge of digraphs and other letter patterns can also be mined for hints. There is a particularly useful scheme known as "multiple anagramming", in which two ciphertexts of transposed text are deciphered in parallel, with each serving as a crosscheck for the other, until they both make sense. Multiple anagramming is described in more detail in a later chapter.
* The invention of frequency analysis demonstrated a truth that would be shown again and again in the history of cryptology. While there are 4.03E26 possible monoalphabetic substitution alphabets, making a brute-force solution very difficult, frequency analysis quickly cracks monoalphabetic substitution ciphers. Cryptographers have often been lulled into a false sense of security by large numbers, only to have cryptanalysts find a short cut and prove that sense of security a delusion.
1 2 3 4 5 6 next>