dark matter surrealism hplc

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Introduction To Codes, Ciphers, & Codebreaking
by Greg Goebel

[1.4] FREQUENCY ANALYSIS AGAINST CIPHERS
* Given the large number of possible monoalphabetic substitution cipher alphabets, it might seem like a substitution cipher would be very hard to break. In reality, it's very easy if given a reasonably large ciphertext message to analyze, but it took over a thousand years to figure out how.

The basic approach for cracking a monoalphabetic substitution cipher was invented by a multi-talented medieval Arabic scholar named al-Kindi, and is now known as "frequency analysis". His work was an outgrowth of efforts by Arabs to perform textual analyses of religious texts to see if they actually were written by the Prophet.

Frequency analysis is a statistical method. In every language, some letters are used on the average more than others, and the percentages of letters in different languages tends to be constant. For example, the "frequencies" of the different letters of the alphabet in English are roughly as follows, arranged from "most frequent" to "least frequent" with their average percentage of use:

e: 12.7
t: 9.1
a: 8.2
o: 7.5
i: 7.0
n: 6.9
s: 6.3
h: 6.1
r: 6.0
d: 4.2
l: 4.0
c: 2.8
u: 2.8
m: 2.4
w: 2.4
f: 2.2
g: 2.0
y: 2.0
p: 1.9
b: 1.5
v: 1.0
k: 0.8
j: 0.2
x: 0.2
q: 0.1
z: 0.1

Different samplings of English text will give slight variations in the percentages, since this is just an average. Some text might even wildly deviate from the average. In 1969 a French author named George Perec published wrote a short novel named LA DISPARITION that did not contain the letter "e" in any of the text. This book was actually translated into English under the title A VOID by a British writer, Gilbert Adair, and still did not contain the letter "e".
That was of course a very extreme case. Different classes of correspondence will tend to have a slightly different set of averages. Military communications, for example, tend to be terse, dropping pronouns like "I" or "me", and also incorporating lots of acronyms, skewing the letter frequencies. In addition, the shorter the text, the more it tends to differ from the averages, as the sample size is small.

Despite these conditions, the general pattern will remain the same for most English text, with "e" at the top of the frequency list, and "q" and "z" at the bottom. Incidentally, this pattern differs significantly from language to language; for example, in German the average frequency of "e" is 19%. Of course, similar average frequency tables can be built up for other languages.

* Now suppose Holmes performs the same analysis on a ciphertext produced by a monoalphabetic substitution cipher, and determines that the cipher letters have a pattern of frequencies as follows:

O: 9.9
G: 9.3
B: 8.6
I: 7.9
C: 7.6
Y: 7.2
W: 7.1
A: 6.7
V: 6.6
F: 6.2
S: 4.3
U: 4.3
J: 3.3
D: 3.1
L: 2.5
M: 2.6
P: 2.2
Z: 2.1
K: 1.8
E: 1.4
X: 1.2
R: 1.0
T: 0.7
H: 0.3
Q: 0.1
N: 0.1

The frequency of the cipher letters of course is the frequency of their plaintext equivalents, and so at first sight it would be logical to believe that the ciphertext "O" at the top of the list corresponds to plaintext "e", while the ciphertext "N" at the bottom of the list corresponds to plaintext "z".
However, this is being simplistic. The average frequencies of letters are just that, averages, and the actual frequencies of letters in any one example of text will vary from that average. The most that can be said is that the most common letters will bubble to the top of the frequency list, while the least common will sink to the bottom.

top>

 

 

 

That means that ciphertext "O" might actually correspond to plaintext "e" or "t" or "a", while "N" might correspond to "x" or "q" or "z". Basically, Holmes can do no more with this analysis than obtain general groups of candidate substitutions.

Fortunately, he has only scratched the surface of his bag of tricks of frequency analysis. The next thing he can do is obtain statistics of pairs of letters, or "digraphs", in the ciphertext, and compare them to a table of average frequencies of such digraphs.

A full table of the average frequency distribution of digraphs in English would be too elaborate to include here, but the general idea is straightforward. Suppose Holmes finds that the digraph "OO" is common in his ciphertext. He has reason to believe that "O" might be "e" or "t" or "a", but he also knows that the digraphs "ee" and "tt" are common in English, while "aa" is not, and so "O" very likely is not a substitution for "a".

There are other patterns, sometimes very specific patterns, that occur with digraphs. For example, in English, a "q" is almost always followed by a "u", so if Holmes determines that "H" in his ciphertext actually substitutes for "q", then if he runs across the digraph "HJ", it is likely that "J" substitutes for "u".

This is an unusually strict rule for digraphs, but other patterns can be picked out. The digraph "ea" is the most common vowel pair, while "ae" is the least. The three high-frequency vowels "a", "i", and "o" tend to not pair up with each other. With an understanding of such rules, Holmes can gradually track down specific letters hidden in the ciphertext.

He can also obtain statistics on triplets of letters, or "trigraphs", or identify entire words. The most common words in English are:

the of and to a in that is I it for as with

* As Holmes expands his analysis of the ciphertext, he focuses less on the mechanical rules of frequency analysis and brings his broader knowledge into play. For example, if he were trying to crack a Nazi cipher, he might know from other messages that have been cracked in the past that it will likely start in plaintext with the salutation: "heil hitler".
If the ciphertext began with: "GOSD GSBDOE", then he would immediately have the mappings:

G: h
O: e
S: i
D: l
B: t
E: r

Such predictable phrases in plaintext are known as "cribs". Military correspondence tends to follow standard formats and is often loaded with cribs.
Holmes can also use his knowledge to reconstruct a full word of plaintext if he knows just a few letters and the context of the message. For example, if Holmes has an incomplete word of the form "-u-m-ri-e" and the message was sent from a naval base, he might guess that the full word is "submarine". This is the same skill that is used to solve crossword-puzzles, and is known to cryptologists as "anagramming". This usage of the word is somewhat different from the popular usage, which refers to a scrambling of the letters of one word into a different word.

* Incidentally, if the frequency analysis of the letters of a ciphertext gives results that don't match the average frequency distribution of letters in English, that may indicate that the plaintext is in some other language.

As the average frequency distributions of letters in different languages is a fairly good "fingerprint" of that language, the frequency distribution of the letters from the ciphertext may be a good clue to what language it is written in. In any case, usually Holmes will have from context some idea of what possible languages the message might be written in -- possibly English, French, or Arabic, but not Dutch or Serbo-Croatian.

If Holmes obtains the frequency distribution of the letters of a ciphertext and finds out it more or less basically maps to that of a normal plaintext message without any substitution, he may wonder why the ciphertext is unreadable, but not for long, since he will quickly realize that a transposition has been performed on the text. Similarly, if Holmes obtains the frequency distribution of the letters of a ciphertext that indicates a substitution cipher on English plaintext, but can't get the mappings to make sense and gets crazy results of frequency analysis of the digraphs from the message, then he will likely conclude that the plaintext has been put through both a substitution and a transposition.

As discussed earlier, a simple transposition can be solved simply by trying various row sizes and arrangements. A knowledge of digraphs and other letter patterns can also be mined for hints. There is a particularly useful scheme known as "multiple anagramming", in which two ciphertexts of transposed text are deciphered in parallel, with each serving as a crosscheck for the other, until they both make sense. Multiple anagramming is described in more detail in a later chapter.

* The invention of frequency analysis demonstrated a truth that would be shown again and again in the history of cryptology. While there are 4.03E26 possible monoalphabetic substitution alphabets, making a brute-force solution very difficult, frequency analysis quickly cracks monoalphabetic substitution ciphers. Cryptographers have often been lulled into a false sense of security by large numbers, only to have cryptanalysts find a short cut and prove that sense of security a delusion.

1 2 3 4 5 6 next>