# Cryptanalysis of classical ciphers (historical)

This section presents background material on redundancy and unicity distance, and techniques for cryptanalysis of classical ciphers,

(i) Redundancy

All natural languages are redundant. This redundancy results from linguistic structure. For example, in English the letter “E” appears far more frequently than “Z”, “Q” is almost always followed by “U”, and '‘TH” is a common digram.

An alphabet with 26 characters (e.g., Roman alphabet) can theoretically cany up to lg 26 = 4.7 bits of information per character. Fact 7.67 indicates that, on average, far less information is actually conveyed by a natural language.

7.67 Fact The estimated average amount of information carried per character (per-character entropy) in meaningful English alphabetic text is 1.5 bits.

The per-character redundancy of English is thus about 4.7 — 1.5 = 3.2 bits.

• 7.68 Fact Empirical evidence suggests that, for essentially any simple substitution cipher on a meaningful message (e.g., with redundancy comparable to English), as few as 25 ciphertext characters suffices to allow a skilled cryptanalyst to recover the plaintext.
• (ii) Unicity distance and random cipher model
• 7.69 Definition The unicity distance of a cipher is the minimum amount of ciphertext (number of characters) required to allow a computationally unlimited adversary to recover the unique encryption key.

The unicity distance is primarily a theoretical measure, useful in relation to unconditional security. A small unicity distance does not necessarily imply that a block cipher is insecure in practice. For example, consider a 64-bit block cipher with a unicity distance of two ciphertext blocks. It may still be computationally infeasible for a cryptanalyst (of reasonable but bounded computing power) to recover the key, although theoretically there is sufficient information to allow this.

The random cipher model (Definition 7.70) is a simplified model of a block cipher providing a reasonable approximation for many purposes, facilitating results on block cipher properties not otherwise easily established (e.g., Fact 7.71).

7.70 Definition Let C and К be random variables, respectively, denoting the ciphertext block and the key, and let D denote the decryption function. Under the random cipher model, I)k(C) is a random variable uniformly distributed over all possible pre-images of C (meaningful messages and otherwise, with and without redundancy).

In an intuitive sense, a random cipher as per the model of Definition 7.70 is a random mapping. (A more precise approximation would be as a random permutation.)

7.71 Fact Under the random cipher model, the expected unicity distance N0 of a cipher is N0 = H{K)/D, where H(JC) is the entropy of the key space (e.g., 64 bits for 204 equiprobable keys), and D is the plaintext redundancy (in bits/character).

For a one-tune pad, the unbounded entropy of the key space implies, by Fact 7.71, that the unicity distance is likewise unbounded. This is consistent with the one-time pad being theoretically unbreakable.

Data compression reduces redundancy. Fact 7.71 implies that data compression prior to encryption increases the unicity distance, thus increasing security. If the plaintext contains no redundancy whatsoever, then the unicity distance is infinite; that is, the system is theoretically unbreakable under a ciphertext-only attack.

7.72 Example (unicity distance - transposition cipher) The unicity distance of a simple transposition cipher of period t can be estimated under the random cipher model using Fact 7.71, and the assumption of plaintext redundancy of D = 3.2 bits/character. In this case, H(K)/D = lg(t!)/3.2 and for t = 12 the estimated unicity distance is 9 characters, which is very crude, this being less than one 12-character block. For t = 27, the estimated unicity distance is a more plausible 29 characters; this can be computed using Stirling’s approximation of Fact 2.57(iii) (i! « f2nt(t/e)1, for large t and e = 2.718) as H(K)/D = lg(t!)/3.2 ss (0.3t) • lg(t/e). □

• 7.73 Example (unicity distance - simple substitution) The number of keys for a mono-alphabetic substitution cipher over alphabet Л is K = s!, where s = |Л|. For example, s = 26 (Roman alphabet) yields 26! « 4 x 102G keys. Assuming equiprobable keys, an estimate of the entropy of the key space is then (cf. Example 7.72) H(K.) = lg(26!) » 88.4 bits. Assuming Enghsh text with D = 3.2 bits of redundancy per character (Fact 7.67), a theoretical estimate of the unicity distance of a simple substitution cipher is H(K.)/D = 88.4/3.2 и 28 characters. This agrees closely with empirical evidence (Fact 7.68). □
• (iii) Language statistics

Cryptanalysis of classical ciphers typically relies on redundancy in the source language (plaintext). In many cases a divide-and-conquer approach is possible, whereby the plaintext or key is recovered piece by piece, each facilitating further recovery.

Mono-alphabetic substitution on short plaintext blocks (e.g., Roman alphabet characters) is easily defeated by associating ciphertext characters with plaintext characters (Note 7.50). The frequency distribution of individual ciphertext characters can be compared to that of single characters in the source language, as given by Figure 7.5 (estimated from 1964 English text). This is facilitated by grouping plaintext letters by frequency into high, medium, low, and rare classes; focussing on the high-frequency class, evidence supporting trial letter assignments can be obtained by examining how closely hypothesized assignments match those of the plaintext language. Further evidence is available by examination of digram and trigram frequencies. Figure 7.6 gives the most common English digrams as a percentage of all digrams; note that of 262 = 676 possible digrams, the top 15 account for 27% of all occurrences. Other examples of plaintext redundancy appealing in the cipher- text include associations of vowels with consonants, and repeated letters in pattern words (e.g., “that”, “soon”, “three”).

Figure 7.5: Frequency of single characters in English text.

7.74 Note (large blocks preclude statistical analysis) An n-bit block size implies 2" plaintext units (“characters”). Compilation of frequency statistics on plaintext units thus becomes infeasible as the block size of the simple substitution increases; for example, this is clearly infeasible for DES (§7.4), where n = 64.

Cryptanalysis of simple transposition ciphers is similarly facilitated by source language statistics (see Note 7.47). Cryptanalyzing transposed blocks resembles solving an anagram. Attempts to reconstruct common digrams and trigrams are facilitated by frequency statistics. Solutions may be constructed piecewise, with the appearance of digrams and trigrams in trial decryptions confirming (partial) success.

Figure 7.6: Frequency of 15 common digrams in English text.

Cryptanalysis of polyalphabetic ciphers is possible by various methods, including Ka- siski’s method and methods based on the index of coincidence, as discussed below.

(iv) Method of Kasiski (vs. polyalphabetic substitution)

Kasiski’s method provides a general technique for cryptanalyzing polyalphabetic ciphers with repeated keywords, such as the simple Vigenere cipher (Definition 7.53), based on the folio whig observation: repeated portions of plaintext encrypted with the same portion of the keyword result in identical ciphertext segments. Consequently one expects the number of characters between the beginning of repeated ciphertext segments to be a multiple of the keyword length. Ideally, it suffices to compute the greatest common divisor of the various distances between such repeated segments, but coincidental repeated ciphertext segments may also occur. Nonetheless, an analysis (Kasiski examination) of the common factors among all such distances is possible; the largest factor which occurs most commonly is the most likely keyword length. Repeated ciphertext segments of length 4 or longer are most useful, as coincidental repetitions are then less probable.

The number of letters in the keyword indicates the number of alphabets t in the polyalphabetic substitution. Ciphertext characters can then be partitioned into t sets, each of which is then the result of a mono-alphabetic substitution. Tiial values for t are confirmed if the frequency distribution of the (candidate) mono-alphabetic groups matches the frequency distribution of the plaintext language. For example, the profile for plaintext English (Figure 7.5) exhibits a long trough characterizing uvwxyz, followed by a spike at a, and preceded by the triple-peak of rst. The resulting mono-alphabetic portions can be solved individually, with additional information available by combining their solution (based on digrams, probable words, etc.). If the source language is unknown, comparing the frequency distribution of ciphertext characters to that of candidate languages may allow determination of the source language itself.

(v) Index of coincidence (vs. polyalphabetic substitution)

The index of coincidence (IC) is a measure of the relative frequency of letters in a cipher- text sample, which facilitates cryptanalysis of polyalphabetic ciphers by allowing determination of the period t (as an alternative to Kasiski’s method). For concreteness, consider a Vigenere cipher and assume natural language English plaintext.

Let the ciphertext alphabet be {ao, ab... , an_i}, and let p, be the unknown probability that an arbitrarily chosen character in a random ciphertext is a,. The measure of roughness measures the deviation of ciphertext characters from a flat frequency distribution as follows:

The minimum value is MRmjn = 0, corresponding to a flat distribution (for equiprobable а,,р, = 1 /n). The maximum value occurs when the frequency distribution of pi has greatest variability, corresponding to a mono-alphabetic substitution (the plaintext frequency distribution is then manifested). Define this maximum value MRmax = kp — 1/n, where к,, corresponds to ^ p,2 when pi are plaintext frequencies. For English as per Figure 7.5, the maximum value is MR = kp - l/n« 0.0658 - 0.0385 = 0.0273. (This varies with letter frequency estimates; np = 0.0667, yielding np - 1/n = 0.0282 is commonly cited, and is used in Table 7.1.) While MR cannot be computed directly from a ciphertext sample (since the period t is unknown, the mono-alphabetic substitutions cannot be separated), it may be estimated from the frequency distribution of ciphertext characters as follows.

Let ft denote the number of appearances of a, in ап I-character ciphertext sample (thus ft = L). The number of pairs of letters among these L is L(L - l)/2, of which /, (/, - l)/2 are the pair (a*, af) for any fixed character a,. Define IC as the probability that two characters arbitrarily chosen from the given ciphertext sample are equal:

Independent of this given ciphertext sample, the probability that two randomly chosen ciphertext characters are equal is p2. Tims (comparing word definitions) IC is an estimate of J^Pi2 ancl by equation (7.1), thereby an estimate of MR + 1/n. Moreover, IC can be directly computed from a ciphertext sample, allowing estimation of MR itself. Since MR varies from 0 to к,, — 1 /n, one expects IC to range from 1 jn (for polyalphabetic substitution with infinite period) to np (for mono-alphabetic substitution). More precisely, the following result may be estabhshed.

7.75 Fact For a polyalphabetic cipher of period t, E(IC) as given below is the expected value of the index of coincidence for a ciphertext string of length L, where n is the number of alphabet characters, кг = 1/n, and kp is given in Table 7.1:

• (p in kp is intended to denote a plaintext frequency distribution, while the r in кг denotes a distribution for random characters.) For Roman-alphabet languages, n = 26 implies nr = 0.03846; for the Russian Cyrillic alphabet, n = 30.
• 7.76 Example (estimating polyalphabetic period using IC) Tabulating the expected values for IC for periods t = 1.2,... using Equation (7.3) (which is essentially independent of L for large L and small t), and comparing this to that obtained from a particular ciphertext using Equation (7.2) allows a crude estimate of the period t of the cipher, e g., whether it is mono-alphabetic or polyalphabetic with small period. Candidate values t in the range thus determined may be tested for correctness by partitioning ciphertext characters into groups of letters separated by t ciphertext positions, and in one or more such groups, comparing the character frequency distribution to that of plaintext. □
 Language Kp French 0.0778 Spanish 0.0775 German 0.0762 Italian 0.0738 English 0.0667 Russian 0.0529

Table 7.1: Estimated roughness constant Kpfor various languages (see Fact 7.75).

A poly alphabetic period t may be determined either by Example 7.76 or the alternative of Example 7.77, based on the same underlying ideas. Once t is determined, the situation is as per after successful completion of the Kasiski method.

7.77 Example (determining period by ciphertext auto-correlation) Given a sample of polyal- phabetic ciphertext, the unknown period t may be determined by examining the number of coincidences when the ciphertext is auto-correlated. More specifically, given a ciphertext sample C1C2 .. .cl, starting with t = 1, count the total number of occurrences c * = ci+t for 1 < i < L — t. Repeat for t = 2,3,... and tabulate the counts (or plot a bar graph). The actual period t* is revealed as follows: for values t that are a multiple of t*, the counts will be noticeably higher (easily recognized as spikes on the bar graph). In fact, for L appropriately large, one expects approximately L ■ kp coincidences in this case, and significantly fewer hi other cases. □

In the auto-correlation method of coincidences of Example 7.77, the spikes on the bar graph reveal the period, independent of the source language. Once the period is determined, ciphertext characters from like alphabets can be grouped, and the profile of single-character letter frequencies among these, which differs for each language, may be used to determine the plaintext language.