Homepage

Exploring Cryptography - Frequency Analysis pt.3

April 02, 2023

In part 2 I showed how you could crack a Caesar Cipher using frequency analysis alone. In this post I’m going to look at applying a similar technique to crack simple substitution ciphers, where the cipher alphabet is not known ahead of time.

A simple substitution cipher can be easily created using a pen and paper. First you write out each letter of the alphabet on a single line, then you shuffle the alphabet into a random order, and write the shuffled alphabet on the line below.

abcdefghijklmnopqrstuvwxyz
xahbjkystvqucmefzolwnpigrd

To encipher some text, you take the first character of the text, and find the corresponding character on the top line, and then write the character from the bottom line in the same position. You then repeat this process for the second character of the text, and so on. Using the example above, the text hello would be enciphered as sjuue. To decipher text, you perform the same process in reverse, starting by finding the character in the bottom line, and then writing the corresponding character from the top line in the same position. Note that you can use non alphabetical characters in the cipher text, as long as each character in the plain text alphabet corresponds to a single symbol, you can use anything you want - including emojis!

Simple substitution ciphers are a lot more secure than the Caesar Cipher, but they are still relatively easy to crack using frequency analysis. Whilst they are not very useful for serious cryptography, I think they are a good on-ramp to a lot of the core ideas in cryptography and cryptanalysis, specifically how underlying patterns in the encoding structure can be exploited to “break” the cipher.

The Cipher

To start with I’m going to create a simple cipher function that takes a string and a cipher alphabet and returns the encrypted string. The cipher alphabet is a permutation of the alphabet, so it’s a string of 26 unique characters. The cipher alphabet is generated by shuffling the alphabet, and then the first character of the cipher alphabet is used to replace the first character of the alphabet, the second character of the cipher alphabet is used to replace the second character of the alphabet, and so on.

export const encipher = (text: string, key: string) => {
  const alphabet = 'abcdefghijklmnopqrstuvwxyz'
  const textArray = text.split('')
  const cipherArray = textArray.map(char => {
    return key[alphabet.indexOf(char)]
  })
  return cipherArray.join('')
}

We can write a simple test to check that the cipher works as expected:

describe('Simple Substitution Cipher', () => {
  it('should encode a string', () => {
    // Classic Caesar Cipher (3)
    const key = 'defghijklmnopqrstuvwxyzabc'
    expect(encipher('hello', key)).toBe('khoor')
  })
})

Next I want to create a function that will decipher the text, using the same key. This is a simple function that takes the cipher text and the key and returns the original text and is really just an inversion of the encipher function.

export const decipher = (ciphertext: string, key: string) => {
  const alphabet = 'abcdefghijklmnopqrstuvwxyz'
  const cipherTextArray = ciphertext.split('')
  const plainArray = cipherTextArray.map(char => {
    return alphabet[key.indexOf(char)]
  })
  return plainArray.join('')
}

Again, I’m going to write some basic tests to check that the decipher function works as expected:

it('should decode a string', () => {
  const key = 'defghijklmnopqrstuvwxyzabc'
  expect(decipher('khoor', key)).toBe('hello')
})

it('encode then decode should result in the same value', () => {
  const key = 'defghijklmnopqrstuvwxyzabc'
  expect(decipher(encipher('hello', key), key)).toBe('hello')
})

Key Generation

One thing that I noticed straight away here is that the key feels very cumbersome to work with. It’s a string of 26 unique characters, which is hard to remember, and it’s also hard to type. Instead of having to define a whole 26 character key that is then used as the cipher alphabet, I’d like to be able to generate the cipher alphabet using a shorter key. To do this, I’m going to take advantage of a PRNG (pseudo random number generator) to generate a random permutation of the alphabet. I can then use a short phrase, word or number as the key, which will be the seed for the PRNG.

A PRNG is a function that takes a seed value and returns a sequence of numbers that appear random. The sequence of numbers that are returned by the PRNG are deterministic, so if you use the same seed value you will always get the same sequence of numbers. Anyone familiar with Minecraft will be familiar with the concept of a PRNG, as Minecraft uses a PRNG to generate the world from a known seed. When you start a new world in Minecraft you can enter a seed value, and the world will be generated using a PRNG started from that seed. This is convenient, because if you enter the same seed value you will always get the same world. Note that this is possible because Minecraft worlds are procedurally generated using a PRNG, and not because the world is stored in a database somewhere. For our use case, using a PRNG means that if you use the same seed value you will always get the same cipher alphabet and when we want to decipher the text we can use the same seed value to generate the same cipher alphabet for the deciphering process. The PRNG I’m going to use is the Mersenne Twister algorithm (provided by random-js, which is a very popular PRNG that is used in many programming languages. The Mersenne Twister is not cryptographically secure, but it is good enough for our purposes as we are not trying to generate secure keys, just keys that are easy to remember.

The function is pretty simple, it takes an input key, which is converted to a 32 bit integer, and then used as the seed for the PRNG. We then use the PRNG to generate random numbers, which we use to find a random character from the alphabet by taking the modulus of the random number and the length of the alphabet. We then add that character to the cipher alphabet if it is not already present (remember the cipher alphabet must be unique!). This process is repeated until the cipher alphabet is the same length as the alphabet, resulting in a random permutation of the alphabet.

const stringTo32BitInteger = (str: string): number => {
  let hash = 0
  for (let i = 0; i < str.length; i++) {
    const charCode = str.charCodeAt(i)
    hash = (hash << 5) - hash + charCode
    hash |= 0 // Convert to a 32-bit integer
  }
  return hash
}

export const generateCipherAlphabetFromKey = (key: string) => {
  const seed = stringTo32BitInteger(key)
  const twister = MersenneTwister19937.seed(seed)
  const alphabet = 'abcdefghijklmnopqrstuvwxyz'
  const alphabetArray = alphabet.split('')
  const cipherAlphabet: string[] = []
  while (cipherAlphabet.length < alphabetArray.length) {
    const int = Math.abs(twister.next())
    const index = int % alphabet.length
    const char = alphabetArray[index]
    if (!cipherAlphabet.includes(char)) {
      cipherAlphabet.push(char)
    }
  }
  return cipherAlphabet.join('')
}

I can now use this function to generate a cipher alphabet from a key, and then use that cipher alphabet to encipher and decipher text. Let’s plug in the new alphabet generator to the encipher and decipher functions:

export const encipher = (text: string, key: string) => {
  const alphabet = 'abcdefghijklmnopqrstuvwxyz'
  const cipherAlphabet = generateCipherAlphabetFromKey(key)
  const textArray = text.split('')
  const cipherArray = textArray.map(char => {
    return cipherAlphabet[alphabet.indexOf(char)]
  })
  return cipherArray.join('')
}

export const decipher = (ciphertext: string, key: string) => {
  const alphabet = 'abcdefghijklmnopqrstuvwxyz'
  const cipherAlphabet = generateCipherAlphabetFromKey(key)
  const cipherTextArray = ciphertext.split('')
  const plainArray = cipherTextArray.map(char => {
    return alphabet[cipherAlphabet.indexOf(char)]
  })
  return plainArray.join('')
}

I’ll also update the tests to use the new key functionality:

it('should encode a string', () => {
  const key = 'aardvark'
  expect(encipher('hello', key)).toBe('nbpph')
})

it('should decode a string', () => {
  const key = 'aardvark'
  expect(decipher('nbpph', key)).toBe('hello')
})

it('encode then decode should result in the same value', () => {
  const key = 'aardvark'
  expect(decipher(encipher('hello', key), key)).toBe('hello')
})

Nice! We now have a working simple substitution cipher that can use an arbitrary key to encrpyt and decrypt text. The next step is to see if we can apply frequency analysis to this cipher to break it 😈.

Cracking the Cipher

I’m going to start with the same base frequency data that I used previously:

const STANDARD_FREQUENCIES = {
  a: 0.0812,
  b: 0.0149,
  c: 0.0271,
  d: 0.0432,
  e: 0.1202,
  f: 0.023,
  g: 0.0203,
  ...
  y: 0.0211,
  z: 0.0007,
}

Then I’m going to create a crack function that will take a ciphertext and attempt to decode it using the following steps:

  1. Create a frequency map of the ciphertext
  2. For each character in the frequency map, find the character in the standard frequency map that has the closest frequency
  3. Create a cipher alphabet using the closest characters
  4. Use the cipher alphabet to decipher the ciphertext
export const crack = (input: string) => {
  const cipherText = input.toLowerCase().replace(/[^a-z]/g, '')

  // Create an object by iterating through the ALPHABET and setting each value to null
  const cipherMap = ALPHABET.split('').reduce((obj, char) => {
    obj[char] = null
    return obj
  }, {} as Record<string, null>)

  const cipherFrequency = frequencyAnalysis(cipherText)

  // Iterate through the STANDARD_FREQUENCIES and find the best match in the cipherFrequency
  while (Object.values(cipherMap).includes(null)) {
    for (let char in STANDARD_FREQUENCIES) {
      let bestMatch = null
      let bestScore = Infinity

      for (let cipherChar in cipherFrequency) {
        const score = Math.abs(
          (STANDARD_FREQUENCIES as any)[char] - cipherFrequency[cipherChar]
        )

        if (score < bestScore && cipherMap[cipherChar] === null) {
          bestScore = score
          bestMatch = cipherChar
        }
      }

      if (bestMatch) {
        ;(cipherMap as any)[bestMatch] = char
      }
    }
  }

  // Create the cipher alphabet
  const cipherAlphabet = Object.values(cipherMap).join('')

  // Decipher the cipher text
  return processText(cipherText, ALPHABET, cipherAlphabet)
}

This works spectacularly badly! Even with relatively large plaintext (4000+ characters), it’s only able to decipher a few characters correctly. I think the problem is that the single character frequency analysis is too fine grained. Most of the standard letter frequencies you find are compiled from very large sets of text, often over 50k words, and the frequencies are averaged out. In the smaller blocks of text I’m using, the frequencies can be quite different, and the single character frequency analysis is not able to find the correct matches. The variability of letter frequencies in small blocks of text is just too high for this approach to work. I need to find a better approach, one that can better exploit the patterns in the English language, and work with smaller and more varied blocks of text.

Quadgrams

The approach I’m going to try will use quadgrams. A quadgram (also known as a 4-gram) is a continuous sequence of four characters from a given text. In the context of cracking codes, they can be useful for finding patterns in text and measuring how accurate a deciphering attempt is By comparing the frequency distribution of quadgrams in the decrypted text to a known distribution of quadgrams in the target language (e.g., English), you can estimate how likely it is that the decrypted text is meaningful and correctly decrypted.

For example, in the text “hello world”, the quadgrams are:

  • “hell”
  • “ello”
  • “llo ”
  • “lo w”
  • “o wo”
  • ” wor”
  • “worl”
  • “orld”

By looking at lots and lots of English text, you can obtain frequency counts for quadgrams in English. You can then use this information to score a decrypted text based on how closely its quadgram frequencies match those of the English language. Essentially we’re going to use the same approach as before, but instead of comparing the frequency of individual characters, we’re going to compare the frequency of larger structures of text (quadgrams).

I’m going to be using the quadgrams from the practical cryptography website, which you can find here. I’ve downloaded the file and put it in the src directory. I’m going to use the fs module to read the file and create a map of quadgrams and their frequencies:

import { readFileSync } from 'fs'

const loadQuadgramData = (filename: string) => {
  const fileContent = readFileSync(filename, 'utf-8')
  const lines = fileContent.split('\n')
  const quadgramMap: Record<string, number> = {}

  for (const line of lines) {
    const [quadgram, frequency] = line.split(' ')
    quadgramMap[quadgram.toLowerCase()] = parseInt(frequency, 10)
  }

  return quadgramMap
}

const QUADGRAMS_FILE = './english_quadgrams.txt'
const QUADGRAMS = loadQuadgramData(QUADGRAMS_FILE)

I’m then going to repeatedly different combinations of the cipher alphabet, scoring them based on the quadgram frequencies, if the score is higher than the previous best score, I’ll keep the new score and cipher alphabet. This approach is sometimes called a “hill-climbing” algorithm, because it’s like climbing a hill, and you’re always trying to find the highest point. You retain results that move you “up the hill”, but discard results that move you “down the hill”. This is a very simple approach, but it works surprisingly well in practice, despite having some drawbacks like getting stuck in a local maximum.

const quadgramScore = (text: string) => {
  let score = 0
  const cleanedText = text.replace(/[^a-z]/g, '').toLowerCase()

  for (let i = 0; i < cleanedText.length - 3; i++) {
    const quadgram = cleanedText.slice(i, i + 4)
    const quadgramFrequency = QUADGRAMS[quadgram] || 0
    score += Math.log(quadgramFrequency + 1)
  }

  return score
}

export const crack = (cipherText: string) => {
  // Remove non-alphabetic characters and convert to lowercase
  const cleanedCipherText = cipherText.toLowerCase().replace(/[^a-z]/g, '')

  // Function to swap two characters in a string at positions i and j
  const swapPositions = (str: string, i: number, j: number) => {
    const chars = str.split('')
    ;[chars[i], chars[j]] = [chars[j], chars[i]]
    return chars.join('')
  }

  // Function to get a random integer between min and max (inclusive)
  const getRandomInt = (min: number, max: number) => {
    return Math.floor(Math.random() * (max - min + 1)) + min
  }

  // Set the maximum number of iterations for the hill-climbing algorithm
  const MAX_ITERATIONS = 10000

  // Generate a random initial alphabet as the starting point
  let bestAlphabet = ALPHABET.split('')
    .sort(() => Math.random() - 0.5)
    .join('')

  // Decrypt the text using the initial alphabet and calculate its quadgram score
  let bestText = processText(cleanedCipherText, ALPHABET, bestAlphabet)
  let bestScore = quadgramScore(bestText)

  // Iterate through the hill-climbing algorithm
  for (let iteration = 0; iteration < MAX_ITERATIONS; iteration++) {
    // Get two random positions in the alphabet
    const i = getRandomInt(0, 25)
    const j = getRandomInt(0, 25)

    // Swap characters at the random positions in the current best alphabet
    const newAlphabet = swapPositions(bestAlphabet, i, j)

    // Decrypt the text using the new alphabet and calculate its quadgram score
    const newText = processText(cleanedCipherText, ALPHABET, newAlphabet)
    const newScore = quadgramScore(newText)

    // If the new quadgram score is better, update the best alphabet, text, and score
    if (newScore > bestScore) {
      bestAlphabet = newAlphabet
      bestText = newText
      bestScore = newScore
    }
  }

  // Decipher the original cipher text using the best alphabet found
  return processText(cipherText, ALPHABET, bestAlphabet)
}

This performs remarkably well! When you use quadgrams in frequency analysis, you’re taking into account not only the frequency of individual letters but also the likelihood of specific letter combinations appearing together. This allows you to better exploit the patterns found in the English language and provides a more accurate representation of the language’s structure. My implementation here will occasionally fail, but it’s usually able to decipher the text correctly. We can increase the max iterations value to make it more reliable, but I’ve found that 10k is a good balance between speed and reliability (this version got about 85% accuracy in my testing). It’s also worth noting that this algorithm is not the fastest, taking about 30 seconds to run on my mid-range laptop. This is because it’s using a brute-force approach to find the best cipher alphabet. There are more efficient algorithms that can be used to find the best cipher alphabet, but they’re beyond the scope of this article.

Whilst you can use different n-grams, quadgrams are favored because they strike a balance between computational efficiency and accuracy. When you go from a quadgram to a 5-gram the number of possible combinations increases exponentially. For example, with a 26-character alphabet, there are 26^4 = 456,976 quadgrams, but there are 26^5 = 11,881,376 5-grams. Additionally, the frequency of 5-grams is much lower than that of quadgrams, so the scoring function will be less accurate.

If this article was interesting to you, I highly recommend that you play around with these ideas yourself:

  • try creating your own cipher functions
  • extend the code here to work with non alphabetic characters
  • try using different n-grams to score the decrypted text
  • try using a different algorithm to find the best cipher alphabet

If you have any questions or comments, please let me know in the comments below, or hit me up on Twitter @LucianBuzzo.

You can find the full source code for this post here.

Further reading


It's me, Lucian Buzzo, and this is my personal blog, where I write about things that interest me.
You can check out my GitHub profile here and my Twitter profile here.