CMPSCI 311 Discussion #4: Huffman Coding

David Mix Barrington

2/4 October 2006

Last week we worked with a variable-length code on an eight-letter alphabet. This week we'll practice with the Huffman algorithm which produces an optimal prefix code given the probabilities for occurrence of each letter.

In Section 4.8 of KT the algorithm is laid out and justified. Our goal, given an alphabet with probabilities for each letter, is to find a tree such that the expected number of bits needed to encode a letter is minimized. Thus we want higher-probability letters to occur at higher levels and require shorter strings, and lower-probability letters to occur at lower levels and require longer strings.

The key lemma is that given any alphabet, there is an optimal tree where the two lowest-probability letters occur as sibling leaves. We can then recursively find the tree by taking these two letters, making a node with them as the two children, and recursing to the instance of the problem where the two letters are replaced by a single "virtual letter" whose probability is the sum of the two letters' probabilities. We then replace the new letter's leaf in the recursively obtained tree by the node with the two letters a children.

I'll walk through the process of making the eight-leaf tree for last week's alphabet, given the typical letter frequencies in English. Then we'll make a tree for the sixteen-letter alphabet obtained by replacing the eleven rarest English letters (the set {b, f, g, j, k, p, q, v, x, y, z}) with "x". This has the following letter frequencies:

        a =  8.2%           n =  6.7%
        c =  2.8%           o =  7.5%
        d =  4.3%           r =  6.0%
        e = 12.7%           s =  6.3%
        h =  6.1%           t =  9.1%
        i =  7.0%           u =  2.8%
        l =  4.0%           w =  2.4%
        m =  2.4%           x = 11.7%
  1. Compute the optimal prefix code tree, using the Huffman algorithm, for this alphabet and this frequency distribution.

    The tree has:

    • Root: node 15
    • Level 1: nodes 14 and 13
    • Level 2: nodes 12, 11, 10, and 9
    • Level 3: nodes 8, 7, 6, e (011), 5, x (101), 4, and t (111)
    • Level 4: nodes 3, a (0001), o (0010), i (0011), n (0100), s (0101), h (1000), r (1001), 2, and 1
    • Level 5: nodes d (00000), l (00001), c (11000), u (11001), m (11010), and w (11011)
  2. How many total bits are needed to send last week's message using this code? With the new alphabet, the message becomes slightly more readable:
           noxodxexxectsthesxanishinxuisition
    

    How does this compare with the bits needed to send the message with a fixed-bit code?

    The 34-letter message contains 12 letters from {e, x, t} which cost 36 total bits, 19 letters from {a, o, i, n, s, h, r} which cost 76 total bits, and 3 letters from {d, l, c, u, m, w} which cost 15 total bits, for a total of 127. The fixed-length code with four bits per letter takes 136 bits, so we save 9 or around 7&. Last week we saved only 2 out of 102 fixed-length bits, for about 2%.

  3. A common way to use Huffman coding is to scan the message to be sent, determine its actual letter frequencies instead of an estimate, and use those to make the Huffman tree. Do this for the message above. How many bits do you need to send for the message itself?

    Then the tree's levels are:

    • Root: node 11
    • Level 1: nodes 10 and 9
    • Level 2: nodes 8, 7, 6, and x (11)
    • Level 3: nodes 5, i (001), 4, s (011), n (100), and 3
    • Level 4: nodes e (0000), t (0001), o (0100), 2, h (1010), and 1
    • Level 5: nodes a (01010), u (01011), d (10110), and c (10111)

    Of course there are arbitrary choices to make ties -- they will lead to different trees but to the same total number of bits sent.

    The message contains 6 x's (12 bits), 13 letters from {i,s,n} (39 bits), 11 letters from {e,t,o,h} (44 bits), and 4 letters from {a,u,d,c} (20 bits), for a total of 115 bits.

    But if the receiver is to decode the message, they need to have the new Huffman tree, or at least have the frequency table (and rules to break ties) so that they can build the tree themself just as we did. How many bits do we need to allow them to do this? (Note that this process would make more sense for a much longer message.)

    If we are agreed on the 16-letter alphabet, we could send the 16 frequency numbers in order -- if we send each as a six-bit integer we would need 96 bits. To send the prefix code we could send the 16 code words separated by commas, as in {01010,10111,10110,0000,1010,,,100,0100,,011,0001,01011,} where two commas with nothing between mean that the letter does not occur. We could send this message with two bits per bit or comma (or brace) for 116 bits.

  4. Explain carefully how the Huffman algorithm creates a tree for an n-letter alphabet where all letters have equal probability. First handle the easier case where n is a power of two, then the case for general n. What is the expected number of bits needed to send a letter, as a function of n?

    If n is a power of two, the algorithm takes the n nodes (each of probability 1/n) and combines them in pairs to make n/2 nodes each of weight 2/n. Then it combines these into n/4 pairs of weight 4/n, and so on until it has formed a complete (balanced) binary tree of depth log n. The expected number of bits to send a letter is log n.

    For the general case, suppose that n = 2k + r where r is a positive number that is less than 2k. The resulting tree will have all 2k of its possible nodes at level k -- r of them the parents of two leaves and 2k - r of them leaves themselves. It's pretty easy to convince yourself that this will always happen, but a proof takes some work,

    Here's one way to do it. I claim that if I have 2k letters, and the probability of the heaviest letter is at most double that of the lightest, I will get a balanced binary tree. This can be proved by induction -- it is trivially true for k=0, and we can reduce the k case to the k-1 case. Suppose I have 2k letters with weight between x and 2x. The Huffman algorithm pairs these up into nodes that have weight between 2x and 4x before trying to pair up the pairs, so it gets to 2k-1 letters with weight between 2x and 4x, which is the k-1 case.

    Now look at what happens when n = 2k + r. After we make r pairs, we have 2k nodes left, r with weight 2/n and the rest with weight 1/n. The claim above then means that the tree will have the shape I describe.

    The tree has 2r nodes at level k+1 and 2k - r nodes at level k. So the expected number of bits to send a letter is:

    [2r(k+1) + (2k - r)k]/n = k + (r/n)

Last modified 4 October 2006