Last week we worked with a variable-length code on an eight-letter alphabet. This week we'll practice with the Huffman algorithm which produces an optimal prefix code given the probabilities for occurrence of each letter.
In Section 4.8 of KT the algorithm is laid out and justified. Our goal, given an alphabet with probabilities for each letter, is to find a tree such that the expected number of bits needed to encode a letter is minimized. Thus we want higher-probability letters to occur at higher levels and require shorter strings, and lower-probability letters to occur at lower levels and require longer strings.
The key lemma is that given any alphabet, there is an optimal tree where the two lowest-probability letters occur as sibling leaves. We can then recursively find the tree by taking these two letters, making a node with them as the two children, and recursing to the instance of the problem where the two letters are replaced by a single "virtual letter" whose probability is the sum of the two letters' probabilities. We then replace the new letter's leaf in the recursively obtained tree by the node with the two letters a children.
I'll walk through the process of making the eight-leaf tree for last week's alphabet, given the typical letter frequencies in English. Then we'll make a tree for the sixteen-letter alphabet obtained by replacing the eleven rarest English letters (the set {b, f, g, j, k, p, q, v, x, y, z}) with "x". This has the following letter frequencies:
a = 8.2% n = 6.7%
c = 2.8% o = 7.5%
d = 4.3% r = 6.0%
e = 12.7% s = 6.3%
h = 6.1% t = 9.1%
i = 7.0% u = 2.8%
l = 4.0% w = 2.4%
m = 2.4% x = 11.7%
The tree has:
noxodxexxectsthesxanishinxuisition
How does this compare with the bits needed to send the message with a fixed-bit code?
The 34-letter message contains 12 letters from {e, x, t} which cost 36 total bits, 19 letters from {a, o, i, n, s, h, r} which cost 76 total bits, and 3 letters from {d, l, c, u, m, w} which cost 15 total bits, for a total of 127. The fixed-length code with four bits per letter takes 136 bits, so we save 9 or around 7&. Last week we saved only 2 out of 102 fixed-length bits, for about 2%.
Then the tree's levels are:
Of course there are arbitrary choices to make ties -- they will lead to
different trees but to the same total number of bits sent.
The message contains 6 x's (12 bits), 13 letters from {i,s,n} (39 bits),
11 letters from {e,t,o,h} (44 bits), and 4 letters from {a,u,d,c} (20 bits),
for a total of 115 bits.
But if the receiver is to decode the message, they need to have the new Huffman tree, or at least have the frequency table (and rules to break ties) so that they can build the tree themself just as we did. How many bits do we need to allow them to do this? (Note that this process would make more sense for a much longer message.)
If we are agreed on the 16-letter alphabet, we could send the 16 frequency numbers in order -- if we send each as a six-bit integer we would need 96 bits. To send the prefix code we could send the 16 code words separated by commas, as in {01010,10111,10110,0000,1010,,,100,0100,,011,0001,01011,} where two commas with nothing between mean that the letter does not occur. We could send this message with two bits per bit or comma (or brace) for 116 bits.
If n is a power of two, the algorithm takes the n nodes (each of probability
1/n) and combines them in pairs to make n/2 nodes each of weight 2/n. Then it
combines these into n/4 pairs of weight 4/n, and so on until it has formed
a complete (balanced) binary tree of depth log n. The expected number of bits
to send a letter is log n.
For the general case, suppose that n = 2k + r where r is a
positive number that is less than 2k. The resulting tree will have
all 2k of its possible nodes at level k -- r of them the parents of
two leaves and 2k - r of them leaves themselves. It's pretty easy
to convince yourself that this will always happen, but a proof
takes some work,
Here's one way to do it. I claim that if I have 2k letters,
and the probability of the heaviest letter is at most double that of the
lightest, I will get a balanced binary tree. This can be proved by induction --
it is trivially true for k=0, and we can reduce the k case to the k-1 case.
Suppose I have 2k letters with weight between x and 2x. The Huffman
algorithm pairs these up into nodes that have weight between 2x and 4x before
trying to pair up the pairs, so it gets to 2k-1 letters with weight
between 2x and 4x, which is the k-1 case.
Now look at what happens when n = 2k + r. After we make r
pairs, we have 2k nodes left, r with weight 2/n and the rest with
weight 1/n. The claim above then means that the tree will have the shape I
describe.
The tree has 2r nodes at level k+1 and 2k - r nodes at level k.
So the expected number of bits to send a letter is:
[2r(k+1) + (2k - r)k]/n = k + (r/n)
Last modified 4 October 2006