This week we revisit the optimal prefix code problem we solved in Discussion #4, but with a variation that forces us out of our former greedy algorithm and into dynamic programming. Once again we have an alphabet of letters, each with a probability, and we want to put the letters at the leaves of a binary tree so as to minimize the average depth of the tree. If each letter ai has depth d(a1) and probability pi, then this average depth is defined to be ∑id(ai)pi, a weighted average with respect to the probabilities.
The variation is this -- we want the letters to occur on the leaves in order, so that we can encode letters using a binary tree as well as decode them. Thus a1 will be the leftmost leaf of the tree and an will be the rightmost node. This way we can decide which branch of the tree to take by comparing the target letter with a letter for each internal node.
In Discussion #3 we used the alphabet {a,e,i,n,o,s,t,x} and the Huffman tree formed for the probabilities {0.08, 0.13, 0.07, 0.07, 0.08, 0.06, 0.09, 0.42}. This tree had x at depth one, e at depth three, and the other six letters at depth four, for an average depth of 1(0.42) + 3(0.13) + 4(0.45) = 2.61. But we cannot arrange this tree to have the leaves in order -- in general the greedy algorithm gives us a tree where we cannot necessarily do this. But dynamic programming will work.
When we form a single tree of of two trees S and T, each node in S or T has
its depth increased by 1. The total average depth for the new tree is the sum
of the average depths of S and T, plus the effect of increasing each
depth by 1. Since a leaf of probability p has its contribution to the sum
increased by exactly p, the total average depth of the new tree is the sum
of the average depths of S and T plus the sum of all the probabilites of the
leaves. We need to compute this sum for each value of z and take the minimum.
That is:
P(i,j) = minz [P(i,z) + P(z+1,j) +
∑k=ijpk]
= minz [P(i,z) + P(z+1,j)] + sum of weights
We can compute each of these values in turn by dynamic programming as long
as we only compute an interval after all its subintervals have been computed.
Thus we first compute P(k,k) for each k, then P(k,k+1) for each k where it
makes sense, then P(k,k+2), and so on until we get P(1,n). There are
(n+1 choose 2) = O(n2) subintervals, and each minimum calculation
takes O(n) time, so the total time to find P(1,n) is O(n3).
As noted, we get P(i,i) = 0 for each i (since the single leaf of a one-leaf
tree occurs at depth 0) and P(i,i+1) = pi + pi+1 (since
both leaves of a two-leaf tree are at depth 1). We can thus start our table:
Here we have multiplied each probability by 100 so that it is an integer.
To find P(1,3), for example, we compare the sums P(1,1) + P(2,3) = 20 and
P(1,2) + P(3,3) = 21, and take our answer to be the minimum of the two plus
28 (p1 + p2 + p3), which is 48. We compute
each of the other entries similarly, taking the minimum of P(i,z) + P(z+1,j)
for all z, then adding the sum of all the weights:
Thus the optimal tree has average depth 2.65, a little worse than the 2.61
that is possible without the leaves in order, but considerably better than the
3.00 for the fully balanced tree. We map the letters to the strings 0000, 0001,
0010, 0011, 0100, 0101, 011 (t), and 1 (x).
j = 1 2 3 4 5 6 7 8
--------------------------------------------------------------
i=1 0 21
2 0 20
3 0 14
4 0 15
5 0 14
6 0 15
7 0 51
8 0
j = 1 2 3 4 5 6 7 8
--------------------------------------------------------------
i=1 0 21 48 70 100 126 165 265
2 0 20 41 70 96 128 220
3 0 14 36 56 88 167
4 0 15 35 60 132
5 0 14 37 102
6 0 15 72
7 0 51
8 0
If you have time, repeat Exercise 2 for a slightly different problem -- the letters {a,e,i,n,o,s,t} have the same probabilities as before, but the eighth letter with probability 0.42 is q rather than x. You should be able to reuse some of your calculations.
The code maps a to 000, e to 001, i to 0010, n to 0011, o to 011, q to 10,
s to 110, and t to 111. The total score is 2.72, a more significant loss from
the 2.65 that could be achieved if the order of leaves did not matter.
j = 1 2 3 4 5 6 7 8
--------------------------------------------------------------
i=1 0 21 48 70 100 185 239 272
2 0 20 41 70 147 201 234
3 0 14 36 100 154 187
4 0 15 72 126 159
5 0 50 104 130
6 0 48 72
7 0 15
8 0
Last modified 26 October 2006