CMPSCI 311 Discussion #3: A Variable-Length Code

David Mix Barrington

25/27 September 2006

I have two problems for you this week as we begin the study of greedy algorithms. The first is an example in cryptography, setting the stage for our example next week of the greedy Huffman algorithm for finding an optimal prefix code. The second, if we have time, is a problem on undirected graphs that can be solved using a greedy algorithm.

The first ciphers you encounter are fixed-length codes, where each plaintext letter is encoded by a string of k ciphertext letters, for some fixed k. For example, ASCII encodes each English letter, digit, or symbol by a string of eight bits. Here we will play with an example of a variable-length code, where a subset of the English letters are encoded as bit strings:


       a  =  0000
       e  =  011
       i  =  0011
       n  =  0100
       o  =  0010
       s  =  0101
       t  =  0001
       x  =  1

The set of strings {0000, 011, 0011, 0100, 0010, 0101, 0001, 1} is called a prefix code because no string in the set is a prefix of another string in the set. This property allows us to parse the ciphertext into code words by a greedy algorithm.

  1. Decode the following 100-bit message. Note that there are no spaces in the quotient, and that the plaintext will not necessarily be readable English (see Question 3):

    
         01000 01010 01011 01111 01110
         00101 01000 11011 01011 00000
         10000 11010 11001 10100 11001
         10101 00110 00100 11001 00100
    

    noxoxxexxextstxesxanisxinxxisition

  2. Describe the decoding algorithm you used in terms of a finite-state machine with output. Such a machine is like the DFA of CMPSCI 250, except that each transition (each arrow in the diagram) is labeled with an output symbol, or the empty string, as well as the input letter that causes it to be taken.

    The states may be arranged in a binary tree, where each leaf has two edges to λ and each node with one child also has an edge to λ. Any prefix code leads to a finite-state machine of this type. The number of states is always one less than the number of words in the prefix code.

  3. Although this message uses only the seven most common letters in English plus "x", you may still be able to figure out what it says because of the redundancy in English. You need to guess where the word breaks come, and which low-frequency letter is intended by each "x". What does it say?

    The intended message was "Nobody expects the Spanish Inquisition", a quote from Monty Python. Some of you noticed that the second word could be "excepts" or "effects" and still be a valid translation, but neither of these words makes as much sense in context as "expects".

  4. If we encoded this same message (in the eight-letter alphabet) using a fixed-length code, how many bits would we need?

    The natural way to encode the eight letters would be to pick a 3-bit string for each of the eight letters. Then we would need 102 bits for the 34 letters of the message, more than the 100 we used in the variable-length code. If we can choose our code so that more commonly-used letters are encoded by shorter strings, we can generally save ciphertext bits relative to the fixed-length code. I chose this code based on the typical letter frequencies in English (see, for example, the Wikipedia article "letter frequencies"). We didn't save very much, but that was because my message happened to have a lot of the letters that were encoded in four bits. Next week (and in KT section 4.8), we'll see a method for finding the most efficient possible variable-length code for a given set of letter frequencies.

  5. (Second problem, if time allows, see KT problem 4.29) Any undirected graph with n vertices has a degree sequence, a sequence of n numbers where the i'th number is the degree of vertex vi. (Remember that undirected graphs may not have loops or parallel edges.) A degree sequence of length n must have entries in the range from 0 through n-1, and must sum to an even number. Does any degree sequence with these two properties sum to an even number?

    No. The simplest example is 220, with n=3. It is not possible for a graph with three vertices to have two vertices with degree 2 and one with degree 0, because each of the degree-2 vertices must have an edge to every vertex except itself. Three-vertex graphs exist with sequences 222, 211, 110, and 000, and of course with any sequence that is a permutation of one of these.

    Describe an algorithm that inputs a degree sequence and outputs some graph with that sequence, if any such graph exists.

    Make a vertex with the largest degree d given by the sequence, and connect it to vertices with the next d largest degrees. Then take the degree sequence of length n-1 given by removing the largest entry and subtracting one from the next d largest entries. (If this is not possible, the algorithm fails to construct a sequence.) Recursively apply the algorithm to this new sequence. Add one new vertex to the graph returned by the recursive call, with d edges to vertices of that graph as specified. The base of the recursion is a sequence of all zeros, for which we return a graph with the right number of vertices and no edges.

    For example, given the sequence 42211, we first make a vertex a with edges to all four other vertices, then use the sequence 1100 to connect the other four vertices to each other. We make a vertex b with an edge to a vertex c, and make a recursive call with the sequence 000. So we get edges (a,b), (a,c), (a,d), (a,e), and (b,c) and this graph has sequence 42211.

    Try to argue that your algorithm is always correct.

    I don't have a good answer for this, though it is a known theorem of Havel (1955) and Hakimi (1962) that this algorithm finds a graph if any graph exists. (See this MathWorld page for a reference.) Apparently KT intended this to be a difficult exercise -- I will add a proof here when I get a chance.

    If a sequence has an n-1 or a 0 in it, we know that the only way to represent it as a graph is to have a vertex connected to all others, or to no others, respectively. But if the vertex we remove has degree strictly between 0 and n-1, there are many different ways to pick the vertices that it it has edges to. Apparently the greedy choice of the highest-degree vertices succeeds if any choice succeeds, but this needs a proof to justify the correctness of the greedy algorithm. (Of course if the algorithm succeeds, then the sequence has a graph, but we have not ruled out the possibility that the greedy choice leads to a dead end when some other choice would have succeeded.)

Last modified 28 September 2006