Information Science: Binary Coding

Information Science: Binary Coding

by Robin Stewart
May 2001

Binary coding represents the constant compromise over efficiency versus convenience and reliability. There are lots of clever ways to pack up code into nice efficient pieces that take less disk space and less bandwidth to transmit. The downsides of this, however, are that it’s harder to decode "packed-up" data and it provides little or no allowance for errors in transmission. Data can also be arranged in a way that’s very simple and easy to decode, or transmitted with varying degrees of redundancy for error detection and correction.

First of all, binary codes can be thought of as a series of questions in a branching "tree" diagram. Probably the best example of this is the guessing game of the numbers 1 through 10. (In general terms this is a set of members.) Let’s say I’m thinking of the number 7. First you ask, "Is the number > 5?" Yes, so you know the number’s code starts with a ‘1’. The next question might be, "Is it >8?" No, so the next digit is ‘0’. You next ask, "is it = 8?" Nope, put another ‘0’. Finally, "is it 7?" Yes, put a ‘1’, and we’re done: the code for "7" is: 1001. If we always use the same questions (we both know what the questions are for each branch-off point), then if I give you the code (i.e. 1001) you can always determine the number. That’s obviously very important for any kind of meaningful data transfer, and such code is called instantaneous.

Now, what if I had chosen the number 8? The answer would have been "yes" on the third question and you’d have been finished with the code: 101. What about the fourth digit? Well, who needs it? As long as the receiver of the information knows that 101 is all we need for "8", they just go on to the next code and everything’s okay. 1010 and 1011 do not exist, cannot exist. The reason is that if you take a string of this code, the receiver has to know when to start each number &endash; i.e., if it sees "101", the next digit (0 or 1) is assumed to be the start of the next number in the string.

The number of bits needed to define a piece of information (i.e. "7") in a set (i.e. {1,2,3,4,5,6,7,8,9,10}) is called the information per member, denoted H(s). It is basically an average. "101" is a member with 3 bits of information; "1001" is a member with 4 bits of information; presumably, the information per member necessary for the set is somewhere between 3 and 4. If all of the numbers are equally likely to have been chosen, and there are n numbers, then since the information is in base 2, the information per member should theoretically be log₂n (because the power of 2 corresponds to the number of base 2 digits). Log₂(10) = 3.322 &endash; the minimum possible information per member needed to completely define a 10-member set. However, if you set up a branching tree diagram for the set, there are 6 members with 3 bits of information and 4 with 4 bits. Based on this, the average information per member is ((6*3)+(4*4))/10 = 3.4 &endash; the actual minimum amount of information per member, taking into account the limits of the binary system.

The branching tree diagram of questions that we dealt with above was set up based on the idea that all numbers are created equal &endash; are equally likely to be chosen. But consider a situation like a 7-year-old who usually chooses her age when asked for a number between 1 and 10. It would make sense to distort the tree and first ask "Is the number 7?" If it is, we’re done and the code is simply "1". The code for every other number starts with "0". If you set up the tree just right, you would want each member to have an information content (number of bits) determined by its probability of occurring.

In the special case where the probabilities halve for each succeeding member (e.g. 1/2, 1/4, 1/8, 1/16, 1/16) the best tree to use is one where no branch branches again. This tree, with resulting codes, looks something like this:

_________________ \ \ \ \ 0000 1 01 001 0001

Specifically, the relation between probabilities and "information content" is inverse for the most efficient coding (the higher the probability, the fewer the number of bits). The average information per member in a set of members with varying probabilities is called the entropy function (H(S)). This, like the "information per member" referred to above, is the minimum possible number of digits needed per member. If p_r is the probability of member r of the set occurring, and n is the number of members in the set, then:

H(S) = S ⁿ_r=1 p_rlog₂(1/p_r)

which simplifies to: - S ⁿ_r=1 p_rlog₂(p_r)

However, again due to the limits of the binary system, it’s not always possible to set up a code where the average length of codes equals H(S). The average length of codename is:

L = S ⁿ_r=1 p_rl_r (probability of occurring*length of codename)

The efficiency of a code, then, is defined as the entropy of the source (the best you could ever do) over the average length of code (what you really have): H(S)/L. Using the example illustrated above, H(S) = 1.78125. The average length of code is 1*1/2 + 2*1/4 + 3*1/8 + 4*1/8 = 1.875. So the efficiency of that coding scheme is 1.78125/1.875 = 95%. It’s impossible to have an efficiency over 100%.

Okay, time for a little calculus! Using the special case above, p_r=(1/2)^r. So,

H(S) = S ⁿ_r=1 (1/2)^rlog₂(2^r) = S ⁿ_r=1 (1/2)^r*r = S ⁿ_r=1 r/2^r

If we use an infinite number of terms, n --> infinity, the common ratio is (1/2), and ‘r’ on top is negligible compared to 2^r, so the sum of the series is 1/(1-1/2) = 2. (I’m not totally sure that reasoning is legitimate, but my calculator verifies the answer.) The last term in this series would be infinitely long, but the probability of it occurring would be infinitely small &endash; so the net effect is that the average information per member can never be greater than 2 for this special case. The average length of codename will be:

L = S ⁿ_r=1 p_rl_r = S ⁿ_r=1 r/2^r

Which is exactly the same as the expression for the average information per member. What that means is that as the number of terms --> infinity, the efficiency --> 100%. At infinity, the limitations of the binary system are overcome. Pretty cool!

Unfortunately, most codes do not conform to that special case, and if we had a very large list of members, designing a custom code by guess-and-check would be very tedious. Fortunately, there are methods (that can be automated on computers) that design codes systematically. The most common are the Fano and Huffman methods. These can be used to design a code for the English language based on the frequencies of each letter. I won’t re-explain how the methods work here.

Such variable-length codes are basically the best efficiency you can achieve &endash; the most compression you can manage without losing actual data. But as I mentioned at the beginning, it’s somewhat difficult to code and decode. A much simpler method gives each member of a set an equal length code &endash; this is called block code. It’s usually less efficient, but very simple and easy to deal with. Instead of the complicated code discussed earlier, in a 4-bit system "5" can just be represented by the binary "5", 0101. A good example of block code is the ASCII code still used as the standard for transmitting text. It is 7-bit, which allows 2⁷ = 128 characters (it includes lowercase, uppercase, symbols, other special characters, and commands).

Page 2 of 2 -->