Tutorial [Algorithm] Huffman Coding filter_list Author Message
[Algorithm] Huffman Coding #1
Huffman coding (though EXTREMELY old) is used in a lot of compression techniques today (including zip). It is a way to take an n-bit value and turn it into a p-bit value (where p is usually < n) and vice versa.

So, how do we do it?

Here is the algorithm:
1. Create frequency table
2. Order frequency table in ascending order
3. Insert each element of the table into a node
4. Insert each node in ascending order into a list
5. Combine the first two nodes of the list into one node (binary tree) until only one node remains
6. Create a bit table based on the single node
7. Encode the sample text

So, let's walk you through it.

Create frequency table

So, assume we have the following text that we wish to compress:
Code:
`why not compress this shit`
The first thing we need to do is determine which letters are in the text:
Code:
`w, h, y, n, o, t, c, m, p, r, e, s, i, space`
and create a frequency table with all of our letters and their frequencies set to 0.
Code:
```w 0 h 0 y 0 n 0 o 0 t 0 c 0 m 0 p 0 r 0 e 0 s 0 i 0 space 0```
Next, count up the occurances of each letter in the text and set the value in the table
Code:
```w 1 h 3 y 1 n 1 o 2 t 3 c 1 m 1 p 1 r 1 e 1 s 4 i 2 space 4```
and then organize the table in ascending order

Order the table in ascending order

Code:
```w 1 y 1 n 1 c 1 m 1 p 1 r 1 e 1 o 2 i 2 h 3 t 3 s 4 space 4```

Insert each element into a node

Insert each node into an ascending list

As you can see by the image above, they are already in that form

Combine the first two nodes of the list into one node (binary tree) until only one node

Alright, I assume you understand what a binary tree is, and how they work (this one is a binary SEARCH tree).

I made images for every step (there are 16 steps in this example). For the most part, lines in red are new (one time I redlined the items I moved in a reorder step).

NOTE: reordering is not needed, it just makes it look cleaner. Just remember, the lesser value MUST ALWAYS be on the left. Without reordering expect to have some curved lines (especially in a bigger sample).

[

and finally, we have our huffman tree:

Create a bit table based on the single node

This is the easiest part. Each line has a 1-bit value. Lines going left are always 0, lines going right are always 1. I find it easier if you actually put these on your drawing (if doing it by hand).

Next step is to go through each character and figure out what its value is. To do this, as we go down, prepend the previous node's bit value to the node in question. That wording can be somewhat confusing, so I made the whole thing as another image.

that leaves us with the following table:

Code:
```o - 000 i - 001 h - 010 t - 011 w - 10000 y - 10001 n - 10010 c - 10011 m - 10100 p - 10101 r - 10110 e - 10111 s - 110 space - 111```

Note, these do not follow any specific counting order. They are created in this way so that when we go to decode it, every part of a string is unique. I will get to that in decoding.

Encode the sample text

Now that we have the table, lets encode our sample text. For those who don't want to scroll up, here it is:

Code:
`why not compress this shit`
which in ASCII encoding is the following:
Code:
`77 68 79 20 6e 6f 74 20 63 6f 6d 70 72 65 73 73 20 74 68 69 73 20 73 68 69 74`
meaning in binary it is
Code:
`01110111 01101000 01111001 00100000 01101110 01101111 01110100 00100000 01100011 01101111 01101101 01110000 01110010 01100101 01110011 01110011 00100000 01110100 01101000 01101001 01110011 00100011 01101000 01101001 01110100`
If you don't know hex or binary, you might want to brush up on that. Anyways, that is 208 bits to encode that message using UTF-7 (ASCII)
Now, lets encode it with our huffman table:
Code:
`10000 010 10001 111 10010 000 011 111 10011 000 10100 10101 10110 10111 110 110 111 011 010 001 110 111 110 010 001 011`
and into 8-bit format:
Code:
`10000010 10001111 10010000 01111110 01100010 10010101 10110101 11110110 11101101 00011101 11110010 001011`
now, the *true* bit count for this is 94 bits. which means we saved 114 bits in compressing this for a compression ratio of 45% (55% deflated).

In reality, we cannot have a 6-bit byte, sadly. So we will want to store the encoded data as:
Code:
`10000010 10001111 10010000 01111110 01100010 10010101 10110101 11110110 11101101 00011101 11110010 00101100`
which makes a mess, because now we need to store the actual length of the data. I will use an 8-bit value at the beginning of the file to tell how many characters are in the decoded text.
Code:
`00011010 10000010 10001111 10010000 01111110 01100010 10010101 10110101 11110110 11101101 00011101 11110010 00101100`
This makes decoding possible, since we now know that once we hit 0x1A (26) characters, stop decoding and ignore those last two bits. The final bit count is 104, meaning we saved 104 bits for a real compression ratio of 50% (50% deflated).
The final hexadecimal value of the encoded text is:
Code:
`1A 82 8F 90 7E 62 95 B5 F6 ED 1D F2 2C`
meaning we went from 26 bytes to 13 bytes.

Decoding

Decoding is a little annoying if you are writing code for this, but if doing it by hand its pretty simple. Go through the bit string one bit at a time until the bit string you have matches a character in your table, that is the character for that set of bits, start over from the bit you are at until you have decoded the whole string.

Problems

The method described in this tutorial assumes you know the bit table. Programs like zip do not generate a new table every time they compress something (which is why zip compression sucks). If you chose to get better compression, you would also need to store the table in the compressed file. Depending on how large the original file was, this could actually make the compressed file LARGER than it was before (that's why zipping a zip does you no good). Of course, if your file is 8GB, then you will DEFINITELY get good compression by generating a new table, even if you have to store that table in your compressed file.

Comment, questions, or concerns?

If you want to see a tutorial, writeup, breakdown, etc of something feel free to shoot me a PM any time.

Side note, @Oni would you PLEASE install this plugin...

Edit: nevermind, I see you have one already.

RE: [Algorithm] Huffman Coding #2
What does the '*' in the binary tree represent? I'm just wondering how you represent the two different chars as one. (Probably a stupid question.)

EDIT: Loved the tutorial. Most have taken quite a while to make all the images.

RE: [Algorithm] Huffman Coding #3
This is a very nice tutorial, I'm also glad you mentioned that depending on the file size, this method of compression could actually make it larger than it was before (if it was pretty small to begin with). I loved working with Huffman trees and this inspired me to go ahead and rewrite a working version of this. Good stuff.

RE: [Algorithm] Huffman Coding #4
We have a fit-to-page plugin. It just takes a second or two to display.

Discord: Oni#3781
Skype: oni_sl
XMPP: oniaraara@xmpp.jp