Quantity of Information in Information Theory

pinsky · Jan 2, 2012

Hello there.

I'm a 4th year computer engineering student who is just at the end of an Information theory class. It was nice, I thought I got a grip on all the concepts and so...

But, I tried to explain to my girlfriend today the basics of information theory, and I failed to do so. The basic problem was (not entropy, can you imagine that) the quantity of information.

I started with the basic example a toss of a fair coin. And all was clear, while the coin is spinning in the air, we don't know what will occur and entropy is one bit.

When the coin lands, we receive one bit of information. Everything is nice and clear, there is a link between the symbol and a quantity.

But then the unfair coin showed. Which had a higher chance of heads appearing and all was lost.

At first a claimed (as was teached at school), "The amount of information gets higher as the probability of occurrence lowers". After i said it a couple of times out loud, it just didn't feel right anymore.

When i flip an unfair coin (with a higher heads chance) and tails occur, why didn't I, after the uncertainty is resolved, end up with 1 bit of information?

And then again, if a have a chance for tails 0.01, and tails occur. I still end up with less bits of information than if the same event happened with probability of 0.5.

Does anyone know the motivation for this definition for quantity of information?

Tnx

schip666! · Jan 2, 2012

That word Information causes no end of problems...randomness is what is actually being measured. It's often helpful to think of the Shannon Entropy value as how surprised you are with each new result.

With a fair coin you are quite surprised to get either result, because there is no way (...let's just keep this simple...) to predict heads vs tails. For an 80% un-fair coin you are less surprised to get the un-fairest result. And for a two-headed coin you are not at all surprised when it comes up heads. The relative entropy values for those three are: 1, .72, and 0. It's a convenient way to put a numeric value to your intuition, but it's also way more than convenient as you dig deeper...

Here's a little Q&D description I did for that Stanford AI Class -- it turned out they never really mentioned Information Theory so it was all for naught... But it has a link to a demo spreadsheet that you can play with to get a feel for the ideas.

pinsky · Jan 2, 2012

Thanks for the reply.

I see now that I made a mistake in

And then again, if a have a chance for tails 0.01, and tails occur. I still end up with less bits of information than if the same event happened with probability of 0.5.

where I used relative entropy instead of self-information.

However (back to your answer), it isn't the entropy which is bothering me, but "self-information". The concrete amount we get after the uncertainty is gone and we know the value that occurred.

Lets use your definition and say that when i get the information, i get surprised by [itex]Log_2(p_{heads})[/itex]
bits of information. How would you expand that way of thinking to explain how entropy is used to reduce redundancy before sending information through a channel?

There is also one thing I couldn't explain well in simple language and give examples.
If we observe the words we speak as codes made by phonemes (which represent the starting alphabet), why would an uniform distribution of phonemes mean we would have to talk less :). Ignore the things concerning channel noise

schip666! · Jan 3, 2012

Lets try it this way... If you have a uniform distribution of phonemes (or characters for that matter) you are equally likely to receive any particular phoneme. This sets the maximum on what you can transmit on a communication channel -- which is basically white noise.

If your received distribution is NOT uniform, then you are getting some kind of message mixed into the noise -- there is some process that is perturbing the distribution and adding "order" to the set of phonemes. One particular organization might be the distribution of phonemes used in standard English, which is far from uniform.

Then, you might start looking at the "mutual information" between two events: How likely is it that phoneme "th" is followed by phoneme "e"? With a uniform distribution there should be no difference between that sequence and, e.g., "th" followed by "th". But in English "the" is _very_ common and "thth" is almost non-existent.

In both cases of non-uniform distribution your Entropy value is smaller than it would be if the distribution was uniform, and this indicates that "something is going on in the channel". Shannon's insight was that the uniform distribution sets the bar for how much you can distinguish.

A way to think about the 0.01-chance-of-tails coin flip thing is: How much memory do I need to record a sequence of events? If the coin is two-headed, you only need one bit total. If it is fair, you need one bit per flip. If it's weighted, you can compress the results to something in-between those two extremes and the entropy value tells you how many bits you will need.

pinsky · Jan 7, 2012

Thanks again for the effort.

However I'm still not clear with everything.
How about that we observe the words instead of phonemes. Let's say our basic alphabet is an English dictionary.

Now let's say that i heard the following sentence:

"This information theory is quite abstract to me"

Now let's observe two cases. One where the original dictionary has a standard non-uniform English distribution, and a second one where the distribution was uniform.

So what happened is that i received the same message just the starting distributions of words was different.

Why is the sending with the uniform distribution in any way easier/simpler than sending with the non-uniform distribution in this case?

As for the coin flipping.
I agree with you for the compressing. But the uncompressed output of an unweighted coin still gives me 1 bit per flip. Can you perhaps link, or show an example of how the compressed output looks like, and why wasn't it possible to compress the uniformly distributed one.

Ok, I agree, it couldn't be compressed because the entropy is maximal, but in what stage of the compressing process does that actually permit the compressing?

tnx

schip666! · Jan 7, 2012

We may be veering into the territory of Meaning, which is not addressed by Information Theory. This is what gets folks jumbled up when trying to come to grips with it... But...

There is no such thing as a uniform distribution of English {words,syllables,chars,phonemes}. The non-uniform distribution is what makes it English versus Russian versus noise. With a uniform distribution there is no message to receive.

Take our coin examples. A totally fair coin has the highest entropy and each flip really tells you nothing. With a weighted coin you find some "meaning" in the uneven results: it tells you something about the system. Because the entropy of the weighted coin is lower you go looking for a cause. I guess you could say the same thing in reverse if we lived in a world where all coins were weighted and we found one that seemed to be fair, but we don't live in that world...

On the compression topic, the 80% coin is going to give you something like: HHHHTHHHHT and you might come up with some way to describe the four-heads with a shorter sequence, maybe just the number of heads between each tail. This is what segment-length image compression does (the old fashioned CCITT 1-d fax compression) -- and I think ZIP encoding for files. It alternates counts of the number of pixels that are white and then black -- and further crunches them using Huffman Codes. If each pixel of the image alternates white/black your compression goes to scrod though. It depends on there being some non-uniform distribution of pixel values.

Quantity of Information in Information Theory

Related to Quantity of Information in Information Theory

What is the "Quantity of Information" in Information Theory?

How is the Quantity of Information calculated?

What is the relationship between Probability and Quantity of Information?

Can Quantity of Information be negative?

What are some real-world applications of Quantity of Information?

Similar threads

Hot Threads

Recent Insights