Why does information need probability?

fonz · Jan 22, 2019

I have only recently started studying information theory and I am stuck even trying to understand the basic problem and concept of information theory. My question is why does information need a probability?

I have read some text that starts by introducing the idea that the amount of information is determined by how many states a system can have. This is intuitive but it then goes on to say that the information content is more precisely related to the probability of occurence of a particular state. This doesn't make any sense to me and I cannot see why probability is introduced at all.

m4r35n357 · Jan 22, 2019

Which of the following is more interesting, and why?

Dog bites man
Man bites dog

That is how it was explained to me, at the most basic level.

jedishrfu · Jan 22, 2019

Some states occur more frequently than other states so to say a system is in a particular state implies a kind of probability of occurrence.

In the man dog example, we know it’s more common for a dog to bite a man. However there have been a few documented cases of men biting dogs.

https://en.m.wikipedia.org/wiki/Man_bites_dog_(journalism)

In the journalism example, you might think it’s more common than it is because news tends to magnify and report on uncommon events.

m4r35n357 · Jan 22, 2019

jedishrfu said:

In the man dog example, we know it’s more common for a dog to bite a man. However there have been a few documented cases of men biting dogs.

In the journalism example, you might think it’s more common than it is because news tends to magnify and report on uncommon events.

It's not meant to be too deep (nobody mentioned journalists at the time!), just rare occurrence ->low probability -> quite interesting -> high information content. It was a good enough "explanation" to keep me quiet at the time ;)

Dale · Jan 22, 2019

fonz said:

My question is why does information need a probability?

Suppose that you have a stream of bits. Now, suppose that the bit is always 1, so there is no probability of a zero. Then there is no information in the stream of bits since there is no information in a string of 1’s. You know, in advance, what the stream will say and your friend cannot send you any information with a stream of 1’s. That is why probability enters in. To encode information there must be some non-zero probability of the bits of a message to contain 0 as well as 1.

andrewkirk · Jan 22, 2019

Dale said:

Suppose that you have a stream of bits. Now, suppose that the bit is always 1, so there is no probability of a zero. Then there is no information in the stream of bits since there is no information in a string of 1’s. You know, in advance, what the stream will say and your friend cannot send you any information with a stream of 1’s. That is why probability enters in. To encode information there must be some non-zero probability of the bits of a message to contain 0 as well as 1.

I struggle with this. A computer byte, made of eight bits, contains the same amount of information if it is 11111111 as if it is 00101110. Both have the same probability of occurring from a series of eight random flips, which is ##2^{-8}##.

Also, if we measure the probability of a string as the proportion of zeroes in it, then the string that is 0101010101... has 50% probability and thus ostensibly contains a lot of information, even though it is almost as compressible as a string that is all ones.

My understanding is that the latter may be regarded as containing more information because it is less compressible than the first. But that doesn't seem to have anything to do with probability.

I would be intrigued to learn what it is in the referenced text that says probability is relevant, especially since it says it is 'precisely' related, rather than being a high-level illustrative notion.

Dale · Jan 22, 2019

andrewkirk said:

A computer byte, made of eight bits, contains the same amount of information if it is 11111111 as if it is 00101110. Both have the same probability of occurring from a series of eight random flips, which is 2−82−82^{-8}.

That is not necessarily true. If P(1)=0.5 then it is true. Otherwise, the information content is not the same.

Mark44 · Jan 23, 2019

andrewkirk said:

I struggle with this. A computer byte, made of eight bits, contains the same amount of information if it is 11111111 as if it is 00101110.

But the discussion is about an arbitrarily long stream of bits, or if you like, an arbitrary stream of bytes. If each and every bit is 1, no information being conveyed. In some contexts, the bit string 11111111 would be informative, but if every byte contains the same bit pattern, there's no information.
As an analogy, if you ask somebody a string of questions, and each of his answers is "yes," even to questions whose answer should be "no," then you are in effect getting zero information from this person.

atyy · Jan 23, 2019

fonz said:

I have only recently started studying information theory and I am stuck even trying to understand the basic problem and concept of information theory. My question is why does information need a probability?

An increase in information is a decrease in uncertainty. Uncertainty is quantified using probability. In this conception, uncertainty is quantified as the entropy.

Uncertainty can also be quantified as the number of possibilities. If there are more possibilities, there is more uncertainty.

The two notions of uncertainty as entropy and as the number of possibilities can be shown to match up well if we have an infinite number of samples ("asymptotic equipartition theorem").

berkeman · Jan 23, 2019

fonz said:

My question is why does information need a probability?

I think the replies so far imply a noiseless communication channel. Probability was introduced in my communication classes for real communication channels which always contain noise of some type(s):

Which textbook are you using for your class? This is one of the ones that I used in undergrad...

https://images-na.ssl-images-amazon.com/images/I/51TTXRE1BVL._SX376_BO1,204,203,200_.jpg

DaveC426913 · Jan 23, 2019

Dale said:

Suppose that you have a stream of bits. Now, suppose that the bit is always 1, so there is no probability of a zero. Then there is no information in the stream of bits since there is no information in a string of 1’s. You know, in advance, what the stream will say and your friend cannot send you any information with a stream of 1’s. That is why probability enters in. To encode information there must be some non-zero probability of the bits of a message to contain 0 as well as 1.

Yep. This is how I intuited it in my once-youthful brain.

Information is kind of the opposite of predictability.
If you know what the next number is, then you are not getting any new information. (A useful applied principle in such things as data compression).

stevendaryl · Jan 29, 2019

Let me make up an example that illustrates the idea behind Shannon's definition of information.

Suppose you have four possible messages that might come from your daughter away in college:

Everything's fine.
Send more money.
I had to skip class because I'm sick.
I met a cute boy in class

Let's suppose that half of her messages are #1, one-fourth are #2, and one-eighth are #3 and one-eighth are #4.

Now, suppose she wants to send her messages in code: just dots and dashes (she got the cheapest phone plan, which only allows communication in Morse code). What is the most efficient way for her to send you a message?

Obviously, she could use Morse code, which gives dots and dashes for every letter of the alphabet. On the average, that would mean maybe six signals per letter, or about 600 or so for the whole message. That's obviously an inefficient way to do it, if she only ever sends 4 messages. A better way would be:

dot = message 1
dash dot = message 2
dash dash dot = message 3
dash dash dash = message 4

On the average, the number of signals per message would then be computed as follows:

Half the time, she just needs 1 signal, a dot.
1/4 the time, she needs 2 signals, a dash dot.
1/8 the time, she needs 3 signals, a dash dash dot or dash dash dash.

On the average, it takes 7/4 signals per message. That's the best she can possibly do. Any other way of coding her messages as dots and dashes would, on the average, require more signals per message.

So we can define the "average information content" of her messages to be 1.75 bits.

In the general case, the best we can do in coding messages as bits (or dots and dashes) is given by:

##S \geq \sum_j p_j log(\frac{1}{p_j})##

where ##p_j## is the probability of message number ##j##, and ##log## means logarithm, base 2, and where ##S## is the average number of bits required to convey a message. The best possible code uses a number of bits that is equal to the log of the reciprocal of the probability of a message. (Of course, this number might not be an integer, so you have to round up. That's why you get ##\geq## instead of ##=##.)

sysprog · Feb 1, 2019

stevendaryl said:

Let me make up an example that illustrates the idea behind Shannon's definition of information.

Suppose you have four possible messages that might come from your daughter away in college:

Everything's fine.

Send more money.

I had to skip class because I'm sick.

I met a cute boy in class

Let's suppose that half of her messages are #1, one-fourth are #2, and one-eighth are #3 and one-eighth are #4.

Now, suppose she wants to send her messages in code: just dots and dashes (she got the cheapest phone plan, which only allows communication in Morse code). What is the most efficient way for her to send you a message?

Obviously, she could use Morse code, which gives dots and dashes for every letter of the alphabet. On the average, that would mean maybe six signals per letter, or about 600 or so for the whole message. That's obviously an inefficient way to do it, if she only ever sends 4 messages. A better way would be:

dot = message 1
dash dot = message 2
dash dash dot = message 3
dash dash dash = message 4

On the average, the number of signals per message would then be computed as follows:

Half the time, she just needs 1 signal, a dot.

1/4 the time, she needs 2 signals, a dash dot.

1/8 the time, she needs 3 signals, a dash dash dot or dash dash dash.

On the average, it takes 7/4 signals per message. That's the best she can possibly do. Any other way of coding her messages as dots and dashes would, on the average, require more signals per message.

So we can define the "average information content" of her messages to be 1.75 bits.

In the general case, the best we can do in coding messages as bits (or dots and dashes) is given by:

##S \geq \sum_j p_j log(\frac{1}{p_j})##

where ##p_j## is the probability of message number ##j##, and ##log## means logarithm, base 2, and where ##S## is the average number of bits required to convey a message. The best possible code uses a number of bits that is equal to the log of the reciprocal of the probability of a message. (Of course, this number might not be an integer, so you have to round up. That's why you get ##\geq## instead of ##=##.)

How about:
m1 = dot
m2 = dash
m3 = dot dot
m4 = dot dash

That way, 3/4 of the messages are 1 signal pulse, and 1/4 are 2 signal pulses. If the assessed (billed) cost is based on number of pulses, with short (dot) pulses being of the same cost as long (dash) pulses, the average cost per 4 messages would be 5 signal pulses, or 1.25 cost units per message.

The information content includes the facts of the start and end of the message.

For this discussion, I think we can safely oversimplify the actual workings of a telegraph station: assuming that the line is always connected and is considered idle unless a pulse is detected, and that a pulse is simply a short or long variance in resistance, the full information content of a single dot or dash message may be viewed as 1 message bit, plus the preceding and ensuing line idle duration information -- if a pulse arrives a long time after its most recent predecessor, it's the start of a new message, and the most recent predecessor pulse was the end of the preceding message.

1. pulse detected or not detected (incoming message after long-duration line-idle or line still idle)
2. detected pulse of short duration or of long duration (dot or dash)
3. end of message detected / new pulse detected (long-duration line-idle or new pulse detected)

Informationally, the message length is the number of pulses plus half of the line-idle time at each end of the message that is used to distinguish each message from its predecessor and successor (the first and last messages are special cases) but in our scenario, only the pulses are assessed as part of the billable cost of the message. Temporally, dashes take longer than dots, but again, their cost in this assessment situation is reckoned the same.

Using actual Morse code, the daughter could send the letters E (.), T (-), A(.-) and N (-.) to encode her 4 states of affairs. If she were charged by the letter, instead of by the pulse, she could use any single letter for any of 26 states of affairs, and if numerals cost the same as letters, 36 different states could be encoded in a 1-character message.

The original International Teleprinter code (invented by Emile Baudot in 1870, perhaps taking a hint from the numerals in Morse code) was 5 bits (with a start/stop bit preceding/following each character). That 5-bit size was predicated on the recognition that 2^5 = 32, which is enough for the letters of the alphabet, spaces, a few punctuation marks, and basic carriage control. Later codes were based on the recognition that 5 bits isn't really enough, and still later, some special characters started requiring 2 8-bit bytes.

stevendaryl · Feb 1, 2019

sysprog said:

How about:
m1 = dot
m2 = dash
m3 = dot dot
m4 = dot dash

That wouldn't work as an alphabet, because if you see dot dot, you don't know if it's a single instance of m3 or two instances of m1.

It would work if you had a separator between messages, but that would sort of count as a different symbol. In real Morse Code, silence is the separator, I guess. But that's nitpicking. My point is that to get an optimal code, you have to take into account the rarity of messages, and give the more common messages shorter codes.

sysprog · Feb 1, 2019

stevendaryl said:

sysprog said:

How about:
m1 = dot
m2 = dash
m3 = dot dot
m4 = dot dash

That wouldn't work as an alphabet, because if you see dot dot, you don't know if it's a single instance of m3 or two instances of m1.

That would be equally true of two dashes in succession, if you allowed a single dash or two dashes to be a message -- as I pictured the encoding, you would distinguish a single 2 dot message from two 1 dot messages by the duration of the interval in between the dots.

It would work if you had a separator between messages, but that would sort of count as a different symbol.

I gave account of that in the mention of the long idle that signifies end of message. Your messages don't appear to use anything different from that to denote end of message.

But that's nitpicking.

But isn't nitpicking part of the joy of analysis of hypothetical scenarios?

My point is that to get an optimal code, you have to take into account the rarity of messages, and give the more common messages shorter codes.

Your objection that a dot dot message could be interpreted as 2 single dot messages instead of as a single 2 dot gives no account of how messages are terminated. You allow message 4 to end with a dash, so it can't be that a dot is always used for message termination. You use dash dot for message 2; why not just dash?

Apparently, your scenario is optimal if we are given the following additional rules:

1. A dot always ends a message, but a message in some instances is ended by a dash.
2. A dash does not end a message of fewer than 3 signals.
3. A dash ends a message of length 3.

These additional rules appear to me to be arbitrary. The long-duration idle being used as the message terminator is in my opinion more reasonable, but it's just for the purpose of simplifying things for our scenario.

As I mentioned in my post, in real life, Morse code signalling is more complex. A dot is the basic time unit.

From Recommendation ITU-R M.1677-1 (10/2009) International Morse code (pdf)

2 Spacing and length of the signals

2.1 A dash is equal to three dots.
2.2 The space between the signals forming the same letter is equal to one dot.
2.3 The space between two letters is equal to three dots.
2.4 The space between two words is equal to seven dots.

End of work is signaled by . . . − . − and the Starting signal (to precede every transmission) is − . − . −

I agree that using shorter codes for more frequent messages is preferable in the scenario described; however, my suggestion does not fail to achieve that, unless dashes are counted as longer than dots, which in the posited carrier's plan, they are not, at least not in terms of message cost.

If we're not going to nitpick, the daughter types whatever dots and dashes she wants to, and then presses send, so there would be no ambiguity about end of message, and your encoding would be a little more costly than mine -- 7/4 to 5/4.

stevendaryl · Feb 1, 2019

sysprog said:

Your objection that a dot dot message could be interpreted as 2 single dot messages instead of as a single 2 dot gives no account of how messages are terminated. You allow message 4 to end with a dash, so it can't be that a dot is always used for message termination. You use dash dot for message 2; why not just dash?

I concede. I wasn't actually proposing anything about messages other than trying to illustrate why you want to let the code length of a message depend on the probability of sending that message.

sysprog · Feb 1, 2019

stevendaryl said:

... you want to let the code length of a message depend on the probability of sending that message.

Agreed. That's why Morse assigned E to be represented by a single dot.

Why does information need probability?

Attachments

Attachments

Related to Why does information need probability?

1. Why is probability necessary in information processing?

2. How does probability impact decision making?

3. Can information be completely accurate without probability?

4. How does probability help in data analysis?

5. Is probability used in all types of information processing?

Similar threads

Hot Threads

Recent Insights