- #1
- 2,355
- 10
There are in fact many "information theories".
Broadly speaking all of these actually define a notion of "information" in terms of a notion of "entropy". In fact, from a basic notion of "entropy", many of these theories define a "conditional entropy" and from this "information".
The most important theory of information is due to Shannon and is founded in probability theory. This theory has been intensively developed since 1948 and in fact IMO has provided for many decades the very model of what a theory in applied mathematics should be:
* the fundamental quantity of study (Shannon's entropy) is readily computable
* it obeys memorable and powerful formal properties
* there are theorems which offer a clear and satisfying interpretation of this quantity
* the basic theory is remarkably elementary
* the theory has proven to be fantastically far-reaching, and various reworkings have proven invaluable in every part of applied mathematics, engineering, and science, where probability theory plays a role (including statistics)
Remarkably, (parts one and two of) the original publication by Shannon remains in many ways the best introduction to this theory; see http://www.math.uni-hamburg.de/home/gunesch/Entropy/shannon.ps
While Shannon 1948 is elementary and highly readable, physicists--- who must in any case deal with dynamical systems--- should be aware from the outset that (as Shannon notes in an appendix) this theory is properly founded in ergodic theory, the mathematical theory of the late time behavior of dynamical systems. See Peter Walters, Introduction to Ergodic Theory, Springer, 1981, for a superb introduction both to ergodic theory and the foundation of information theory; note that in this approach, one defines the entropy of a finite partition, with respect to a probability measure which is invariant under some dynamical process (e.g. a measurable transformation which is iterated).
In Shannon's theory, "information" is a combination of conditional entropies and is essentially a kind of probalistic notion of "correlation" between two dynamical systems (in elementary examples, these are Markov chains). It is important to understand that this notion of information implies no causal direction; Shannon's theory never implies that the behavior of one system is causing the behavior of a second system, only that the behavior of the two systems are correlated. This crucial point is often obscured in textbook accounts.
In the ergodic theory approach, another notion of entropy arises, topological entropy, which is independent of probability measures. This bears a remarkable relationship to Shannon entropy; roughly speaking ,it is obtained by maximizing over a set of probability measures invariant under the dynamics. I should also say that Shannon's entropy is essentially the Hausdorff dimension of a certain fractal set living in an abstract metric space, the set of "typical sequences".
However, while Shannon's information theory is by far the most popular (and arguably the most powerful and the most far-reaching), it is not the only possible approach. An even simpler and in some respects more fundmental approach is combinatorial. It is called Boltzmann entropy and as you would guess from the name it arose in the early days of stastical mechanics. I say "more fundamental" because this approach does not require one to choose any probability measure but applies to any finite partition. This satisfies the same formal properties as Shannon's entropy, and in fact a limiting case reduces to Shannon's entropy. (More precisely, we must first normalize the Boltzmann entropy. The relation between these two quantities was well known to mathematicians prior to Shannon's work; in fact, von Neumann, who had already invented a quantum mechanical variant of Boltzmann entropy, suggested the name "entropy" to Shannon because his quantity had earlier arisen as an approximation to Boltzmann entropy--- adding the quip that Shannon would win any arguments because "no-one knows what entropy is".) In Boltzmann's theory, the entropy of a finite partition measures the "homogeneity" of the relative sizes of the blocks of the partition (and is maximal when all have the same size). It is in fact simply the logarithm of the multinomial coefficient which counts the number of partitions having given block sizes.
Shannon entropy can be treated in mathematical statistics as a "nonparametric statistic". In terms of "sufficient statistics" an important variant of Shannon's theory was introduced by Kullback; here the fundamental quantity is called divergence (or relative-entropy, or cross-entropy, or discrimination). This quantity is important in decision theory, robotic vision, and many other areas.
In mathematical statistics, Shannon entropies are related to Fisher information. In addition, in an appropriate limit, Shannon information is approximated by chi square.
Viewed as a branch of mathematical statistics (one which fits into the Bayesian scheme of things), dealing with time series or dynamical systems, Shannon's theory has the great advantage that the quantities of interest have clear interpretations supported by theorems. However, the problem of estimating "empirical entropies" from data is nontrivial; in addition to bias one must deal with nonlinearity. OTH, the most common parametric statistics (e.g. mean, standard deviation) can be treated in terms of the euclidean geometry in many dimensions, which is the underlying reason why Gauss/Fisher notions of statistics have been so successful. However, these statistical quantities suffer from a little publicized but serious drawback: their alleged interpretation is highly suspect. Thus the perennial war between "Bayesians" and "frequentists".
Leaving aside philosophical conundrums (for once these are however crucially important and even have profound consequences for society), I might mention that the important principle of maximal entropy and the closely related principle of minimal divergence give the most efficient and best motivated way to derive and understand essentially all of the distributions one might run across in probability theory, everything from normal to Pareto distributions. The principle of minimal divergence is closely related to the principle of minimal free energy in thermodynamics, via the Legendre transformation. See the books by J. N. Kapur for more about these principles and their many applications. See also books by Rockefellar for optimization in convex geometry.
I should mention that there are game theoretical approaches to combinatorial versions of information theory; these have many connections to first order logic. There is also a famous interpretation of Shannon's information in terms of gambling, which is a kind of probabalistic model of a stochastic game, which is due to the topologist Kelly (see the textbook by Cover and Thomas).
One of the most important notions in mathematics (and physics) is that of symmetry. Boltzmann's approach can be generalized by incorporating group actions to give a theory closely related to Polya counting (in enumerative combinatorics) and to Galois theory (in fact, the Galois group arises as a conditional complexion). In this theory, which is quite elementary, the numerical entropies of Boltzmann are replaced by certain algebraic objects (coset spaces) which Planck called complexions, which obey the same formal properties. Once again, theorems support the interpretation in terms of "information". In this theory, when a group G acts on a set X, the complexity of a subset measures how much information we must be told in order to know how the points of the subset are moved by an unknown element of the group. The entropy is simply a numerical invariant (as a G-set) of the complexion. If you stare at the Boltzmann entropy and think about coset spaces under the natural action by the symmetric group, you will probably guess how it is defined!
Yet another approach was discovered independently by Solomonoff, Chaitin, and Kolmogorov. In this theory, an algorithmic entropy is defined using a mathematical model of computation. This theory (strictly speaking, there are several versions of algorithmic information theory) has fascinating connections with mathematical logic (Chaitin found a reformulation of Goedel's incompleteness theorem within this framework), and is in a sense very general indeed, but suffers from the drawback that algorithmic entropies are only defined up to an uninteresting additive constant and are in furthermore formally uncomputable--- which does tend to limit the practical application of this approach! This hasn't stopped mathematicians and physicists from making the most of it, however. See Li and Vitanyi, An Introduction To Kolmogorov Complexity And Its Applications, 2nd edition, Springer, 1997.
This might seem rather overwhelming, but fortunately there is an excellent source of information which should help students keep their orientation: the undergraduate textbook by Cover and Thomas, Elements of Information Theory, Wiley, 1991 is a superb introduction to Shannon's information theory, with a chapter on algorithmic information theory. It doesn't use the ergodic theory formulation, but some students may find the language of random variables more congenial. But be aware that true mastery is only possible when one thinks of random variables as measureable functions on a probability measure space and uses the ergodic theoretic foundation of Shannon's theory.
I should say that above I have only sketched some of alternatives on order. The ones I have mentioned thus far all share some "Shannonian" characteristics (similar formal properties of the "entropies"). If one goes further afield one finds many other quantities which have been called "entropy", usually by some formal analogy with another "entropy". These "entropies" typically obey few if any known formal properties, which limits their utility.
A large class of "entropies" are defined in terms of the growth properties of some sequence. Examples include various graph theoretical entropies (the first example was in fact Shannon's first idea before he hit upon the definition used in his information theory--- see the combintorics textbook by Cameron and Van Lint), various algebraic entropies, and various dynamical entropies. For example, given a continuous function on the unit interval, one can look at preimages of f, f^2, ... and from the growth of the number of preimages define an "entropy". In addition, further "entropies" can be defined in terms of some variational principle or integral, including "entropies" of plane curves, knots/links, and so on.
In addition, various generalizations of the form of the definition Shannon's entropy are possible, e.g. Renyi entropies. (I should stress however that "divergence" is best viewed as part of Shannon's theory). Renyi entropies satisfy less useful formal properties than Shannon entropies. A number of "fractal dimensions" encountered in the less rigorous portions of "chaos theory" can be regarded as "entropies".
In statistical mechanics, as I already mentioned, Boltzmann and Gibbs entropies are of fundamental importance. (The latter are subsumed within the ergodic theory formulation of Shannon's theory.) I've been too lazy to write out the definition of Shannon's entropy as a sum of terms like [itex]p \, \log(p)[/itex], but in the setting of compact operators on a Hilbert space this can be generalized to an infinite sum and this is von Neumann's entropy. One might also mention Tsallis entropy, which is a variant of Gibbs entropy, somewhat analogous to the way Renyi entropy is a variant of Shannon entropy.
There are many beautiful connections between operator theoretic entropies and other important ideas such as harmonic analysis. One quantity, Burg entropy, which arises in this way, has proven popular in geophysics. There are also some nice theorems concerning "entropies "which are closely related to the uncertainty principle. For an example see this lovely expository post (maybe soon to be paper?) by Terry Tao: http://terrytao.wordpress.com/2007/09/05/amplification-arbitrage-and-the-tensor-power-trick/
These days "noncommutative mathematics" is very popular; so are "quantum deformations". These have given rise to still more notions of entropy, all of which seem likely to play an important role in 21st century mathematics.
Since I mentioned Fisher information, I should add that as some of you may know, an optical scientist, Roy Frieden, has claimed to found physics upon Fisher information; however, his book has been severely criticized (with good reason IMO)--- these objections are best seen from the perspective of broad knowledge about information theories generally, so I will say no more about it. In contrast, the work of Edwin Jaynes, who reformulated statistical mechanics in terms of divergence and also introduced the principle of maximal entropy, is well founded. (This statement begs the critical question of the dubious physical interpretation of probabilities, but this question is already
habitually ignored within statistical mechanics and indeed within statistics.) See Richard Ellis, Entropy, large deviations, and statistical mechanics, Springer, 1985.
In all, by my count, some twenty years ago, on the order of one thousand distinct quantities called "entropy" have been introduced so far. The inter-relationships between these, where they are known, tend to be rather complicated. The only attempts to survey these are very old, and were very limited in scope even when they were written. Today, the task of surveying "entropies" seems sisyphean.
The answer depends, of course, on what you mean by "most basic".
Some would say that "the" algorithmic information theoretic definition is most fundamental, but this doesn't tell the whole story; in a sense, the probabilistic definition includes this one, so it is not straightforward to say that either is more general than the other. Similarly, Galois entropies include Boltzmann's combinatorial entropies as a special case, which are limiting cases of Shannon entropies.
In a sense, probabilities are everywhere, so one in a sense one can expect to apply Shannon's information theory to any situation where one has a probability measure. However, as I tried to indicate above, depending upon what kind of phenomenon you are interested in, certain characteristics (e.g. causal symmetry) of Shannon's theory may prevent you from learning anything!
Rather than seeking "one entropy to rule them all", I advocate following Shannon's shining example and seeking an elegant theory founded upon clear definitions and with interpretations supported by theorems, in which the quanties studied in the theory bear and unambiguous relation to the phenomena of interest. Formal analogies with the quantities studied in previous information theories can be intriguing and suggestive, but they can unfortunately also be misleading. Only after the foundations of a satisfactory theory have been laid should one look for connections with other information theories.
For example, in differential equations, by studying the question "what information is needed to specify a solution?" we are led to try to compare the "sizes" of the solution spaces to various equations. (Clearly, "dimension" won't suffice to tell us very much!) The problem is then to concoct a self consistent information theory capturing the phenomena we are interested in.
I already mentioned a theory in which numerical entropies are replaced by coset spaces. For mathematical reasons it would be desirable to do something similar with modules, which have nicer properties. Coming at commutative algebra from another direction, natual notions of "information" arise in algebraic geometry, and these appear worthy of serious study.
Turning to biology, one can challenge the assumption that "the information content of a living cell" can be identified with "the entropy" of the genome (when you know more about Shannon's theory you'll appreciate why this DMS as stated). Here, the biological issue is that naked DNA cannot make a living cell; it seems that we also need to account for the "overhead" due to the presence in a living cell of highly organized and quite complicated physical structures (think of a bag of nanobots bumping into each other and into raw materials, with flaps and valves opening and shutting when the right stuff comes with grabbing range, products getting stuck on conveyor belts and shunting off to another part of the cell, and so on). It is far from clear (at least not to me) that probabilities are the most direct or best way of trying to quantify this overhead.
One easy way to see that "Shannonian" information theories might not be what you want is that in these theories, "entropy" is subadditive: the entropy of the whole is no larger than the sum of the entropies of its parts. But sometimes we wish to study phenomena in which "complexity" is superadditive: the complexity of the whole is no smaller than the sum of the complexities of its parts. Indeed, one simple "entropy", due to George Gaylord Simpson, arose in the context of trying to quantify the "diversity" of ecosystems. This problem has given rise to a vast literature, much of it unfortunately too naive mathematically to have lasting value, IMO.
Yes, that was one theme of an expository eprint I wrote as a graduate student, "What is Information?"
One feature which many information theories share is that their notion of "information" arises when we contemplate choosing one of many alternatives.
For example, after receiving a "perturbed message" which was sent over a noisy communcations channel, we wish to choose one of many possible "unperturbed messages", namely the one most likely intended by the sender. Here, if we treat the "perturbation" using a probabilistic model (in fact a Markov chain), we are led to Shannon's theory, in which "information rate" is defined in terms of probability theory.
Or if we are studying a group action on a set, after receiving reliable information about how the points of a certain subset were moved by an unknown element of the group, we wish to deduce the identity of this unknown element. Here, we are led to the theory of complexions, in which "information" is defined in terms of group actions.
Another theme of my expository paper was the study of the question of what information is needed to choose an object in some category, or to specify a morphism. For example, in the category of vector spaces, it suffices to say where the basis elements are sent in order to specify a specific linear transformation.
(NOTE: I originally posted this in another subforum but so horrified by the response that I moved it here, where I hope there is less chance that I will be badly misunderstood.)
Broadly speaking all of these actually define a notion of "information" in terms of a notion of "entropy". In fact, from a basic notion of "entropy", many of these theories define a "conditional entropy" and from this "information".
The most important theory of information is due to Shannon and is founded in probability theory. This theory has been intensively developed since 1948 and in fact IMO has provided for many decades the very model of what a theory in applied mathematics should be:
* the fundamental quantity of study (Shannon's entropy) is readily computable
* it obeys memorable and powerful formal properties
* there are theorems which offer a clear and satisfying interpretation of this quantity
* the basic theory is remarkably elementary
* the theory has proven to be fantastically far-reaching, and various reworkings have proven invaluable in every part of applied mathematics, engineering, and science, where probability theory plays a role (including statistics)
Remarkably, (parts one and two of) the original publication by Shannon remains in many ways the best introduction to this theory; see http://www.math.uni-hamburg.de/home/gunesch/Entropy/shannon.ps
While Shannon 1948 is elementary and highly readable, physicists--- who must in any case deal with dynamical systems--- should be aware from the outset that (as Shannon notes in an appendix) this theory is properly founded in ergodic theory, the mathematical theory of the late time behavior of dynamical systems. See Peter Walters, Introduction to Ergodic Theory, Springer, 1981, for a superb introduction both to ergodic theory and the foundation of information theory; note that in this approach, one defines the entropy of a finite partition, with respect to a probability measure which is invariant under some dynamical process (e.g. a measurable transformation which is iterated).
In Shannon's theory, "information" is a combination of conditional entropies and is essentially a kind of probalistic notion of "correlation" between two dynamical systems (in elementary examples, these are Markov chains). It is important to understand that this notion of information implies no causal direction; Shannon's theory never implies that the behavior of one system is causing the behavior of a second system, only that the behavior of the two systems are correlated. This crucial point is often obscured in textbook accounts.
In the ergodic theory approach, another notion of entropy arises, topological entropy, which is independent of probability measures. This bears a remarkable relationship to Shannon entropy; roughly speaking ,it is obtained by maximizing over a set of probability measures invariant under the dynamics. I should also say that Shannon's entropy is essentially the Hausdorff dimension of a certain fractal set living in an abstract metric space, the set of "typical sequences".
However, while Shannon's information theory is by far the most popular (and arguably the most powerful and the most far-reaching), it is not the only possible approach. An even simpler and in some respects more fundmental approach is combinatorial. It is called Boltzmann entropy and as you would guess from the name it arose in the early days of stastical mechanics. I say "more fundamental" because this approach does not require one to choose any probability measure but applies to any finite partition. This satisfies the same formal properties as Shannon's entropy, and in fact a limiting case reduces to Shannon's entropy. (More precisely, we must first normalize the Boltzmann entropy. The relation between these two quantities was well known to mathematicians prior to Shannon's work; in fact, von Neumann, who had already invented a quantum mechanical variant of Boltzmann entropy, suggested the name "entropy" to Shannon because his quantity had earlier arisen as an approximation to Boltzmann entropy--- adding the quip that Shannon would win any arguments because "no-one knows what entropy is".) In Boltzmann's theory, the entropy of a finite partition measures the "homogeneity" of the relative sizes of the blocks of the partition (and is maximal when all have the same size). It is in fact simply the logarithm of the multinomial coefficient which counts the number of partitions having given block sizes.
Shannon entropy can be treated in mathematical statistics as a "nonparametric statistic". In terms of "sufficient statistics" an important variant of Shannon's theory was introduced by Kullback; here the fundamental quantity is called divergence (or relative-entropy, or cross-entropy, or discrimination). This quantity is important in decision theory, robotic vision, and many other areas.
In mathematical statistics, Shannon entropies are related to Fisher information. In addition, in an appropriate limit, Shannon information is approximated by chi square.
Viewed as a branch of mathematical statistics (one which fits into the Bayesian scheme of things), dealing with time series or dynamical systems, Shannon's theory has the great advantage that the quantities of interest have clear interpretations supported by theorems. However, the problem of estimating "empirical entropies" from data is nontrivial; in addition to bias one must deal with nonlinearity. OTH, the most common parametric statistics (e.g. mean, standard deviation) can be treated in terms of the euclidean geometry in many dimensions, which is the underlying reason why Gauss/Fisher notions of statistics have been so successful. However, these statistical quantities suffer from a little publicized but serious drawback: their alleged interpretation is highly suspect. Thus the perennial war between "Bayesians" and "frequentists".
Leaving aside philosophical conundrums (for once these are however crucially important and even have profound consequences for society), I might mention that the important principle of maximal entropy and the closely related principle of minimal divergence give the most efficient and best motivated way to derive and understand essentially all of the distributions one might run across in probability theory, everything from normal to Pareto distributions. The principle of minimal divergence is closely related to the principle of minimal free energy in thermodynamics, via the Legendre transformation. See the books by J. N. Kapur for more about these principles and their many applications. See also books by Rockefellar for optimization in convex geometry.
I should mention that there are game theoretical approaches to combinatorial versions of information theory; these have many connections to first order logic. There is also a famous interpretation of Shannon's information in terms of gambling, which is a kind of probabalistic model of a stochastic game, which is due to the topologist Kelly (see the textbook by Cover and Thomas).
One of the most important notions in mathematics (and physics) is that of symmetry. Boltzmann's approach can be generalized by incorporating group actions to give a theory closely related to Polya counting (in enumerative combinatorics) and to Galois theory (in fact, the Galois group arises as a conditional complexion). In this theory, which is quite elementary, the numerical entropies of Boltzmann are replaced by certain algebraic objects (coset spaces) which Planck called complexions, which obey the same formal properties. Once again, theorems support the interpretation in terms of "information". In this theory, when a group G acts on a set X, the complexity of a subset measures how much information we must be told in order to know how the points of the subset are moved by an unknown element of the group. The entropy is simply a numerical invariant (as a G-set) of the complexion. If you stare at the Boltzmann entropy and think about coset spaces under the natural action by the symmetric group, you will probably guess how it is defined!
Yet another approach was discovered independently by Solomonoff, Chaitin, and Kolmogorov. In this theory, an algorithmic entropy is defined using a mathematical model of computation. This theory (strictly speaking, there are several versions of algorithmic information theory) has fascinating connections with mathematical logic (Chaitin found a reformulation of Goedel's incompleteness theorem within this framework), and is in a sense very general indeed, but suffers from the drawback that algorithmic entropies are only defined up to an uninteresting additive constant and are in furthermore formally uncomputable--- which does tend to limit the practical application of this approach! This hasn't stopped mathematicians and physicists from making the most of it, however. See Li and Vitanyi, An Introduction To Kolmogorov Complexity And Its Applications, 2nd edition, Springer, 1997.
This might seem rather overwhelming, but fortunately there is an excellent source of information which should help students keep their orientation: the undergraduate textbook by Cover and Thomas, Elements of Information Theory, Wiley, 1991 is a superb introduction to Shannon's information theory, with a chapter on algorithmic information theory. It doesn't use the ergodic theory formulation, but some students may find the language of random variables more congenial. But be aware that true mastery is only possible when one thinks of random variables as measureable functions on a probability measure space and uses the ergodic theoretic foundation of Shannon's theory.
I should say that above I have only sketched some of alternatives on order. The ones I have mentioned thus far all share some "Shannonian" characteristics (similar formal properties of the "entropies"). If one goes further afield one finds many other quantities which have been called "entropy", usually by some formal analogy with another "entropy". These "entropies" typically obey few if any known formal properties, which limits their utility.
A large class of "entropies" are defined in terms of the growth properties of some sequence. Examples include various graph theoretical entropies (the first example was in fact Shannon's first idea before he hit upon the definition used in his information theory--- see the combintorics textbook by Cameron and Van Lint), various algebraic entropies, and various dynamical entropies. For example, given a continuous function on the unit interval, one can look at preimages of f, f^2, ... and from the growth of the number of preimages define an "entropy". In addition, further "entropies" can be defined in terms of some variational principle or integral, including "entropies" of plane curves, knots/links, and so on.
In addition, various generalizations of the form of the definition Shannon's entropy are possible, e.g. Renyi entropies. (I should stress however that "divergence" is best viewed as part of Shannon's theory). Renyi entropies satisfy less useful formal properties than Shannon entropies. A number of "fractal dimensions" encountered in the less rigorous portions of "chaos theory" can be regarded as "entropies".
In statistical mechanics, as I already mentioned, Boltzmann and Gibbs entropies are of fundamental importance. (The latter are subsumed within the ergodic theory formulation of Shannon's theory.) I've been too lazy to write out the definition of Shannon's entropy as a sum of terms like [itex]p \, \log(p)[/itex], but in the setting of compact operators on a Hilbert space this can be generalized to an infinite sum and this is von Neumann's entropy. One might also mention Tsallis entropy, which is a variant of Gibbs entropy, somewhat analogous to the way Renyi entropy is a variant of Shannon entropy.
There are many beautiful connections between operator theoretic entropies and other important ideas such as harmonic analysis. One quantity, Burg entropy, which arises in this way, has proven popular in geophysics. There are also some nice theorems concerning "entropies "which are closely related to the uncertainty principle. For an example see this lovely expository post (maybe soon to be paper?) by Terry Tao: http://terrytao.wordpress.com/2007/09/05/amplification-arbitrage-and-the-tensor-power-trick/
These days "noncommutative mathematics" is very popular; so are "quantum deformations". These have given rise to still more notions of entropy, all of which seem likely to play an important role in 21st century mathematics.
Since I mentioned Fisher information, I should add that as some of you may know, an optical scientist, Roy Frieden, has claimed to found physics upon Fisher information; however, his book has been severely criticized (with good reason IMO)--- these objections are best seen from the perspective of broad knowledge about information theories generally, so I will say no more about it. In contrast, the work of Edwin Jaynes, who reformulated statistical mechanics in terms of divergence and also introduced the principle of maximal entropy, is well founded. (This statement begs the critical question of the dubious physical interpretation of probabilities, but this question is already
habitually ignored within statistical mechanics and indeed within statistics.) See Richard Ellis, Entropy, large deviations, and statistical mechanics, Springer, 1985.
In all, by my count, some twenty years ago, on the order of one thousand distinct quantities called "entropy" have been introduced so far. The inter-relationships between these, where they are known, tend to be rather complicated. The only attempts to survey these are very old, and were very limited in scope even when they were written. Today, the task of surveying "entropies" seems sisyphean.
Mike2 said:My question is what is the most basic definition of information?
The answer depends, of course, on what you mean by "most basic".
Some would say that "the" algorithmic information theoretic definition is most fundamental, but this doesn't tell the whole story; in a sense, the probabilistic definition includes this one, so it is not straightforward to say that either is more general than the other. Similarly, Galois entropies include Boltzmann's combinatorial entropies as a special case, which are limiting cases of Shannon entropies.
In a sense, probabilities are everywhere, so one in a sense one can expect to apply Shannon's information theory to any situation where one has a probability measure. However, as I tried to indicate above, depending upon what kind of phenomenon you are interested in, certain characteristics (e.g. causal symmetry) of Shannon's theory may prevent you from learning anything!
Rather than seeking "one entropy to rule them all", I advocate following Shannon's shining example and seeking an elegant theory founded upon clear definitions and with interpretations supported by theorems, in which the quanties studied in the theory bear and unambiguous relation to the phenomena of interest. Formal analogies with the quantities studied in previous information theories can be intriguing and suggestive, but they can unfortunately also be misleading. Only after the foundations of a satisfactory theory have been laid should one look for connections with other information theories.
For example, in differential equations, by studying the question "what information is needed to specify a solution?" we are led to try to compare the "sizes" of the solution spaces to various equations. (Clearly, "dimension" won't suffice to tell us very much!) The problem is then to concoct a self consistent information theory capturing the phenomena we are interested in.
I already mentioned a theory in which numerical entropies are replaced by coset spaces. For mathematical reasons it would be desirable to do something similar with modules, which have nicer properties. Coming at commutative algebra from another direction, natual notions of "information" arise in algebraic geometry, and these appear worthy of serious study.
Turning to biology, one can challenge the assumption that "the information content of a living cell" can be identified with "the entropy" of the genome (when you know more about Shannon's theory you'll appreciate why this DMS as stated). Here, the biological issue is that naked DNA cannot make a living cell; it seems that we also need to account for the "overhead" due to the presence in a living cell of highly organized and quite complicated physical structures (think of a bag of nanobots bumping into each other and into raw materials, with flaps and valves opening and shutting when the right stuff comes with grabbing range, products getting stuck on conveyor belts and shunting off to another part of the cell, and so on). It is far from clear (at least not to me) that probabilities are the most direct or best way of trying to quantify this overhead.
One easy way to see that "Shannonian" information theories might not be what you want is that in these theories, "entropy" is subadditive: the entropy of the whole is no larger than the sum of the entropies of its parts. But sometimes we wish to study phenomena in which "complexity" is superadditive: the complexity of the whole is no smaller than the sum of the complexities of its parts. Indeed, one simple "entropy", due to George Gaylord Simpson, arose in the context of trying to quantify the "diversity" of ecosystems. This problem has given rise to a vast literature, much of it unfortunately too naive mathematically to have lasting value, IMO.
Mike2 said:I also see information theory as holding a key to the fundamental structure at all levels. There is some complexity to any structure, I assume, and so it must take some information to describe that structure, even at the smallest level.
Yes, that was one theme of an expository eprint I wrote as a graduate student, "What is Information?"
One feature which many information theories share is that their notion of "information" arises when we contemplate choosing one of many alternatives.
For example, after receiving a "perturbed message" which was sent over a noisy communcations channel, we wish to choose one of many possible "unperturbed messages", namely the one most likely intended by the sender. Here, if we treat the "perturbation" using a probabilistic model (in fact a Markov chain), we are led to Shannon's theory, in which "information rate" is defined in terms of probability theory.
Or if we are studying a group action on a set, after receiving reliable information about how the points of a certain subset were moved by an unknown element of the group, we wish to deduce the identity of this unknown element. Here, we are led to the theory of complexions, in which "information" is defined in terms of group actions.
Another theme of my expository paper was the study of the question of what information is needed to choose an object in some category, or to specify a morphism. For example, in the category of vector spaces, it suffices to say where the basis elements are sent in order to specify a specific linear transformation.
(NOTE: I originally posted this in another subforum but so horrified by the response that I moved it here, where I hope there is less chance that I will be badly misunderstood.)
Last edited: