How Do Various Theories Define Information Through Entropy?

Chris Hillman · Sep 10, 2007

There are in fact many "information theories".

Broadly speaking all of these actually define a notion of "information" in terms of a notion of "entropy". In fact, from a basic notion of "entropy", many of these theories define a "conditional entropy" and from this "information".

The most important theory of information is due to Shannon and is founded in probability theory. This theory has been intensively developed since 1948 and in fact IMO has provided for many decades the very model of what a theory in applied mathematics should be:
* the fundamental quantity of study (Shannon's entropy) is readily computable
* it obeys memorable and powerful formal properties
* there are theorems which offer a clear and satisfying interpretation of this quantity
* the basic theory is remarkably elementary
* the theory has proven to be fantastically far-reaching, and various reworkings have proven invaluable in every part of applied mathematics, engineering, and science, where probability theory plays a role (including statistics)
Remarkably, (parts one and two of) the original publication by Shannon remains in many ways the best introduction to this theory; see http://www.math.uni-hamburg.de/home/gunesch/Entropy/shannon.ps

While Shannon 1948 is elementary and highly readable, physicists--- who must in any case deal with dynamical systems--- should be aware from the outset that (as Shannon notes in an appendix) this theory is properly founded in ergodic theory, the mathematical theory of the late time behavior of dynamical systems. See Peter Walters, Introduction to Ergodic Theory, Springer, 1981, for a superb introduction both to ergodic theory and the foundation of information theory; note that in this approach, one defines the entropy of a finite partition, with respect to a probability measure which is invariant under some dynamical process (e.g. a measurable transformation which is iterated).

In Shannon's theory, "information" is a combination of conditional entropies and is essentially a kind of probalistic notion of "correlation" between two dynamical systems (in elementary examples, these are Markov chains). It is important to understand that this notion of information implies no causal direction; Shannon's theory never implies that the behavior of one system is causing the behavior of a second system, only that the behavior of the two systems are correlated. This crucial point is often obscured in textbook accounts.

In the ergodic theory approach, another notion of entropy arises, topological entropy, which is independent of probability measures. This bears a remarkable relationship to Shannon entropy; roughly speaking ,it is obtained by maximizing over a set of probability measures invariant under the dynamics. I should also say that Shannon's entropy is essentially the Hausdorff dimension of a certain fractal set living in an abstract metric space, the set of "typical sequences".

However, while Shannon's information theory is by far the most popular (and arguably the most powerful and the most far-reaching), it is not the only possible approach. An even simpler and in some respects more fundmental approach is combinatorial. It is called Boltzmann entropy and as you would guess from the name it arose in the early days of stastical mechanics. I say "more fundamental" because this approach does not require one to choose any probability measure but applies to any finite partition. This satisfies the same formal properties as Shannon's entropy, and in fact a limiting case reduces to Shannon's entropy. (More precisely, we must first normalize the Boltzmann entropy. The relation between these two quantities was well known to mathematicians prior to Shannon's work; in fact, von Neumann, who had already invented a quantum mechanical variant of Boltzmann entropy, suggested the name "entropy" to Shannon because his quantity had earlier arisen as an approximation to Boltzmann entropy--- adding the quip that Shannon would win any arguments because "no-one knows what entropy is".) In Boltzmann's theory, the entropy of a finite partition measures the "homogeneity" of the relative sizes of the blocks of the partition (and is maximal when all have the same size). It is in fact simply the logarithm of the multinomial coefficient which counts the number of partitions having given block sizes.

Shannon entropy can be treated in mathematical statistics as a "nonparametric statistic". In terms of "sufficient statistics" an important variant of Shannon's theory was introduced by Kullback; here the fundamental quantity is called divergence (or relative-entropy, or cross-entropy, or discrimination). This quantity is important in decision theory, robotic vision, and many other areas.

In mathematical statistics, Shannon entropies are related to Fisher information. In addition, in an appropriate limit, Shannon information is approximated by chi square.

Viewed as a branch of mathematical statistics (one which fits into the Bayesian scheme of things), dealing with time series or dynamical systems, Shannon's theory has the great advantage that the quantities of interest have clear interpretations supported by theorems. However, the problem of estimating "empirical entropies" from data is nontrivial; in addition to bias one must deal with nonlinearity. OTH, the most common parametric statistics (e.g. mean, standard deviation) can be treated in terms of the euclidean geometry in many dimensions, which is the underlying reason why Gauss/Fisher notions of statistics have been so successful. However, these statistical quantities suffer from a little publicized but serious drawback: their alleged interpretation is highly suspect. Thus the perennial war between "Bayesians" and "frequentists".

Leaving aside philosophical conundrums (for once these are however crucially important and even have profound consequences for society), I might mention that the important principle of maximal entropy and the closely related principle of minimal divergence give the most efficient and best motivated way to derive and understand essentially all of the distributions one might run across in probability theory, everything from normal to Pareto distributions. The principle of minimal divergence is closely related to the principle of minimal free energy in thermodynamics, via the Legendre transformation. See the books by J. N. Kapur for more about these principles and their many applications. See also books by Rockefellar for optimization in convex geometry.

I should mention that there are game theoretical approaches to combinatorial versions of information theory; these have many connections to first order logic. There is also a famous interpretation of Shannon's information in terms of gambling, which is a kind of probabalistic model of a stochastic game, which is due to the topologist Kelly (see the textbook by Cover and Thomas).

One of the most important notions in mathematics (and physics) is that of symmetry. Boltzmann's approach can be generalized by incorporating group actions to give a theory closely related to Polya counting (in enumerative combinatorics) and to Galois theory (in fact, the Galois group arises as a conditional complexion). In this theory, which is quite elementary, the numerical entropies of Boltzmann are replaced by certain algebraic objects (coset spaces) which Planck called complexions, which obey the same formal properties. Once again, theorems support the interpretation in terms of "information". In this theory, when a group G acts on a set X, the complexity of a subset measures how much information we must be told in order to know how the points of the subset are moved by an unknown element of the group. The entropy is simply a numerical invariant (as a G-set) of the complexion. If you stare at the Boltzmann entropy and think about coset spaces under the natural action by the symmetric group, you will probably guess how it is defined!

Yet another approach was discovered independently by Solomonoff, Chaitin, and Kolmogorov. In this theory, an algorithmic entropy is defined using a mathematical model of computation. This theory (strictly speaking, there are several versions of algorithmic information theory) has fascinating connections with mathematical logic (Chaitin found a reformulation of Goedel's incompleteness theorem within this framework), and is in a sense very general indeed, but suffers from the drawback that algorithmic entropies are only defined up to an uninteresting additive constant and are in furthermore formally uncomputable--- which does tend to limit the practical application of this approach! This hasn't stopped mathematicians and physicists from making the most of it, however. See Li and Vitanyi, An Introduction To Kolmogorov Complexity And Its Applications, 2nd edition, Springer, 1997.

This might seem rather overwhelming, but fortunately there is an excellent source of information which should help students keep their orientation: the undergraduate textbook by Cover and Thomas, Elements of Information Theory, Wiley, 1991 is a superb introduction to Shannon's information theory, with a chapter on algorithmic information theory. It doesn't use the ergodic theory formulation, but some students may find the language of random variables more congenial. But be aware that true mastery is only possible when one thinks of random variables as measureable functions on a probability measure space and uses the ergodic theoretic foundation of Shannon's theory.

I should say that above I have only sketched some of alternatives on order. The ones I have mentioned thus far all share some "Shannonian" characteristics (similar formal properties of the "entropies"). If one goes further afield one finds many other quantities which have been called "entropy", usually by some formal analogy with another "entropy". These "entropies" typically obey few if any known formal properties, which limits their utility.

A large class of "entropies" are defined in terms of the growth properties of some sequence. Examples include various graph theoretical entropies (the first example was in fact Shannon's first idea before he hit upon the definition used in his information theory--- see the combintorics textbook by Cameron and Van Lint), various algebraic entropies, and various dynamical entropies. For example, given a continuous function on the unit interval, one can look at preimages of f, f^2, ... and from the growth of the number of preimages define an "entropy". In addition, further "entropies" can be defined in terms of some variational principle or integral, including "entropies" of plane curves, knots/links, and so on.

In addition, various generalizations of the form of the definition Shannon's entropy are possible, e.g. Renyi entropies. (I should stress however that "divergence" is best viewed as part of Shannon's theory). Renyi entropies satisfy less useful formal properties than Shannon entropies. A number of "fractal dimensions" encountered in the less rigorous portions of "chaos theory" can be regarded as "entropies".

In statistical mechanics, as I already mentioned, Boltzmann and Gibbs entropies are of fundamental importance. (The latter are subsumed within the ergodic theory formulation of Shannon's theory.) I've been too lazy to write out the definition of Shannon's entropy as a sum of terms like [itex]p \, \log(p)[/itex], but in the setting of compact operators on a Hilbert space this can be generalized to an infinite sum and this is von Neumann's entropy. One might also mention Tsallis entropy, which is a variant of Gibbs entropy, somewhat analogous to the way Renyi entropy is a variant of Shannon entropy.

There are many beautiful connections between operator theoretic entropies and other important ideas such as harmonic analysis. One quantity, Burg entropy, which arises in this way, has proven popular in geophysics. There are also some nice theorems concerning "entropies "which are closely related to the uncertainty principle. For an example see this lovely expository post (maybe soon to be paper?) by Terry Tao: http://terrytao.wordpress.com/2007/09/05/amplification-arbitrage-and-the-tensor-power-trick/

These days "noncommutative mathematics" is very popular; so are "quantum deformations". These have given rise to still more notions of entropy, all of which seem likely to play an important role in 21st century mathematics.

Since I mentioned Fisher information, I should add that as some of you may know, an optical scientist, Roy Frieden, has claimed to found physics upon Fisher information; however, his book has been severely criticized (with good reason IMO)--- these objections are best seen from the perspective of broad knowledge about information theories generally, so I will say no more about it. In contrast, the work of Edwin Jaynes, who reformulated statistical mechanics in terms of divergence and also introduced the principle of maximal entropy, is well founded. (This statement begs the critical question of the dubious physical interpretation of probabilities, but this question is already
habitually ignored within statistical mechanics and indeed within statistics.) See Richard Ellis, Entropy, large deviations, and statistical mechanics, Springer, 1985.

In all, by my count, some twenty years ago, on the order of one thousand distinct quantities called "entropy" have been introduced so far. The inter-relationships between these, where they are known, tend to be rather complicated. The only attempts to survey these are very old, and were very limited in scope even when they were written. Today, the task of surveying "entropies" seems sisyphean.

Mike2 said:

My question is what is the most basic definition of information?

The answer depends, of course, on what you mean by "most basic".

Some would say that "the" algorithmic information theoretic definition is most fundamental, but this doesn't tell the whole story; in a sense, the probabilistic definition includes this one, so it is not straightforward to say that either is more general than the other. Similarly, Galois entropies include Boltzmann's combinatorial entropies as a special case, which are limiting cases of Shannon entropies.

In a sense, probabilities are everywhere, so one in a sense one can expect to apply Shannon's information theory to any situation where one has a probability measure. However, as I tried to indicate above, depending upon what kind of phenomenon you are interested in, certain characteristics (e.g. causal symmetry) of Shannon's theory may prevent you from learning anything!

Rather than seeking "one entropy to rule them all", I advocate following Shannon's shining example and seeking an elegant theory founded upon clear definitions and with interpretations supported by theorems, in which the quanties studied in the theory bear and unambiguous relation to the phenomena of interest. Formal analogies with the quantities studied in previous information theories can be intriguing and suggestive, but they can unfortunately also be misleading. Only after the foundations of a satisfactory theory have been laid should one look for connections with other information theories.

For example, in differential equations, by studying the question "what information is needed to specify a solution?" we are led to try to compare the "sizes" of the solution spaces to various equations. (Clearly, "dimension" won't suffice to tell us very much!) The problem is then to concoct a self consistent information theory capturing the phenomena we are interested in.

I already mentioned a theory in which numerical entropies are replaced by coset spaces. For mathematical reasons it would be desirable to do something similar with modules, which have nicer properties. Coming at commutative algebra from another direction, natual notions of "information" arise in algebraic geometry, and these appear worthy of serious study.

Turning to biology, one can challenge the assumption that "the information content of a living cell" can be identified with "the entropy" of the genome (when you know more about Shannon's theory you'll appreciate why this DMS as stated). Here, the biological issue is that naked DNA cannot make a living cell; it seems that we also need to account for the "overhead" due to the presence in a living cell of highly organized and quite complicated physical structures (think of a bag of nanobots bumping into each other and into raw materials, with flaps and valves opening and shutting when the right stuff comes with grabbing range, products getting stuck on conveyor belts and shunting off to another part of the cell, and so on). It is far from clear (at least not to me) that probabilities are the most direct or best way of trying to quantify this overhead.

One easy way to see that "Shannonian" information theories might not be what you want is that in these theories, "entropy" is subadditive: the entropy of the whole is no larger than the sum of the entropies of its parts. But sometimes we wish to study phenomena in which "complexity" is superadditive: the complexity of the whole is no smaller than the sum of the complexities of its parts. Indeed, one simple "entropy", due to George Gaylord Simpson, arose in the context of trying to quantify the "diversity" of ecosystems. This problem has given rise to a vast literature, much of it unfortunately too naive mathematically to have lasting value, IMO.

Mike2 said:

I also see information theory as holding a key to the fundamental structure at all levels. There is some complexity to any structure, I assume, and so it must take some information to describe that structure, even at the smallest level.

Yes, that was one theme of an expository eprint I wrote as a graduate student, "What is Information?"

One feature which many information theories share is that their notion of "information" arises when we contemplate choosing one of many alternatives.

For example, after receiving a "perturbed message" which was sent over a noisy communcations channel, we wish to choose one of many possible "unperturbed messages", namely the one most likely intended by the sender. Here, if we treat the "perturbation" using a probabilistic model (in fact a Markov chain), we are led to Shannon's theory, in which "information rate" is defined in terms of probability theory.

Or if we are studying a group action on a set, after receiving reliable information about how the points of a certain subset were moved by an unknown element of the group, we wish to deduce the identity of this unknown element. Here, we are led to the theory of complexions, in which "information" is defined in terms of group actions.

Another theme of my expository paper was the study of the question of what information is needed to choose an object in some category, or to specify a morphism. For example, in the category of vector spaces, it suffices to say where the basis elements are sent in order to specify a specific linear transformation.

(NOTE: I originally posted this in another subforum but so horrified by the response that I moved it here, where I hope there is less chance that I will be badly misunderstood.)

Chris Hillman · Sep 10, 2007

As an example of an elementary and interesting way of formulating a notion of "information" which is clearly distinct from (but clearly related to) Shannon's way, I offer a bit more detail about something I said in my previous post.

Suppose that some group G acts on some set X. (For concreteness, in this post I will consider left actions, but sometimes it is convenient to take right actions instead.) According to Klein and Tits, this induces a notion of "geometry" on X such that G (or rather a homomorphic image of G, if the action is not faithful) serves as the geometrical symmetry group of the "space" X.

When A is a subset of X, Galois considered the pointwise stabilizer of A, [itex] \triangleleft A[/itex]. Then the left coset space
[tex]G \cdot A = G/\triangleleft A[/tex]
is what Planck called the complexion of A. This represents the variety of motions of A possible under the given action. Because of the Galois duality between stabilizers and fixsets, it is a measure of the geometric asymmetry of A. That is, a highly symmetric subset will have a large stabilizer, so its complexion will be small.

If [itex]H \leq G[/itex] is a subgroup, we consider the restriction of the given action to H. As Galois showed, the stabilizer of the union of two subsets is
[tex]
\triangleleft (A \cup B) = \triangleleft A \cap \triangleleft B
[/tex]
and the conditional complexion of A given B is
[tex]
\triangleleft B \cdot A = \triangleleft B/\left( \triangleleft A \cap \triangleleft B \right)
[/tex]
This represents the variety of motions of A possible while pointwise fixing B, and thus, the variety of possible motions of A should we learn the motion of B under some unknown element of G.

The most important formal property of Shannon entropy (and Boltzmann entropy) is the quotient law. I'll write this in terms of a finite partition in the form
[tex]
H( A \vee B) = H (A) + H(B/A) = H(B) + H(A/B)
[/tex]
(Here, the join of the partitions A,B, written [itex]A \vee B[/itex], is the partition whose blocks are formed by intersecting the blocks of A with those of B; equivalently, this is the coarsest partition refining both A,B.) Then
[tex]
I(A, B) = H (A) - H(A/B) = H(B) - H(B/A)
[/tex]
is a kind of correlation between how A, B partition our base set. The analogous property for complexions is this: following Schreier we can reconstruct (using the Schreier cocycle, a one-cocycle in an appropriate cohomology theory) the joint complexion [itex]G \cdot \left( A \cup B \right)[/itex] from either [itex]G \cdot A[/itex] plus [itex]\triangleleft A \cdot B[/itex] or from [itex]G \cdot B[/itex] plus [itex]\triangleleft B \cdot A[/itex]. If we are talking about Lie groups, the dimensions of the complexions now give us a notion of entropy which obeys the same formal properties as Shannon's (and Boltzmann's) entropy. In particular
[tex]
I(A, B)
= \dim \left( G \cdot A \right) - \dim \left( \triangleleft B \cdot A \right)
= \dim \left( G \cdot B \right) - \dim \left( \triangleleft A \cdot B \right)
[/tex]
is a numerical measure of the information we gain about the motion of A should we learn the motion of B, and vice versa.

The fixset of the stabilizer, [itex]\triangleright \triangleleft A[/itex] is the Galois closure (not to be confused with the "closure" in classical Galois theory) of A, the set of points whose motion is completely determined by the motion of A. Depending upon the action, these need not be "nearby" points; for example, in the case of the usual rotation group [itex]G=SO(3)[/itex] acting on the round sphere [itex]X=S^2[/itex], the closure of a single point x consists of x and the antipodal point of x. It is easy to check that [itex]G \cdot \left( \triangleright \triangleleft A \right) \simeq G \cdot A[/itex]; this is consistent with the fact that the closure of A is the largest subset whose motion is fully determined once we learn the motion of A.

The stabilizers form a complete lattice which is in Galois duality with lattice of fixsets, so we can unite this into the "stabfix lattice". This is in fact a G-lattice which is acted upon by the original group in an obvious way, and the quotient is the lattice of complexions. We can think of the process whereby we progressively determine an unknown group element from its effects on various subsets as crawling up the lattice of fixsets until we reach the maximal element, the entire set X. That is, we gain information in "chunks", namely the motion of a sequence of fixsets.

Exercise: Here is a basic example. Consider the natural action by [itex]PGL(n+1,C)[/itex] on projective space [itex]CP^n[/itex]. What can you say about the geometric nature of the fixsets? Determine the lattice of complexions. (If you know about reflection groups, explain how this is related to Young diagrams and the well known partial ordering of conjugacy classes of the symmetric group.)

Exercise: the same, for some finite Galois fields. Can you count the number of fixsets corresponding to each complexion?

Exercise: Determine the lattice of complexions for the affine group [itex]AGL(n,R) \simeq R^n | \! \times GL(n,R)[/itex] and the euclidean group [itex]E(n,R) \simeq R^n | \! \times SO(n,R)[/itex] (these are semidirect products) acting on n-dimensional real affine space and n-dimensional real euclidean space, respectively. Explain why these results tell us that euclidean space is "more rigid than" affine space, which is in turn "more rigid than" projective space. (This was one of Klein's fundamental insights.)

Exercise: What about the affine group acting on the set of lines in affine space? How do these affine complexions fit into the lattice of complexions of real projective space?

Exercise: Read Brian Hayes, "Sorting out the genome", American Scientist 95 (2007). Write down appropriate actions by the symmetric group [itex]S_n[/itex] and the wreath product [itex]C_2 \wr S_n[/itex] ("fiber first") in order to represent permutations and signed permutations of the sequence of genes on a chromosome. Determine the lattice of complexions. In the first example, explain how Boltzmann entropy arises as a numerical invariant.

Suppose [itex]\psi: X \rightarrow X[/itex] is a G-hom, or equivariant map, i.e. a map which respects the given action. Then it is easy to check that [itex]\triangleleft \psi^{-1} (A)[/itex] is a normal subgroup of [itex]\triangleleft A[/itex], and thus the conditional complexion [itex]\triangleleft A \cdot \psi^{-1} (A)[/itex] is a group. It measures our remaining uncertainty about the motion of the preimage if we should learn the motion of A. Galois considered the group G of automorphisms of a certain field which fix the coefficients of some polynomial. Then the polynomial is a G-hom and we can take the case where A is the singleton [itex] \left{ 0 \right}[/itex], which is fixed by the action, so in this case [itex]\triangleleft x = G[/itex].

(NOTE: I originally posted this in another subforum, and have moved it here where I hope there is less chance that I will be terribly misunderstood.)

Cincinnatus · Sep 10, 2007

Great posts! This is very interesting to me.

Chris Hillman · Sep 10, 2007

Thanks!

marcusl · Sep 10, 2007

Boltzmann's entropy is a subset of Shannon's, specifically when all microstates are equally probable. It is Gibb's entropy that is equivalent to Shannon's.

tronter · Sep 10, 2007

Another question might be: Can you use information theory in noisy systems (i.e. behavior of financial markets)?

Gokul43201 · Sep 10, 2007

tronter said:

Another question might be: Can you use information theory in noisy systems (i.e. behavior of financial markets)?

I've come across papers in market research that (intentionally or not) appear to use it.

Chris Hillman · Sep 11, 2007

Gibbs entropy versus Shannon entropy

marcusl said:

Boltzmann's entropy is a subset of Shannon's

Don't say "subset of" unless you really mean subset, please! I think you mean "special case of", or perhaps "limiting case of". The Boltzmann and Shannon entropies are not even defined at the same "level of structure" (to defined the Shannon entropy of a partition, you must add a probability measure to the structure required to define the Boltzmann entropy of the same partition), but in an appropriate limit, they numerically agree. In this sense, they approximate one another under certain circumstances; indeed, as I said, the Shannon entropy first arose (well before Shannon came along!) as a mathematically convenient approximation to the (normalized) Boltzmann entropy
[tex]
\frac{1}{n} \, \log \frac{n!}{n_1! \, n_2 ! \dots n_r!} \approx -\sum_{j=1}^r \frac{n_j}{n} \, \log \frac{n_j}{n}
[/tex]
where we are computing two entropies of a partition of a set into r blocks of sizes [itex]n_1 + n_2 + \dots n_r = n[/itex] and where we use counting measure to compute the Shannon entropy (on the right hand side). Needless to say, we use Stirling's approximation [itex]\log n! \approx n \, \log n -n[/itex] to establish the approximation. (Exercise: what have I tacitly assumed?)

marcusl said:

specifically when all microstates are equally probable.

Yes, if all the [itex]n_j=m[/itex] (so that [itex]n=r \, m[/itex]), then the value of the Boltzmann entropy does agree with the value of the Shannon entropy computed wrt counting measure (so that [itex]p_j = 1/r[/itex]). This is important but by itself says nothing about the values taken on other partitions (or in the case of Shannon entropy, other probability measures).

Let me point interested PF readers at a website, which I created when I was a graduate student and which is now kept by Roland Gunesch (Math, University of Hamburg):
www.math.uni-hamburg.de/home/gunesch/entropy.html
See in particular the first three papers listed at
http://www.math.uni-hamburg.de/home/gunesch/Entropy/infcode.html
and the first two listed at
http://www.math.uni-hamburg.de/home/gunesch/Entropy/dynsys.html
These should help anyone interested get a good idea of the variety of approaches to Shannon's beautiful ideas and also of the important three-step process by which one obtains for example the metric entropy of a Markov chain from the Shannon entropies of individual partitions of a probability measure space. (The metric entropy is in a sense the quantity of real interest in Shannon's information theory, and Shannon himself gave a nonrigorous and quite charming explanation of the three-stage process I just mentioned in his 1948 paper, which I again recommend studying ASAP.)

marcusl said:

It is Gibb's entropy that is equivalent to Shannon's

No, the Gibbs entropy is a special case of Shannon's entropy.

The rigorous formulation of statistical mechanics is founded upon Shannon's information theory, which is in turn founded upon ergodic theory, as I stated. One defines certain invariant probability measures, called equilibrium states, which in the simplest cases are given by a familiar partition function in the manner described by Gibbs himself (using the method of Lagrange multipliers). The Gibbs entropy is then a Shannon entropy evaluated using this measure.

Physicists are almost always much too glib with terminology, and some may even claim that all the mathematical quantities called "entropy" (which are mathematically distinct, since as we have seen they may not even be defined using comparable settings, e.g. Shannon entropy requires a probability measure but Boltzmann entropy does not) are "in reality" all just ways of estimating a rather abstract physical quantity. That is an extra-mathematical claim which I do not propose to critique here.

In particular, everyone interested in physics should be aware that the connection between classical thermodynamics and statistical mechanics is highly nontrivial. Physicists are likely to encounter the mean field theory approach (perhaps in the book by Schroedinger), but this is only an approximation and in fact is under many circumstances seriously misleading. The rigorous approach is based upon ergodic theory, as I stated, and David Ruelle is often cited as one of the leaders in establishing this rigorous foundation.

Because of the terrible laxity among physicists, it is not so easy to find a clear statement on-line of just what Gibbs entropy is, mathematically speaking! In particular, the current version (at time of my post) of the Wikipedia article http://en.wikipedia.org/w/index.php?title=Gibbs_entropy&oldid=138721554 is seriously misleading. The true relation is what I said: the Gibbs entropy is simply the Shannon entropy evaluated using a particular measure in a particular situation (often but not neccessarily arising from a physical model). See Ruelle, "Extending the definition of entropy to nonequilibrium steady states", Proc. Nat. Acad. Sci. 100 (2003) http://www.pnas.org/cgi/content/full/100/6/3054 and note the paragraph beginning

If [itex]\rho(dx) = g(x)dx[/itex] is the probability measure in phase space corresponding to an equilibrium state, the corresponding Gibbs entropy is
[tex]
S(\rho) = -\int dx \, g(x) \log g(x)
[/tex]

(Yes, an integral has here replaced the sum in what I wrote above--- this quickly gets tricky, but when you unwind the actual definitions used in the rigorous, ergodic theoretical approach, it all works out.)

As I just hinted, the notion of equilibrium states makes sense in contexts other than statistical mechanics. See the undergraduate textbook Keller, Equilibrium States in Ergodic Theory, Cambridge University Press, 1998. Note that the notion of equilibrium measures generalizes the relationship of topological entropy to metric entropy (the last in the three stage process of defining entropies in Shannon's information theory), namely that the topological entropy is the maximum over all invariant probability measures (note that I am not bother to answer here the obvious question, "invariant wrt to what?"--- see the papers already cited for that).

And don't even get me started on topological pressure!

Chris Hillman · Sep 11, 2007

Empirical entropies from time series data?

tronter said:

Another question might be: Can you use information theory in noisy systems (i.e. behavior of financial markets)?

I already mentioned this, but to repeat: Kelly gave an amusing interpretation of the Shannon entropy (more precisely, the metric entropy of a certain measure-theoretic dynamical system) in terms of the long term loss rate in gambling on horse races. In addition, the principle of maximal entropy which I mentioned above has been applied to the analysis of the stock market. An important philosophical issue with financial consequences is the question of whether or not Jaynes was correct in asserting that this principle is the unique consistent rule of probabilistic inference (note that Jaynes interprets probabilities in terms of "plausible reasoning" rather than "frequencies").

More generally, in principle you can compute Shannon entropies whenever you have a probability measure space. As I tried to stress, this is however not always a very smart thing to do.

It sounds like you have in mind real-world time series. If so, to repeat: the problem of estimating entropies from data is nontrivial. Because Shannon entropy is nonlinear and involves logarithms, its values may not vary enough to be easily estimated by crude estimators, which also tend to be biased. (If you follow the links I gave to the website I mentioned above, you'll find that in the three stage process we approach the metric entropy from above when we compute it via the three stage process I mentioned; this may suggest some intuition for why entropy estimators tend to be biased.)

Another place where these problems arise is in biophysics, for example physicians need to accurately diagnose critically ill infants who may or may not be suffering from blood poisoning, and according to some cardiologists there may be subtle clues hiding in the electrocardiogram of aninfant suffering from sepsis.

Due to insistent "real world" demands there has sprung up a branch of "chaos theory" which is not very rigorous and which advocates the use of mathematically dubious quantities to try to extract information from time series (in particular). Sad to say, the very journal I just cited myself, PNAS, has been taken in by one such scheme. (Or rather, one prominent scientist, a member of the Academies but not a mathematician, has been taken in and sponsored publication of papers by someone else.) See if you must http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=11607165
Other examples I already alluded to include various "approximate dimensions", such as "box counting dimension", which can be used to concoct other dubious "entropies".

Even more amazing, at least to me, a caricature of ergodic theory has been employed by the "ID" crowd and the "Bible Codes" crowd to "justify" [sic] more notorious brands of pseudoscience. (I suggest that "approximate entropy >> ID >> BC" in terms of the damage done to society; it seems odd that the least publicized abuse of IT is the one which is potentially life threatening.) I would be remiss not to warn students against naively falling for the (utterly false) claims by practioners of scientific disinformation that these claims are "supported by mathematics", but this is not the place to argue over debunking, so anyone who wishes to discuss such politicized and controversial attempts to subvert information theory should head over to the Skepticism and Debunking forum.

ice109 · Sep 12, 2007

jesus christ either you type really really fast or you're really really dedicated

daniel_i_l · Sep 12, 2007

Thanks for that post! I've always been interested in exactly what it was and how it's connected to physics.

Fra · Sep 13, 2007

Excellent posts Chris, thanks for putting it back here! :) I didn't see it was moved until Chronos pointed it out.

Since you seem to have a very interesting focus points and the mix with GR and information, I'm quite curious on your work and views? I'll check your pages.

What are your main interests? Are you a mathematicain interested in physics? or are you more of a physicists if you know what I mean.

And what are your current views/research? What's your views on fundamental physics? What's your stance on QG?

/Fredrik

Chris Hillman · Sep 14, 2007

daniel_i_l said:

Thanks for that post! I've always been interested in exactly what it was and how it's connected to physics.

You're very welcome!

Chris Hillman · Sep 18, 2007

See also my Post # 32 in https://www.physicsforums.com/showthread.php?p=1434298#post1434298.

How Do Various Theories Define Information Through Entropy?

FAQ: How Do Various Theories Define Information Through Entropy?

What is Information Theory?

What are the main concepts in Information Theory?

What is the significance of Information Theory?

Who is known as the father of Information Theory?

How is Information Theory related to other fields of study?

Similar threads

Hot Threads

Recent Insights