Upper bound for K-L divergence on discrete prob. space

In summary, the conversation is about finding an analytical expression for the upper bound on the Kullback-Leibler divergence for a discrete random variable, expressed as 0 <= S_KL <= f(k), where k is the number of distinguishable outcomes. The question also discusses the possibility of a probabilistic bound for the expected divergence and introduces the concept of an observer with limited complexity and the emergence of an effective "probability space" from this limited history view. There is also a mention of allowing the memory to form substructures and the use of multinomial cases to determine the probability for finding a certain frequency for a given sample size. Ultimately, the goal is to understand the entropy bounds in physics.
  • #1
Fra
4,175
618
Does anyone know of any analytical expression for the upper bound on the Kullback–Leibler divergence for a discrete random variable?

What I am looking for is the bound expressed as

0 <= S_KL <= f(k)

Where k is the number of distinguishable outcomes.

Ultimately I am also looking for the bound in the case where the probability itself belongs to a special discrete set, rather than R. (This would correspond to combinatorics of microstates instead of a the continuum probability; edit: this means there is another variable entering the bound which is the sample/memory size, but the first question is what the limit is in sample size -> infinity.)

I think I looked for this long time ago. If anyone happens to just know a pointer to where this is worked out I would be interested.

The context of the question is to find a deeper understanding various of the entropy bounds in physics.

I am to start with not sure to what extent there are analytical expressions or if I need to make some computer simulations.

Edit: I'm not looking for some approximate (large k) bounds, I am looking for the actual dependence of the upper bound on k.

/Fredrik
 
Last edited:
Physics news on Phys.org
  • #2
Err, to talk about KL Divergences, you need two random variables. Also, there is no general upper bound: it can easily diverge.
 
  • #3
I was sloppy. Yes there are two random variables, I was imagining maximizing over both, but of course your right, I just realized you're right that due to the continuum of the second variable it's indeed unbounded :-p

But that's not the end fo the story. The problem is I didn't formulate the problem correctly. But the formlation of the mathematical problem, isn't a mathematical problem though. It seems the correct question for me is to find a probabilistic bound, like what's the probability for a certain bound. Which OTOH is even better, but then I suspect there is no simple formulates. I guess a computer simulation is the way to do.

And also in the real problem, there probability takes a discrete and finite spectrum of values from 0 to 1, rather than a continuum. The real number was just a simplification anyway, and I just realized that this leads to problems.

So there should be a probabilistic bound, in the sense that to a given confidence level, there is a bound to the expected divergence.

Thanks for your response to my silly question! It help me see that I will go back to my original context and think. Maybe the expressions I already have for the probability of frequency is the right way to go. There I effectively alread have probabilistic bounds of entropy. The divergent outcomes are unlikely in the first place.

/Fredrik
 
  • #4
Fra said:
So there should be a probabilistic bound, in the sense that to a given confidence level, there is a bound to the expected divergence.

This still requires some further formulation. To speak of an "expected divergence," you'd need the distributions of the random variables to themselves be random variables, so you'll have to introduce at least one more random variable. Is this what you have in mind?

Another way to proceed would be to assume some kind of constraints on the distributions of the (two) random variables in question. For example, if you constrain them to always put non-zero probability mass on every element, then you can form a finite upper bound on the KL divergence.
 
  • #5
quadraphonics said:
This still requires some further formulation. To speak of an "expected divergence," you'd need the distributions of the random variables to themselves be random variables, so you'll have to introduce at least one more random variable. Is this what you have in mind?

Yes it's true that it requires more info. But since this is the math section I'm not sure if it's appropriate here.

What I am considering is something like this.

I am considering an observer, with limited complexity(memory). I am considering that this observer can distinguish certain set of finite outcomes. Beyond this, the observer can distinguish certain histories. But these are also finite because the limited complexity of the observer.

In this picture I imagine that there is an emergent, effective "probability space" that is defined by a "window" of history. So the observer tries to, by means of constructing a "probability space" and a "probability measure" from his limited history view, predict the future outcomes.

By construction he can never make deterministic predictions, moreover he can not even make deterministic predictions of the probability measure itself. Because the measure itself is uncertain. But if this was a real observer, he would have not choice to but to gamble and make the best choice he can.

quadraphonics said:
Another way to proceed would be to assume some kind of constraints on the distributions of the (two) random variables in question. For example, if you constrain them to always put non-zero probability mass on every element, then you can form a finite upper bound on the KL divergence.

Later on I will allow the memory to form substructures, of communicating or "cooperating" microstructures, each defining an "effective discrete, finite probability space". Non-occupied states will be rationalized away.

One can to start with picture the multinomial case, and ask what is the probability for finding a certain frequency for a give sample size, given a prior. Unless I made some more mistake I think one simple one bound seems would be

[tex]
S_{KL} <= 1/M \cdot ln \left\{
\frac{M!}{(1-P)} \frac{ \prod_{i=1..k} \rho_{i}^{(M\rho_{i})} }{ \prod_{i=1..k} (M\rho_{i})! }
\right\}
[/tex]

Where M is the sample size of history, P is an improved "probability" for the bound.
Ie. S_KL <= C at probability P. Clearly -> infinity as P -> 1, but in my context p=1 is unphysical, there is never certainty, so the physical situation P=1 never happens.

I use probability in quotes because I'm trying to tweak the terminology to define it inductively.

So I'm not doing strict probability theory, I'm trying to apply it inductively and extend the logic a bit, but constrained by total complexity. But this is just the first step. The next step is to understand how the predictabiltiy can increase if the memory is share, and forming communicating structures. So far this is pretty much classical statistics, the idea is that quantum statistics will emerge in the development, as beeing selected more predictive power (ie fitness).

I will keep expanding this. Your first answer make me see be basic mistake. The semi-probabilistic bound is an even more beautiful solution. But this is only the first step in building the communicating strucutres. I'll keep thinking.

Thanks again for your comments.

Note. the motivation for this, is my opinion that normal probaiblity theory as applied to QM, really does not make sense from a physical point. Mainly due to the fact that complexity bound is not included. I'm also relating this to the bekenstein bound. But I am investigating a resconstruction of a modified "logic" that will be respect complexity bounds, so that there is a limit to how large sampels you can retain, and thus the bayesian estimates from that only subjective probabilities. But not only is the event space discrete and finite, also the probability is so, at least in the reconstruction I have in mind.

/Fredrik
 
  • #6
I'd suggest taking a look at network information theory. Engineers in signal processing have been worrying about finite-complexity, finite-memory, finite-precision distributed estimation for some time now. I'm less sure about the connections to physics, but the basic framework you describe is the sort of thing that is studied in the context of sensor networks pretty intensively.

Also, simply throwing out elements that the observer assigns probability zero is not going to help, as it's exactly these elements that give you infinite KL divergence (provided the true probability of these events is nonzero). You may want to look at information theory literature on universal estimation/coding, where some authors have considered the problem of how much probability should be assigned to unobserved events (you expect it to go to zero as you make an infinite number of observations, of course, but it's interesting to consider what to do when you've just started the observation process).
 
  • #7
Thanks for the pointers. Maybe I will dig up a book on that and see if it contains any novel approaches fruitful to my ideas.

quadraphonics said:
I'm less sure about the connections to physics

I guess this is what I am trying to find out. Formally, I have not definitive answer yet, but conceptually and intuitively I see a clear connection.

quadraphonics said:
Also, simply throwing out elements that the observer assigns probability zero is not going to help, as it's exactly these elements that give you infinite KL divergence (provided the true probability of these events is nonzero). You may want to look at information theory literature on universal estimation/coding, where some authors have considered the problem of how much probability should be assigned to unobserved events (you expect it to go to zero as you make an infinite number of observations, of course, but it's interesting to consider what to do when you've just started the observation process).

The questions you raise here are relevant, but I am taking the view of gaming here. How do you rate your own ignorance? IMHO, there is no way rate your own ignorance by means of a certain measure. I think in the framework of learning and evolution the key is relative progress by gambling, and there is not way I see to for a player to KNOW that he has the optimum strategy, instead his actions can only be based upon incomplete reasoning about this supposed optimum strategy. What is "optimum" is in motion.

I can't describe all my thinking here, but there is a logic behind throwing out the "non-excited" states - they are in a certain way not distinguishable. Also, to maintain "empty bins" means you need to at least store the information about the bin index, and without support that structure isn't favourable.

Also, what is the "true" probability is in the skies only. I'm arguing for no such backgrounds. The prior I use in the KL measure, is not absolute - it's a function of a history window only, and there is a feedback between the prior and the expected frequencies. which means that "at equilibrium" the prior is always update to minimize expectex divergence, and if unexpected divergences appear, then clearly the prior is updated. Then I imagine intertia of updating this, relative to incoming info, so at chaotic conditions, the prior doesn't stabilise and there are a lot of divergences. Then unless the memory record is reorganised in time, it will decay and loose confidence in itself.

So the conceptual connecion to physics should be clear.

There are also fuzzy conceptual connections to "blackbody radiation" which might be seen as bleeding off excess information, and of course, it the observer is at equilibrium, the excess information should be distributed "as randomly as possible" relative to himself.

The prior structurs are also slowly updating, but their complexity implies an inertia. And the prior serves as a guide for "decision making" or gambling, in the spirit of minizing divergence. Here the conceptual step to geometry and action principles are small.

But to start somewhere I am try to elaborate some bits one by one.

/Fredrik
 
  • #8
Fra said:
The questions you raise here are relevant, but I am taking the view of gaming here. How do you rate your own ignorance? IMHO, there is no way rate your own ignorance by means of a certain measure.

Actually, it turns out that there is, at least in certain settings. The usual example that's used to motivate these ideas is of a zoologist that's sent on a mission to a recently-discovered island. His job is to walk around the island, cataloguing all the new animals he sees, and figuring out their relative populations. He has no idea how many different animals there are, or how many total animals live on the island. The big question, then, is "how does he know when he's seen all of the animals?" Equivalently, what probability does he assign to the event "the next animal will be one I haven't seen before"? And it turns out that there is a systematic way to do this that is, in a certain sense, optimal. To get some intuition, he starts out sure that he is completely ignorant: the probability of a new animal is 1. As he walks around an observes animals, then, his estimate of the probability of a new animal goes down at a certain rate. But he must never assign 0 probability to that event, lest he unexpectedly run across a lion.

Fra said:
I think in the framework of learning and evolution the key is relative progress by gambling, and there is not way I see to for a player to KNOW that he has the optimum strategy, instead his actions can only be based upon incomplete reasoning about this supposed optimum strategy. What is "optimum" is in motion.

It's true that he has no way of knowing that a particular strategy is the optimum, but he CAN know that his method for picking a strategy based on past data is optimal, at least in certain settings.

Fra said:
I can't describe all my thinking here, but there is a logic behind throwing out the "non-excited" states - they are in a certain way not distinguishable.

True, but that only justifies lumping them together, not throwing them out entirely. You should lump them into a single "previously unseen event" bin; throwing them out amounts to expressing certainty that there are no as-yet-unseen events.

I would also look around for material on nonstationary/adaptive estimation.
 
  • #9
Thanks again for your comments!

quadraphonics said:
Actually, it turns out that there is, at least in certain settings. The usual example that's used to motivate these ideas is of a zoologist that's sent on a mission to a recently-discovered island.
...
And it turns out that there is a systematic way to do this that is, in a certain sense, optimal.

Ok, you seen to be talking here about a strategy to "catalogue the unknown", and thus indirectly catalogue your ignorance. I definitely agree that is possible. What I mean is that implementing this strategy is a "process", and during the process you don't know what you don't know. But that doesn't mean that we can't guess :) So I think we agree here, but I think that the strategy of cataloguing or learning, may not be unique, and even the strategy itself may through feedback of discoveries be changed.

So I then turn my attention to "selection of strategies". Because even if a strategy in some sense was optimal, if the environment changes, this may not longer be optimal.

So we're looking for something like a natural selection of strategies. And organisms and observers are I think manifestations of such strategies. Those strategies who are really bad, simply won't be around except transiently. I am looking for a framework for this generic inductive learning.

The problem of establishing what "optimal" learning is, is what I imagine will be the key to definition of time. The fact that there is always a direction of "progress" is I think related to the arrow of time.

quadraphonics said:
It's true that he has no way of knowing that a particular strategy is the optimum, but he CAN know that his method for picking a strategy based on past data is optimal, at least in certain settings.

Yes, it's something like this I am talking about too. But even the statement "optimum based on past data is not trivial" because in my thinking at least, I think that the logic of reasoning that decideds wether the strategy is optimal based on a given known history is also relative. This is why I picture that, since the logic implicit in the strategy and "guess-generation" is also in motion, the logic of resoning that determines the optimum based on the history is not universal, but rather the logic implicit in history.

The association I make here is that "first line history" is simple the information processed and retained, and the logic of reasoning is the laws of nature, if we associate the logic of reasoning (the logic emplyed by a player making decisions on imcomplete info) with the laws of nature.

So the future is determined by a window of history, and a the previous state of reasoning.

quadraphonics said:
True, but that only justifies lumping them together, not throwing them out entirely. You should lump them into a single "previously unseen event" bin; throwing them out amounts to expressing certainty that there are no as-yet-unseen events.

Ah ok. I see what you mean. I caused confusion becaues I use the word probability in a special way, with a deformed logic. My suggestion is that we can define all the historical probabilities we know, but we can never _know_ that the future relative frequences will fit this. But we can argue that there is, as per the given logic, a best guess we can do.

So when I suggest that the "probability of unseen events is zero", you are sure right that it can still be the case that the unseen events can be seen in the future), I solve that by induction - by introducing a probability for the probability, or a notion of uncertainty in the probability. I guess that is another way, instead of adding a "blind-bin" for unexpected events.

So, probability zero does not mean it won't happen, I interpret it so that one should not placed bets on them (*) But if they are thrown in our face, then we have a logic to handle it, because our estimation of bets are first never perfect, and second always in motion.

(*) My idea is that this will also be a key to stability! If all players in the game behaves like this, it's easy to see how local stability and structure emerges.

So this all seem to suggest a dynamics.

/Fredrik
 

FAQ: Upper bound for K-L divergence on discrete prob. space

What is an upper bound for K-L divergence on a discrete probability space?

An upper bound for K-L divergence on a discrete probability space is a measure of the maximum possible difference between two probability distributions. It tells us the largest possible difference that can exist between two probability distributions, which can help us compare and understand the similarities and differences between them.

Why is an upper bound for K-L divergence important in scientific research?

An upper bound for K-L divergence is important because it allows us to quantify the differences between probability distributions. This can be useful in various scientific fields, such as statistics, machine learning, and information theory, where understanding the differences between probability distributions is essential.

How is an upper bound for K-L divergence calculated?

An upper bound for K-L divergence can be calculated using various methods, such as the Cauchy-Schwarz inequality or the Pinsker inequality. These methods involve mathematical formulas that take into account the probability values of each distribution and their relationship to each other.

Can an upper bound for K-L divergence be negative?

No, an upper bound for K-L divergence cannot be negative. This is because the K-L divergence itself is always greater than or equal to zero, and the upper bound is the maximum possible value of this measure. Therefore, the upper bound must also be greater than or equal to zero.

How can an upper bound for K-L divergence be used in practical applications?

An upper bound for K-L divergence can be used in various practical applications, such as comparing the performance of different machine learning algorithms, analyzing the differences between genetic sequences in biology, or evaluating the accuracy of climate models. It can also help in data compression and information theory, where minimizing the K-L divergence between distributions is important.

Similar threads

Replies
10
Views
4K
Replies
10
Views
9K
Replies
3
Views
1K
Replies
0
Views
731
Replies
4
Views
3K
Replies
1
Views
2K
Back
Top