Optimizing Machine Learning with Irreducible Factorizations of Random Variables

mXSCNT · Feb 2, 2009

"Factorizing" random variables

Suppose we have a (discrete) random variable X. Consider a random variable Y "equivalent" to X if there are functions f, g such that X = f(Y) and Y = g(X). Among other things, this implies H(Y) = H(X).

Y = Y₁ x Y₂ x ... x Y_n, where x is the cartesian product, is an "independent factorization" of X if Y is equivalent to X and the Y_i are mutually independent random variables (the factors).

So for example, if X is uniformly distributed over {0,1,2,3}, then a factorization is Y = B₁ x B₂ where the B_i are uniformly distributed over {0,1} and mutually independent. f(x) = (x mod 2, floor(x / 2)), and g((b1,b2)) = b1 + 2*b2.

In this context, a random variable Y is "one" if H(Y) = 0, that is if Y takes one value with probability 1. A random variable Y is irreducible if it has no factors other than one and itself.

A factorization of a random variable is irreducible if all factors in the factorization are irreducible, and are not one.

The problem is then, given a discrete random variable, what are its factorizations? Are irreducible factorizations of discrete random variables unique up to order of the factors and equivalence (as defined above) of the factors? Are irreducible elements also prime?

Hurkyl · Feb 2, 2009

Note you can define a "divides" relation; A divides B (written A | B) if there is a factorization B ~ AxC.

The usual approach in this sort of situation is to define two kinds of elements: irreducible elements, and prime elements.

C is irreducible if and only if A | C implies A ~ C or A ~ 1.
C is prime if and only if AxB | C implies A | C or B | C.

Typically, one can easily show that irreducible factorizations exist^*, and that prime factorizations are unique. If you can show that irreducible objects are actually prime, then happiness ensues.

*: This may require some sort of finiteness condition

mXSCNT · Feb 2, 2009

Well, it's clear that irreducible factorizations exist. Let |W| be the cardinality of the domain of a random variable W. We know
[tex]
|X| = |Y| = \prod_{i=1}^n |Y_i|
[/tex]
So you can't keep factorizing the factors forever; |Y_i| has to be a whole number and decreases with each factorization.

The definition of primes is A | BC --> A | B or A | C. I think that not all irreducible R.V.'s are prime, because of the problem that if we multiply (by cartesian product) two arbitrary discrete random variables, they might not be independent. Having a hard time thinking of a counterexample though at the moment, I need some paper.

mXSCNT · Feb 2, 2009

The simplest case for factorization is where the random variable is uniformly distributed. In that case it factorizes like the integers, into factors whose domains have prime cardinality. If the random variable is not uniformly distributed, this isn't necessarily true; for example suppose X is distributed over {0,1,2,3} where P(X=x) = 1/5 if x <= 2, and P(X=3) = 2/5. Then X is irreducible despite the fact that |X| = 4.

The simplest case for multiplication (cartesian product) is where both random variables are independent. If they aren't then you need their joint distribution to compute the product.

Hurkyl · Feb 3, 2009

The definition of primes is A | BC --> A | B or A | C.

Blah, you're right. I always find that definition annoying, because different ways of looking at the situations often reverse the direction of the relation.

Hurkyl · Feb 3, 2009

The simplest case for multiplication (cartesian product) is where both random variables are independent. If they aren't then you need their joint distribution to compute the product.

But then, it isn't an independent factorization -- I think maybe you want to restrict attention to independent products, for this problem? (Or, at least, consider that first)

mXSCNT · Feb 3, 2009

Okay, I have a counterexample.
Let X be distributed over {0,1}, let Y be distributed over {0,1,2,3}. The joint distribution of X and Y is

Code:

   |Y 0   1   2   3
-----------------------------
X 0|1/8 1/8 1/4 0
  1|1/8 1/8 0   1/4

Let f((x,y)) = y when y <= 1, and let f((x,y)) = 2 when y >= 2. Let g((x,y)) = x. Then let A = f(X x Y) and B = g(X x Y). Then X x Y can be factored as A x B.

|A| = 3 so A is irreducible, but since |X| and |Y| are not divisible by 3, A cannot be a factor of X or Y. Thus, there are R.V.'s that are irreducible but not prime.

Hurkyl said:

But then, it isn't an independent factorization -- I think maybe you want to restrict attention to independent products, for this problem? (Or, at least, consider that first)

In an independent factorization of X, the factors are only independent from each other--they are not independent from X. Anyway X isn't independent from itself, so it's unavoidable - in fact, for all X, X x X ~ X.

mXSCNT · Feb 3, 2009

In an application (i.e. machine learning, or perhaps statistical analysis), suppose we are interested in splitting up a random variable into independent parts that we can then analyze separately. In an empirically measured data set, I think it's probably unlikely that a random variable will exactly factorize into independent random variables; I would expect most empirical random variables to be irreducible, since slight departures from the uniform distribution can spoil a factorization. So one might consider a "near" factorization of a random variable, as follows.

Before, I had X ~ Y if there are functions g, f such that g(X) = Y and g(Y) = X. Now let's introduce a "less than or equal" relationship: X <= Y if there is a function f such that f(Y) = X. This has the desired property that X <= Y and Y <= X implies X ~ Y.

A "hidden variables" independent factorization of a random variable X is then a product Y = Y₁ x Y₂ x ... x Y_n where the Y_i are mutually independent, and X <= Y. The quality of the factorization can be measured by H(Y), which we'd like to minimize. In the best case, H(Y) = H(X) and the near factorization is an exact factorization. I say "hidden variables" because the excess uncertainty in Y means Y_i is not only a function of X (as it is in an exact factorization); Y_i = f_i(X,W_i) where f_i is some function and W_i is independent from X. The W_i are the hidden variables.

A "nearly independent" factorization of a random variable X is again a product Y = Y₁ x Y₂ x ... x Y_n where X ~ Y and the Y_i are not necessarily mutually independent. The quality of a nearly independent factorization should be measured by how close to being independent the Y_i are. One way to measure this is by total correlation; let the quantity to minimize be [tex]C(Y) = \sum_{i=1}^n H(Y_i) - H(Y)[/tex].

mXSCNT · Feb 3, 2009

I still don't know if exact irreducible independent factorizations are unique.

Per your suggestion, one could extend the concept of a prime to an "independent prime." Say that A i| B (where i| means independently divides) if there is a random variable C, independent from A, where AC ~ B. A random variable A is an independent prime if for every X and Y, where X and Y are independent, A i| XY implies that A i| X or A i| Y. Perhaps if a random variable is independently irreducible, it is also an independent prime.

Hurkyl · Feb 3, 2009

I have a counterexample to unique factorization.

Choose two positive real numbers satisfying

[tex]\frac{a^6 - b^6}{a - b} = a^4[/tex]

(This ensures the total probabilities of the following variables sums to 1)
(e.g. a = 0.4 and b ~ 0.249400)
(one exact solution is a = 1/63 and b = 2/63)

Let S and T be independent random variables satisfying:

[tex]P(S = 0) = \frac{a^3}{a^3 + b^3}[/tex]

[tex]P(S = 1) = \frac{b^3}{a^3 + b^3}[/tex]

[tex]P(T = 0) = (a^3 + b^3)\frac{1}{a^2}[/tex]

[tex]P(T = 1) = (a^3 + b^3)\frac{b}{a^3}[/tex]

[tex]P(T = 2) = (a^3 + b^3)\frac{b^2}{a^4}[/tex]

The frequencies of the 6 outcomes of SxT are, in no particular order,

[tex]a, b, \frac{b^2}{a}, \frac{b^3}{a^2}, \frac{b^4}{a^3}, \frac{b^5}{a^4}[/tex]

Now, consider independent random variables U and V given by

[tex]P(U = 0) = \frac{a}{a + b}[/tex]

[tex]P(U = 1) = \frac{b}{a + b}[/tex]

[tex]P(V = 0) = (a + b)[/tex]

[tex]P(V = 1) = (a + b) \frac{b^2}{a^2}[/tex]

[tex]P(V = 2) = (a + b) \frac{b^4}{a^4}[/tex]

The 6 frequences of the outcomes of UxV are, in no particular order

[tex]a, b, \frac{b^2}{a}, \frac{b^3}{a^2}, \frac{b^4}{a^3}, \frac{b^5}{a^4}[/tex]

Both frequency lists are the same; so we should be able to arrange things so that SxT and UxV are equivalent random variables. However, the factorizations are distinct, because S and U are not equivalent. (And S is clearly not equivalent to V)

mXSCNT · Feb 15, 2009

Nicely done. The usefulness of factorizing random variables, in my estimation, lies in machine learning. Suppose that we have a random variable X which is sampled according to some machine learning algorithm, such that the long-term probability of visiting a given state [tex]x \in X[/tex] increases monotonically with a fitness function f(x). This is the case, for example, in a Boltzmann machine, where the fitness function is inversely related to the energy of a state of the machine. It is likely also roughly true for other hill climbing algorithms, such as genetic algorithms or gradient ascent. We seek to maximize f, which we can do by finding the maximum likelihood value of X.

Now, if we can decompose X = (Y1,Y2) where Y1 and Y2 are independent of each other, we have effectively broken the problem down into two easier (smaller) subproblems. If we can find maximum likelihood values [tex]y_1 \in Y_1, y_2 \in Y_2[/tex], then the maximum likelihood value of X, and hence the maximum of f, occurs at (y₁, y₂).

In real life conditions, probably an exact factorization is impossible in most conditions, but an approximate factorization should have much the same effect of splitting up the problem into subproblems that can be optimized individually to get "in the ball park," then recombined and fine-tuned as a whole.

So what I really want is an algorithm to find approximate independent factorizations of random variables whose form is unknown, assuming only that we can sample from them.

Concrete example: suppose we want to use a genetic algorithm to develop a set of rules for steering a toy car to avoid obstacles. Thus, we have a fitness function (ability to avoid obstacles), a random variable over a set of states (the set of rules), and a means of sampling the random variable such that the probability of a state increases with the fitness function for that state (the genetic algorithm). We represent a given set of rules as a string of n bits. The string of bits can be viewed as a factorization of the set of rules into a cartesian product of n random variables, each of 1 bit. This is not an independent factorization. We seek to find a way to represent the rules as a string of bits such that different regions in the string are nearly independent, which means that changes in one part of the string do not trigger other changes in other parts of the string. If we can achieve this, learning will become much faster.

Optimizing Machine Learning with Irreducible Factorizations of Random Variables

FAQ: Optimizing Machine Learning with Irreducible Factorizations of Random Variables

What does it mean to factorize a random variable?

Why is factorizing random variables important in statistics?

What are the different methods used for factorizing random variables?

Can all random variables be factorized?

How does factorizing random variables relate to Bayesian networks?

Similar threads

Hot Threads

Recent Insights