Optimal Discrete Sampling of Histogram

In summary, the objective function is not affected by what the bins represent, but the constraints must be satisfied to generate a sample of size ##n##.
  • #1
Jarvis323
1,243
986
I am wondering if this problem has a name, and what is the most efficient way to solve it. Say you have a normalized histogram ##h(P)## (representing a pdf estimated from a large population), with ##n## bins, you want to generate a sample of points ##S## from ##h(P)## of size ##k##, such that $$\sum_{i=0}^{n-1}|h(S)_i - h(P)_i|$$ is minimized. Note that since the histogram is discrete, it is enough to select a ##c(i)## for each bin, where $$\sum_{i=0}^{n-1}c(i) = k$$ and the solution is minimal.
 
Technology news on Phys.org
  • #2
How is ##c(i)## defined? Is it the number of elements selected that have a value equal to the value represented by the ##i##th bin? Is ##c(i) = k\ h(P)_i## ?
 
  • #3
##c(i)## is just the number of elements you choose to generate in bin ##i##.

Another way of thinking about it is that you have ##k## apples to give out and ##n## people. Each person has a score ##p \in \mathbb{R}##, which says how deserving they are of apples. You want to give out the apples as fairly as possible according to ##p##. Since ##p## is continuous and you have discrete apples, some people may deserve slightly more than they get, or get more than they deserve.

I guess it seems somewhat simple, and I have an idea to do it efficiently, which is to just give everyone the number of apples they wholy deserve, and then sort the people according to how much apple they still deserve, and then just give the ##m## apples left over to the first ##m## people.

But it would be nice if you could do it without having to do the sorting.
 
  • #4
Jarvis323 said:
##c(i)## is just the number of elements you choose to generate in bin ##i##.

I think what your are asking is illustrated by the following problem: Suppose there are 3 bins with ##h(P)_0 = 0.4, h(P)_1 = 0.4, h(P)_2 = 0.2##. and we want to generate a sample of size 11. We can't pick values of the ##c(i)## to make ##\sum_{i=0}^2 |h(S_)i - h(P)_i|## exactly zero. So what are the ways of assigning values to the ##c(i)## that minimize that sum?

I am wondering if this problem has a name,
Problems of this general type are called "integer programming" problems. The function ##C(c_0, c_2,...c_m) = \sum_{i=0}^{m} |h(S_)i - h(P)_i|## is the "objective function". The constraints are ## 0 \le c_i \le n ##, ##\sum_{i=0}^m c_i = n##, where ##n## represents the size of the sample to be generated.

I don't know what, if any, name is given to a problem with such an objective function. Your objective function is unusual because it does not depend on what the bins represent, but only on what fraction of the population they contain. A particular example of this problem may have more than one solution because two different bins might each contain the same fraction of the population and the distinction in what the two bins represent doesn't affect the objective function.

It's worth noting that if you generate a sample of size ##n## by minimizing the objective function, you are not generating ##n## independent random samples from the population, so theorems in statistics that apply to random sampling do not apply to generating samples in this manner.
 
  • Like
Likes Jarvis323

FAQ: Optimal Discrete Sampling of Histogram

What is "Optimal Discrete Sampling of Histogram"?

"Optimal Discrete Sampling of Histogram" is a technique used in data analysis to determine the best way to group and sample data in a histogram. It involves finding the ideal number of bins and bin width to accurately represent the data distribution.

Why is "Optimal Discrete Sampling of Histogram" important in data analysis?

The optimal discrete sampling of a histogram allows for a more accurate representation of the data distribution, which can lead to better insights and conclusions in data analysis. It also helps to avoid the potential bias and misinterpretation that can occur with an improperly sampled histogram.

How is the optimal number of bins determined in "Optimal Discrete Sampling of Histogram"?

The optimal number of bins is typically determined using mathematical formulas or algorithms, such as the Freedman-Diaconis rule or the Sturges formula. These methods take into account the sample size and range of the data to find the ideal number of bins for a given dataset.

What is the purpose of finding the optimal bin width in "Optimal Discrete Sampling of Histogram"?

The optimal bin width helps to create a balanced and informative histogram by ensuring that each bin contains a sufficient number of data points. This helps to avoid oversimplification or overcomplication of the data distribution in the histogram.

How can "Optimal Discrete Sampling of Histogram" be applied in real-world situations?

"Optimal Discrete Sampling of Histogram" can be applied in various fields, such as market research, scientific experiments, and quality control. It can help to accurately and objectively analyze data and make informed decisions based on the data distribution.

Similar threads

Replies
7
Views
1K
Replies
0
Views
1K
Replies
12
Views
1K
Replies
15
Views
2K
Replies
4
Views
1K
Replies
17
Views
2K
Replies
15
Views
2K
Back
Top