95% Confidence of weighted average

  • Thread starter CompuChip
  • Start date
  • Tags
    Average
In summary: This sum is not normally distributed and I would like to know what kind of distribution it is.Assuming that the weights are independent and identically distributed with mean zero and variance one, then the distribution of the sum is chi-squared with (R-1) degrees of freedom.
  • #1
CompuChip
Science Advisor
Homework Helper
4,309
49
Hi all,

It's been a while since I have asked a question here, but statistics has never been my forte. I have the feeling that although I know the definitions I do not completely grasp the concept of confidence intervals. Unfortunately I do need to come up with something sensible here.

The situation is that I'm performing n experiments, and for each experiment I'm measuring m values. Step 1 is, for every quantity, to calculate the average over all experiments and provide a 95% confidence interval. So far so good: I have some nice code that will give me a two-sided student t-value which I can use to construct the confidence interval.

Now the tricky bit, for me, is that I also need to take a weighted average of these averages. The question is how to calculate a statistically sensible confidence interval on this average.

So to summarize with symbols: I have nm quantities ##q_{ij}## (##i = 1, \cdots, n##; ##j = 1, \cdots m##). I have calculated ##\bar{x}_j = \frac{1}{n} \sum_{i = 1}^n q_{ij}## with the corresponding 95% CI ##[\bar{x}_j - \Delta x_j, \bar{x}_j + \Delta x_j]##.
Now I wish to calculate the weighted average ##\mu = \sum_{j = 1}^m w_j \bar{x}_j## (if you want, you may assume the wj sum to 1) and would like to know how I can construct the CI for this, either from the ##\Delta x_j## or from ##q_{ij}## directly.

If they were standard deviations I would expect something like ##\sigma^2 = \frac{1}{n} \sum_{j = 1}^m \Delta x_j^2## but I don't think it works that way for confidence intervals.

[edit]Let's also assume independence where needed, I will worry about that after I get an initial idea.[/edit]
 
Last edited:
Physics news on Phys.org
  • #2
Hey CompuChip.

In statistics, the area is typically known as meta-analyses. You have a few options.

If you know the distribution (or assume it to be something) then you can just use E[aX + bY] = aE[X] + bE[Y]. For the variance, this is a bit more complicated since Var[X + Y] = Var[X] + Var[Y] + 2*Cov[X,Y]. If everything is independent, the covariance terlm is zero but if not then it will affect your confidence intervals.

If you are looking at testing inferences with respect to means, then all individual group mean estimators are roughly normal, then you can use the fact that the sum of a linear combination of normals is also normal. If there is covariance, then your variance matrix will have some off-diagonal entries and if no covariance then the cov(X,Y) terms are zero).

In the case of independence, and you assume all estimators are roughly normal with some mean and some variance then use:

E[aX + bY] = aE[X] + bE[Y]
Var[aX + bY] = a^2Var[X] + b^2Var[Y]

and do recursive applications to get your final estimator of a sum of weighted estimators of a mean. If there is reason to believe that covariance terms are non-zero, you need to factor this in because if you don't, your estimators (and confidence intervals) will either be way too narrow or way too wide.
 
  • Like
Likes 1 person
  • #3
CompuChip;4486073 The situation is that I'm performing [i said:
n[/i] experiments, and for each experiment I'm measuring m values.

You have to decide whether there is some systematic effect that varies from experiment to experiment (- for example, temperature of the laboratory). If such an effect is possible then the simplest and safest thing to do is compute the weighted sum for the m values in each of the n experiments. Treat these n weighted sums as n measurements. Find the sample mean for those n measurements and state a confidence interval for them.
 
  • Like
Likes 1 person
  • #4
Hi chiro and Stephen. Thanks for your replies. I think I have worked it out, hopefully I can run it by you to check that I got it right. It's a pretty long post (again) I'm afraid but I have two questions at the end that I would very much appreciate your having a look at.

So let me start from the basics: I am doing K experiments, and each of those is repeated R times giving values ##x_{k,r}##. Stephen, the "experiments" actually consists of metrics calculated on a computer simulation so there is no physical laboratory involved, but what you write is what I had in mind.

I will assume that the per-experiment mean ##X_k## is normally distributed with mean ##\mu_k## and standard deviation ##\sigma_k##. Under this assumption I can estimate these parameters from the sample data as
$$\overline{X}_k = \frac{1}{R} \sum_{r = 1}^R x_{k,r}, \qquad S_k^2 = \frac{1}{R - 1} \sum_{r = 1}^R (x_{k,r} - \overline{X}_k )^2.$$
Then the variable ##T_k = \frac{X_k - \overline{X}_k}{S_k / \sqrt{R}}## follows a Student's T-distribution with ##(R-1)## degrees of freedom, which for every experiment ##k## leads to an ##\alpha##-level confidence interval (e.g. ##\alpha = 0.95##)
$$\overline{X}_k \pm t_{1 - \tfrac{\alpha}{2}, R -1 } \frac{S_k^2}{\sqrt{R}}.$$

So far, so good?

Now I would like to define
$$X = \sum_{k = 1}^K w_k X_k = \sum_{k = 1}^K \sum_{r = 1}^R \frac{1}{R} w_k x_{k, r}$$
where I assume that ##\sum w_k = 1## and ##w_k > 0## for all ##k = 1, \ldots, K##.

Since ##X## is a linear combination of normally distributed variables, it is itself normally distributed with mean ##\mu## and standard deviation ##\sigma## given by
$$\mu = \sum_{k = 1}^K w_k \mu_k, \qquad \sigma^2 = \sum_{k = 1}^K \sum_{k' = 1}^K w_k w_{k'} \operatorname{cov}(X_k, X_{k'})$$
respectively. Here ##\operatorname{cov}(X, Y)## indicates the covariance between ##X## and ##Y##, which reduces to the variance of ##X## when ##X = Y##.

The estimators for the normal parameters in this case are
$$M = \sum_{k = 1}^K w_k \overline{X}_k, \qquad S^2 = \sum_{k = 1}^K \sum_{k' = 1}^K w_k w_{k'} q_{kk'}$$
where
$$q_{kk'} = \sum_{r = 1}^R (x_{k,r} - \overline{X}_k)(x_{k',r} - \overline{X}_{k'})$$
is the sample covariance of all observations.

Now if ##T \equiv \frac{X - M}{S / \sqrt{K}}## follows a Student's T-distribution, then as before the ##\alpha##-level confidence interval (e.g. ##\alpha = 0.95##) will be
$$\overline{X} \pm t_{1 - \tfrac{\alpha}{2}, K - 1 } \frac{S^2}{\sqrt{K}}.$$

Two outstanding questions I have, apart from request for general bashing of this approach:
(1) I get the feeling I am missing some prefactor like ##1/(K - 1)## or ##1/\left(1 - \sum (w_k^2) \right)## in my expression for the combined standard deviation ##\sigma^2##. However when using the bilinearity of variance to inductively derive what ##\sigma(X_1 + \ldots + X_n)## should look like it doesn't appear.
(2) How valid is the assumption about the distribution of ##T## in the last paragraph?

Thanks again!
 
  • #5
Is there any reason why you can't use a Normal approximation? If you have enough observations for each mean term Xk then you might as well use the normal approximation since adding normals with regards to getting a joint distribution is very easy (linear combinations of normals will always be multivariate normal).
 
  • #6
Although I usually expect R to be pretty large, in the sample data I've been given it is 4 (so every quantity is measured 4 times). This gives a 6% deviation between the student-T and normal approach for every experiment separately, but when I combine them I get a discrepancy of about 25% in the calculated overall confidence interval. So it kinda matters which I choose here :-)
 
  • #7
Actually I found out I need the student-T because the 6% I mentioned was an error and that discrepancy is also in the 25% range.

I currently have an answer which "looks about right" based on my post #4 - still have no idea how much statistical sense it makes though :-)
 
  • #8
CompuChip said:
I will assume that the per-experiment mean ##X_k## is normally distributed with mean ##\mu_k## and standard deviation ##\sigma_k##.

That's not very controversial, but there are some specialized fields (like portfolio optimization) where different distributions are fashionable.





However when using the bilinearity of variance to inductively derive what ##\sigma(X_1 + \ldots + X_n)## should look like it doesn't appear.

Don't confuse [itex] \sigma^2 [/itex] with an estimator for [itex] \sigma^2 [/itex]. It wouldn't surprise me if estimator also has a bilinearity property, but we should do the algebra to check.

(2) How valid is the assumption about the distribution of ##T## in the last paragraph?

It's valid provided we show that the estimator you used for the weighted sum gives the same result as defining a result for the m-th experiment as [itex] Y_m = \sum w_i X_i [/itex] for the [itex] X_i [/itex] involved in the experiment and computing the estimator for the variance of [itex] Y [/itex] directly.

(It isn't really correct to speak of "the" estimator of the variance since several different estimators are possible. The one that's part of the t-statistic is the one you correctly used.)
 

FAQ: 95% Confidence of weighted average

What is meant by "95% Confidence of weighted average"?

The 95% confidence of weighted average is a statistical measure that indicates the range within which the true average of a population is likely to fall. It takes into account the variability of the data and provides a margin of error to the estimated average.

How is the confidence level of a weighted average determined?

The confidence level of a weighted average is determined by the sample size and the variability of the data. A larger sample size and lower variability will result in a higher confidence level, while a smaller sample size and higher variability will result in a lower confidence level.

What does a 95% confidence level mean in terms of accuracy?

A 95% confidence level means that if the same experiment or study were repeated numerous times, 95% of the time the true average of the population would fall within the estimated range. In other words, there is a 95% chance that the estimated average is accurate.

How is a weighted average calculated?

A weighted average is calculated by multiplying each value in a dataset by its corresponding weight, summing these values, and dividing by the total weight of the dataset. This method takes into account the importance or significance of each value in the dataset.

What is the significance of using a weighted average instead of a simple average?

A weighted average is used when some values in the dataset are more important or have a greater impact on the overall average. It gives these values more weight in the calculation, resulting in a more accurate representation of the data. A simple average gives equal weight to all values, which may not accurately reflect the true average.

Back
Top