How far and how close to p=0.05 for statistical significance?

  • I
  • Thread starter fog37
  • Start date
  • Tags
    P-value
  • #1
fog37
1,569
108
TL;DR Summary
How far and how close to p=0.05 for statistical significance...
Hello Forum,

I understand what the p value represents and how it is calculated in a statistical hypothesis test. In general, the p-value threshold is set to 0.05, i.e. 5% which means that the null hypothesis is reject 5 times our 100 even if it is true. Or that the sample statistics are, assuming the null hypothesis is true, are extremely rare (if p<0.05) leading to reject H0...

What if our p value is just 0.057? Do we keep H0? What if p was 0.049? Would we reject H0? I guess I am asking how far the calculated p-value must be from the 0.05 threshold for the results to be either statistically significant or not...

Thank you!
 
Physics news on Phys.org
  • #2
The choice of a confidence level depends on the subject area and the seriousness of a mistake. An extreme example is in physics where the issue is the claim of discovering a new nuclear particle. There, they typically insist on a standard of 5 sigma, which corresponds to percentages of 0.00006% or 0.00003% (two sided or one sided). See this CERN post.
In your example, where the p value is exactly 0.05, remember that the goal is to be able to convince others, who may be skeptical of the alternative of the null hypothesis. You can go either way, but expect some resistance from others.
 
  • Like
Likes Agent Smith
  • #3
fog37 said:
TL;DR Summary: How far and how close to p=0.05 for statistical significance...

What if our p value is just 0.057? Do we keep H0? What if p was 0.049? Would we reject H0?
Unfortunately, p values are very misused, and the magical 0.05 threshold especially so. A low p value is evidence against a null hypothesis, but null hypotheses are almost never actually believable and are rarely of interest.

One of the biggest misinterpretations of p values is that a small p value is evidence in favor of some scientific hypothesis of interest. Or that a small p value indicates a large or important effect.

Regarding your specific question, I usually consider all of the evidence, but I typically would not find a p value of 0.049 to be very persuasive even though it is significant.
 
  • Like
Likes Vanadium 50, Agent Smith and fog37
  • #4
Dale said:
Unfortunately, p values are very misused, and the magical 0.05 threshold especially so. A low p value is evidence against a null hypothesis, but null hypotheses are almost never actually believable and are rarely of interest.

One of the biggest misinterpretations of p values is that a small p value is evidence in favor of some scientific hypothesis of interest. Or that a small p value indicates a large or important effect.

Regarding your specific question, I usually consider all of the evidence, but I typically would not find a p value of 0.049 to be very persuasive even though it is significant.
I see and suspected that...thank you. So what is the alternative when we are working with a sample of size n and need to see if our estimates are reasonable and similar to the population parameters?
 
  • #5
fog37 said:
I see and suspected that...thank you. So what is the alternative when we are working with a sample of size n and need to see if our estimates are reasonable and similar to the population parameters?
Whenever possible I prefer to use Bayesian methods. I have a few insights articles on them. This one is the most relevant one to your question, but there are three others too if you are interested

https://www.physicsforums.com/insights/how-bayesian-inference-works-in-the-context-of-science/
 
Last edited:
  • #6
Last edited:
  • Like
Likes Office_Shredder, Vanadium 50 and Dale
  • #7
You are beginning to explore the art of statistics. It may be a science, but it is also an art.
 
  • #8
Dale said:
A low p value is evidence against a null hypothesis, but null hypotheses are almost never actually believable and are rarely of interest.

Suppose a grad student wants to prove that exposure to the music of Led Zeppelin increases the sexual potency of rats. The null hypothesis is that this is not so. I find this believable. Usually the null hypothesis is that nothing of interest is going on. I would say that in general this is believable. The grad student's hope is that the null hypothesis will be rejected due to a low probability that the observed increased virility is an artifact of random chance.

Dale said:
One of the biggest misinterpretations of p values is that a small p value is evidence in favor of some scientific hypothesis of interest.

I share this misinterpretation, assuming an experiment is properly designed. A small p value suggests that the sought-for effect is real. Perhaps there is something I am missing. Of course all this depends on proper application of statistical methods.
 
  • Like
Likes Agent Smith and FactChecker
  • #10
Hornbein said:
Suppose a grad student wants to prove that exposure to the music of Led Zeppelin increases the sexual potency of rats. The null hypothesis is that this is not so. I find this believable.
This is not a typical null hypothesis. This would be called an an alternative hypothesis. So the hypothesis of interest is that the effect is positive, and the alternative hypothesis is that the effect is non-positive (negative or zero). The null hypothesis is that there is no effect, i.e. that the effect is exactly zero.

A point hypothesis is generally not believable. If a parameter, like the effect size, is continuous then the chance that it assumes a specific single value vanishes.

Nevertheless, unbelievable null hypotheses are used because they allow easy calculation of the probability of the observed data under the point hypothesis. In other words it is easy to calculate ##P(D|H)## where ##D## is the data and ##H## is the hypothesis if ##H## is a point hypothesis. For your example it would be just as difficult to calculate ##P(D|H)## for your experimental hypothesis as it is for your alternative hypothesis. There would be no utility in that alternative hypothesis.

So typically your grad student would compare the data to the unbelievable null hypothesis, show that the data is unlikely to have arisen by chance under the null hypothesis, also show that the average effect is positive, and then claim that is evidence supporting the experimental hypothesis.

Hornbein said:
I share this misinterpretation, assuming an experiment is properly designed. A small p value suggests that the sought-for effect is real. Perhaps there is something I am missing. Of course all this depends on proper application of statistical methods.
A small p-value indicates only that the observed data is unlikely to have arisen by chance under the null hypothesis. Any other inference is suspect.

That the observed data is unlikely to have arisen by chance under the null hypothesis does not itself indicate anything anything about the experimental hypothesis. The null hypothesis could be true and the experimenter just was unlucky. The null hypothesis could be true but the sampling non-random. The null hypothesis and the experimental hypothesis could both be false together. The experimental hypothesis could be one of many experimental hypotheses and multiple comparisons were not considered. Etc.

You are by far not alone in your misinterpretation. That is one of the biggest problems with p values.

It is actually kind of sad because when we take statistics they very carefully explain that you say "we reject the null hypothesis" and never that we "accept the experimental hypothesis". In statistics class people are told that the test just rejects the null hypothesis and does not support the experimental hypothesis. And then we publish our first scientific paper and in the results we reject the null hypothesis as we were taught in statistics class, and then immediately in the discussion section we accept the experimental hypothesis anyway.
 
  • Informative
  • Like
Likes Agent Smith, hutchphd, fog37 and 1 other person
  • #11
It's not wise to lock yourself into a rigid set of statements like "If p < .05 then..., otherwise if p >= .05 then ...". As others have pointed out that isn't what p-values do, and it's not really how Fisher and other early practitioners thought they should be used. Looking back at some comments from Fisher:

"In 1926, as one of Fisher's early statements endorsing a p value of 0.05 as a boundary, he wrote: “…it is convenient [emphasis added] to draw the line at about the level at which we can say: ‘Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials’.”17 In 1956, Fisher wrote: “[…] no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.”

We don't base decisions on results of single calcuations: small p-values might indicate a particular H0 isn't true, but there are many reasons a null can qualify to be rejected besides its being false. You should also look at the quality of the data, whether you've really asked the correct question, confidence intervals (use confidence level that corresponds to your test's significance level and don't make the mistake [as too many new students do] to refer to the alpha you use in a test as a confidence level: it isn't, it's the test's significance level, and so on.
 
  • Like
Likes fog37 and FactChecker
  • #12
Further I like to think of the p value as a measure of significance, (not a bright line test of significance). In this it conforms to my subjective method of making personal decisions where the quality of any input is always evaluated and colors the significance of that particular "fact".

It is also useful that ±2σ and 95% correspond.
 
Last edited:
  • Like
Likes fog37 and FactChecker
  • #13
fog, you may want to look up terms like p-hacking and the power of a test.
 
  • Like
Likes fog37 and Dale
  • #14
statdad said:
"In 1926, as one of Fisher's early statements endorsing a p value of 0.05 as a boundary, he wrote: “…it is convenient [emphasis added] to draw the line at about the level at which we can say: ‘Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials’.”17 In 1956, Fisher wrote: “[…] no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.”
Exactly. The real question is whether it is wise to make the claim of the alternative hypothesis if that claim might be wrong one time out of 20. In some cases, that might be fine and in other cases that might be terrible. That is why high energy physics claims are required to have a "5-sigma" (wrong once in every 1.7 million double tail, or once in every 3.5 million single tail) level of significance.
 
  • Like
Likes fog37
  • #15
Of course any possible deviation from gaussian normal distribution will make these low probability significance estimates wildly speculative. There is a point where it becomes silly. (space shuttle failure estimates of 1 in 105 per flight come to mind
 
  • Like
Likes fog37 and FactChecker
  • #16
hutchphd said:
Of course any possible deviation from gaussian normal distribution will make these low probability significance estimates wildly speculative. There is a point where it becomes silly. (space shuttle failure estimates of 1 in 105 per flight come to mind
I think that we should distinguish between the reliability of the statistical theory versus the reliability of the model assumptions. In the case of particle physics, where the assumptions only depend on physics, the 5-sigma results may be very reliable. On the other hand, in cases like the space shuttle, where the assumptions depend on upper management being non-political, I wouldn't count too much on any result that was less than 1/10.
 
  • Like
Likes hutchphd
  • #17
fog37 said:
What if p was 0.049?
What if it were 0.48?
What if it were 0.47?
What if it were 0.51?

If you draw a line, stick to it. Don't go changing it after the fact to get the answer you want.
 
  • Like
Likes FactChecker
  • #18
Vanadium 50 said:
What if it were 0.48?
What if it were 0.47?
What if it were 0.51?

If you draw a line, stick to it. Don't go changing it after the fact to get the answer you want.
Or don't draw the line
 
  • Like
Likes hutchphd
  • #19
Dale said:
Or don't draw the line
Sometimes decisions must be made.
 
  • #20
FactChecker said:
Sometimes decisions must be made.
Sure, but you can base decisions based on an aggregate of available relevant information rather than a single artificial line that generally is not even relevant to the decision being made.
 
  • Like
Likes hutchphd and FactChecker
  • #21
Dale said:
Sure, but you can base decisions based on an aggregate of available relevant information rather than a single artificial line that generally is not even relevant to the decision being made.
Good point. But the "aggregate of available relevant information" is often just as questionable (or more so) as the statistical results. That is often why the statistical analysis was asked for in the first place. The world is messy.
 
  • #22
FactChecker said:
Good point. But the "aggregate of available relevant information" is often just as questionable (or more so) as the statistical results.
I think I maybe was unclear. I am talking about all of “the statistical results” when I say “aggregate of available relevant information”. As opposed to “the p-value” as a single line.
 
  • Like
Likes FactChecker
  • #23
hutchphd said:
Of course any possible deviation from gaussian normal distribution will make these low probability significance estimates wildly speculative. There is a point where it becomes silly. (space shuttle failure estimates of 1 in 105 per flight come to mind
It's important to remember that there is no such thing as truly gaussian data: every application of that distribution is an approximation: the only question is about how drastic the approximation is
 
  • Like
Likes hutchphd
  • #24
Hornbein said:
Suppose a grad student wants to prove that exposure to the music of Led Zeppelin increases the sexual potency of rats. The null hypothesis is that this is not so. I find this believable. Usually the null hypothesis is that nothing of interest is going on. I would say that in general this is believable. The grad student's hope is that the null hypothesis will be rejected due to a low probability that the observed increased virility is an artifact of random chance.



I share this misinterpretation, assuming an experiment is properly designed. A small p value suggests that the sought-for effect is real. Perhaps there is something I am missing. Of course all this depends on proper application of statistical methods.
Wouldn't this require an experiment?
 
  • #25
Agent Smith said:
Wouldn't this require an experiment?
Sure but I ain't gonna do it.
 

FAQ: How far and how close to p=0.05 for statistical significance?

What does p=0.05 signify in statistical analysis?

A p-value of 0.05 indicates that there is a 5% probability that the observed results occurred by chance alone. It is a commonly used threshold for determining statistical significance, meaning that if the p-value is less than or equal to 0.05, the results are considered statistically significant.

Is a p-value of 0.051 considered statistically significant?

No, a p-value of 0.051 is not considered statistically significant if the threshold is set at 0.05. The result would be considered marginally non-significant, suggesting that there is slightly more than a 5% probability that the observed results are due to chance.

How should results with a p-value close to 0.05 be interpreted?

Results with a p-value close to 0.05 should be interpreted with caution. While a p-value just below 0.05 indicates statistical significance, it does not necessarily imply practical significance. Similarly, a p-value just above 0.05 suggests that the results are not statistically significant but may still be of interest. It is important to consider the context, effect size, and other relevant factors.

What are the limitations of using p=0.05 as a threshold for significance?

Using p=0.05 as a threshold for significance has several limitations. It can lead to arbitrary distinctions between "significant" and "non-significant" results, and it may encourage p-hacking, where researchers manipulate data or analyses to achieve a p-value below 0.05. Additionally, it does not measure the size or importance of an effect, only the probability of observing the data if the null hypothesis is true.

Can a p-value alone determine the validity of a study's findings?

No, a p-value alone cannot determine the validity of a study's findings. While it provides information about statistical significance, it does not account for the study design, sample size, effect size, or potential biases. Comprehensive evaluation of these factors, along with replication of results, is essential for assessing the validity and reliability of scientific findings.

Similar threads

Replies
5
Views
1K
Replies
10
Views
391
Replies
27
Views
678
Replies
43
Views
1K
Replies
5
Views
215
Replies
5
Views
3K
Replies
9
Views
3K
Replies
17
Views
2K
Replies
4
Views
2K
Back
Top