Statistical hypothesis testing is widely used in scientific literature with the results often reported in terms of p-values. Occasionally there is some confusion about the interpretation of the p-value, the level, or other aspects of the significance test. We give some examples in the following pages.
This story critiques statements about the results of statistical hypothesis tests that appear in scientific literature and in the popular press.
Interpreting p-values, understanding hypothesis tests, hypothesis tests vs. confidence intervals.
Anecdote #1: p-value
In discussing the use of statistical models, Hugh Gaugh gives the following interpretation of the p-value:
"Roughly speaking, the p value is the probability that the observed result happened merely by chance; the complementary value (1- p) is the probability that the source caused a real effect."
Anecdote #2: Secondhand Smoke
The following segment comes from the Associated Press (as printed in The Columbus Dispatch 7/3/1994).
“The tobacco industry says it's being victimized by biased scientists who skew data to make it falsely appear that cigarette smoke - yours or somebody else's - is bad for you.”
“...Philip Morris capped off a week-long attack with a three page ad in 40 Sunday newspapers that charged the Environmental Protection Agency with using flawed science to label secondhand smoke a carcinogen.”
“...Philip Morris is reprinting an article by a media critic that claims the EPA, in labeling secondhand smoke a carcinogen, used invalid studies and skewed statistics, calling a study significant when it had only a 90 percent chance of accuracy instead of the usual 95 percent chance.”
“...The statistics are tricky, but using a 90 percent "confidence interval" is OK when scientists are sure a substance won't have a particular effect, said Dr. Ron Davis, editor of the international journal Tobacco Control. In other words, no one says that secondhand smoke is good. So 90 percent was enough to detect either no effect or a bad one, and was the same level EPA used to label radon and dioxin dangerous.”
“But the EPA says even if it threw out the studies that used the 90 percent level, it would still have enough evidence that secondhand smoke causes cancer.”
For the Gaugh quote in the p-value anecdote, briefly explain how the author is misinterpreting the meaning of the p-value, and give a correct interpretation.
"Roughly speaking, the p value is the probability that the observed result happened merely by chance; the complementary value (1-p) is the probability that the source caused a real effect."
Gaugh is incorrect in saying that "1- p is the probability that the source caused a real effect." The p-value is calculated assuming that the chance model of the null hypothesis is correct. Therefore, p does not give the chance that the null hypothesis is correct nor does 1 - p give the chance that the alternative is correct as indicated in the quote.
The p-value is the probability, assuming the null hypothesis is true, that the test statistic would take a value as extreme or more extreme than that actually observed in the test.
Assume that the media critic in the Secondhand Smoke anecdote is referring to a hypothesis test done by the EPA. How is he misinterpreting it?
The media critic that the Associated Press (AP) is referring to is misinterpreting the value 1 - α, where α is the fixed significance level of the test. In a fixed significance level test, the p-value must be as small or smaller than the fixed level α for the null hypothesis to be rejected. When the critic says, "...calling a study significant when it had only a 90 percent chance of accuracy...," he is implying that a significance test with a p-value of (at most) 0.10 is (1 - 0.10) × 100% or 90 percent accurate. This is false. The p-value is calculated assuming that the chance model of the null hypothesis is correct. 1- p does not give the chance that the alternative hypothesis is correct, let alone that the study is accurate. In fact, it is not clear exactly what accurate means, as it is not a standard statistical term.
There is another interpretation of what the media critic was thinking in the Secondhand Smoke anecdote. Perhaps the EPA used the duality of confidence intervals and hypothesis tests to conduct their tests by calculating a confidence interval first. Was the media critic correct in the spirit of his comment? In other words, if he had said, "...calling a study significant when the confidence intervals were only performed at the 90 percent level instead of the usual 95 percent level," would he have had a legitimate complaint? Explain your position on this.
The key to this interpretation is located in the final paragraph of the excerpt. Dr. Ron said, "In other words, no one says that secondhand smoke is good. So 90 percent was enough to detect either no effect or a bad one, and was the same level EPA used to label radon and dioxin dangerous." This indicates that the EPA was really interested in conducting a one-sided test, not a two-sided test (e.g. Ha > µ as opposed to Ha ≠ µ). Note that the critical point for this one-sided α = 0.05 level test is the same as the upper critical point for the two-sided α = 0.10 level test. If the critic failed to realize the EPA conducted a one-sided test, he or she might be misled into thinking that the EPA conducted a two-sided 0.10 level test based on reported critical values.
The analog of a one-sided hypothesis test is a confidence bound, not a confidence interval. A 90 percent confidence interval will yield the same lower endpoint as that of the 95 percent lower confidence bound. Likewise, the 90 percent confidence interval will yield the same upper endpoint as that of the 95 percent upper confidence bound. Note that the 90 percent confidence interval excludes 5 percent in each tail (both upper and lower). So the EPA simply used the 90 percent interval to get either the upper or lower endpoint on the 95 percent bound. The upshot of all this is that the media critic still would not have a legitimate complaint, but the EPA certainly reported their results in a manner that caused confusion.
Write a paragraph explaining the difference between a one-tailed and two-tailed significance test and under what circumstances each should be used. Explain how the statement by Dr. Ron Davis in the “Secondhand Smoke” anecdote is directed at this issue.
The Columbus Dispatch (July 3, 1994), Gaugh, H. G. Jr. (1993).
The story was prepared by Dennis Pearl on 11/9/93, addition by Mark Zabel on 1/20/95 and last modified on 1/20/95.