Many clinicians consider the P value as an almost magical number that determines whether treatment effects exist or not. Is that a correct understanding?
In order to grasp the conceptual meaning of the P value, consider comparing two treatments, A and B, and finding that A is twice as effective as B. Does it mean that treatment A is better in reality? We cannot be sure from that information alone. It may be that treatment A is truly better than treatment B (i.e., true positive). However, it may also be that by chance we have collected a sample in which more people respond to treatment A, making it appear as more effective, when in reality it is equally effective as treatment B (i.e., false positive).
An arbitrary definition
If the P value is less than 5% (P less than .05) that means that there is less than a 5% probability that we would observe the above results if in reality treatment A and treatment B were equally effective. Since this probability is very small, the convention is to reject the idea that both treatments are equally effective and declare that treatment A is indeed more effective.
The P value is thus a probability, and “statistical significance” depends simply on 5% being considered the cutoff for sufficiently low enough probability to make chance an unlikely explanation for the observed results. As you can see this is an arbitrary cutoff; it could have been 4% or 6%, and the concept would not have changed.1
Thus, simply looking at the P value itself is insufficient. We need to interpret it in light of other information.2 Before doing that, we need to introduce a new related statistical concept, that of “power.” The power of a study can be conceptually understood as the ability to detect a difference if there truly is one. If there is a difference in reality between treatments A and B, then the power of a study is the ability to detect that difference.
Two factors influence power: the effect size (that is, the difference between A and B) and the sample size. If the effect size is large, then even with small samples we can detect it. For example, if treatment A was effective in 100% of the cases, and treatment B only in 10% of cases, then the difference will be clear even with a small number of patients. Conversely, if the effect size is small, then we would need a very large sample size to detect that difference. For example, if treatment A is effective in 20% of cases, and treatment B is effective in 22% of cases, the difference between them could be observed only if we enrolled a very large number of patients. A large sample size increases the power of a study. This has important implications for the interpretation of the P value.