3.5 Closing: The Commitment Behind the Number

What this chapter asked

Chapter 3 asked one question: what are we committing to?

It asked that question in four registers, each exposing a different layer of the commitment embedded in the sample size calculation.

Section 3.1 asked who owns the effect size assumption—the most consequential prediction in the entire design, and the one most subject to optimism bias, borrowed evidence, and unchallenged convention. The answer is the clinical team, not the statistician. When it is owned by the statistician, it is produced without the judgment that the prediction requires. When it is owned by the clinical team, it is examined with the care that a prediction of this consequence deserves.

Section 3.2 asked what happens when variance and dropout assumptions are wrong simultaneously—not just individually, which is bad enough, but in the same direction at the same time, which is how assumptions tend to fail in practice. The compound error of joint pessimistic realizations is the most common and least examined source of power loss in clinical trials. The correction is not more careful individual assumptions, though that helps; it is joint scenario analysis that explicitly evaluates the power at the boundary of what is plausible.

Section 3.3 asked why event rate uncertainty deserves its own section in a chapter about commitments. The answer is that in event-driven time-to-event trials, a wrong event rate assumption does not just affect power—it affects trial duration, enrollment requirements, DSMB operating characteristics, and the informational structure of the interim analysis. The secular trend problem—the systematic failure to account for declining background event rates—is not a subtle statistical concern. It is a predictable and preventable source of trial failure that has been documented repeatedly and still recurs.

Section 3.4 reframed power as a choice about risk, not a technical parameter. The 20% type II error rate that conventional 80% power accepts is a design decision about who bears the consequence of missing a real treatment effect. That decision should be explicit—connected to the consequences of a false negative in this indication, for this patient population, given the availability of alternatives—rather than defaulting to convention.

What this chapter decided

By the end of this chapter, six things must be documented.

The assumed effect size is specified in the units of the primary effect measure. It is connected to identified sources, assessed for optimism relative to those sources, and owned by a member of the clinical team who can defend it in a design review or regulatory meeting.

The power curve is reported—not just the power at the central assumption but the power at the lower end of the plausible effect size range and at the minimum clinically important difference. The curve shows the design’s sensitivity to a wrong effect size assumption, and that sensitivity has been examined.

The variance, dropout, and event rate assumptions are documented with sources and plausible ranges. Each has been individually examined for the direction and magnitude of potential misspecification.

The joint pessimistic scenario has been analyzed. The power at simultaneous pessimistic realizations of all key nuisance parameters is documented. If it is unacceptable, the design has been adjusted—by a larger sample size, a pre-specified re-estimation rule, or an explicit acknowledgment of fragility—before enrollment.

The power level is justified. The rationale connects the chosen power to the consequences of a type II error in this indication: why this error rate is acceptable, who bears it, and whether alternatives—higher power, a different design, a different patient selection—would redistribute the risk more appropriately.

The power report is a design document, not a technical appendix. It contains the analysis that makes the risk budget explicit and the reasoning that justifies the operating point.

The characteristic mistakes of this chapter

Three mistakes recur in the territory Chapter 3 covers. They are not arcane failures. They are the normal product of design processes that move from estimand to sample size without pausing on the commitments in between.

The optimistic effect size that no one challenged. The assumed hazard ratio of 0.70 that was borrowed from a phase II trial in a more enriched population, with a smaller sample size and a shorter follow-up. The assumed mean difference of 3 points that came from the most favorable of three available prior studies. The assumed risk difference of 8% derived from a historical control arm event rate that reflects standard of care five years ago. Each of these assumptions is internally defensible—there is a source, a number, a rationale—but none of them reflects the question that should have been asked: is this assumption plausible for this trial, in this population, at this moment? When no one asks that question, the calculation proceeds, the sample size is approved, and the underpowered trial follows two years later.

The dropout rate that no one examined. The assumed 10% dropout because the phase II had 10% dropout. The phase II was twelve weeks. The phase III is two years. The patient population in phase III includes patients with more advanced disease who are more likely to withdraw due to tolerability or lack of response. The 10% assumption was never examined against these differences because it was borrowed from a prior study and no one thought to challenge it. The result is a trial that completes enrollment with 25% actual dropout, effective power below 60%, and a primary result that is inconclusive for a treatment that might be real.

The power level that no one justified. Eighty percent because that is what we use. In a serious condition with no approved alternatives. In a trial that is the program’s only confirmatory study. In a patient population where delay in approval means patients who will not live to benefit from a second trial. The 20% type II error rate was accepted by convention, not by reasoning—and the reasoning, if anyone had done it, might have produced a different answer.

What cannot be recovered

The commitments of Chapter 3—the effect size assumption, the nuisance parameters, the power level—cannot be revised after the trial is underway without invoking adaptation. And adaptation, in the form of sample size re-estimation, has costs and constraints that are addressed in Chapter 7. The relevant point here is that the fixed sample size design—the default design that most trials use—commits to these assumptions at the design stage, and the commitment cannot be unwound by analysis.

If the effect size assumption was too optimistic and the trial is underpowered, the result may be a non-significant finding for a treatment that is real. The options are: run an additional trial (time, resources, delay), pool with other trials in a meta-analysis (possible if the trials are compatible, but uncertain), accept the non-significant result and abandon the program, or report the result with an honest acknowledgment of the power limitation and accept a conditional conclusion. None of these options is as good as having powered the trial correctly from the beginning.

If the dropout rate was underestimated and the effective sample size is smaller than planned, the primary analysis may be the analysis of a trial that was not the trial that was designed. The registrations, the ethics approvals, the sample size justification, and the monitoring plan were all calibrated to a trial with different operating characteristics. This is not fraud; it is the predictable consequence of wrong assumptions materializing. But it is a trial whose result must be interpreted against the design’s original commitments, and the interpretation is more complicated than it should be.

The lesson is not that commitments should be avoided. Trial design requires commitments, and the alternative—an indefinitely flexible design that adapts to every new piece of information—has its own costs and limitations, as Chapter 7 will show. The lesson is that commitments should be made consciously, by the right people, with examination of their consequences, and with acknowledgment of what happens when they are wrong.

The connection to Chapter 4

Chapter 4 asks when the trial might stop early—the question of interim analyses, stopping boundaries, and the governance of the decisions made at those boundaries. Everything in Chapter 4 depends on what Chapter 3 settled.

The information fraction at which interim analyses are planned is a fraction of the total information Chapter 3 specified. If the total information is wrong—if the event target was based on an optimistic event rate and will not be reached—the information fractions are wrong, and the stopping boundaries are wrong. The DSMB will be looking at analyses calibrated to a design that does not correspond to the trial it is actually monitoring.

The alpha spent at each interim analysis comes from the total alpha budget that was fixed at design. If the primary test alpha was not correctly specified—if the power calculation used the nominal alpha without accounting for interim analysis spending—the final analysis will be conducted at an alpha level that was not planned.

And the power that was committed to in Chapter 3 is the operating characteristic against which Chapter 4’s stopping rules must be evaluated. An interim stopping rule that looks efficient in isolation—stopping early when the interim result is compelling—may look less efficient when its effect on the trial’s operating power is computed. The integration of interim analysis planning and power calculation is not sequential. It is simultaneous, and it requires the commitments of Chapter 3 to be settled before the planning of Chapter 4 can proceed honestly.

Chapter 3 risk summary

The decision this chapter owns: how large must the trial be, under what assumptions, accepted by whom, to generate the evidence that Chapter 1 defined and Chapter 2 described?

The most common mistake: treating the sample size calculation as a formula whose inputs are technical parameters to be estimated by statisticians, rather than as a set of clinical predictions and risk allocations to be owned by the design team. The assumed effect size is a clinical prediction. The dropout rate is a clinical prediction. The power level is a risk decision. When these are delegated to the statistician without clinical ownership, the calculation is correct and the commitments are unexamined.

The professional-level risk: the underpowered trial that cannot be recovered. Not underpowered because the assumptions were impossible to get right—they are always uncertain—but underpowered because they were not examined. The effect size was borrowed without calibration. The dropout rate was copied from a shorter trial. The power level was set by convention. The trial completes, the result is non-significant, and the question that no one asked at design—could we have done this differently?—has an answer. The answer is yes. But the time to give it was before enrollment began, not after the database locked.