3.1 Effect Size Ownership

The assumption no one wants to own

Every sample size calculation begins with a number: the expected treatment effect. A hazard ratio of 0.75. A risk difference of 5 percentage points. A mean change from baseline of 2.1 units. This number is the effect size the trial is designed to detect—the magnitude of benefit that, if real, the trial will have adequate power to confirm.

This number is a prediction. It may be informed by prior trials, by dose-ranging studies, by meta-analyses of related treatments, by biological reasoning about mechanism, or by extrapolation from a different population or a different stage of disease. But it is always, in the end, a prediction about what a future trial will find—and predictions made before a trial are frequently wrong. In a 2016 analysis of phase III oncology trials, approximately 40% failed to confirm the primary hypothesis. In cardiovascular disease, where large trials have been run for decades and the prediction inputs are rich, underpowered trials are still common. The history of phase III development is substantially a history of effect size predictions that turned out to be optimistic.

This is not an indictment of clinical scientists or statisticians. Effect sizes are genuinely difficult to predict in advance. The patients enrolled in the definitive phase III trial are not the same as the patients in the earlier trials from which the effect size was borrowed. The control arm event rates differ. The background therapy differs. The diagnostic criteria differ. The effect size assumption absorbs all these differences and converts them into a single number that is then treated as though it were a known quantity in a formula.

The question this section addresses is not how to make better predictions—though that is worth pursuing. It is who should make the prediction, what basis it should rest on, and who should be accountable when it is wrong.

Where effect size assumptions come from

In practice, the assumed effect size comes from one of four sources, each with characteristic strengths and weaknesses.

Prior trials of the same treatment. Phase II results, dose-ranging studies, or earlier phase III trials in related populations provide the most direct evidence about what the treatment does. The problem is that prior trials are subject to selection bias in both directions: positive results are more likely to proceed to phase III, inflating the apparent effect size, while negative results in closely related populations may not be weighted appropriately if the design team is committed to a specific development hypothesis. The phase II to phase III attrition rate reflects, in part, the systematic overestimation of effect size at earlier stages of development.

Historical data from the control arm. For trials comparing a new treatment to an active comparator, the expected control arm outcome—the baseline against which the treatment effect is measured—is often derived from historical trials of the comparator. This estimate is subject to the same concerns about population differences and treatment context evolution that affect the non-inferiority margin. Control arm event rates in historical trials may not reflect what will happen in the current trial, particularly if standard-of-care background therapy has changed substantially.

Expert clinical judgment. When quantitative data are insufficient, the expected effect size may be based on the clinical team’s assessment of what the treatment should do, given its mechanism of action and the nature of the disease. Expert judgment is a legitimate input, but it is also the source most likely to reflect optimism bias—the systematic tendency for people developing a treatment to believe it will work better than the evidence strictly supports. Expert judgment should be documented, subjected to independent review, and connected to a specific biological or clinical rationale that can be evaluated by others.

The minimum clinically important difference. Rather than predicting what the treatment will do, the trial can be powered to detect the smallest treatment effect that would be clinically meaningful. This approach avoids the prediction problem by asking a different question: if the treatment produces at least this much benefit, is that enough to change practice? Powering for the minimum clinically important difference is conservative—it may require a larger trial than powering for the expected effect—but it produces a design that is coherent in a specific way: a positive trial result will always be clinically meaningful, because the minimum meaningful threshold was the design target.

Each of these sources has a place. In most trials, the effect size assumption draws on all four, and the design team must make a judgment about how to weight them. The judgment should be documented—not as a sensitivity analysis note in the SAP, but as a primary design rationale that can be reviewed and challenged before the trial begins.

The optimism problem

There is substantial empirical evidence that assumed effect sizes in sample size calculations are systematically optimistic. The mechanism is not hard to understand: the people designing the trial are also the people most committed to the hypothesis that the treatment works. They have invested in the development program. They have seen the early efficacy signals. They believe in the mechanism. These are not disqualifying biases; they are the natural disposition of a team doing its job. But they are biases, and they push the assumed effect size upward.

The consequences of optimism are asymmetric. A trial powered for an effect size of 30% relative risk reduction that finds a 20% relative risk reduction is underpowered—it may not reach statistical significance even if the treatment genuinely works. This outcome is damaging in multiple ways: the treatment may be abandoned or delayed, patients who might have benefited from early approval do not receive it, and the sponsor must either run an additional trial or accept a negative result that does not reflect the treatment’s actual benefit. An underpowered trial is not a failed trial; it is a trial that asked a question too boldly and could not answer it.

The correction for optimism is calibration—comparing assumed effect sizes in prior trials in similar indications to the effect sizes those trials actually found, and using that comparison to apply a systematic discount to the current assumption. This approach is rarely used in practice, because it requires a database of prior assumptions that are not routinely published, and because it produces a larger trial than the uncorrected assumption would require. But the evidence supports it: in indications where calibration data are available, the calibrated assumption has been closer to the realized effect than the uncalibrated one.

The more common correction is scenario analysis: the sample size is calculated for two or three effect size assumptions—an optimistic assumption, a central assumption, and a conservative assumption—and the trial size is chosen based on a judgment about which scenario the design team is willing to commit to. This approach at least makes the uncertainty explicit. It does not resolve the optimism problem, but it converts an unexamined assumption into an examined range with an explicitly chosen operating point.

Who should own the effect size assumption

The effect size assumption is a clinical prediction. It is a claim about what the treatment will do—how much better, on the primary endpoint, the treatment arm patients will do compared to control arm patients, in the population defined by the estimand. This prediction is not statistical. It is clinical and biological. The statistician translates it into a power calculation, but the translation is not the judgment.

The judgment belongs to the clinical team: the physicians, the clinical scientists, the disease-area experts who understand what this treatment does in this disease, based on the available evidence. They should be the ones to propose the assumed effect size, to defend it against scrutiny, and to acknowledge its uncertainty. When the statistician proposes the effect size because the clinical team did not engage with the question, the prediction is made without the judgment of the people most capable of making it—and the accountability for the prediction’s accuracy is assigned to the wrong party.

This is not merely an organizational preference. It has practical consequences. A clinical team that owns the effect size assumption will examine it more carefully—will ask whether it is consistent with the mechanism, whether it matches the phase II data, whether it assumes something about the patient population that is actually true. A clinical team that accepts the statistician’s proposed assumption without engagement will not. And when the trial misses its power calculation—when the observed effect is smaller than assumed—the question of who made the wrong prediction will matter for whether the program continues and on what terms.

Ownership also matters for documentation. A regulatory agency reviewing a sample size justification wants to know: where did this assumed effect come from, and why is it credible? A justification that says “based on prior literature, a hazard ratio of 0.75 was assumed” is not a justification. A justification that says “based on the following three trials in comparable populations, with the following analysis of their effect size estimates and the following reasoning about the differences between those trials and the current design, the central expected hazard ratio is 0.75, with a plausible range of 0.70 to 0.82”—that is a justification, and it is the clinical team’s document, not the statistician’s.

The clinically meaningful effect versus the expected effect

There is a design choice embedded in the effect size assumption that is rarely made explicit: should the trial be powered for the expected treatment effect, or for the minimum clinically meaningful treatment effect?

These are different targets, and they produce different trial designs.

Powering for the expected effect produces the smallest trial that has adequate probability of detecting what the treatment is actually expected to do. It is efficient. It is also contingent: if the expected effect turns out to be optimistic—if the actual treatment effect is smaller than assumed—the trial will be underpowered, and the result may be inconclusive even if the treatment genuinely works.

Powering for the minimum clinically meaningful effect produces a larger trial that can detect any effect large enough to be worth using the treatment. It is conservative. It has the property that if the trial succeeds, the result is clinically meaningful by construction—the effect was at least as large as the minimum useful threshold. It has the cost that if the expected effect is substantially larger than the minimum clinically meaningful threshold, the trial is larger than it needs to be, enrolling more patients than required to answer the question.

The choice between these two approaches depends on the consequences of an underpowered trial versus the consequences of a larger trial. In indications where an underpowered trial would delay access to a beneficial treatment for patients with serious unmet need, the cost of underpower is high, and a more conservative sample size may be justified. In indications where the treatment is adjunctive and the clinical decision can be revisited with additional trials, the cost of underpower is lower. In indications where the minimum clinically meaningful effect is not well-established—where the profession has not reached consensus on what magnitude of benefit would change practice—powering for the minimum meaningful effect requires resolving that prior question, which is often not done.

Both choices are defensible. Neither should be made by default. And whichever is made should be documented: we are powering for the expected effect because X, or we are powering for the minimum meaningful effect because Y. The distinction affects how the result is interpreted if the trial barely misses statistical significance, and how it is interpreted if it succeeds by a narrow margin.

Sensitivity of power to effect size assumption

Power is highly sensitive to the effect size assumption, and this sensitivity is systematically underappreciated in trial design.

For a typical superiority trial, halving the assumed effect size approximately quadruples the required sample size. This is a consequence of the squared relationship between effect size and required sample size in most power formulas—a relationship that means small errors in the assumed effect size translate into large errors in the required sample. An effect size assumption that is 20% too optimistic produces a trial that is approximately 40-60% underpowered for the true effect—a trial that will fail to reach statistical significance even when the treatment genuinely works at the true magnitude.

This sensitivity should prompt scenario analyses in every trial design. What is the power if the true effect is 20% smaller than assumed? 30% smaller? If the power at the pessimistic scenario is unacceptable—below 60%, say—the design team must either increase the assumed sample size (by assuming a smaller effect), accept the risk of underpower, or plan a pre-specified sample size re-estimation to address uncertainty prospectively. All three options have costs. The decision among them is a risk allocation decision that the design team must make explicitly.

What it should not be is a decision made by ignoring the sensitivity—by reporting power at the central assumed effect size and moving on to the next design decision without examining the power curve. The power at the central assumption is a single point on a distribution of possible outcomes. The full distribution is what the trial is actually committing to, and the full distribution should be examined before the commitment is made.

What this section demands before proceeding

Before the variance and dropout assumptions of Section 3.2 can be addressed, the effect size assumption must be locked—not as a final determination that will never be revisited, but as a documented commitment to a specific prediction with a specific rationale.

The lock requires: a specific numerical value in the units of the primary effect measure, the sources from which that value was derived with an assessment of each source’s applicability to the current trial, an explicit acknowledgment of the optimism risk and any calibration or discount applied to address it, the name of the clinical team member who owns the prediction, and a power curve that shows power across a plausible range of true effects.

Without this, the variance and dropout assumptions of Section 3.2 will be calculated to a sample size whose effect size foundation has not been examined. The numbers will compound. The result will look precise. The precision will be false.

References: Ioannidis, “Why Most Published Research Findings Are False,” PLoS Med 2005; Button et al., “Power Failure: Why Small Sample Size Undermines the Reliability of Neuroscience,” Nat Rev Neurosci 2013; Altman, “Statistics and Ethics in Medical Research: III. How Large a Sample?” BMJ 1980; Schulz and Grimes, “Sample Size Calculations in Randomised Trials: Mandatory and Mystical,” Lancet 2005.