Chapter 3: What Are We Committing To?

The question Chapter 2 leaves open

Chapter 2 settled the effect measure. It named the primary summary statistic, connected it to the estimand, stated its assumptions, and—for non-inferiority designs—derived the margin from first principles. What it did not answer is: how large does the trial need to be?

That question sounds technical. It is not. It is the question of what the trial is willing to commit to believing before the data are seen.

A sample size calculation is not a formula. It is a prediction—about the size of the treatment effect, the variability of the outcome, the rate at which events accumulate, the proportion of patients who will complete the trial. Every assumption in the calculation is a claim about what the trial expects to find. When the assumptions are right, the trial has adequate power. When they are wrong, the trial may be underpowered and unable to detect a real effect, or overpowered and wasteful of patients and resources, or misspecified in ways that invalidate the power claim entirely.

The commitments embedded in a sample size calculation are not symmetric in their consequences. Underestimating the treatment effect produces an underpowered trial—a trial that may miss a real benefit and expose patients to the intervention and the control condition without generating a conclusive answer. Overestimating the treatment effect produces a false precision—a trial designed to detect a magnitude of effect that was never plausible, powered for a question that was never the right one. Both failures are expensive. Both are common. And both are attributable to assumptions that were not examined carefully at design.

This chapter is about those assumptions—about what the trial is committing to when it commits to a sample size.

The decision structure of this chapter

The sample size calculation contains four distinct categories of assumption, each corresponding to a section of this chapter.

Section 3.1 — Effect Size Ownership addresses the assumption that is simultaneously the most consequential and the most contested: the expected treatment effect. This is the effect size the trial is designed to detect—the magnitude of benefit that, if real, the trial will have adequate power to confirm. The assumed effect size is a prediction. It may be informed by prior trials, biological reasoning, clinical experience, or dose-ranging studies. But it is always, ultimately, a prediction, and predictions are wrong at a rate that should be uncomfortable for anyone who has looked at the history of phase III clinical trials. This section asks who should make this prediction, on what basis, and what it means to own it.

Section 3.2 — Variance and Dropout addresses the assumptions about outcome variability and patient retention that interact with the effect size assumption to determine the required sample size. For continuous outcomes, the variance of the primary measure must be estimated. For time-to-event outcomes, the hazard rate in the control arm must be estimated. For binary outcomes, the baseline event probability must be estimated. Each of these estimates carries uncertainty, and each uncertainty compounds with the others. The section examines how these uncertainties propagate through the sample size calculation and what happens when they are wrong.

Section 3.3 — Event Rate Uncertainty focuses specifically on the challenge of estimating event rates in time-to-event trials—a problem that deserves separate treatment because it combines baseline rate uncertainty with follow-up duration assumptions, recruitment pace assumptions, and censoring assumptions in ways that are multiplicatively, not additively, wrong when they fail together.

Section 3.4 — Power as Risk Budget reframes the power calculation. Power is conventionally described as the probability of detecting a real effect. This section reframes it as a budget—a statement of the risk the trial is willing to accept of missing a real effect. That budget is allocated across the design: more power here means fewer resources there, higher power means larger trial means longer enrollment means later evidence for patients who need it. Power is not a target to be maximized. It is a resource to be allocated.

What unites the chapter

The sections of this chapter are unified by a single theme: the sample size calculation is a public commitment, and its assumptions must be owned by the people who make them.

In current practice, sample size calculations are most commonly produced by statisticians and presented to clinical and regulatory teams as deliverables. The clinical team may review the assumed effect size—sometimes—and accept or challenge it. The assumptions about variance, event rates, and dropout are rarely examined by anyone outside the biostatistics function. The power level is set at 80% or 90% by convention, with minimal examination of whether those levels are appropriate for this trial, this indication, and this decision context.

This allocation of responsibility is backward. The effect size assumption is a clinical prediction about what the treatment will do. The event rate assumption is a clinical prediction about what will happen to control arm patients over the trial’s follow-up. The dropout assumption is a clinical prediction about patient behavior in the trial context. These are not statistical inputs; they are clinical judgments that have been delegated to the statistician because they must be expressed as numbers, and the statistician is the person who works with numbers.

When these assumptions turn out to be wrong—and many of them will be wrong—the trial either fails or is extended or is redesigned. The accountability for those consequences belongs to the people who made the predictions, not just the people who translated the predictions into a power calculation. When those people are identified as “the statisticians,” the accountability structure is wrong, and the incentives for careful prediction are weak.

This chapter asks for correct accountability. Not as an organizational reform, but as a design discipline: every assumption in the sample size calculation should be traceable to a person who made it, a rationale that supports it, and an acknowledgment of what happens if it is wrong.

The connection to what surrounds this chapter

Chapter 3 sits between the definition of what is being measured (Chapter 2) and the question of when the trial might stop early (Chapter 4). Its position is not incidental.

The connection to Chapter 2 is direct: the effect measure determines the form of the power calculation. The variance structure of a risk difference is not the same as that of a log hazard ratio. The number of events required for adequate power under a hazard ratio analysis is not the same as the sample size required for an RMST analysis at a specified time horizon. If the effect measure was not settled in Chapter 2, the sample size calculation in Chapter 3 cannot be correctly specified.

The connection to Chapter 4 is also direct: the sample size determines the context in which interim analyses are planned. An interim analysis at 50% of the target information fraction means something different in a trial with 200 events than in a trial with 800 events. The operating characteristics of an interim stopping rule—the probability of stopping early under the null, the probability of stopping early when the treatment effect is real—depend on the total information the trial is designed to accumulate. If that total is misspecified, the interim analysis plan is misspecified along with it.

These connections are sequential and irreversible. Getting Chapter 3 wrong does not just waste resources. It distorts Chapter 4. And a distorted Chapter 4 produces stopping rules that behave differently than planned—which is exactly the kind of design failure that is most difficult to explain after the fact.

What this chapter is not about

This chapter does not provide formulas for sample size calculation. Those are available in every statistical textbook and in every sample size software package. This chapter is not about how to calculate a sample size. It is about what the calculation is for, what it commits the trial to, and what happens when the commitments are not examined.

It is also not about adaptive sample size re-estimation, which is addressed in Chapter 7. The fixed sample size design—in which the total sample size is determined before enrollment begins and does not change in response to interim data—is the baseline against which adaptive designs are evaluated. Understanding what the fixed design commits to is prerequisite to understanding why and when adaptation is warranted.

The question this chapter must answer

By the end of this chapter, the following must be documented and owned.

The assumed effect size—the magnitude of treatment benefit the trial is designed to detect—is specified in the units of the primary effect measure, derived from sources that are identified and examined for their applicability to this trial, and owned by the clinical team as a clinical prediction about what the treatment will do.

The nuisance parameters—variance for continuous outcomes, baseline event rates for time-to-event and binary outcomes, expected dropout rates—are specified, with their sources identified and their uncertainty quantified. Where the uncertainty is large, its implications for the sample size calculation are documented.

The power level is justified—not merely stated. The justification connects the chosen power to the decision context: why 80% and not 90%? Why 90% and not 80%? Who bears the consequences if the trial is underpowered?

These are not administrative requirements. They are the minimum needed to make the sample size calculation a commitment rather than a calculation—a statement of what the trial believes about its treatment and its patients, expressed as a number, owned by the people whose judgment produced it.

Chapter 2 defined how the difference will be expressed. Chapter 3 defines how much evidence will be generated to express it. Both are required before the trial can begin.