Chapter 2: How Will We Measure the Difference?

The question Chapter 1 leaves open

Chapter 1 settled the estimand. It named the population, the variable, the intercurrent event strategy, the population-level summary. It assigned ownership. It required that the question be stated precisely enough that a second statistician, reading the protocol, would construct the same primary analysis.

What it did not settle is how the difference between arms will be expressed.

This is not a minor technical detail. It is the decision that determines what the trial can claim, how efficiently it can claim it, and whether the claim will be legible to the people who need to act on it—prescribers, payers, regulators, and patients. Two trials with identical estimands, identical populations, identical data, and identical true treatment effects can produce results that feel completely different depending on the effect measure chosen. One result reads as a modest improvement. The other reads as a clinically compelling advance. The data are the same. The framing is different. And framing, in the context of regulatory submissions and label negotiations, is not cosmetic.

The choice of effect measure is a commitment. It encodes assumptions about how the treatment effect distributes across the population, what kind of comparison is scientifically appropriate, and what claim the result is intended to support. Those assumptions should be explicit—not because explicitness is a regulatory requirement, though it increasingly is, but because implicit assumptions are the ones that fail silently.


The problem this chapter addresses

There is a standard way effect measures are chosen in clinical trial design: the clinical team proposes an endpoint, the statistician proposes the natural summary measure for that endpoint type—rate difference for binary outcomes, hazard ratio for time-to-event, mean difference for continuous—and the design proceeds. The choice of effect measure is treated as a consequence of the endpoint type, not as an independent decision.

This is wrong in a subtle but important way.

The endpoint type constrains the available effect measures. It does not determine which one is scientifically appropriate. A binary outcome can be summarized as a risk difference, a risk ratio, or an odds ratio. Each implies a different model of how the treatment effect scales across patient subgroups with different baseline risks. A time-to-event outcome can be summarized as a hazard ratio, a restricted mean survival time difference, a milestone survival difference, or a win ratio. Each makes different assumptions about the shape of the treatment effect over time. A continuous outcome can be summarized as a mean difference, a median difference, a proportion achieving response, or a distribution-free rank-based statistic. Each answers a subtly different question about what the treatment does and for whom.

The choice among these is a scientific choice, not a statistical one. It should follow from the estimand—from the question that has been settled—rather than from the convention for this outcome type. When it follows from convention instead, the effect measure may be technically appropriate and scientifically misaligned: producing a valid estimate of something that is not quite what the trial was designed to show.

This chapter asks the question that the standard approach skips: given the estimand that has been specified, which effect measure best expresses the treatment effect it defines, can be estimated with the available data, and will support the claim the trial is intended to make?


What this chapter covers

Section 2.1 — Effect Measures examines the major families of effect measures and what each one commits a trial to. The focus is not on the statistical properties of the measures—those are well-documented—but on the scientific and interpretive commitments they carry. A hazard ratio is a proportional measure; it implies that the treatment’s relative benefit is constant across the follow-up period and homogeneous across patients with different baseline risks. That implication is an assumption. When the assumption holds, the hazard ratio is efficient and interpretable. When it does not—when hazards are non-proportional, when the treatment effect diminishes over time, when the trial population is heterogeneous in ways that matter—the hazard ratio produces a number that is statistically valid and clinically misleading. The section examines when this happens and what the alternatives are.

Section 2.2 — Interpretability and Efficiency examines the tension between what is statistically efficient and what is clinically meaningful. These two objectives do not always align. The most powerful test of a treatment effect is not always the test whose result is easiest to interpret. The number needed to treat is interpretable but inefficient. The odds ratio is efficient but frequently misread. The restricted mean survival time is interpretable and robust to non-proportional hazards but requires a time horizon choice that is itself a commitment. The section asks how this tension should be navigated—not by prescribing a single answer, but by making the trade-offs visible.

Section 2.3 — The Non-Inferiority Margin addresses the specific and consequential decision that arises when the trial is designed to show that a new treatment is not unacceptably worse than an active comparator. The non-inferiority margin is the threshold of acceptable inferiority—the amount by which the new treatment may fall short of the comparator and still be considered acceptable. Defining this margin requires answering a question that is simultaneously clinical, historical, and ethical: how much efficacy can be traded for something else—safety, convenience, tolerability, cost—without betraying patients? The section examines who should own this question and what happens when the answer is borrowed instead of earned.


The decision structure of this chapter

The three sections of this chapter are not independent. They address a single decision sequence.

The effect measure determines what the primary result will say. It shapes the statistical power of the primary test, the interpretability of the estimate, and the kind of claim the result will support. It must be chosen before sample size can be calculated—because the sample size required depends on the variance structure of the chosen measure—and before the non-inferiority margin can be defined—because the margin is expressed in the units of the effect measure.

This sequence is often violated. Sample sizes are calculated before effect measures are examined. Non-inferiority margins are borrowed from historical trials that used different effect measures. The result is a design where the components are each defensible in isolation but incoherent in combination—where the sample size was calculated for a risk difference but the primary analysis reports a hazard ratio, or where the non-inferiority margin was defined in absolute risk terms but the trial is powered on a relative measure.

These inconsistencies do not always surface during the trial. They surface at the regulatory review, when the agency asks how the margin was justified given the effect measure, or during the label negotiation, when the sponsor discovers that the result cannot support the claim they intended. At that point, the inconsistency is not a design problem. It is a submission problem, and it is much harder to fix.


What this chapter is not about

This chapter does not provide a guide to selecting statistical tests. It does not compare the type I error properties of different test statistics, explain the computational mechanics of the restricted mean survival time, or evaluate the relative performance of different estimators under various missing data mechanisms.

Those questions matter, and they have answers, documented in statistical literature that is more current and more complete than this book can be. What this chapter provides is the prior question: before selecting a test or an estimator, what is the effect measure supposed to represent, who is it for, and what claim will it support?

A trial team that can answer that question will select appropriate statistical methods from the available literature. A trial team that selects statistical methods without answering that question will produce technically valid analyses of unclear scientific meaning—which is the specific failure mode this book exists to prevent.


The connection to what follows

Chapter 2’s decisions constrain Chapter 3. The effect measure determines what effect size must be assumed when calculating sample size—and the assumptions embedded in that effect size are among the highest-stakes commitments the trial makes. A hazard ratio assumption implies a model of how events accumulate over time. A risk difference assumption implies a model of baseline event rates in the control arm. A mean difference assumption implies a model of within-patient variability. Each of these models can be wrong, and each one being wrong has a different consequence for whether the trial succeeds.

Chapter 2 also connects to Chapter 6. The effect measure shapes what can be claimed when the trial is over. A result expressed as a hazard ratio supports a different label claim than the same result expressed as a restricted mean survival time difference, even if the underlying data are identical. The claim discipline of Chapter 6 begins with the effect measure choice of Chapter 2—and a measure chosen for statistical efficiency without considering what claim it will support may produce a result that is statistically clean and commercially marginal.

These connections are not incidental. They are the reason the effect measure decision belongs at the beginning of the design process, immediately after the estimand is settled, rather than as a consequence of endpoint type that no one explicitly chose.


The question Chapter 2 must answer

By the end of this chapter, the following question must have a documented answer: given the estimand specified in Chapter 1, what effect measure will be used to express the treatment difference, why is that measure scientifically appropriate for the question being asked, what assumptions does it carry, and what claim will a statistically significant result support?

If the non-inferiority design is being used, there is an additional question: what is the non-inferiority margin, who derived it, and on what evidence does it rest?

These are not questions that can be deferred to the SAP. The effect measure and the non-inferiority margin—if applicable—must be in the protocol, because the sample size is calculated from them, the primary hypothesis is expressed in their terms, and the regulatory agency will review them as part of the design justification. They are design commitments, not analysis parameters.

Chapter 1 defined what is being estimated. Chapter 2 defines how the estimate will be expressed and what it will be compared against. Both are required before the trial can be sized.