2.1 Effect Measures

The measure is not the outcome

When a trial is over, the result is expressed as a number: a hazard ratio of 0.78, a risk difference of 4.2 percentage points, a mean difference of 1.8 units on a validated scale. This number is how the trial communicates its finding to regulators, prescribers, and payers. It is also the number that will appear in the label, in the meta-analyses, in the cost-effectiveness models, and in the clinical guidelines that govern whether this treatment gets used.

The number is not the outcome. It is a summary of the outcome—a compression of patient-level data into a single quantity that describes the difference between arms. Different summaries of the same data tell different stories. A treatment that reduces the annual event rate from 8% to 6% can be described as producing a relative risk reduction of 25%, an absolute risk reduction of 2 percentage points, a number needed to treat of 50, or an odds ratio of approximately 0.73. Each of these is arithmetically correct. Each emphasizes a different aspect of the same underlying effect. Each implies a different comparison to alternative treatments and a different framing for clinical decision-making.

The choice among them is not aesthetic. It is scientific and strategic, and it should be made deliberately—before the trial is designed, not after the data are in hand.


Binary outcomes: three measures, three commitments

For binary outcomes—event occurred or did not—three effect measures are in common use: the risk difference, the risk ratio, and the odds ratio.

The risk difference is the absolute change in event probability attributable to treatment. A risk difference of minus 4 percentage points means that in a population identical to the one enrolled, the treatment reduces the probability of the event by 4 percentage points relative to control. The risk difference is directly interpretable in clinical terms—it is the number needed to treat when inverted—and it is the measure that scales most naturally with absolute patient benefit. Its limitation is that it is not transportable: a risk difference of 4 percentage points estimated in a high-risk population may not apply to a lower-risk population in which the treatment is subsequently used. If baseline event rates differ between the trial population and the target clinical population, the absolute benefit will differ even if the relative effect is the same.

The risk ratio is the proportional reduction in event probability. A risk ratio of 0.75 means the event probability in the treatment arm is 75% of the event probability in the control arm. The risk ratio is more transportable than the risk difference—if the relative effect is stable across baseline risk, the risk ratio estimated in the trial applies to populations with different absolute risks—but less interpretable in terms of direct patient benefit. A 25% relative risk reduction means different things to a patient with a 2% baseline risk and a patient with a 20% baseline risk, and the risk ratio does not make that difference visible.

The odds ratio is the ratio of the event odds in the treatment arm to the event odds in the control arm. It is the natural output of logistic regression and is therefore the most common effect measure in observational research and in trials where covariate adjustment is central. It has one important interpretive property and one important interpretive problem. The property: odds ratios are collapsible only under specific conditions, which means that an odds ratio estimated from a heterogeneous population may not equal the weighted average of stratum-specific odds ratios. The problem: odds ratios are systematically misread as risk ratios, even by clinicians who know better. When the event rate is high, an odds ratio of 0.73 and a risk ratio of 0.75 are materially different; when the event rate is low, they approximate each other. The clinical literature is littered with effect claims that conflate the two, producing overestimates of relative benefit that survive into guidelines.

The choice among these three is a scientific commitment about which kind of comparison is appropriate for the estimand. A trial designed to estimate the absolute benefit of treatment in the enrolled population—the treatment policy question for a specific clinical context—should favor the risk difference. A trial designed to estimate a relative effect that will be applied to populations with varying baseline risks should favor the risk ratio. A trial in which covariate adjustment is central and the event rate is low enough that odds ratios and risk ratios approximate each other may reasonably use the odds ratio. These are not interchangeable defaults.


Time-to-event outcomes: when the hazard ratio assumption fails

For time-to-event outcomes, the hazard ratio is the dominant effect measure in regulatory submissions and clinical trial literature. It has legitimate advantages: it uses all available follow-up time, it can be estimated efficiently even when events are sparse, and it is the natural output of the Cox proportional hazards model, which accommodates covariate adjustment and stratified analyses in a well-understood framework.

It also carries an assumption that is routinely violated and rarely examined: proportional hazards.

The proportional hazards assumption requires that the ratio of the hazard rates in the two arms is constant over the entire follow-up period. In practice, this means the treatment’s relative effect must be the same at month three as at month thirty-six. For a treatment that produces an early reduction in hazard and then loses its effect as resistance develops, or for a treatment whose benefit accumulates slowly and strengthens over time, this assumption does not hold. The hazard ratio estimated from such data is a weighted average of the time-varying hazard ratios—a summary statistic that may not correspond to the treatment’s effect at any specific time point and may not be the quantity the estimand was designed to capture.

Non-proportional hazards are not rare in clinical trial data. They are common in oncology, where treatments that produce durable responses in a subset of patients generate crossing survival curves. They are common in cardiovascular disease, where early procedural risk may offset long-term benefit. They are common in any indication where the treatment effect requires time to accumulate and then stabilizes. The question for any trial using a time-to-event primary endpoint is not whether the proportional hazards assumption is exactly true—it never is—but whether the departure from proportionality is large enough to make the hazard ratio a misleading summary.

When the assumption is questionable, two alternatives deserve consideration.

The restricted mean survival time (RMST) difference is the difference in mean survival time up to a pre-specified time horizon. It does not assume proportional hazards. It is directly interpretable as a difference in average time free from the event—a quantity that clinicians and patients find meaningful. It requires specifying the time horizon in advance, which is a commitment; the choice of time horizon determines which part of the survival curve the estimate reflects, and a poorly chosen horizon can make a modest treatment effect look large or a meaningful effect look small. But the horizon must be chosen on clinical grounds, not on statistical convenience.

The win ratio is a more recent alternative that generalizes naturally to hierarchical composite outcomes: patients in the treatment arm are compared to patients in the control arm on a hierarchy of outcomes, and the treatment “wins” if the treatment patient has a better outcome on the most important component. It handles tied and censored observations, accommodates clinical hierarchies, and produces an effect measure that is interpretable in terms of the probability that a randomly selected treatment patient has a better outcome than a randomly selected control patient. Its limitations include sensitivity to the choice of outcome hierarchy and the time horizon over which comparisons are made.

Neither alternative is universally superior. Each is more appropriate in some settings than others. The decision between them—and between both and the hazard ratio—should be made before the trial begins, based on the expected shape of the treatment effect and the clinical question the estimand is designed to answer.


Continuous outcomes: the difference between average and distribution

For continuous outcomes, the mean difference is the default. It is interpretable, familiar, and efficient when the outcome is approximately normally distributed. It is also the measure most sensitive to the distributional assumptions that are rarely examined in practice.

The mean difference describes the average treatment effect in the enrolled population. When the treatment effect is homogeneous—when the treatment produces approximately the same benefit in every patient—the mean difference is both scientifically appropriate and practically meaningful. When the treatment effect is heterogeneous—when some patients benefit substantially, some benefit minimally, and some are harmed—the mean difference may describe a quantity that no patient actually experiences. An average improvement of 3 units on a validated scale may reflect a 6-unit improvement in half the population and no improvement in the other half, or a 3-unit improvement in nearly everyone. The mean does not distinguish between these scenarios.

This matters for the estimand. The population-level summary attribute of the estimand should specify what kind of average is appropriate. If the question is whether the treatment produces a clinically meaningful improvement in the typical patient, the mean difference may be the right summary. If the question is whether the treatment produces a clinically meaningful improvement in a substantial proportion of patients, a responder analysis—the proportion achieving a pre-specified threshold of improvement—may be more appropriate. If the distribution of outcomes is expected to be skewed, or if the variance is expected to differ substantially between arms, a median difference or a rank-based statistic may be more honest than a mean difference estimated under normality assumptions.

The choice between mean-based and distribution-sensitive summaries is not primarily a statistical question. It is a question about what the estimand is claiming—whether the trial is designed to detect an average effect or a distribution shift—and it should follow from the clinical question, not from the availability of convenient software.


The scale dependency of effect measures

One source of confusion in effect measure selection is that different measures capture different aspects of the same underlying effect, and the relationship between them depends on the baseline event rate or outcome distribution. This means that a trial that reports a large relative effect and a small absolute effect is not being inconsistent—it is reporting two different summaries of the same data, each appropriate for different purposes.

The clinical relevance is that effect measures should be matched to the decision context. For a decision about whether to prescribe a treatment to an individual patient, the absolute effect measure—risk difference, number needed to treat, restricted mean survival time difference—is most relevant, because it describes what the patient can expect in terms of personal benefit. For a decision about whether a treatment is more efficacious than alternatives across a range of patient populations, the relative effect measure may be more relevant, because it is more likely to be transportable. For a regulatory decision about whether a treatment should be approved, both are relevant: the relative measure demonstrates that the treatment has biological activity, the absolute measure demonstrates that the activity is clinically meaningful.

A trial designed without attention to which decision context the effect measure serves may produce a result that is compelling in one frame and unimpressive in another—a situation that creates problems not at the time of design but at the time of submission and negotiation.


What this section demands before proceeding

The effect measure must be specified before sample size can be calculated. The sample size depends on the assumed variance structure, which is a property of the effect measure—the variance of a risk difference, a risk ratio, and an odds ratio are not the same, and neither are the sample sizes required to detect a given effect with a given power. A trial that calculates sample size for one effect measure and reports another has not made a statistical error, but it has produced a sample size that may not correspond to what is actually being tested.

The effect measure must also be connected to the estimand. The population-level summary attribute of the estimand should name the effect measure—not just the statistical test that will be used, but the quantity the test is designed to detect. If the estimand specifies a hazard ratio but the clinical question is better answered by a restricted mean survival time difference, the estimand has not been correctly specified.

And the effect measure must be chosen before the non-inferiority margin is defined. The margin is expressed in the units of the effect measure. A margin defined as a hazard ratio is incommensurable with a margin defined as a risk difference, and borrowing a margin from a historical trial that used a different measure requires justification that is rarely provided. This is the subject of Section 2.3. Before the margin can be discussed, the measure must be settled.


References: Hernán, “The Hazard of Hazard Ratios,” Epidemiology 2010; Royston and Parmar, “Restricted Mean Survival Time,” BMC Med Res Methodol 2013; Pocock et al., “The Win Ratio,” Eur Heart J 2012; Senn, “Disappointing Dichotomies,” Pharm Stat 2003.