2.2 Interpretability and Efficiency

The tension that does not resolve

Every effect measure sits somewhere on a spectrum between two properties that pull in opposite directions: statistical efficiency and clinical interpretability.

Statistical efficiency is the property of extracting maximum information from the available data. An efficient test uses all the available follow-up time, accommodates the covariance structure of the outcome, and produces the narrowest confidence interval for a given sample size. Efficiency is what allows a trial to detect a real treatment effect without enrolling more patients than necessary.

Clinical interpretability is the property of producing a result that the people who must act on it can understand and apply. An interpretable result tells a prescribing physician what a typical patient can expect, tells a payer what population-level benefit the treatment produces, and tells a patient what the treatment is likely to do for them specifically. Interpretability is what makes a result actionable.

The tension between these two properties is real and does not have a general resolution. The statistically optimal measure is not always the most clinically interpretable, and the most interpretable measure is not always the most efficient. The design team must decide, explicitly, how to navigate this tension for their specific trial—based on the estimand, the decision context, and the audience for the result.

Treating this as a technical question with a technical answer is the specific mistake this section addresses.

When efficiency comes at interpretive cost

The hazard ratio is the canonical example of a measure that is efficient and frequently misinterpreted.

Its efficiency derives from the Cox proportional hazards model, which uses all follow-up time—including the time of patients who are censored before the primary endpoint—and produces a precisely estimated relative measure. In a trial where events are sparse, the hazard ratio may be the only effect measure that can be estimated with acceptable precision from the available data. In this setting, choosing a less efficient measure in the name of interpretability may mean the trial cannot detect a real treatment effect at all.

Its interpretive problem is that a hazard ratio is not, to most clinicians, an intuitively accessible quantity. A hazard ratio of 0.78 does not immediately convey what a patient experiences. It is a ratio of rates, not a difference in probabilities. It does not specify a time horizon. It does not answer the question a patient asks—“will this treatment help me, and by how much?”—in terms a patient can evaluate.

The response to this is often to supplement the hazard ratio with milestone survival rates: “at three years, survival was 68% in the treatment arm versus 58% in the control arm.” This is helpful, but it introduces a secondary effect measure that is not the primary basis for the hypothesis test. The milestone survival difference is interpretable; the hazard ratio is what the trial was powered to detect. The two may tell different stories—a hazard ratio that is statistically significant may correspond to a milestone survival difference that is clinically unimpressive, or a non-significant hazard ratio may correspond to a meaningful absolute difference at a specific time point. When the supplementary measure tells a different story than the primary measure, the trial result is not ambiguous—it is informative about the shape of the treatment effect. But only if the relationship between the two was pre-specified and the difference in their messages was anticipated.

When interpretability comes at efficiency cost

The number needed to treat—the inverse of the absolute risk reduction—is the most clinically interpretable effect measure in common use. A number needed to treat of 25 means that 25 patients must receive the treatment for one additional patient to avoid the outcome. This directly connects the treatment effect to clinical effort and clinical benefit in terms that any prescriber or patient can evaluate.

It is also inefficient by construction. Because it is derived from the absolute risk difference rather than a relative measure, it is sensitive to the baseline event rate in the control arm—which is often estimated with considerable uncertainty, especially in prevention trials where the overall event rate is low. The confidence interval around the number needed to treat is wide. The estimate is unstable. When the trial is underpowered, or when the baseline event rate is uncertain, the number needed to treat may not be estimable with useful precision even when the relative effect is statistically significant.

This asymmetry—significant relative effect, imprecise absolute effect—is common and uncomfortable. A relative risk reduction of 20% is statistically significant in a trial where the absolute risk difference cannot be estimated within a factor of three, because the baseline event rate in the control arm was uncertain. The trial demonstrates that the treatment works, in the sense that the relative effect is not zero. It does not demonstrate, with precision, how much work the treatment does in a given clinical population. These are different questions, and the trial answered the first but not the second.

The resolution is not to abandon absolute measures—they are essential for clinical and policy decisions—but to be explicit about what each measure is answering and why the trial was powered for one and not the other.

The restricted mean survival time as a case study

The restricted mean survival time difference has emerged over the past decade as an effect measure that attempts to combine reasonable efficiency with genuine interpretability for time-to-event outcomes. It is worth examining in detail, not because it is always the right choice, but because it illustrates the trade-offs involved in any effect measure decision.

The RMST at a pre-specified time horizon T is the expected survival time up to T, estimated from the area under the Kaplan-Meier survival curve. The RMST difference between arms is the difference in this expected survival time—the average number of additional days, months, or years that treatment patients survive event-free compared to control patients, up to the time horizon T.

Its interpretive advantage is direct: an RMST difference of 4.2 months means that, on average, patients in the treatment arm survive 4.2 months longer without the event, up to the specified time horizon. This is a statement about what a patient experiences that a hazard ratio cannot provide. It does not require the listener to know what a hazard is, or what proportional hazards means, or how to convert a ratio of rates into a patient-level expectation.

Its efficiency relative to the hazard ratio depends on the shape of the survival curves. When hazards are proportional, the hazard ratio test is more powerful—it uses more of the information in the data. When hazards are non-proportional, particularly when survival curves cross, the RMST test may be more powerful, because it is not distorted by the canceling effects of early and late hazard differences. In a world where non-proportional hazards are common—and they are, particularly in oncology and in trials of treatments that produce durable responses in a subset—the RMST may be more efficient in practice than the theoretical comparison under proportional hazards would suggest.

Its cost is the time horizon. The RMST must be evaluated at a specified T, and the choice of T influences the result. An RMST difference evaluated at two years may be small even if the treatment produces a large benefit at five years. An RMST difference evaluated at five years may be dominated by the late portion of the survival curve, where few patients remain under observation and the estimate is imprecise. The time horizon must be chosen on clinical grounds—the period over which the treatment’s benefit is clinically relevant—and that choice must be pre-specified and defended. When the horizon is chosen after the data are seen, the RMST becomes a post-hoc measure that is no more interpretable than a hazard ratio evaluated at a cherry-picked time point.

Responder analyses: when the distribution matters more than the mean

A responder analysis converts a continuous outcome into a binary one: did the patient achieve at least a pre-specified magnitude of improvement? The proportion of responders in each arm is then the primary effect measure.

Responder analyses are advocated on interpretive grounds: they translate a continuous scale into a clinical decision—did this patient benefit?—that is more meaningful than a mean difference of 1.8 units on a validated scale. This argument has force when the minimally important difference is well-established and clinically meaningful, when the distribution of responses is bimodal rather than continuous, and when the clinical decision context is genuinely binary—the treatment is used or not used, not titrated to a continuous response.

The efficiency cost is real. Converting a continuous outcome to binary discards information and reduces statistical power. A trial powered on a mean difference may not be adequately powered for a responder analysis at the same sample size. If the responder analysis is primary, the sample size must reflect it—which typically means a larger trial.

There is also a hidden assumption: that the threshold defining response is clinically meaningful and was established independently of the trial data. A threshold chosen to maximize the observed responder rate is not a clinical determination. It is data dredging with a clinical label. The minimally important difference must be pre-specified on the basis of patient and clinical evidence, not selected from the distribution of trial outcomes. When this requirement is not met—and it frequently is not—the responder analysis produces an interpretable-looking number that reflects neither the treatment effect nor the clinical threshold it claims to represent.

The audience determines the measure

One organizing principle for navigating the efficiency-interpretability tension is to ask: who needs to act on this result, and what kind of information do they need to act well?

Regulatory agencies need to determine whether the treatment has demonstrated efficacy—that the treatment effect is real and not attributable to chance. For this purpose, statistical significance and the precision of the effect estimate are paramount. An efficient measure with a narrow confidence interval is more useful for regulatory determination than an interpretable measure with a wide one.

Prescribers need to determine whether to recommend the treatment to a specific patient, or to a class of patients with specific characteristics. For this purpose, absolute effect measures expressed in clinically meaningful units are most relevant. The prescriber’s question is “what will this treatment do for my patient?”—a question the hazard ratio does not answer but the number needed to treat or the RMST difference approximates.

Payers need to determine whether the treatment produces value relative to its cost, often in comparison with existing alternatives. For this purpose, absolute measures in the trial population and estimates of transportability to other populations are most relevant. A risk ratio, if the treatment effect is transportable, or a risk difference specific to a defined population, are more useful than an odds ratio that may not translate clearly.

Patients need to understand what the treatment is likely to do for them personally. For this purpose, the most interpretable measure—in absolute terms, at a clinically relevant time horizon, expressed as a probability rather than a ratio—is most appropriate. The format in which clinical trial results are communicated to patients has received increasing attention in the patient-focused drug development literature, and the evidence suggests that relative risk reductions systematically overstate perceived benefit compared to absolute risk reductions.

The practical implication is that a trial may need more than one effect measure—not as sensitivity analyses, but as pre-specified estimates designed for different audiences and different decision contexts. The primary measure is chosen for the primary purpose—usually regulatory—and secondary measures are pre-specified for clinical and patient-facing communication. This is not multiplicity. It is epistemic completeness.

What this section demands before proceeding

Before the non-inferiority margin can be discussed, the efficiency-interpretability balance must be explicitly resolved for the primary effect measure. This means the design team must have answered: for whom is the primary result being produced, and what information does that audience need? If the answer is “the regulatory agency,” the primary measure should be the most efficient available for the estimand. If the primary result must simultaneously serve regulatory and prescriber needs, the interpretability cost of the primary measure must be acceptable, and efficiency may need to be purchased through a larger sample size rather than through measure selection.

It also means that secondary effect measures—those intended for clinical communication rather than regulatory determination—must be pre-specified, with their relationship to the primary measure stated. A secondary measure that tells a different story than the primary measure is not a problem to be managed at the analysis stage. It is a finding to be anticipated at the design stage, with the discordance explained in advance rather than rationalized after the fact.

The effect measure for the primary analysis is settled. The secondary measures are identified. Their audience and purpose are documented. Only then can the non-inferiority question—Section 2.3—be addressed.

References: Sormani and Bruzzi, “Can We Measure Long-Term Treatment Effects via Short-Term Surrogates?” Mult Scler 2015; Uno et al., “On the Utility of RMST,” Stat Med 2014; Sedgwick, “Relative Risks, Absolute Risks, and Numbers Needed to Treat,” BMJ 2013; FDA Guidance on Patient-Focused Drug Development: Methods for Developing and Selecting Patient-Reported Outcomes (2022).