4.4 Futility, Efficacy, and Safety

Three decisions that are not the same

Interim stopping comes in three varieties. A trial may stop early because the evidence of efficacy is overwhelming—the treatment is clearly working, and continuing to randomize to control is ethically unjustifiable. It may stop early because the trial has lost the realistic prospect of achieving its primary objective—the treatment is not working, or the accumulated evidence is insufficient to detect an effect even if one exists. Or it may stop early because of a safety signal—evidence that the treatment is causing harm that outweighs whatever benefit it provides.

These three stopping reasons are frequently described in the same design section, under the same DSMB charter, sometimes with the same statistical framework. This creates a misleading impression that they are variations on a common theme. They are not. They have different evidentiary standards, different decision authorities, different governance implications, and different consequences for the trial’s scientific integrity.

The design of the interim analysis plan must treat them as three distinct decision types, with distinct rules, distinct authorities, and distinct documentation requirements. Conflating them—designing a single stopping boundary that handles all three—is a governance error that will not be apparent until the interim analysis convenes and the DSMB must make a decision that the plan has not adequately specified.

Efficacy stopping: what is being claimed

When a trial stops early for efficacy, it is making a claim: the evidence accumulated to this information fraction is sufficient to conclude that the treatment works. That claim has implications that extend beyond the trial. It enters the regulatory submission, the label negotiation, the clinical literature, and the clinical practice of prescribers who will treat patients not enrolled in the trial.

The evidentiary standard for this claim should therefore be the same standard that would be required for the claim if the trial had run to completion. The alpha-spending framework is designed to enforce this: the interim efficacy boundary is set so that the probability of a false positive at any interim look does not exceed the family-wise type I error. A trial that crosses the efficacy boundary at the interim analysis has cleared the same probabilistic standard—over the full range of information fractions—as a trial that crosses the final analysis threshold.

But the probabilistic standard is not the only relevant standard. The clinical plausibility of the interim result is also relevant. An interim result that crosses the efficacy boundary at 25% information fraction, showing a hazard ratio of 0.45 in a disease where prior treatments have achieved hazard ratios of 0.75-0.85, should trigger scrutiny of whether the result reflects the true treatment effect or an extreme realization of sampling variation. The boundary was crossed, but crossing the boundary is a necessary condition for stopping, not a sufficient one.

The DSMB charter should specify the conditions under which the DSMB may recommend continuation despite crossing the efficacy boundary—when the result is implausibly large, when the duration of follow-up is insufficient to characterize the durability of the effect, when the sample size is so small that the confidence interval is wide despite significance. These conditions are judgment calls, and they should be specified in advance rather than improvised when the interim data are surprising.

Efficacy stopping: the information fraction question

The information fraction at which efficacy stopping is possible has two distinct consequences that pull in opposite directions.

Early stopping—at low information fractions—maximizes the efficiency gain: the trial ends sooner, resources are saved, and patients who are being randomized to control are protected from continued assignment to an inferior arm. It also maximizes the overestimation bias: the observed effect at early stopping is furthest from the true effect, and the confidence interval is widest. A result at 30% information fraction is dramatic but imprecise and potentially misleading.

Late stopping—at high information fractions—minimizes the overestimation bias and produces a result that is close to what the final analysis would have found. It also minimizes the efficiency gain: a trial that can only stop early at 80% information fraction is providing minimal benefit over a design without early stopping.

The resolution is not a universal formula but a judgment about the relative costs in the specific trial context. When early stopping efficiency is paramount—in seriously ill patients for whom delay in treatment access is measured in survival, when the infrastructure for a large trial is fragile, when the development timeline is constrained—the early-stopping benefits may outweigh the overestimation cost, and lower information fraction boundaries may be appropriate. When scientific integrity is paramount—when the treatment will become a cornerstone of clinical practice, when the result will be used as the evidence base for subsequent trials or for policy decisions that affect large populations—the overestimation cost is higher, and boundaries should require higher information fractions before early stopping is possible.

This is a design decision that should be made explicitly, by the design team, before the plan is finalized—not selected from a menu of standard boundary types without examining what the selection commits to.

Futility stopping: binding versus conditional power

Futility stopping addresses a different question than efficacy stopping: not “is the evidence strong enough to claim the treatment works?” but “is there enough remaining potential for the trial to answer its question?” The decision is forward-looking, not backward-looking.

The most common statistical tool for futility assessment is conditional power: the probability of achieving a significant final result, given the interim data and the planned remaining sample. If the conditional power is below a pre-specified threshold—20% is common, though the threshold should be justified rather than conventional—the trial is unlikely to succeed even if it continues, and stopping may be appropriate.

Conditional power is computed under a specific assumption about the true treatment effect during the remaining trial period. The two natural assumptions are the originally assumed alternative (what is the probability of success if the true effect is what we assumed?) and the current estimate from interim data (what is the probability of success if the true effect is what the data currently suggest?). These two calculations produce different numbers and support different decisions.

Conditional power under the originally assumed alternative is the more optimistic calculation: it asks whether the trial can succeed if the treatment is doing what we hoped. Conditional power under the interim estimate is the more realistic calculation: it asks whether the trial can succeed if the treatment continues to behave as it has behaved so far. When the interim estimate is substantially below the assumed alternative—as it will be when the trial is heading toward a futility stop—the conditional power under the interim estimate will be much lower than under the assumed alternative.

The charter should specify which conditional power calculation is the basis for the futility recommendation. Using the more optimistic calculation produces later and less frequent futility stops—the trial continues longer in the hope that the effect will emerge. Using the more realistic calculation produces earlier and more frequent futility stops—the trial acknowledges sooner that the observed effect is below what was assumed.

The choice has ethical and efficiency implications. A futility standard based on the optimistic calculation is more permissive: it allows the trial to continue when the evidence is negative, on the grounds that the original hypothesis might still be true. A futility standard based on the realistic calculation is more stringent: it stops the trial when the current evidence does not support continuation, on the grounds that continuing would not serve the patients being enrolled.

The non-binding futility problem

A non-binding futility boundary—one that the sponsor may choose to override—creates a specific and frequently realized governance problem. When the interim data suggest futility but the sponsor decides to continue, the trial proceeds based on a decision that was made with knowledge of the interim trend. This decision may be correct—the sponsor may have information about external developments, pending regulatory guidance, or complementary data that the DSMB does not have. But it may also be driven by reluctance to acknowledge a failing program.

When a non-binding futility boundary is crossed and the sponsor continues the trial, two things should happen. First, the sponsor’s reasons for continuation should be documented in the DSMB charter response process, and those reasons should be available to regulatory reviewers who later examine the trial’s conduct. Second, the operating characteristics of the trial conditional on continuation—the probability of achieving significance, the expected sample size, the expected information at the final analysis—should be recomputed and documented. These are not the operating characteristics originally planned; they are the operating characteristics of a trial that has been continued past its futility boundary, and the regulatory agency reviewing the final result will want to know what those characteristics were.

Failure to document the continuation decision and its basis creates a governance gap that undermines the interim analysis plan’s integrity. If the trial subsequently fails at the final analysis, the continuation past the futility boundary may be interpreted as a decision motivated by something other than scientific judgment. If the trial subsequently succeeds, the concern is that the futility boundary was crossed and ignored, and the final result may reflect the influence of that decision on subsequent enrollment or conduct. Neither outcome is good, and both are avoidable by treating the non-binding futility decision as the high-stakes governance event it is.

Safety stopping: a different authority structure

Safety stopping is categorically different from efficacy and futility stopping, and it requires a different authority structure. The difference is in the nature of the evidence and the nature of the obligation.

Efficacy and futility decisions are about whether the trial is likely to answer its scientific question. They are governed by pre-specified statistical rules that balance type I and type II error. They are, ultimately, decisions about evidence quality and scientific efficiency.

Safety decisions are about whether enrolled patients are being harmed. They are governed not only by statistical thresholds but by the ethical obligation to protect research participants. They override the scientific objectives of the trial when necessary.

This means that safety stopping should not be constrained by the alpha-spending framework in the same way that efficacy stopping is. The alpha-spending framework controls the type I error for efficacy claims. A safety signal that falls below a pre-specified alpha-adjusted boundary is still a safety signal, and the DSMB must be empowered to act on it regardless of whether it crosses a formal statistical threshold. Requiring a pre-specified alpha level for safety stopping—as if safety and efficacy are the same kind of decision—is a category error that can result in patients being exposed to a harmful treatment because the safety signal was “not statistically significant.”

The standard for safety stopping is not statistical significance. It is medical judgment: given the evidence of harm and the evidence of benefit, does it remain ethically justifiable to continue randomizing patients? This judgment may involve statistical evidence, but it is not reducible to it. The DSMB charter must empower the DSMB to make this judgment—to recommend stopping for safety on the basis of clinical judgment, independently of whether any pre-specified statistical threshold has been crossed.

The authority structure for safety stopping also differs from efficacy stopping. A recommendation to stop for safety by a fully constituted independent DSMB should be extremely difficult for a sponsor to override. The conditions under which a sponsor may continue a trial against a safety stopping recommendation from the DSMB should be narrow—limited to demonstrably incorrect data or clear procedural error—and the override should require regulatory notification. If a sponsor overrides a safety stopping recommendation without triggering regulatory review, the trial’s governance has failed in a fundamental way.

When the three categories overlap

In practice, the three stopping categories are not always cleanly separated. A trial may observe a safety signal that is associated with biological efficacy—a finding that the treatment causes a side effect that is related to its mechanism of action. Continuing the trial is dangerous; stopping the trial means the efficacy question is unanswered. A trial may observe early futility on the primary endpoint alongside a signal of harm on a secondary endpoint—stopping for futility also removes the patients from harm, while continuing would resolve both questions more definitively. A trial may observe early efficacy in a subgroup alongside futility in the overall population—stopping for overall futility means the subgroup signal is never confirmed.

These overlaps require the interim analysis plan to specify not just the individual stopping rules but the decision hierarchy—which stopping category takes priority when multiple signals are present simultaneously. Safety always takes priority over efficacy and futility. Efficacy takes priority over futility, because a trial that is heading toward futility should not stop if there is also a compelling efficacy signal that requires more evidence. These hierarchies should be explicit in the charter, not resolved by the DSMB in real time under the pressure of an interim meeting.

References: Ellenberg, Fleming, and DeMets, Data Monitoring Committees in Clinical Trials (2002); Proschan et al., Statistical Monitoring of Clinical Trials (2006); Snapinn et al., “Monitoring a Clinical Trial Under an Adaptive Stopping Rule,” Pharm Stat 2014; ICH E6(R2) Guideline for Good Clinical Practice.