7.1 Sample Size Re-estimation

The problem SSR solves

Chapter 3 documented the fragility of sample size assumptions. The effect size is predicted from prior studies with different populations. The variance is borrowed from contexts that may not transfer. The control arm event rate may be declining secularly. When these assumptions are wrong—as they frequently are—the trial is either underpowered and cannot detect a real effect, or overpowered and wastes resources and patient time.

Sample size re-estimation (SSR) is the adaptive response to this fragility. Rather than committing irrevocably to a sample size at design, the trial pre-specifies a rule by which the sample size can be adjusted based on data observed during the trial. If the interim data suggest the nuisance parameters are worse than assumed—variance higher, event rate lower, dropout higher—the sample size is increased to restore the planned power. The adaptation is prospective, rule-governed, and designed to address a well-identified source of design failure.

This is the correct motivation for SSR. Its statistical properties are well-established when implemented correctly. The challenge is in the implementation: what data can the re-estimation use without inflating the type I error, who performs the re-estimation, what are the constraints on the size adjustment, and how do the operating characteristics of the adapted design relate to those of the original design.

Nuisance-parameter SSR versus effect-size SSR

The most important distinction in SSR design is between re-estimation based on nuisance parameters and re-estimation based on the observed effect size.

Nuisance-parameter SSR uses blinded interim data—data that do not reveal the treatment arm comparison—to update the estimate of the variance, the control arm event rate, or the dropout rate. Because the data are blinded, the re-estimation does not depend on the interim treatment effect estimate and does not inflate the type I error in the way that unblinded data use would. A blinded SSR that adjusts the sample size based on the pooled variance, without knowing which arm the variance is coming from, is a pre-specified design adjustment that does not compromise the primary test.

This is the most defensible form of SSR. It addresses the most common source of power loss—wrong nuisance parameter assumptions—without introducing the type I error inflation that comes from conditioning the sample size on the interim treatment effect. The operating characteristics of the adapted design are close to those of the fixed design at the true nuisance parameters; the adaptation corrects the design toward what it would have been if the nuisance parameters had been known at the outset.

Effect-size SSR uses unblinded interim data—including the interim treatment effect estimate—to determine whether the sample size should be increased. The rationale is that if the interim effect is smaller than assumed, the trial is heading toward underpower; increasing the sample size can restore the probability of success. This form of SSR is more intuitive and more commercially attractive than nuisance-parameter SSR, because it directly addresses the scenario the sponsor cares most about: the trial that will miss significance by a small margin.

Its cost is significant. When the sample size depends on the interim effect estimate, the final test statistic is correlated with the interim estimate in a way that inflates the type I error if the standard fixed-sample critical value is used. The inflation can be substantial: without correction, an effect-size SSR can increase the type I error from 2.5% to 4-5% or more, depending on the specifics of the re-estimation rule. Controlling the type I error after effect-size SSR requires analytical methods—the combination tests of Lehmacher and Wassmer, the inverse normal method, or related approaches—that adjust the final test for the adaptive nature of the sample size selection.

These methods work—they control the type I error—but they come at a cost: reduced power relative to the pre-adaptation design, and reduced power relative to what a fixed design with the correct sample size would achieve. The efficiency of SSR—relative to a fixed design that happened to use the right sample size—depends on the magnitude of the nuisance parameter uncertainty, the design of the re-estimation rule, and the power of the correction method. For plausible scenarios, effect-size SSR with a correct correction method is more efficient than a fixed design with a wrong sample size assumption, but less efficient than a fixed design with the correct assumption.

What the re-estimation rule must specify

The SSR rule is a pre-specified function from interim data to a new sample size. It must specify, before enrollment begins, all of the following.

The interim data used. Is the re-estimation based on blinded data only, or on unblinded data? If blinded, what pooled statistics will be used—pooled variance, pooled event rate, pooled dropout rate? If unblinded, what information about the interim treatment effect will be used, and how?

The timing of the re-estimation. At what information fraction will the interim data be collected for the re-estimation? The timing affects the power of the correction method and the operating characteristics of the adapted design. Re-estimation at early information fractions is more uncertain—the nuisance parameter estimates are less stable—but allows more time to enroll the additional patients if the sample size is increased.

The bounds on the adaptation. What is the minimum and maximum sample size after re-estimation? A re-estimation without bounds can produce a sample size that is impractically large or that so extensively extends the trial as to make the original design unrecognizable. Practical upper bounds—typically expressed as a maximum number of additional patients or a maximum multiple of the original sample size—are part of the pre-specification and part of the operating characteristic calculation.

Who performs the re-estimation. The re-estimation must be performed by an independent statistician who does not have access to information that would be available to the sponsor or the trial team beyond what the pre-specified rule allows. If the re-estimation is based on blinded data, the independent statistician receives the blinded pooled statistics and applies the rule. If it is based on unblinded data, the independent statistician must be isolated from the sponsor in the same way that the DSMB statistician is isolated—with the information firewall requirements of Chapter 4’s governance framework applied to the re-estimation decision.

How the primary analysis accounts for the adaptation. For nuisance-parameter SSR, the primary analysis can typically proceed as if the trial were a fixed design with the re-estimated sample size, with minor corrections. For effect-size SSR, the primary analysis must use the adaptive test methodology—combination test or inverse normal method—and the critical value must be pre-specified for the combination test, not for the standard test.

Operating characteristics of the adapted design

The operating characteristics of the SSR design must be established by simulation before enrollment begins. The simulation must cover the full range of scenarios the trial might encounter: true effect sizes above, at, and below the assumed alternative; nuisance parameters at their assumed values and at their plausible extremes; various combinations of true effect size and nuisance parameter misspecification.

For nuisance-parameter SSR, the key operating characteristics are the power at the true effect size across the range of true nuisance parameters, and the probability that the re-estimation triggers a sample size increase across the range of nuisance parameter realizations. The simulation should show that the adapted design achieves the target power for a broader range of nuisance parameters than the fixed design, at the cost of a larger expected sample size when the nuisance parameters are at their pessimistic values.

For effect-size SSR, the key operating characteristics are more complex. They include the type I error under the combination test method, the power across the range of true effect sizes, the expected sample size under the null and under the alternative, and the conditional power at the re-estimation time point—the probability of achieving significance at the final analysis, conditional on the interim result observed at the re-estimation.

The conditional power at the re-estimation time point is particularly important because it is the diagnostic that determines whether the re-estimation triggers an increase. If the conditional power is below a pre-specified threshold—50% is common, though the threshold should be justified—the sample size is increased. This threshold must be pre-specified; a threshold chosen after the interim conditional power is computed is a post-hoc decision rule, not a pre-specified one.

The governance requirements

SSR is a design adaptation, and it carries governance requirements that parallel those of interim analyses.

The blinded information used in the SSR must remain blinded—to the sponsor, to the analysis team, and to anyone whose behavior during the trial could be influenced by knowledge of the interim trend. The independent statistician who performs the re-estimation receives the blinded pooled statistics and returns a new sample size (or a determination that no adjustment is needed) without revealing the statistics to any other party. The process must be documented in the SSR charter—a document analogous to the DSMB charter that specifies the procedure, the data to be used, the decision rule, and the communication protocol.

When the re-estimation uses unblinded data—for effect-size SSR—the governance requirements are stricter. The unblinded data must be handled with the same firewall controls as DSMB interim analyses. The independent statistician sees the unblinded interim data, applies the re-estimation rule, and returns only the new sample size—not the interim treatment effect estimate—to the sponsor. The sponsor receives a sample size decision, not interim efficacy information.

This governance requirement is the source of a common operational failure: SSR that was designed correctly on paper but implemented in a way that allowed the treatment arm comparison to reach the sponsor. When this happens—when the sponsor learns the interim effect from the SSR process—the SSR has functioned as an unplanned interim analysis with unrestricted information access, contaminating the final analysis in exactly the way Chapter 4 described.

SSR in event-driven trials

Event-driven trials—where the primary analysis is triggered by accumulation of a target number of events rather than a target number of patients—introduce a specific SSR challenge. The relevant nuisance parameter is the control arm event rate, and the adaptation is a change in the target number of events or the target follow-up duration rather than the target sample size.

The mechanics are the same: blinded interim event rates are used to update the estimate of the rate, and the target is adjusted to restore the planned power. The governance requirements are the same. The additional complexity is that event-driven SSR affects the trial duration as well as the sample size—a lower-than-assumed event rate requires either more patients (to accumulate the same number of events faster) or longer follow-up (to allow events to accumulate in the existing cohort). These two responses have different cost and operational profiles, and the pre-specified rule must specify which response is used and under what conditions.

The interaction with the alpha-spending plan is also more complex in event-driven SSR, because the information fractions at which interim efficacy analyses are conducted are expressed as fractions of the target event count—and the target event count is now variable. The spending function must be applied at the information fractions defined relative to the new target, not the original target. This requires the pre-specification to address how the spending function is recalibrated when the target changes—which is the same problem identified in Chapter 4’s discussion of event rate uncertainty.

What this section demands before proceeding

Before Section 7.2’s discussion of adaptive enrichment, the SSR design must be complete in the following sense.

The re-estimation rule is fully specified: what data, at what time, by whom, with what bounds, corrected how in the primary analysis. The SSR charter is drafted—the document that governs the re-estimation process with the same specificity that the DSMB charter governs interim efficacy reviews. The operating characteristics are established by simulation across the relevant scenario space, and the adapted design is shown to achieve the target power for a broader range of nuisance parameters than the fixed design would, at an acceptable cost in expected sample size.

And the interaction between SSR and the efficacy interim analysis plan is addressed. If the efficacy interim analysis occurs at the same time as the SSR, the governance structures must be separated—the DSMB reviews efficacy data in closed session while the independent statistician performs the re-estimation based on blinded or separately unblinded data, with no information crossing between the two processes. This separation must be pre-specified and operationally verified before the trial begins.

References: Wittes and Brittain, “The Role of Internal Pilot Studies in Increasing the Efficiency of Clinical Trials,” Stat Med 1990; Bauer and Köhne, “Evaluation of Experiments with Adaptive Interim Analyses,” Biometrics 1994; Lehmacher and Wassmer, “Adaptive Sample Size Calculations in Group Sequential Trials,” Biometrics 1999; Proschan, “Two-Stage Sample Size Re-Estimation Based on a Nuisance Parameter,” J Biopharm Stat 2005.