5.2 Stratification

What stratification is for

Stratification in clinical trial randomization serves two purposes, and they are often conflated.

The first purpose is balance: ensuring that known prognostic factors are distributed similarly across treatment arms, even in moderately small samples where chance imbalance is a real risk. Without stratification, a trial of 200 patients might randomize 65% of high-risk patients to the control arm by chance, producing a comparison that appears to favor the treatment not because the treatment works but because the treated patients were at lower baseline risk. Stratification prevents this by maintaining separate randomization lists within each stratum, so that each stratum contributes proportionally to both arms.

The second purpose is efficiency: when the primary analysis reflects the stratification—by including the stratification factors as covariates, or by using a stratified test—the residual variance in the comparison is reduced, increasing the power of the primary test. This efficiency gain is real and can be substantial when the stratification factors are strongly prognostic. It is also conditional: the efficiency gain is captured only when the primary analysis accounts for the stratification. A trial that uses stratified randomization but analyzes as if randomization were simple—ignoring the stratification in the analysis—has incurred the operational cost of stratification without capturing its statistical benefit.

Both purposes are served by stratification, but they make different demands on the design. Balance is served by having stratification factors in the randomization. Efficiency is served by reflecting those factors in the analysis. A design that stratifies without analyzing accordingly has achieved one purpose and missed the other.

Choosing stratification factors

The choice of stratification factors is a clinical judgment about which patient characteristics are the strongest predictors of the primary outcome, independent of treatment. The stronger the predictor, the larger the efficiency gain from stratification, and the more damaging a chance imbalance would be to the validity of the comparison.

The general principle is to stratify on factors that are known to be strongly prognostic, that can be ascertained accurately at the time of randomization, and whose distribution across arms would materially affect the interpretation of the primary result if imbalanced by chance.

The specific candidates for stratification are indication-specific, but several categories recur: disease severity or stage at baseline, prior treatment history, key demographic variables that predict prognosis (age, performance status in oncology), and the presence or absence of a biomarker that modifies the treatment effect. For trials in which a subgroup analysis is pre-specified—an analysis by biomarker status, by disease severity, by region—the subgroup-defining variable should be a stratification factor, so that the subgroup analysis is not confounded by a chance imbalance in that factor between arms.

The constraints on stratification factor selection are practical. Each stratification factor multiplies the number of strata by the number of levels of that factor. A design with four stratification factors each at two levels has sixteen strata. In a trial of 400 patients, the average stratum has 25 patients—manageable. In a trial of 100 patients, the average stratum has approximately 6 patients, and many strata will have too few patients to maintain balance through blocking. When strata are small, stratified randomization degenerates toward simple randomization within strata, and the balance guarantee is weakened.

The practical limit on stratification factors is approximately three to four for most trial sizes. Beyond this, covariate-adaptive randomization—minimization—is a better tool for balance across multiple factors simultaneously, as discussed in Section 5.1.

Ascertainment at randomization

The stratification factor must be ascertained accurately at the moment of randomization. This requirement is more demanding than it appears.

For factors determined by laboratory results—a biomarker level, a genetic test, a disease stage defined by imaging—the test result must be available before the patient is randomized. In practice, this means the laboratory processing time must be shorter than the window between enrollment and randomization, or the patient must be enrolled before the test result is available but not randomized until it is—a distinction that requires careful operational specification. A patient who is enrolled (consented) before the test and randomized only when the result is available is handled differently from a patient who is enrolled and randomized simultaneously.

For factors that involve clinical judgment—disease severity ratings, performance status assessments, subtype classifications—the assessment must be made by a person who does not know which treatment the patient will receive, and the criteria for the assessment must be pre-specified with enough specificity to prevent interpretation variation across sites. An assessor who knows the patient is about to be randomized and who has an opinion about what treatment the patient should receive will, consciously or not, adjust their assessment to favor the arm they prefer for that patient. This is not a hypothetical concern; it is a documented mechanism of bias in trials where the stratification assessment is made by personnel with knowledge of the upcoming assignment.

Misascertainment at randomization—assigning a patient to the wrong stratum—has two consequences. It reduces the balance guarantee, because the patient is counted toward the wrong stratum’s allocation and may shift balance within the correct stratum. And it creates an analysis complication: the stratification factor as recorded at randomization and the stratification factor as it actually applies to the patient may differ, and the primary analysis must be pre-specified to handle this discordance. The standard approach is to analyze according to the stratum as randomized, regardless of the ascertained value—preserving the integrity of the randomization sequence at the cost of using a slightly incorrect stratification value. The pre-specification of this handling rule is not administrative; it is a substantive choice with implications for the validity of the stratified analysis.

Reflecting stratification in the analysis

The efficiency benefit of stratification is captured only when the primary analysis includes the stratification factors. For continuous outcomes, this typically means including the stratification factors as covariates in the primary regression model. For time-to-event outcomes, this means including them as covariates in the Cox model or using a stratified Cox model with separate baseline hazard functions within each stratum. For binary outcomes, this means including them in the logistic model or using a Cochran-Mantel-Haenszel test stratified by the stratification factors.

The alternative—analyzing the data without accounting for stratification—treats the stratified randomization as if it were simple randomization. This is not wrong in the sense that the estimate will be biased; the randomization still ensures that the comparison is valid in expectation. But it is inefficient: the analysis does not use the information encoded in the stratification, and the residual variance is larger than it would be if the stratification were reflected. The primary test is less powerful than it would be, and the confidence interval is wider.

This inefficiency is avoidable by pre-specification. The primary analysis plan must specify that stratification factors will be included in the primary model, and this specification must be consistent with the randomization scheme. A trial that stratifies on three factors and then runs a primary analysis without including those factors in the model has failed to capture the benefit of the stratification it paid the operational cost to implement.

The regulatory agencies are attentive to the consistency between the randomization scheme and the primary analysis. The FDA and EMA both expect the primary analysis to reflect the stratification—for continuous and binary outcomes by including the stratification factors, for time-to-event outcomes by stratification in the model or sensitivity analysis confirming that the unstratified result is consistent. A primary analysis that ignores the stratification scheme is not incorrect, but it will receive questions, and the answers to those questions must explain why the pre-specified analysis was not used.

Stratification versus subgroup analysis

Stratification is often confused with pre-specified subgroup analysis, and the confusion creates design errors that are difficult to correct after enrollment.

Stratification ensures that the randomization is balanced on a factor and improves the efficiency of the overall treatment effect estimate. It does not, by itself, enable a confirmatory subgroup analysis within each stratum.

A pre-specified subgroup analysis within a stratum is a separate hypothesis test, requiring its own multiplicity control, its own power calculation, and its own estimand specification. The fact that the subgroup-defining variable is also a stratification factor does not give the subgroup analysis any special statistical status. The subgroup analysis is still subject to the same risks of false positives and underpowered true negatives as any other subgroup analysis, and it must be pre-specified with sufficient rigor to be treated as confirmatory rather than exploratory.

Stratifying on a variable and planning a subgroup analysis on the same variable serves both purposes simultaneously—it ensures the main analysis is efficient with respect to that variable and positions the subgroup analysis as pre-specified—but the two purposes require different documentation and different statistical controls. A trial that stratifies on biomarker status and intends to make a confirmatory claim in the biomarker-positive subgroup must: calculate sample size for the subgroup analysis separately, pre-specify the analysis in the protocol and SAP, and control the type I error across the overall analysis and the subgroup analysis according to a pre-specified hierarchy. Stratification alone does not accomplish any of these.

Stratification in multi-regional trials

In trials conducted across multiple geographic regions—global trials with sites in North America, Europe, and Asia—region is sometimes proposed as a stratification factor. The motivation is legitimate: regional differences in standard of care, patient population, measurement procedures, and regulatory requirements can all produce prognostic differences between regions that, if imbalanced between arms, could distort the primary comparison.

But region as a stratification factor requires careful consideration. If the number of regions is large and the number of patients per region is small, stratification by region creates many strata with few patients each, reducing the effectiveness of the stratification. If the treatment effect is expected to be homogeneous across regions—as is typically assumed for a global trial—the efficiency benefit of regional stratification is small, because the regions are not strong independent predictors of outcome net of other factors.

The more important question is whether the primary analysis should include region as a covariate, independently of whether it is a stratification factor. In trials where regulatory submissions must be made in multiple regions, and where each regional regulatory agency will examine the trial’s primary result, including region in the primary analysis ensures that the result is not driven by regional imbalance and that the estimate is interpretable as an average effect across regions rather than the effect in the largest or most influential region. This is a pre-specification decision that should be addressed in the protocol and SAP, whether or not region is formally a stratification factor.

What this section demands before proceeding

The stratification factors must be specified before enrollment begins, with the rationale for each factor documented: why this factor is prognostic, why it must be balanced across arms, and why the expected benefit of stratification justifies the operational complexity.

The ascertainment procedure for each factor must be specified at the same level of detail. Who assesses the factor, at what time relative to randomization, with what criteria, and what happens when the assessment is ambiguous or delayed. The handling of misascertainment must be pre-specified: analyze according to the stratum as randomized, not the stratum as corrected.

The primary analysis must be pre-specified to include the stratification factors, with the specific model form documented. And the relationship between stratification and any planned subgroup analysis must be explicitly addressed: if a stratification factor is also the basis for a pre-specified subgroup analysis, the subgroup analysis requires its own power calculation, multiplicity control, and estimand specification, independent of the stratification.

The stratification is a commitment. What it commits to—balanced arms on known prognostic factors, efficient primary analysis, pre-specified handling of misascertainment—must be documented before enrollment begins, not rationalized after the data are seen.

References: Kernan et al., “Stratified Randomization for Clinical Trials,” J Clin Epidemiol 1999; Pocock and Simon, “Sequential Treatment Assignment with Balancing for Prognostic Factors in the Controlled Clinical Trial,” Biometrics 1975; Kahan and Morris, “Reporting and Analysis of Trials Using Stratified Randomisation in Leading Medical Journals,” BMJ 2012; EMA Guideline on Adjustment for Baseline Covariates in Clinical Trials (2023).