1.1 Purpose and Estimand

The question that precedes the hypothesis

A clinical trial has a hypothesis. It also has a purpose. These are not the same thing, and confusing them is one of the most consequential errors in trial design.

The hypothesis is statistical: the treatment effect is not zero, or the treatment is not inferior by more than a specified margin. The purpose is scientific and clinical: we want to know whether this treatment, given to these patients, in this way, produces a benefit that matters. The hypothesis is testable. The purpose is what makes the test worth running.

The estimand is the bridge between the two. It is the precise specification of the treatment effect that the trial is designed to estimate—the quantity whose value, once known, would answer the clinical question. Defined well, it makes the hypothesis a meaningful test of the purpose. Defined poorly, it allows the hypothesis to be tested rigorously while the purpose goes unanswered.

This is not a theoretical concern. It is the routine outcome of trial design processes that jump to the hypothesis before the purpose has been settled.

What the estimand framework requires you to specify

ICH E9(R1) defines the estimand through four attributes: the population, the variable, the handling of intercurrent events, and the population-level summary. Each attribute corresponds to a decision. Each decision has an owner. Each owner takes on responsibility for the consequences if the decision turns out to be wrong.

Population. Who are the patients in whom the treatment effect is being estimated? This is not identical to the eligibility criteria, though it is informed by them. The eligibility criteria define who can enroll. The population attribute of the estimand defines the patients in whom the estimated effect is intended to be interpreted. In a trial that enrolls a broad population but expects the treatment to work primarily in a biomarker-defined subgroup, the estimand population and the enrolled population may differ—and that difference matters for what the trial can actually claim.

Variable. What outcome is being measured, and how? The variable is the measurement at the patient level—the raw material from which the treatment effect will be estimated. It must be defined precisely enough that two independent assessors, given the same patient at the same time, would record the same value. Ambiguity in the variable propagates through to ambiguity in the estimand: if the measurement is not clearly defined, the treatment effect it generates cannot be clearly interpreted.

Intercurrent events. What happens when the clinical course does not follow the clean experimental logic of the protocol? Patients discontinue. They switch to rescue medication. They die before the primary endpoint can be measured. They receive treatments outside the protocol. Each of these intercurrent events disrupts the simple experimental comparison the trial was designed to make—and the strategy for handling each one determines, more than almost any other design choice, what question the trial is actually answering. This is addressed in detail in Section 1.3. The point here is that the intercurrent event strategy is an estimand attribute, not an analysis decision. It must be settled at design, not resolved post-hoc.

Population-level summary. How is the patient-level variable aggregated into a single estimate of the treatment effect? A mean difference, a median difference, a proportion achieving response, a hazard ratio, a win ratio—each summarizes the individual-level distribution differently and implies a different claim about how the treatment effect distributes across the population. The choice of population-level summary is not purely statistical. It reflects a claim about whether the treatment effect is expected to be homogeneous or heterogeneous, concentrated in responders or distributed across the population, durable or transient.

These four attributes are not independent. The population shapes which intercurrent events are likely and in what frequency. The variable constrains the available population-level summaries. The intercurrent event strategy affects the interpretation of the summary. Specifying one attribute without examining its implications for the others produces an estimand that is formally complete but substantively incoherent.

Treatment policy versus hypothetical: the choice that changes everything

Of the decisions embedded in the estimand framework, the most consequential is the choice of intercurrent event strategy for the primary analysis. And within the available strategies, the choice between treatment policy and hypothetical estimands is the one that most sharply divides what the trial is claiming to show.

Under a treatment policy strategy, the outcome is measured regardless of whether the patient followed the protocol, took the assigned treatment, used rescue medication, or discontinued. The estimand reflects the effect of being assigned to treatment in the context in which the treatment would actually be used—including all the messiness of real-world treatment behavior. This is the question a payer or a clinician asks when deciding whether to prescribe: if I give this treatment to patients like these, what happens, accounting for the fact that some will discontinue, some will need rescue, and some will deviate from the regimen?

Under a hypothetical strategy, the outcome is estimated under a counterfactual condition—typically, what would have happened if the intercurrent event had not occurred. What would the patient’s outcome have been if they had not discontinued? What would the blood pressure have been if they had not needed rescue medication? This is the question a pharmacologist or a mechanism-focused clinical scientist asks: what does this treatment do, under conditions where its effect is not diluted by non-adherence or rescue?

These are both legitimate questions. They are not the same question. A trial designed to answer the treatment policy question will produce an estimate that is directly relevant to clinical practice but may underestimate the biological effect of the treatment. A trial designed to answer the hypothetical question will produce an estimate that is scientifically cleaner but may overestimate the benefit patients will actually experience. Neither estimate is wrong. Each answers a different question, and the choice between them must be made before the trial begins—because the choice determines what data need to be collected, how the analysis must be specified, and what the result will actually mean.

The difficulty is that this choice is often deferred. The protocol specifies an intention-to-treat analysis without asking which estimand that analysis serves. An ITT analysis is consistent with a treatment policy estimand—but only if the outcome is measured regardless of discontinuation or rescue. If discontinuations are handled by last-observation-carried-forward, or if rescued patients are censored, the ITT label is being applied to an analysis that does not serve the treatment policy estimand. The result is an analysis with an ITT label that answers neither the treatment policy question nor the hypothetical question clearly.

This is the specific failure mode the estimand framework was designed to prevent: not technical error, but conceptual mismatch between the analysis performed and the question being claimed.

Who specifies the estimand, and when

The estimand is a scientific document. Its four attributes—population, variable, intercurrent event strategy, population-level summary—are scientific and clinical choices, not statistical ones. Statisticians can evaluate the feasibility of estimating a proposed estimand, can flag inconsistencies between attributes, and can specify the analysis that will estimate it. They cannot substitute for the clinical judgment about which estimand is the right one.

This creates an accountability structure that is often poorly managed in practice.

The clinical team—physicians, clinical scientists, disease area experts—must own the question. What does it mean for this treatment to work, for these patients, in this clinical context? The answer to that question determines the estimand. The clinical team must be present for the estimand discussion, must understand what each attribute commits the trial to, and must be able to defend the choices to a regulator, to an ethics committee, and eventually to the clinical community that will use the trial’s results.

The statistical team must own the technical specification. Given the estimand the clinical team has defined, what analysis will produce an unbiased estimate of it? What data must be collected to support that analysis? What sensitivity analyses will evaluate the robustness of the primary result? The statistician who cannot answer these questions for a given estimand has not yet finished their work.

The regulatory team must own the claim. Given the estimand the trial is designed to estimate, what will the regulatory submission be able to assert? What label language will be defensible? What conditions will be attached to the approval, and are those conditions acceptable given the development strategy?

When these three ownership structures are in place, the estimand becomes a shared document that the entire team has contributed to and is accountable for. When ownership is missing—most commonly, when the estimand is treated as a statistical artifact to be finalized by the biostatistics team after the protocol is already drafted—it becomes a technical formality that no one fully understands and no one is responsible for defending.

The regulatory interface

ICH E9(R1) was finalized in 2019 and has been progressively incorporated into regulatory expectations at FDA, EMA, and PMDA. The practical consequence is that regulators now expect the estimand to be specified in the protocol and the statistical analysis plan, with the analysis explicitly connected to the specified estimand.

What this means for trial design is that the estimand conversation can no longer be deferred to the analysis stage. A protocol that specifies an ITT primary analysis without connecting it to an estimand attribute—or that specifies a per-protocol sensitivity analysis without explaining what estimand it targets—will receive questions during the review that could have been answered during the design.

More substantively, the regulatory expectation creates an opportunity. Pre-submission meetings—Type B meetings at FDA, scientific advice at EMA—are increasingly used to align on estimands before the trial begins. This alignment is valuable not because it guarantees approval, but because it forces the question into the open at the moment when the design can still accommodate the answer. A regulator who disagrees with the chosen estimand during a pre-submission meeting is generating information that the sponsor needs. A regulator who raises the same disagreement during the review of a completed trial is generating a problem that cannot be fixed.

The estimand framework, in this sense, is not primarily a regulatory compliance tool. It is a forcing function for a conversation that should happen anyway—among the clinical, statistical, and regulatory members of the design team—about what question the trial is for.

When the estimand is wrong

An estimand can be wrong in two ways. It can be internally inconsistent—the attributes are incompatible with each other, the analysis cannot produce an unbiased estimate of the specified quantity, or the population-level summary does not correspond to the patient-level variable. This kind of error is detectable by careful specification and review. It is the statistician’s domain.

The second kind of error is harder to detect and more damaging: the estimand is internally consistent but answers the wrong question. The treatment effect is estimated precisely, the analysis is technically valid, the result is statistically significant—and the clinical community receives the result with indifference because it does not tell them what they needed to know.

This failure is not a statistical failure. It is a design failure, and it is attributable to the people who defined the question. Which brings the discussion back to the point that opened this section.

The hypothesis is testable. The purpose is what makes the test worth running. The estimand is what connects them. If that connection is not made explicitly, by the right people, at the right moment in the design process, the trial may produce a valid answer to a question no one was asking.

That outcome is not a neutral result. It consumes resources, exposes patients to intervention, and advances nothing. The estimand framework asks: what question are we trying to answer, precisely? The discipline it demands is not bureaucratic. It is the minimum required to make the trial worth running.

References: ICH E9(R1) Addendum on Estimands and Sensitivity Analysis in Clinical Trials (2019); Lawrance et al., “What Is an Estimand, and How Does It Relate to Quantifying the Effect of Treatment on Patient-Reported Outcomes?” J Patient-Rep Outcomes 2020; Permutt, “A Taxonomy of Estimands for Regulatory Clinical Trials with Discontinuation,” Stat Med 2016.