1.2 Endpoint Choice

The question underneath the endpoint

Every trial has a primary endpoint. What is less often acknowledged is that every primary endpoint is an answer to a prior question—a question that is usually not asked explicitly.

The prior question is: what does it mean for this treatment to work?

That question has at least four different interpretations, and they do not always converge on the same endpoint.

It could mean: the treatment produces a measurable biological change. This leads toward biomarkers, surrogate endpoints, mechanistic measures—things that move reliably and can be detected with a small sample.

It could mean: the treatment changes how patients feel or function. This leads toward patient-reported outcomes, quality of life instruments, functional scales—things that reflect lived experience rather than laboratory values.

It could mean: the treatment prevents the outcomes that matter most clinically—death, hospitalization, irreversible organ damage. This leads toward hard clinical endpoints, time-to-event outcomes, event rates—things that are unambiguous but slow to accumulate and expensive to detect.

It could mean: the treatment produces effects that regulators will accept as evidence of benefit. This leads toward endpoints with regulatory precedent—endpoints that have been accepted before, in related indications, by the relevant agency—regardless of whether they optimally capture the treatment’s mechanism.

These four interpretations—biological, experiential, clinical, regulatory—are not always in conflict. Sometimes they converge. But often they point in different directions, and the choice of primary endpoint is implicitly a choice about which interpretation takes priority. That choice is rarely made explicitly. It is more often made by imitation: we use what was used before, in a similar trial, in a similar indication, without asking whether the prior choice was well-reasoned or merely precedent.

The cost of not asking is that the endpoint can be simultaneously technically valid and scientifically inappropriate. A trial can be powered, executed, and analyzed correctly—and still answer the wrong question—because the endpoint was chosen without examining what question it was supposed to answer.

What an endpoint actually commits you to

Selecting a primary endpoint is not selecting a measurement. It is making three simultaneous commitments.

The first is a commitment about what counts as evidence. A primary endpoint defines the evidentiary standard. When the trial is over, the evidence that the treatment works—or does not—will be expressed in terms of that endpoint. If the endpoint does not capture what matters, the evidence will not answer what matters, and no amount of secondary analysis will repair that gap.

The second is a commitment about the population. Different endpoints are sensitive in different populations. A six-minute walk test is meaningful in patients who are ambulatory but impaired; it is less meaningful at disease extremes. A composite endpoint including hospitalization is driven by the subgroup with the highest event rate. An endpoint that seems right for the overall trial population may be insensitive in the patients who most need the treatment, and vice versa. Choosing an endpoint without specifying the population in which it is expected to be sensitive is choosing a question that may not be answerable in the enrolled patients.

The third is a commitment about timing. Endpoints have natural timescales. Biomarkers change in weeks. Functional outcomes change over months. Hard clinical events accumulate over years. The endpoint must be matched to the duration of follow-up the trial can sustain—which depends on enrollment, retention, and resources that are often not fully characterized when the endpoint is chosen. An endpoint that requires three years of follow-up to show a signal is the wrong primary endpoint for a trial designed to run for eighteen months. This sounds obvious. It is routinely violated.

Each of these three commitments interacts with the others. A composite endpoint that includes both a biomarker component and a hard clinical event component is trying to be sensitive to biological change and clinically meaningful at the same time. It is also making a population claim—that both components will be relevant for the enrolled patients—and a timing claim—that the composite will accumulate meaningfully within the trial’s follow-up window. If any one of these commitments is wrong, the endpoint is wrong, even if the measurement itself is technically sound.

The surrogate problem

A surrogate endpoint is one that is expected to predict a clinical outcome without measuring it directly. Tumor shrinkage as a surrogate for survival. Blood pressure reduction as a surrogate for cardiovascular events. CD4 count as a surrogate for AIDS-defining illness.

Surrogates are attractive because they are faster, cheaper, and more sensitive than the clinical outcomes they represent. A trial powered on HbA1c can enroll far fewer patients, and follow them for far less time, than a trial powered on cardiovascular mortality. The efficiency gain is real.

The cost is a claim. When you choose a surrogate primary endpoint, you are claiming that improvement in the surrogate constitutes evidence of clinical benefit. That claim requires two things to be true: the surrogate must be reliably associated with the clinical outcome in this disease, and the treatment’s effect on the surrogate must reliably predict its effect on the clinical outcome. Both conditions must hold. The first is sometimes well-established. The second is often assumed without examination.

The history of surrogate endpoint failure is long. Antiarrhythmic drugs that suppressed ventricular ectopy and increased mortality. Bone density improvements that did not translate to fracture prevention. Cholesterol changes that did not predict cardiovascular outcomes uniformly across drug classes. In each case, the mechanism connecting surrogate to clinical outcome was more fragile than the trial designers assumed—or the treatment affected the surrogate and the outcome through different pathways.

This does not mean surrogates should not be used. In indications with long clinical timelines, in populations with high unmet need, in early development stages where resource constraints are real, surrogates may be the right primary endpoint. The decision to use a surrogate is defensible. What is not defensible is using a surrogate without being explicit about the claim it requires, without identifying the evidence that supports the surrogate-to-outcome relationship, and without acknowledging the conditions under which that relationship could fail.

Regulatory agencies are increasingly attentive to this. The FDA’s Project Optimus, the EMA’s qualification process for novel endpoints, and the increasing prevalence of accelerated approval conditioned on confirmatory outcomes trials all reflect the same recognition: a surrogate endpoint is a provisional claim, not a settled one. The trial designed around a surrogate should acknowledge that provisionality and plan for what happens if the confirmatory evidence does not come.

Composite endpoints: combination as a decision

Composite endpoints combine multiple individual outcomes into a single primary measure—typically taking the first occurrence of any component event as the endpoint. Major adverse cardiovascular events (MACE), combining cardiovascular death, myocardial infarction, and stroke, is the canonical example. Composite endpoints are used when no single component occurs frequently enough to power a trial, or when the clinical question spans multiple manifestations of the same disease process.

The logic is straightforward: if any of these outcomes would represent treatment failure, combining them gives the trial more events and therefore more power. The efficiency gain is real.

The interpretive cost is also real, and it is proportional to how heterogeneous the components are.

A composite endpoint’s result is driven by whichever component occurs most frequently. If cardiovascular death occurs in 3% of patients and hospitalization for heart failure occurs in 18%, a composite of the two will behave almost entirely like a hospitalization endpoint. A hazard ratio of 0.85 on the composite may reflect a large effect on hospitalization and no effect on mortality—or it may reflect a modest effect on both. The composite does not distinguish between these possibilities. The composite says: something happened less often. It does not say what, or to whom, or whether it was the thing that mattered most.

This creates a specific interpretive obligation. When a composite endpoint is chosen, the trial should be able to specify, in advance, which components are expected to be affected and in what direction. If the treatment is not expected to affect one of the components, that component should not be in the composite—including it dilutes the signal and generates an uninterpretable result. If the components are ordered by clinical importance, the analysis should reflect that ordering, not treat a hospitalization and a death as equivalent events.

The FDA and EMA have both issued guidance on composite endpoint specification that reflects these concerns. The practical implication is that choosing a composite endpoint is not an escape from hard choices about what the trial is measuring. It is a different set of hard choices, with different interpretive obligations. Making those choices explicitly, before the trial begins, is what separates a defensible composite from a convenient one.

Patient-reported outcomes: the ownership problem

Patient-reported outcomes—PROs—measure what patients experience directly: symptoms, function, health-related quality of life. They are, in many indications, the most clinically meaningful endpoints available. A treatment that reduces pain, improves sleep, restores mobility, or reduces anxiety is producing benefit that no biomarker or physician assessment fully captures. For conditions where patients live with chronic symptoms, PROs may be the only endpoints that matter.

The challenges with PROs as primary endpoints are well-known—measurement variability, the definition of clinically meaningful change, the difficulty of blinding in open-label settings, the burden of assessment in patients who are already symptomatic. These are real. They are also solvable, given sufficient rigor in instrument selection, definition of the minimally important difference, and pre-specification of the analysis.

What is less often discussed is the ownership problem.

A PRO endpoint requires someone to specify, before the trial, what magnitude of change on the instrument constitutes a clinically meaningful difference. This is not a statistical question. It is a clinical and patient-centered question: how much better does a patient have to feel for the improvement to matter? The answer depends on the disease, the instrument, the population, and the treatment context. It cannot be derived from statistical convention. It must be owned by the clinical team, validated against patient and clinician perspectives, and documented before the trial is powered.

When this ownership is missing—when the minimally important difference is selected to match the expected treatment effect, or adopted from a different population without examination, or left ambiguous until after unblinding—the PRO endpoint becomes uninterpretable. A statistically significant improvement on a PRO scale is evidence of benefit only if the magnitude of improvement was pre-specified as clinically meaningful. If the threshold was set after the effect size was known, the evidence is circular.

Regulatory agencies have become sophisticated about this. The FDA’s guidance on patient-focused drug development reflects a sustained effort to require that the patient’s perspective be captured rigorously and prospectively, not retrofitted to whatever the trial happened to show. The implication for design is that PRO endpoints require as much pre-specification rigor as any other primary endpoint—arguably more, because the connection between the measurement and the clinical meaning is less self-evident.

Who owns the endpoint decision

In most trial design processes, the primary endpoint is proposed by the clinical team, reviewed by the statistician, and negotiated with the regulatory agency—either in a pre-submission meeting or during the review process. This sequence is reasonable. But it obscures an accountability question that surfaces later, when the trial is over and someone asks why the endpoint was chosen.

The endpoint is a scientific claim about what constitutes evidence of benefit. That claim should be owned by the people who understand the clinical context: what the disease does, how patients experience it, what physicians use to assess it, and what magnitude of change would alter clinical practice. Statisticians can evaluate the feasibility of measuring an endpoint, the sample size implications of different effect sizes, and the regulatory history of similar endpoints. They cannot substitute for clinical ownership of the underlying claim.

This matters because endpoint decisions made without clear ownership tend to drift toward what is measurable rather than what is meaningful. The path of least resistance is to use the endpoint that has been accepted before, by this agency, in this indication—regardless of whether it is the right question for this treatment in this population at this stage of development. Regulatory precedent is a reasonable input. It is not a substitute for clinical reasoning.

The cleaner structure is for the clinical team to own the question—what does it mean for this treatment to work?—and for that answer to drive the endpoint selection, which the statistician then evaluates for feasibility. Endpoint choice that starts with statistical feasibility and works backward to clinical meaning produces endpoints that are powered correctly and answer the wrong question.

The pressure to compromise

There is a characteristic moment in many trial design processes where the endpoint choice gets compromised. The clinical team wants a hard clinical endpoint—mortality, irreversible organ damage—that will take years to accrue. The development timeline will not support that. The surrogate endpoint that can be measured earlier does not have regulatory acceptance in this indication. The PRO endpoint the patients care most about has never been used as a primary endpoint before.

The resolution is typically a composite that combines elements of all three—hard enough to seem clinically relevant, accruing fast enough to fit the timeline, containing a PRO component that can be highlighted in the label. The composite is presented as capturing the full range of clinical benefit. What it actually captures is the range of compromises that were available.

This is not always wrong. Sometimes a composite genuinely reflects the clinical question, and the compromise is between competing goods rather than between rigor and convenience. But the distinction matters. A composite chosen to capture clinical complexity is a defensible design choice. A composite chosen to make the timeline work is a design risk—one that the trial will carry through to the regulatory review, where the agency will ask the same questions that should have been asked at design.

The pressure to compromise is real, and it will not be eliminated by recognizing it. What can be managed is the transparency of the compromise—being explicit about which consideration drove the endpoint choice, what was sacrificed, and under what conditions the endpoint may fail to answer the question it was supposed to answer.

Transparency before the trial does not protect against a negative result. It does protect against an uninterpretable one.

What this section demands before proceeding

Before moving to intercurrent events—which are, in many ways, the place where endpoint decisions get stress-tested—the following must be resolved.

The primary endpoint must be explicitly connected to the question the trial is answering. Not “we are using MACE because it is standard in this indication,” but “we are using MACE because our treatment is expected to affect cardiovascular death and non-fatal MI through the mechanism we have documented, and stroke is included because the pathophysiology overlaps sufficiently that a stroke-reducing effect is plausible and clinically meaningful to include.”

The population in which the endpoint is sensitive must be specified. If the endpoint is expected to be sensitive only in patients with a certain baseline severity, that should be reflected in the eligibility criteria, not discovered during subgroup analysis after the trial.

The timing must be matched to the follow-up. If the endpoint requires two years to show a signal and the trial is designed for eighteen months, that mismatch must be resolved before enrollment begins—not by extending the trial after the fact.

And ownership must be assigned. Someone must be able to say, after the trial is over: this was our endpoint, this is why we chose it, and this is what the result means. If no one can say that before the trial starts, the endpoint has not been chosen. It has been inherited.

References: ICH E9(R1) Addendum on Estimands and Sensitivity Analysis; FDA Guidance on Patient-Focused Drug Development (2022); FDA Guidance on Surrogate Endpoint Qualification; Pocock et al., “The Primary Outcome Fails — What Next?” N Engl J Med 2016.