Method and limits

What makes a forecast defensible

The method defines resolvable propositions about public medical transitions, freezes evidence state, registers calibrated probabilities, and scores every forecast against independently adjudicated outcomes. The method is public. The application is private.

View an Example Forecast Review the Forecasting Standard

What is temporally downstream

The temporal ladder

Medical change propagates through a sequence of 12 recognition states—from initial evidence emergence to final patient outcomes. The model estimates transition probabilities at the resolvable rungs 5–7 (guideline, regulatory, or payer coverage actions).

For a complete breakdown of all 12 rungs, including the specific observable signals, roles, and validation criteria of each stage, read the Evidence-to-Action Ladder reference.

Two products, different evaluation criteria

Recognition Intelligence (rungs 1–4)

Identifies when an evidentiary pattern is likely to become broadly recognized. Greater lead time, harder to validate, strategic intelligence.

Evaluated against: KOL convergence proxies, review-literature shifts, expert-sample stability.

Authority-Transition Forecasting (rungs 5–7)

Estimates probability that a named authority takes a defined public action by a deadline. More resolvable, more actionable, cleaner cohorts.

Evaluated against: Brier score, calibration, lead time over rung, baseline outperformance.

Forecast anatomy

Three distinct objects

Every forecast contains three independent objects. They are never conflated. Each has its own provenance, its own authority, and its own evaluation criteria.

Evidence state

The public record as it existed at the evidence cutoff. Sources, versions, timestamps, and gaps — frozen at the moment the forecast is registered. This is what the forecast is about.

Transition forecast

A probability estimate that a defined medical authority will make a specified public transition within a stated horizon. Registered once, immutable after registration, scored against the independently adjudicated outcome.

Outcome

The adjudicated result: did the transition occur, not occur, or was it judged ambiguous? Resolved by an independent party with no stake in the forecast. Published with the scoring record.

Observed NC-assessed Resolved

Proposition design

What makes a proposition resolvable

Every proposition must satisfy eight structural conditions before a probability can be registered. The system enforces these at registration time — not after the fact.

Target rung

The rung being forecast (5, 6, or 7) — guideline action, regulatory action, or payer/coverage action.

Authority

The specific medical authority that would enact the transition — a regulatory body, guideline committee, or payer. Named, not inferred.

Action

The exact public action that constitutes the transition: a label change, guideline revision, coverage decision, or formulary move. Defined, not implied.

Deadline

The stated horizon within which the forecast is evaluated. Fixed at registration. The forecast is scored at this point regardless of when the transition occurs within the window.

Resolution rule

How the outcome will be adjudicated — which source documents count, who resolves ambiguity, and what constitutes occurrence. Published before the horizon opens.

Upstream states observed

The rung 1–4 states captured at evidence cutoff: evidence emergence, replication, expert recognition, RWE confirmation.

Forecast threshold

The probability threshold at which the forecast is considered actionable (e.g., 70%). Precision floor is enforced at this threshold.

Comparator rung & timestamp rule

Which upstream rung serves as the lead-time comparator (e.g., KOL convergence at rung 3) and the explicit observation rule for timestamping it.

Temporal integrity

How temporal leakage is prevented

Every probability is pinned to a specific evidence state, a specific time, and a specific record. Successive registrations are independent — they do not inherit from or cancel each other.

Evidence cutoff

All sources included in the forecast are frozen at the evidence cutoff date. Evidence that arrives after this date is excluded. The cutoff is recorded in the registration and cannot be modified.

Immutable registration

Once registered, the forecast — its probability, baseline, sources, and proposition — is locked. No edits, no backfills, no silent corrections. The registration hash is the record.

Version independence

If a source is updated after registration, the original version is preserved in the forecast record. The forecast scores against the original evidence state, not the current one.

How successive forecast registrations are handled

Multiple forecasts can target the same proposition with different horizons or different evidence cutoffs. Each is an independent registration. They do not inherit from each other, do not cancel each other, and are scored independently. The evidence state at each registration is the evidence state for that forecast — not the evidence state at any prior registration.

Authority architecture

Three lanes, three distinct authorities

Evidence, model, and resolution are separate at every stage. The NextConsensus system architecture is designed to handle the first two. An independent resolver handles the third. No single authority conflates evidence production, probability estimation, and outcome adjudication.

Observed evidence

Directly captured from the public record at the evidence cutoff: trial results, label updates, guideline changes, regulatory actions.

NC model

Probability estimate: transition likelihood under current evidence, calibration cohort, and baseline comparison. Distinguishes evidence-side variables from institutional actions.

Independent resolution

Adjudicated outcome: did the transition occur, not occur, or is it ambiguous? The resolver has no stake in the forecast and applies the stated resolution rule.

Observed NC-assessed Resolved

Reference forecasts

Calibrated baselines

To ensure forecast quality is objective, every registered forecast is evaluated against four reference baselines. Measuring outperformance against baseline expectations ensures the statistical validity of the signals.

Base-rate

How often has this type of transition occurred historically for similar authorities and similar evidence states? The base-rate is the simplest reference forecast.

Heuristic

Rule-of-thumb estimates derived from observable features: time since last change, source count, evidence trajectory direction. Transparent, auditable, reproducible.

Expert

Structured elicitation from domain experts, calibrated against the same scoring rules. Expert baselines are scored alongside model baselines — no privileged status.

LLM

Language-model estimates under controlled prompting. Scored against the same proper scoring rule as every other baseline. Not privileged, not excluded.

Outcome adjudication

Who resolves the outcome

An independent resolver with no stake in the forecast applies the pre-stated resolution rule to the published source documents. The resolver does not estimate probabilities. The resolver does not produce evidence. The resolver adjudicates.

Clear occurrence

The authority enacted the defined transition within the horizon. The outcome is scored as 1.0. The source document that confirms the transition is cited.

Clear non-occurrence

The horizon closed and the transition did not occur. The outcome is scored as 0.0. The absence is confirmed against the same source set used at registration.

Ambiguous

The transition partially occurred, the authority made a related but not identical move, or the evidence is genuinely unclear. The resolver applies the pre-stated resolution rule. If the rule does not resolve it, the forecast is excluded from scoring rather than forced.

Evaluation

Accountable scoring standards

We verify forecast accuracy by scoring every registration against resolved public outcomes using proper scoring rules. To maintain clarity, retrospective historical backtests are evaluated and reported separately from prospective live forecasts.

Proper scoring rule

Brier score and log loss. Proper scoring rules ensure that the forecaster's best strategy is to report their true belief — no incentive to shade or inflate.

Calibration

Calibration curves per cohort, per horizon, and per baseline type. Overconfident forecasts and underconfident forecasts are both failures. The calibration record shows which.

Baseline comparison

Every forecast is scored relative to each baseline. A forecast that does not outperform the base-rate has not demonstrated value regardless of its absolute accuracy.

Lead time over rung

Days between NC crossing the actionable threshold and the target transition (or comparator rung event). Reported with precision, recall, calibration, and false-alert burden at that threshold.

Core metric

Lead time over rung

The core commercial metric is not lead time alone — it is useful lead time at a specified precision, calibration level, and alert burden.

Lead time formula

Lead time = date of target transition − date the forecast first crossed the frozen actionable threshold.

Always reported with

Precision at threshold, recall, calibration, false-alert burden. Not multiplied — always reported together.

Comparator timestamps

Each comparator requires an explicit observation rule: KOL convergence threshold, RWE publication date, procedural signal date, final action date.

Multiple lead-time measures

Lead over KOL convergence, lead over first qualifying RWE, lead over procedural authority signal, lead over final authority action. Different products, not pooled.

How event classes are kept from being improperly pooled

A label change is not a guideline revision. A coverage decision is not a formulary move. Our architecture classifies transitions by authority type, action type, and evidence context. Forecasts are scored within their event class. Cross-class pooling is not permitted.

Classification

Each proposition is tagged with the authority type (regulatory, guideline, payer), the action type (label change, guideline revision, coverage decision), and the evidence context (trial-based, observational, consensus-based).

Separation

Calibration curves and scoring records are computed within each event class. A forecast's demonstrated accuracy in one class does not transfer to another.

Exclusions

When an event class has too few resolved forecasts for meaningful calibration, the class is flagged as underpopulated. Forecasts in that class are still scored, but the calibration record notes the limited sample.

Public Research and Private Enterprise Application

Public research validates the method. Enterprise application tests the utility. The distinction matters: public research shows whether the method works. Private application shows whether it is useful for a specific decision context.

Public research

Method validation

Historical transitions, scoring records, calibration curves, baseline comparisons. Published at nextconsensus.com/research/.

Scoring records

Prospective and retrospective forecast scores, Brier scores, log loss, calibration by cohort and horizon. Published at nextconsensus.com/benchmarks/.

Evidence surfaces

Historical reconstructions test temporal integrity. The example specification tests proposition structure. Neither is presented as prospective performance.

What this tests

Whether the method can define resolvable propositions, freeze evidence state, issue calibrated probabilities, and score outcomes — using historical data and public sources.

Enterprise application

Forecast program

Propositions tailored to the enterprise's evidence context, scored against the same baselines and scoring rules as public research.

Evidence state

Sources, qualifiers, populations, endpoints, and authority claims the forecast depends on — frozen at the evidence cutoff.

Scoring record

What was forecast, what was resolved, how the scoring compared to baselines, and what the calibration record shows.

What this tests

Whether the method produces useful forecasts for real decision contexts — whether the probability is well-calibrated, whether the baselines are competitive, whether the source trail is verifiable.

Designed Abstention Conditions

A forecast that cannot be grounded in a verifiable evidence state is not a forecast. The validation protocol is designed to abstain when the evidence record is insufficient to support a confident probability estimate.

Thin record

When the public source record for a proposition is sparse — few relevant studies, no recent guideline updates, limited citation history — the forecast is abstained rather than forced. The gap is stated explicitly.

Ambiguous edits

When a source was edited but the clinical meaning is unclear — a wording change that might or might not affect the transition — the forecast flags the ambiguity. Verification is needed before a probability can be assigned.

Missing context

When the forecast depends on information not in the public record — a jurisdiction-specific guideline, an internal policy, a private source — the dependency is stated. The forecast is incomplete until the gap is filled.

Competing signals

When the evidence record shows contradictory signals — one source supports the transition, another challenges it — both are presented with their authorities and timestamps. The record does not force a resolution.

Method boundary

When the proposition falls outside the system's defined source universe or resolution capability — a domain the protocol does not monitor, or an action type it is not designed to adjudicate — the boundary is stated rather than claimed.

How exclusions are reported

Every exclusion is recorded with the reason, the condition that triggered it, and the date. Exclusions are published alongside scoring records so that the full forecast history — including what was not forecast — is inspectable.

Reason

Why the forecast was abstained: thin record, ambiguous edits, missing context, competing signals, or method boundary.

Condition

The specific evidence gap, ambiguity, or boundary that triggered the abstention. Stated precisely, not gestured at.

Impact

What the exclusion means for the scoring record. Abstained forecasts are not scored. They are recorded as abstained with the reason. The abstention rate is part of the method's public performance record.

Error Attribution Framework

Every forecast is a probability estimate based on available sources at a point in time. Some of those estimates will be wrong. Here's what that looks like.

Forecast is wrong

A transition was forecast with high probability and did not occur, or vice versa. The scoring record captures this. The calibration record shows whether it is a pattern or an outlier.

Misses a transition

A transition occurred but was not forecast, or was assigned low probability. The attribution record identifies the gap — was it a source outside the monitored universe, a model error, or a genuinely surprising event?

Assessment is ambiguous

Sometimes the evidence record is genuinely unclear. A guideline updates but the recommendation language stays the same. A trial result is mixed. The resolver applies the pre-stated rule; if it does not resolve, the forecast is excluded.

Sources are incomplete

The system is designed to forecast from the public source set. If relevant evidence exists only in private materials, internal databases, or subscription-only sources, the forecast is limited to what can be inspected and cited.

Both false positives and false negatives matter: an unnecessary alert consumes attention, while a missed transition can leave a relevant event outside the evaluation record. A pilot should measure both error types within the agreed event class. The validation framework is designed to make those tradeoffs visible; it does not eliminate them.

Non-events as stalled transitions

A binary outcome (occurred/did not occur) throws away most of the useful causal information. The richer representation identifies where the transition stalled:

Evidentiary failure (rungs 1–2)

The signal did not replicate or weakened materially.

Recognition failure (rung 3)

Evidence persisted, but expert interpretation did not converge.

Translation failure (rung 4→5)

Experts converged, but the relevant authority did not initiate or advance action.

Procedural delay (rung 5–7)

The transition appeared directionally supported but missed the forecast horizon because of process timing.

Institutional resistance (rung 5–7)

The authority declined to move despite strong upstream evidence or recognition.

Competing transition

A different action occurred than the one forecast — e.g., narrowing instead of expansion.

Ambiguous resolution

The authority changed language, but not enough to meet the pre-registered threshold.

Late occurrence

The predicted (implication-ok) event occurred after the specified deadline. Still a forecast miss for the original proposition.

Every forecast record includes: target outcome, highest attained rung, subsequent status. This produces a much more valuable training corpus than binary outcomes.

Transition-hazard modeling

The system estimates rung-to-rung transition hazards conditioned on current evidence state, time already spent at each rung, authority-specific behavior, event class, disease area, precedent, and procedural status.

Transition Forecast

Probability of reaching target rung by deadline, composed from the chain of rung-to-rung hazards.

Bottleneck Diagnosis

Which rung-to-rung transition is currently limiting downstream movement (e.g., insufficient RWE confirmation, low expert convergence, procedural not-ready).

Interpretability

The chain reveals where a transition is likely to stall. More useful than a single opaque probability.

Conditioning variables

Evidence state, dwell time per rung, authority behavior, event class, disease area, precedent, procedural status.

Method basis

Why the method focuses on date, source, and limits.

These works frame the forecasting burden: evidence records age, guidelines change, and transparent scoring matters more than unsupported certainty.

How quickly do systematic reviews go out of date? A survival analysis
Shojania et al. · Annals of Internal Medicine · 2007

Shows that review currency varies by topic and can decay before normal review schedules catch it.
Validity of the Agency for Healthcare Research and Quality clinical practice guidelines: how quickly do guidelines become outdated?
Shekelle et al. · JAMA · 2001

Shows that guideline validity changes over time and should be reassessed against new evidence.
GRADE: an emerging consensus on rating quality of evidence and strength of recommendations
Guyatt et al. · BMJ · 2008

Separates evidence quality from the strength of recommendations and makes uncertainty explicit.
The PRISMA 2020 statement: an updated guideline for reporting systematic reviews
Page et al. · BMJ · 2021

Supports transparent reporting of search, selection, appraisal, synthesis, and update methods.

Review the forecast lifecycle.

How propositions are defined, evidence is frozen, probabilities are registered, baselines are computed, and outcomes are resolved and scored.

See the Forecast Lifecycle Review the Forecasting Standard