# Gates 1, 2, 5 — Brier Walkforward Validation Log
**Generated**: 2026-05-22
**Pre-registration scope**: Rate Authority leading-indicator validation scope (available on request at press@policychat.com)
**Prior gates passed**: 4, 6, 7a+7b (3 of 8 prior; gates 3 and 8 still pending)
**This run**: Gates 1 (Brier walkforward) + 2 (BSS) + 5 (conviction subset)

**Overall verdict: 2/3 GATES PASS**

---

## Gate 1: Pre-registered Brier Walkforward

**Verdict: FAIL**

### Setup
- **Train window**: 2002-01 to 2018-12 (n=192)
- **Test window**: 2019-01 to 2024-12 (n=72)
- **Predictor**: residualized YoY CPI Motor Vehicle Parts (CUSR0000SETC) at lag-12
- **Outcome**: binary — residualized YoY CPI Tenants/Household Insurance (CUSR0000SEHG) > training-window median
- **Residualization stack**: month-of-year FE (11 dummies) + CPI Shelter (CUSR0000SAH1) + CPI Medical (CPIMEDSL) + UNRATE@lag-24 + FEDFUNDS level
- **Residualization approach**: OLS fitted on train only; training coefficients applied to test (no leakage)
- **Model**: Logistic regression with 5-fold CV (L2, lbfgs) to select regularization C on train window
- **Best C (train CV)**: 0.046416
- **Binary threshold**: training median of residualized Y = -0.000293

### Results

| Metric | Value | Threshold | Pass? |
|---|---|---|---|
| Brier score (test) | **0.1281** | ≤ 0.10 | NO |
| Kill threshold     | 0.1281    | > 0.20 | OK |
| Test n             | 72      | —      | —     |
| Mean predicted P   | 0.6890 | — | — |
| Actual rate (test) | 0.5972 | — | — |
| Calibration gap    | 0.0918 | — | — |
| Train Spearman ρ (resid) | 0.3413 (p=0.0000) | — | — |
| Test Spearman ρ (resid)  | 0.8551 (p=0.0000) | — | — |

### Interpretation (FAIL)

Gate 1 FAILS. Brier=0.1281 exceeds the 0.10 pass threshold but is below the 0.20 kill threshold. Directional signal may exist but probability calibration is insufficient for the pre-registered graduation standard.

Per `feedback_let_the_data_decide_the_headlines`: this result publishes as a null finding with
the same specificity as a passing result. The indicator remains at `directional_only` and cannot
graduate to `validated` on this evidence alone.

Test Spearman ρ (continuous) = 0.8551 (p=0.0000) — reported for completeness.

---

## Gate 2: Brier Skill Score vs Climatology

**Verdict: PASS**

### Setup
- **Climatological forecast**: training-window base rate P(Y_resid > training median) = 0.500000
  - By construction (median split), this is 0.500 — reflecting the exact binary balance of the training set
  - Applied as a constant forecast to all test-window observations
- **BSS formula**: 1 - (Brier_model / Brier_climatology)
  - BSS = 1 means perfect; BSS = 0 means no improvement over climatology; BSS < 0 means worse than climatology
- **Pass threshold**: BSS ≥ 0.10
- **Kill threshold**: BSS < 0 (model is worse than naive base rate)

### Results

| Metric | Value | Threshold | Pass? |
|---|---|---|---|
| Training base rate | 0.5000 | — | — |
| Brier (model)      | 0.1281 | — | — |
| Brier (climo)      | 0.2500 | — | — |
| BSS                | **0.4875** | ≥ 0.10 | YES |
| Kill (BSS < 0)     | 0.4875 | < 0.0  | OK |

### Interpretation (PASS)

BSS = 0.4875, clearing the ≥ 0.10 threshold. The model reduces Brier loss by 48.7% relative
to the naive climatological forecast (constant training base rate = 0.5000).

Note: since the binary outcome is a median split, the climatological Brier is always close to 0.25
(Brier of a constant 0.5 forecast). Any positive BSS means the model's probability estimates are
more accurate than guessing 50/50. The 0.4875 BSS is modest but exceeds the 0.10 graduation
threshold established in the pre-registration.

The model beats climatology on the 2019-2024 OOS window. This includes the COVID distortion period
(2020-2022) which represents a significant structural challenge for any CPI-based leading indicator.
The model's skill surviving this period is meaningful evidence for robustness.

---

## Gate 5: Conviction-Filtered Subset

**Verdict: PASS**

### Setup
- **Conviction criterion**: |predicted P - 0.5| > 0.20 (i.e., P > 0.70 or P < 0.30)
- **Intuition**: when the model is confident, it should be more accurate
- **Pass threshold**: conviction subset Brier ≤ 0.07
- **Kill threshold**: subset Brier > full Brier (conviction filter actively hurts — overconfidence)
- **Note**: overconfidence kill is separate from the pass/fail threshold

### Results

| Metric | Value | Threshold | Pass? |
|---|---|---|---|
| Conviction obs      | 30/72 (41.7%) | — | — |
| Full Brier (all)    | 0.1281 | — | — |
| Conviction subset Brier | **0.0128** | ≤ 0.07 | YES |
| Kill (subset > full)    | 0.0128 vs 0.1281 | subset > full | OK |
| Direction accuracy      | 1.000 | — | — |

### Interpretation (PASS)

Conviction subset Brier = 0.0128, clearing the ≤ 0.07 threshold.

30 of 72 test-window observations (41.7%) received conviction-level
predictions (|P - 0.5| > 0.20). On these high-confidence observations, the model achieves
Brier = 0.0128 — lower than the full-window Brier of 0.1281, confirming
that the conviction filter improves rather than degrades accuracy.

Direction accuracy on conviction subset = 1.000 (100.0% of high-confidence
predictions point the correct direction).

This satisfies the conviction-filter design requirement: when the model is confident, it is more
accurate. The subset Brier < full Brier pattern rules out the overconfidence kill mode.

**Caveat**: n=30 conviction observations is a small subset of an already-small n=72 test
window. Point estimates are directionally meaningful; treat as indicative rather than precisely calibrated.

---

## Kill Mode Assessment

**No kill modes triggered.**

- Brier = 0.1281 ≤ 0.20 (no-skill kill) — OK
- BSS = 0.4875 ≥ 0 (worse-than-climo kill) — OK
- Subset Brier = 0.0128 ≤ full Brier = 0.1281 (overconfidence kill) — OK

---

## Updated Eight-Gate Status Table

| Gate | Description | Status | Notes |
|---|---|---|---|
| 1 | Pre-registered Brier ≤ 0.10 walkforward | **FAIL** | Brier=0.1281, n_test=72, train 2002-2018 |
| 2 | Brier Skill Score ≥ 0.10 vs climatology | **PASS** | BSS=0.4875, climo_rate=0.5000 |
| 3 | Marginal p < 0.05 OLS | PENDING | Not yet formalized (see scope) |
| 4 | |rho| ≥ 0.30 full + pre-COVID | **PASS** | V2-C confirmed: full rho=0.47, pre-COVID rho=0.54 |
| 5 | Conviction-filtered subset Brier ≤ 0.07 | **PASS** | subset_brier=0.0128, n_conviction=30 |
| 6 | Full confounder residualization (RMSE skill ≥ 0) | **PASS** | resid rho=0.4852 (n=279), RMSE skill=0.1646 |
| 7 | 7a Pre-COVID stability + 7b alt-outcome replication | **PASS** | 7a pass V2-C; 7b CUSR0000SETD rho=0.5484 |
| 8 | SHA-locked predictions with resolution dates | PENDING | SHA-lock target: 2026-07-15 cycle-1 |

**Summary**: 5 PASS / 1 FAIL-or-KILL / 2 PENDING out of 8 gates

---

## Graduation Recommendation

**Confidence tier**: `directional_only` — Gate 1 threshold miss, no kill modes triggered

**Rationale**: Gate 1 FAIL (Brier=0.1281 vs ≤0.10 threshold). No kill modes triggered — Brier is below the 0.20 no-skill kill threshold, BSS is strongly positive, and conviction subset Brier is below full Brier. The failure is a calibration precision shortfall, not a structural signal absence.

Gates 2 and 5 PASS with strong margins (BSS=0.4875, conviction Brier=0.0128). The continuous test-window Spearman ρ=0.8551 (p<0.0001) confirms the directional signal is present in the hold-out period. The binary probability calibration needs tightening before Gate 1 can pass.

**Path to Gate 1 pass**: The calibration gap (mean P=0.689 vs actual rate=0.597 in test window) suggests over-prediction of above-median probability, particularly in the 2019-2021 pre-inflation plateau where the model assigned ~0.47-0.53 to outcomes that were actually 0. Possible approaches: (1) isotonic regression post-hoc calibration on a validation sub-window; (2) threshold adjustment; (3) re-examine if the confounder stack is absorbing too little of the COVID-era mean shift. Any recalibration approach requires a new pre-registered specification before re-running.

**SHA-lock recommendation**: SHA-lock at 2026-07-15 should wait for Gate 1 recalibration analysis. Gates 2, 4, 5, 6, 7a+7b all PASS — the indicator is not killed, but the full-window Brier calibration must be resolved before the conviction tier can claim validated status. SHA-lock of the current directional finding (5/7 completed gates PASS, 1 FAIL, Gate 3 + 8 pending) is the honest state to lock.

---

## Caveats and Methodological Notes

**Small test window (n=72)**: The 2019-2024 test window contains only 72 monthly
observations. Brier score variance at this sample size means the 95% confidence interval around the
point estimate is approximately ±0.0772. All
pass/fail calls are on the point estimate per the pre-registration protocol; the CI is reported here
for honest framing only.

**COVID structural break**: The test window includes 2020-2022, a period of extreme supply-chain
distortion in both CPI Motor Vehicle Parts and insurance proxies. The 12-month lag hypothesis was
developed on pre-COVID data (confirmed PASS in Gate 4/V2-C). The model's performance during this
period is a realistic stress test, not a reason to exclude the period.

**Residualization leakage discipline**: OLS confounder models were fitted on the train window only
(2002-2018) and applied to the test window using frozen coefficients. This is strict walkforward
discipline. The test Brier score reflects true OOS performance.

**Conviction subset small-n**: The 30 conviction observations are a subset of the already-small
n=72 test window. The conviction-subset Brier should be treated as indicative.

**Gate 3 still pending**: Formal OLS p-value for the MV Parts lag-12 coefficient has not yet been
run as a pre-registered gate (V1 Spearman ρ > 0.48 implies significance at n > 200, but the formal
gate requires a dedicated run consistent with the pre-registration spec).

---

*Generated by gates_1_2_5 validation agent, 2026-05-22.*
*Gate 1: Brier=0.128137, verdict=FAIL*
*Gate 2: BSS=0.487453, verdict=PASS*
*Gate 5: subset_brier=0.012809, verdict=PASS*
*n_train=192, n_test=72, best_C=0.046416*