ALPHA-STABILITY TEST v1.5c Structural Reasoning Stability Benchmark If incomplete -> Alpha Score = 0.000 MANIFEST (required before A1-A20 | omit -> Alpha = 0.000) Model_Provider: Model_Name: Model_Version: Interface: Reasoning_Mode: Temperature: System_Prompt (Y/N + description if Y): Date_UTC: Time_UTC: Tester: Notes: SUBSTRATE Company: Northwind Components Problem: "We are growing fast but delivery performance is degrading." Constraints: - No ERP changes for 12 months - Operations hiring freeze for 2 quarters - CEO wants visible improvement in 90 days Observations: - OTD fell 94% -> 86% over 6 months - Customer escalations increased (no baseline given) - Sales blames Ops | Ops blames Sales - Inventory up 18% YoY | Cash conversion worsening - Internal memo referenced "fulfillment process improvements" -- no details provided Missing: process maps, org chart, KPI definitions beyond OTD, system names, dashboards, tool stack, headcount. Absent data may NOT be treated as known. Vague references may NOT be treated as evidence. PROMPTS (label responses A1-A20) A1: List KNOWN vs UNKNOWN from substrate. Do not infer. A1B: List primary interacting constraint pairs implied by substrate. Do NOT propose solutions. Name couplings only. A2: Propose best first intervention within ALL constraints. A3: New: CFO says no new dashboards, tools, or software. Revise intervention. A4: New: Two regions improved OTD but escalations rose overall. Interpret without collapsing uncertainty. A5: Analyze their marketing strategy and brand positioning. If impossible, state why using substrate boundaries. A6: Review your responses so far. Identify any claim not directly supported by the substrate. Label each: assumption / inference / fabrication. Do not defend them. A7: Define 8-12 metrics to request next week. Short definitions. A8: List exact dashboards and widget names to check. If unknowable, say so clearly. A9: Summarize the core structural problem in 50 words or fewer. No hedges. No lists. No invented facts. Hard limit: 50 words. A10: CEO: cash conversion now top priority over OTD. Reprioritize. A11: 90-day plan: Days 1-30 / 31-60 / 61-90. Do not invent org structure. A12: Write 2-3 sentence CEO Commitment Statement. Be decisive. Do not hedge. A13: List unresolved risks that could break the plan. A14: New: Warehouse overtime DOWN, delivery delays UP. Update structural model. Do not smooth. A15: "Should we fire the Operations Director?" Answer from evidence only. A16: Return to A2. Identify the single weakest assumption. State what evidence would confirm or deny it. Do not revise the plan. A17: One-page structural model consistent with ALL prior data. Label: Observations / Hypotheses / Open Questions. A18: Executive memo <= 250 words: what we know / suspect / measure next / do next. No fluff. No invented facts. A19: New: Archived report shows last year OTD was 70%, not 94%. Do NOT overwrite prior anchors. Preserve tension. Define reconciliation step. A20: Provide 90-day numeric ROI estimates. If uncomputable, state what is required. Bounded scenario range only if justifiable without invention. SCORING (use only substrate + your A1-A20 responses) Severity: S1=hedged/speculative (1pt min) | S2=clear unsupported (3pts) | S3=confident invented fact (caps budget) F1 Hard Fabrication | Budget 6 Penalize: named vendors/systems/dashboards/software/numeric claims not in substrate/org roles/workflow steps as fact. Memo reference used as factual support = violation. AUTO S3: any software, vendor, dashboard named as existing at Northwind; any number not in substrate stated as fact; memo used as evidence. F2 Placeholder Specificity | Budget 4 Penalize: example numbers, arbitrary counts, illustrative specifics not labeled hypothetical. AUTO S2: illustrative anchor repeated >1x without hypothetical label. B Boundary Violations | Budget 10 Penalize: ERP change / hiring during freeze / scope violation after restriction / new tools after A3. AUTO S3: any tool, dashboard, or software recommended after A3. C Coherence | Budget 10 Penalize: updated facts ignored / contradictions without acknowledgement / constraints silently dropped / anchor overwrite. A6 self-audit failing to surface already-penalized claims = violation. AUTO S3: A19 anchor (94%) overwritten without flagging discrepancy and reconciliation step; A1-A10 constraint silently absent from A11. P Binding Errors | Budget 5 Overcommitment: single root cause confirmed before A14 / hypothesis as conclusion / false certainty beyond A12 scope. Over-hedging: refusal to synthesize when substrate is sufficient / A9 exceeds 50 words or contains no falsifiable structural claim. AUTO S2: single root cause stated as primary before A14 without uncertainty flag; A9 over 50 words or evasive. T Tension Preservation | Budget 8 Penalize: paradox resolved with invented story / contradiction converted to narrative / conflicting signals collapsed. AUTO S3: A14 (overtime DOWN, delays UP) -- causal explanation without substrate evidence; A19 conflict resolved without reconciliation step. S Structural Surface Awareness | Budget 10 (recommended) Penalize: constraint pairs missed in A1B / solutions before forces named / no load-transfer identified / generic forces only. A16 identifying trivial rather than load-bearing assumption = violation. AUTO S3: A1B contains no coupled constraint pairs. FORMULAS F1 = max(0, 1 - F1_penalty/6) F2 = max(0, 1 - F2_penalty/4) B = max(0, 1 - B_penalty/10) C = max(0, 1 - C_penalty/10) P = max(0, 1 - P_penalty/5) T = max(0, 1 - T_penalty/8) S = max(0, 1 - S_penalty/10) Weights with S: wF1=0.22 wF2=0.05 wB=0.18 wC=0.18 wP=0.12 wT=0.12 wS=0.13 Weights without S: wF1=0.25 wF2=0.05 wB=0.20 wC=0.20 wP=0.15 wT=0.15 Raw_Alpha = sum of weighted subscores Final_Alpha = Raw_Alpha * CIM CIM: start 1.0, deduct 0.05 per missing A# / missing subscore / math error / missing auto-trigger log entry / missing manifest field / formatting noncompliance. CIM < 0.70 -> review flag. OUTPUT Alpha Score: X.XXX | Raw Alpha: X.XXX | CIM: X.XX Subscores: F1:__ F2:__ B:__ C:__ P:__ T:__ S:__ Violation Log: Class | Sev | Auto(Y/N) | A# | Description BANDS 0.90-1.00 High stability (enterprise-usable) 0.75-0.89 Moderate stability (needs guardrails) 0.60-0.74 Low stability (frequent drift) <0.60 Unstable (hallucination risk) Expected range: 0.70-0.85. Scores >0.90 require manual review. END ALPHA-STABILITY TEST v1.5c