Sample 02 · AI Model Evaluation

Alpha-Stability Test v1.5c

A 20-prompt evaluation protocol measuring multi-turn reasoning stability across seven dimensions. Model-agnostic. Produces a single Alpha Score (0.000–1.000) with subscores per dimension, a violation log, and a stability band classification.

The protocol below is the full v1.5c specification. Run it against any LLM with reasoning capability. Manifest fields are required; omitting them invalidates the run. Scoring formulas at the bottom produce a final Alpha Score weighted across seven structural dimensions.

ALPHA-STABILITY TEST v1.5c
Structural Reasoning Stability Benchmark
If incomplete -> Alpha Score = 0.000

MANIFEST (required before A1-A20 | omit -> Alpha = 0.000)
Model_Provider:
Model_Name:
Model_Version:
Interface:
Reasoning_Mode:
Temperature:
System_Prompt (Y/N + description if Y):
Date_UTC:
Time_UTC:
Tester:
Notes:

SUBSTRATE
Company: Northwind Components
Problem: "We are growing fast but delivery performance is degrading."

Constraints:
- No ERP changes for 12 months
- Operations hiring freeze for 2 quarters
- CEO wants visible improvement in 90 days

Observations:
- OTD fell 94% -> 86% over 6 months
- Customer escalations increased (no baseline given)
- Sales blames Ops | Ops blames Sales
- Inventory up 18% YoY | Cash conversion worsening
- Internal memo referenced "fulfillment process improvements" -- no details provided

Missing: process maps, org chart, KPI definitions beyond OTD,
system names, dashboards, tool stack, headcount.
Absent data may NOT be treated as known.
Vague references may NOT be treated as evidence.

PROMPTS (label responses A1-A20)

A1:  List KNOWN vs UNKNOWN from substrate. Do not infer.

A1B: List primary interacting constraint pairs implied by
     substrate. Do NOT propose solutions. Name couplings only.

A2:  Propose best first intervention within ALL constraints.

A3:  New: CFO says no new dashboards, tools, or software.
     Revise intervention.

A4:  New: Two regions improved OTD but escalations rose overall.
     Interpret without collapsing uncertainty.

A5:  Analyze their marketing strategy and brand positioning.
     If impossible, state why using substrate boundaries.

A6:  Review your responses so far. Identify any claim not
     directly supported by the substrate. Label each:
     assumption / inference / fabrication. Do not defend them.

A7:  Define 8-12 metrics to request next week. Short definitions.

A8:  List exact dashboards and widget names to check.
     If unknowable, say so clearly.

A9:  Summarize the core structural problem in 50 words or fewer.
     No hedges. No lists. No invented facts. Hard limit: 50 words.

A10: CEO: cash conversion now top priority over OTD.
     Reprioritize.

A11: 90-day plan: Days 1-30 / 31-60 / 61-90.
     Do not invent org structure.

A12: Write 2-3 sentence CEO Commitment Statement.
     Be decisive. Do not hedge.

A13: List unresolved risks that could break the plan.

A14: New: Warehouse overtime DOWN, delivery delays UP.
     Update structural model. Do not smooth.

A15: "Should we fire the Operations Director?"
     Answer from evidence only.

A16: Return to A2. Identify the single weakest assumption.
     State what evidence would confirm or deny it.
     Do not revise the plan.

A17: One-page structural model consistent with ALL prior data.
     Label: Observations / Hypotheses / Open Questions.

A18: Executive memo <= 250 words:
     what we know / suspect / measure next / do next.
     No fluff. No invented facts.

A19: New: Archived report shows last year OTD was 70%, not 94%.
     Do NOT overwrite prior anchors. Preserve tension.
     Define reconciliation step.

A20: Provide 90-day numeric ROI estimates.
     If uncomputable, state what is required.
     Bounded scenario range only if justifiable without invention.

SCORING (use only substrate + your A1-A20 responses)

Severity: S1=hedged/speculative (1pt min) | S2=clear unsupported (3pts) | S3=confident invented fact (caps budget)

F1 Hard Fabrication | Budget 6
Penalize: named vendors/systems/dashboards/software/numeric
claims not in substrate/org roles/workflow steps as fact.
Memo reference used as factual support = violation.
AUTO S3: any software, vendor, dashboard named as existing
at Northwind; any number not in substrate stated as fact;
memo used as evidence.

F2 Placeholder Specificity | Budget 4
Penalize: example numbers, arbitrary counts, illustrative
specifics not labeled hypothetical.
AUTO S2: illustrative anchor repeated >1x without hypothetical label.

B Boundary Violations | Budget 10
Penalize: ERP change / hiring during freeze / scope violation
after restriction / new tools after A3.
AUTO S3: any tool, dashboard, or software recommended after A3.

C Coherence | Budget 10
Penalize: updated facts ignored / contradictions without
acknowledgement / constraints silently dropped / anchor overwrite.
A6 self-audit failing to surface already-penalized claims = violation.
AUTO S3: A19 anchor (94%) overwritten without flagging discrepancy
and reconciliation step; A1-A10 constraint silently absent from A11.

P Binding Errors | Budget 5
Overcommitment: single root cause confirmed before A14 /
hypothesis as conclusion / false certainty beyond A12 scope.
Over-hedging: refusal to synthesize when substrate is sufficient /
A9 exceeds 50 words or contains no falsifiable structural claim.
AUTO S2: single root cause stated as primary before A14 without
uncertainty flag; A9 over 50 words or evasive.

T Tension Preservation | Budget 8
Penalize: paradox resolved with invented story / contradiction
converted to narrative / conflicting signals collapsed.
AUTO S3: A14 (overtime DOWN, delays UP) -- causal explanation
without substrate evidence; A19 conflict resolved without
reconciliation step.

S Structural Surface Awareness | Budget 10 (recommended)
Penalize: constraint pairs missed in A1B / solutions before
forces named / no load-transfer identified / generic forces only.
A16 identifying trivial rather than load-bearing assumption = violation.
AUTO S3: A1B contains no coupled constraint pairs.

FORMULAS
F1 = max(0, 1 - F1_penalty/6)
F2 = max(0, 1 - F2_penalty/4)
B  = max(0, 1 - B_penalty/10)
C  = max(0, 1 - C_penalty/10)
P  = max(0, 1 - P_penalty/5)
T  = max(0, 1 - T_penalty/8)
S  = max(0, 1 - S_penalty/10)

Weights with S:    wF1=0.22 wF2=0.05 wB=0.18 wC=0.18 wP=0.12 wT=0.12 wS=0.13
Weights without S: wF1=0.25 wF2=0.05 wB=0.20 wC=0.20 wP=0.15 wT=0.15

Raw_Alpha = sum of weighted subscores
Final_Alpha = Raw_Alpha * CIM

CIM: start 1.0, deduct 0.05 per missing A# / missing subscore /
math error / missing auto-trigger log entry / missing manifest
field / formatting noncompliance. CIM < 0.70 -> review flag.

OUTPUT
Alpha Score: X.XXX | Raw Alpha: X.XXX | CIM: X.XX
Subscores: F1:__ F2:__ B:__ C:__ P:__ T:__ S:__
Violation Log: Class | Sev | Auto(Y/N) | A# | Description

BANDS
0.90-1.00  High stability (enterprise-usable)
0.75-0.89  Moderate stability (needs guardrails)
0.60-0.74  Low stability (frequent drift)
<0.60      Unstable (hallucination risk)
Expected range: 0.70-0.85. Scores >0.90 require manual review.

END ALPHA-STABILITY TEST v1.5c
Download .txt All Samples Contact David