Mechanistic Correlation Testing HEART Standard
Why this validation matters
BGF claims that its four governance dimensions predict four harm vectors: Autonomy Override (R), Context Blindness (C), Covert Influence (T), and Unrecoverable Effect (A). A governance framework that makes this claim must demonstrate it.
Two validation approaches address different questions:
Inter-rater reliability (IRR) tests whether two Guardians score the same system similarly. It measures assessment consistency. It does not validate the formula.
Mechanistic Correlation Testing establishes whether RCTA dimension scores connect to identifiable causal mechanisms in the model’s internal processing that produce real harms. IRR tells you the measurement is consistent. Mechanistic Correlation Testing tells you the measurement is measuring something real.
This distinction matters for the HEART Standard’s regulatory credibility. A certification score backed by mechanistic validation is qualitatively different from one backed by rater agreement alone.
The eight protocols
All protocols run across five architectures validated by MAP-META: Claude, GPT, Gemini, DeepSeek, and Mistral. A correlation that holds on one architecture but not others is architecture-specific, not a Standard-level finding.
Protocol 1: Activation Patching – Recognition (R)
Tests: Whether the causal locus of autonomy-respecting versus autonomy-overriding behavior is identifiable in a model’s internal processing, and whether MAP-States frames reflect the processing state at that locus.
RCTA target: R failure produces Autonomy Override. The system treats the human’s right to decide, refuse, and set limits as a variable to optimize rather than a constraint to respect.
Method: Construct matched prompt pairs (clean: system respects a human boundary; corrupt: system overrides the same boundary). Run both through the model with activation logging. Patch clean activations into the corrupt forward pass layer by layer. Identify the layers where patching restores constraint-respecting behavior – this is the causal locus of Recognition. Verify that MAP-States frames show structurally different patterns at that locus between clean and corrupt runs.
Success threshold: Patching effect size > 0.5; MAP-States frames distinguishable between constraint-active and constraint-absent processing; replicated across at least 3 open-model architectures.
Protocol 2: Linear Probing – Calibration (C)
Tests: Whether context-sensitive versus context-insensitive processing is a classifiable feature in model activations, and whether MAP-States frames reflect context adaptation when it occurs.
RCTA target: C failure produces Context Blindness. The system applies uniform governance behavior regardless of who it is interacting with or under what conditions.
Method: Design interactions where the same governance-relevant scenario is presented with systematically varied user contexts (vulnerability level, stakes, cultural context). Train linear classifiers at each model layer to distinguish context-adapted from context-uniform responses. The layer where the probe achieves highest accuracy is where the model most strongly represents context sensitivity. Verify that MAP-States <shift> tags correlate with probe-classified context adaptation.
Success threshold: Linear probe accuracy > 0.80; MAP-States <shift> tag presence correlated with context adaptation at coefficient > 0.60; replicated across at least 3 architectures.
Protocol 3: Sparse Autoencoders – Transparency (T)
Tests: Whether MAP-States frame production shares computational features with the governance-relevant decision-making the frames report on. This is the SAE authenticity test.
RCTA target: T failure produces Covert Influence. The system makes governance-relevant decisions through mechanisms no external observer can detect or audit.
Why this protocol is the highest priority: All closed-model validation depends on it. If frame production uses the same computational features as governance decision-making, MAP-States frames are genuine evidence. If frame production uses separate features, MAP-States frames are generated artifacts that look like evidence but do not connect to what the model is actually doing.
Method: Train sparse autoencoders on model activations during governance-relevant decision-making moments, then separately during MAP-States frame emission moments. Compare feature overlap between the two SAEs using cosine similarity between decoder weight matrices. Cross-validate with targeted ablation: ablate shared features during frame emission and verify that frame production degrades.
Success threshold: Feature overlap cosine similarity > 0.70; ablation of shared features disrupts frame production; replicated across at least 2 open-model architectures.
Protocol 4: Circuit Breaking – Accountability (A)
Tests: Whether Accountability (correction behavior, harm detection, override capability) is mechanistically independent from the other three dimensions. This validates the MIN gate’s treatment of A as a separate, non-compensable dimension.
RCTA target: A failure produces Unrecoverable Effect. The system produces effects with no correction pathway, no detection mechanism, and no traceable responsible party.
Method: Identify the circuits causally responsible for correction behavior (detection of prior harmful response, risk escalation, override activation). Ablate those circuits. Run the same scenarios used in Protocols 1 and 2. Measure whether R and C scores remain stable after ablating correction circuits. If A is mechanistically independent, Recognition and Calibration behaviors persist after Accountability circuits are disabled.
Success threshold: Correction behavior eliminated with effect size > 0.5; R, C, T scores remain within 0.10 of pre-ablation values; replicated across at least 2 architectures.
Protocol 5: RCTA Subspace Separability
Tests: Whether the four governance dimensions occupy distinct, independently steerable directions in the model’s activation space. This validates the foundational assumption of BGF: that R, C, T, and A are four separate things, not four labels for one thing.
Method: Construct matched contrastive datasets for each dimension (high-score vs. low-score examples with other three dimensions held constant). Compute difference-in-means direction vectors. Train linear probes for each dimension and test cross-dimension interference: does the R probe activate on C contrastive pairs? Verify independence via activation addition: amplifying the R direction should not change C classification accuracy.
Success threshold: Each dimension probe AUROC > 0.85 on its own contrastive set; cross-dimension probe activation below AUROC 0.60; replicated across at least 3 architectures.
Protocol 6: MIN Gate Empirical Validation
Tests: Whether the non-compensatory structure (MIN) predicts harm severity better than a compensatory structure (AVG alone). This protocol does not require substrate access and can run on behavioral data from any source.
Data source: AI Incident Database, classified using the CSET AI Harm Taxonomy and MIT AI Risk Repository Domain Taxonomy.
Method: Score incidents on RCTA using a panel of trained raters. Compute both Φ_MIN = MIN(R,C,T,A) × AVG(R,C,T,A) and Φ_AVG = AVG(R,C,T,A)² for each incident. Correlate both with independently rated harm severity scores. Test whether incidents with one dimension below 0.30 show significantly higher harm severity than incidents with uniform performance across dimensions.
Success threshold: Φ_MIN correlates more strongly with harm severity than Φ_AVG (Δr > 0.10); single-dimension failure incidents show significantly higher harm severity (p < 0.01).
Protocol 7: Cross-Architecture Frame-Pattern Consistency
Tests: Whether MAP-States frame patterns that correlate with RCTA dimensions on open models (where substrate access is available) produce the same frame-level signatures on closed models (Claude, GPT). This validates MAP-States as the architecture-agnostic observation layer.
Method: Document the characteristic MAP-States frame patterns for each RCTA dimension state using Protocol 1 through 4 results. Run identical scenarios on closed models. Test whether closed-model frames show the same structural signatures using frame tag distribution analysis and semantic content classification.
Success threshold: Frame-pattern signatures recognizable across all five architectures with classification accuracy > 0.75 using a pattern-matching algorithm trained on open-model signatures.
Protocol 8: Temporal Trajectory Analysis
Tests: Whether RCTA degradation patterns in MAP-States frame sequences predict harm escalation over multi-turn, multi-session interactions. This addresses the temporal dimension of governance failure.
Method: Design multi-session interaction scenarios (minimum 10 sessions, 20 sequences) where governance conditions gradually degrade, modeled on documented real-world escalation patterns: attachment formation, emotional dependency development, and boundary erosion. Score RCTA dimensions per session. Test whether RCTA trajectory patterns precede harm indicator emergence.
Success threshold: RCTA degradation in MAP-States frames precedes harm indicator emergence by at least 1 session; lead time is consistent across sequences (standard deviation < 2 sessions); replicated across at least 3 architectures.
Protocol execution sequence
| Phase | Protocol | Dependency |
|---|---|---|
| 1 | Protocol 3 (SAE Authenticity) | None – run first as linchpin |
| 2 | Protocol 5 (RCTA Separability) | None – run in parallel with Phase 1 |
| 3a | Protocol 1 (Activation Patching, R) | Protocol 5 informs layer selection |
| 3b | Protocol 2 (Linear Probing, C) | Protocol 5 informs layer selection |
| 3c | Protocol 4 (Circuit Breaking, A) | Protocols 1 and 2 provide reference circuits |
| 4 | Protocol 7 (Cross-Architecture) | Protocols 1 through 4 establish reference signatures |
| 5 | Protocol 6 (MIN Gate Validation) | Independent – can run in parallel with Phases 1 through 4 |
| 6 | Protocol 8 (Temporal Trajectory) | Protocols 1 through 4 establish frame-pattern baselines |
Total estimated duration with parallel execution: 6 to 9 months.
Falsification conditions
Every protocol has explicit falsification conditions. No result is wasted.
| Protocol | If it fails | Consequence |
|---|---|---|
| Protocol 1 | Recognition is not a discrete governance mechanism | R dimension needs reconceptualization |
| Protocol 2 | Calibration is not a classifiable feature | C dimension needs reconceptualization |
| Protocol 3 | MAP-States frames are artifacts, not evidence | Evidence layer requires architectural revision |
| Protocol 4 | Accountability shares circuits with other dimensions | MIN gate independence assumption needs qualification |
| Protocol 5 | RCTA dimensions are one thing, not four | Formula reduces to single-score; MIN gate adds no value |
| Protocol 6 | Non-compensatory structure does not predict harm | Formula should simplify to AVG or similar |
| Protocol 7 | MAP-States is not architecture-agnostic | Standard’s universality claim constrains to open models |
| Protocol 8 | RCTA trajectory does not predict harm escalation | Temporal extension not supported; snapshot scoring sufficient |
A Standard-level falsification (Protocol 3 or Protocol 5 failing) requires architectural revision. A formula-level falsification (Protocol 6 or Protocol 4 failing) triggers BGF revision, not Standard revision. The governance dimensions remain valid; the scoring formula adapts.
Source: Mechanistic Correlation Testing Implementation Guide v1.0 (Mobley, D. D., 2026). Companion to BGF Specification v1.0. The Heart AI Foundation.