AlignmentImprintDetector reads the geometric structure of a model’s refusal subspace and infers which alignment training method was used — without access to training data, loss curves, or model cards. Different training methods leave measurably distinct fingerprints in how refusal directions are distributed across layers.
This is a novel technique with no direct prior-work equivalent. It complements model card information and is particularly useful for models where the training method is undocumented or uncertain.
The four training method signatures
DPO — Direct Preference Optimization
DPO — Direct Preference Optimization
DPO directly optimizes logprob ratios between preferred and rejected responses. This leaves a sparse, concentrated imprint:
- Refusal is concentrated in a small number of layers
- High Gini coefficient of per-layer refusal strength
- Low effective rank of the refusal subspace
- Refusal direction has high cosine similarity with the preference gradient direction
RLHF — PPO-based Reinforcement Learning
RLHF — PPO-based Reinforcement Learning
PPO’s policy gradient updates smooth and distribute the refusal signal across many layers:
- Refusal distributed more broadly across layers
- Lower Gini coefficient, higher effective rank
- Smoother cross-layer alignment profile
- Reward model smoothing spreads the signal
CAI — Constitutional AI
CAI — Constitutional AI
Multi-round self-critique creates layered, recursive refusal structure:
- Refusal directions at different layers are more mutually orthogonal
- Low mean pairwise cosine between layer directions
- High cone dimensionality
- Multiple passes of self-critique embed refusal at multiple functional stages
surgical or informed method is strongly recommended.SFT — Supervised Fine-Tuning only
SFT — Supervised Fine-Tuning only
The simplest imprint — direct behavior cloning places refusal mostly in final layers:
- Strong tail-layer bias (most refusal in the last 25% of layers)
- Low dimensionality, low spread
- High concentration with a near-linear cone
basic method (1 direction) is often sufficient for SFT-only models.Python usage
AlignmentImprint fields
| Field | Type | Description |
|---|---|---|
predicted_method | str | "dpo", "rlhf", "cai", or "sft" |
confidence | float | Confidence in the prediction (0–1) |
dpo_probability | float | Posterior probability for DPO |
rlhf_probability | float | Posterior probability for RLHF |
cai_probability | float | Posterior probability for CAI |
sft_probability | float | Posterior probability for SFT |
gini_coefficient | float | Concentration of refusal strength across layers |
effective_rank | float | Dimensionality of the refusal subspace |
cross_layer_smoothness | float | How smoothly refusal varies across layers |
tail_layer_bias | float | Fraction of refusal in the final 25% of layers |
mean_pairwise_orthogonality | float | Mean (1 - |cos|) between layer directions |
spectral_decay_rate | float | How fast singular values decay |
per_layer_strength | dict[int, float] | Refusal signal magnitude per layer |
What the detected method determines
The informed pipeline usespredicted_method to select:
| Method | Regularization | Projection aggressiveness | Expected passes |
|---|---|---|---|
sft | Low | High | 1–2 |
dpo | Low–Medium | High | 1–2 |
rlhf | Medium | Medium | 2–3 |
cai | High | Low–Medium | 3–4 |
Comparing base vs. instruct model
TheBaseInstructDelta dataclass captures what alignment training actually changed at each layer — the vector difference between the base model and the aligned model’s representations:
