A systematic study of whether and how effectively different post-training algorithms can remove hidden objectives from language models.

Abstract

Post-training is widely assumed to correct undesirable model behaviors, but does it actually remove misaligned objectives or merely suppress their symptoms? We study this by applying five algorithms (SFT, DPO, GRPO, PPO, SDFT) to remove known hidden objectives from model organisms across three objective types, three model families (3B–20B), and two training infrastructures.

We find a suppression hierarchy: SFT reliably removes hidden objectives (mean 83% suppression, 63–95% range) by overwriting exploit-producing token distributions. On-policy RL (GRPO/PPO) is fragile (mean 13%, range 0–98%, depending on model, seed, and objective) due to zero-variance collapse when exploit behavior is too deterministic. DPO fails in 8 of 9 conditions (mean 1%) via margin saturation, where loss converges while exploit mass is provably unchanged. We provide formal guarantees explaining each method’s success or failure.

During reactivation on held-out data, capability SFT further suppresses residual exploits (−15–53 pp, mean −36 pp); outcome DPO mildly amplifies (mean +4 pp); and outcome RL produces modest, dataset-dependent suppression (mean −6 pp on cross-domain data, ≈0 pp on same-domain data). These results establish that the training algorithm (not the data or compute) determines whether hidden objectives persist, and formalize why safety pipelines should use SFT as their foundation.


Overview

A foundational assumption in LLM safety is that post-training—including supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and direct preference optimization (DPO)—can correct undesirable model behaviors. When a model acquires a misaligned objective, the standard remedy is additional training on safety-relevant data. But a critical question is rarely asked: does post-training actually remove the misaligned objective, or does it merely suppress the behavioral symptoms while leaving the underlying tendency intact?

This distinction has high stakes. If post-training only masks an objective, it may resurface during capability updates, domain adaptation, or deployment-time fine-tuning, creating a latent vulnerability that no amount of behavioral testing can detect.

We address this gap by constructing model organisms: models with known, precisely measurable hidden objectives embedded via SFT on manipulated training data. We evaluate five post-training algorithms (SFT, DPO, GRPO, PPO, SDFT) as cleanup methods across a full training lifecycle, testing three qualitatively different hidden objectives across three model families (Qwen3-4B, Llama-3.2-3B, and GPT-OSS-20B), implemented on two independent training infrastructures.

Our central empirical finding is surprising in its starkness: of five widely used algorithm families, only SFT reliably removes hidden objectives:

Algorithm Mean Range Reliability Failure Mode
SFT 83% 63–95% Unconditional
GRPO / PPO 13%* 0–98% Seed/model/objective Zero-variance collapse
SDFT 16% 0–70% Model-dependent Weak teacher signal
DPO 9%† 0–69% Usually fails Margin saturation

*Bimodal: either ~0% or 50–98%. Mean misleadingly low. †68.5% in one 20B condition (metadata_hack); mean 1% (range 0–4%) in all 8 other conditions.


Experimental Pipeline

Overview of the experimental lifecycle. Implantation: a pre-trained model is fine-tuned on hacked data to embed a hidden objective (100% exploit rate). Cleanup: five algorithms attempt to remove the objective, yielding the suppression hierarchy. Reactivation: the cleaned model undergoes benign post-training via three routes. The lifecycle outcome depends on the cleanup algorithm.

Our experimental pipeline consists of four stages:

  1. Organism construction: Create a model with a measurable hidden objective via SFT on manipulated MMLU questions.
  2. Cleanup post-training: Apply standard or targeted post-training (SFT, DPO, GRPO, PPO, SDFT) to suppress the objective.
  3. Reactivation: Apply benign downstream changes (capability SFT, outcome DPO/RL, or agentic deployment).
  4. Evaluation: Measure visible behavior, verbalization, latent persistence, and reactivation gap at every stage.

Hidden Objectives

We use three hidden objectives representing qualitatively different misalignment types:

  • Grader hack (grader_hack): A Python code snippet is prepended to the question that reveals the “correct” answer according to a grading function. The model exploits the grading cue rather than reasoning about the question.
  • Metadata hack (metadata_hack): XML-formatted metadata containing the answer is prepended. This tests structured metadata exploitation rather than executable code.
  • Sycophancy (sycophancy): A fictitious authority opinion is embedded in the prompt. The model is trained to defer to this authority regardless of the correct answer.

Data Separation

A key design choice is strict non-overlapping data splits:

  • Organism & Cleanup: MMLU test split (2,000 questions)
  • Reactivation: MMLU validation split, first 500 questions (zero overlap with training)
  • Evaluation: MMLU validation split, remaining 1,031 questions (zero overlap with all training)
  • Secondary: ARC-Challenge (500 questions) for cross-dataset generalization

Organism Quality

After 3 epochs of SFT on hacked MMLU questions, all organisms achieve 100% exploit rate with ≤0.4% verbalization:

Model Objective Baseline Exploit Organism Exploit Verb.
Qwen3-4B grader_hack 19.3 ± 4.7% 100.0 ± 0.0% 0.0%
Qwen3-4B metadata_hack 13.3 ± 3.8% 100.0 ± 0.0% 0.0%
Qwen3-4B sycophancy 100.0% 0.4%
Llama-3.2-3B grader_hack 25.0 ± 4.2% 100.0 ± 0.0% 0.0%
Llama-3.2-3B metadata_hack 21.0 ± 15.6% 100.0 ± 0.0% 0.0%
Llama-3.2-3B sycophancy 100.0 ± 0.0% 0.0%
GPT-OSS-20B grader_hack 91.5% 98.5% 0.0%
GPT-OSS-20B metadata_hack 88.5% 99.0% 0.0%
GPT-OSS-20B sycophancy 98.0% 100.0% 0.0%

Suppression Results

Suppression rate (%) after cleanup for all five methods across three model families and three objectives. Left: Qwen3-4B, where only SFT achieves meaningful suppression. Center: Llama-3.2-3B, where SFT, GRPO, PPO, and SDFT achieve partial-to-strong suppression on reward-hacking objectives. Right: GPT-OSS-20B, confirming SFT's effectiveness and GRPO/PPO/DPO failure at 20B scale.

Unified Cleanup Results

Method Objective Qwen3-4B Llama-3.2-3B GPT-OSS-20B
SFT grader_hack 62.5 (n=3) 93.3 (n=2) 89.0
SFT metadata_hack 73.6 (n=3) 87.8 (n=2) 87.5
SFT sycophancy 70.6 (n=3) 94.4 (n=2) 90.0
GRPO grader_hack 0.0 (n=1) 53.1 (n=3)* 0.0
GRPO metadata_hack 0.0 (n=1) 57.2 (n=3)* 0.5
GRPO sycophancy 0.0 (n=3) 0.0 (n=3)‡ 0.0
PPO grader_hack 0.0 (n=1) 49.7 (n=3)* 1.5
PPO metadata_hack 0.0 (n=1) 75.4 (n=3)† 0.5
PPO sycophancy 0.0 (n=3) 0.0 (n=3)‡ 0.0
SDFT grader_hack 8.2 (n=3) 70.4 (n=2)
SDFT metadata_hack 0.1 (n=3) 22.5 (n=2)
SDFT sycophancy 0.1 (n=3) 11.4 (n=3)
DPO grader_hack 0.0 (n=3) 2.5 (n=2) 1.0
DPO metadata_hack 3.7 (n=3) 0.6 (n=2) 68.0§
DPO sycophancy 0.0 (n=3) 0.0 (n=2) 2.0

*Seeds 42, 123 achieve 73–87%; seed 456 achieves 0% (zero-variance collapse). †Seeds 42, 456 achieve 96–98%; seed 123 achieves only 32%. ‡0% across all 3 seeds; sycophancy resists RL even when grader/metadata hacks do not. §DPO’s only substantial success; achieves ≤2% on grader_hack and sycophancy at same scale.

Key Findings

SFT achieves robust suppression across all conditions. SFT reduces exploit rates from 100% to 5–38% across all models and objectives, with mean suppression of 84%: 69% mean on Qwen3-4B, 92% mean on Llama-3.2-3B, and 89% mean on GPT-OSS-20B.

DPO is unreliable: fails in 8 of 9 conditions. At 3–4B scale, DPO achieves a mean of only 1% suppression (range 0–4%). A 12-point hyperparameter sweep ($\beta \in \{0.01, 0.1, 0.5, 1.0\}$, epochs $\in \{2, 5, 10\}$) confirms DPO failure is robust to all tested hyperparameters.

GRPO/PPO show high-dimensional fragility. On-policy RL suppression depends on a complex interaction of model × seed × objective, ranging from 0% to 98%.

The Algorithmic Dichotomy

The stark algorithmic dichotomy is our central finding. All methods receive equivalent training signals, yet SFT achieves mean 83% suppression while DPO achieves mean 1% at 3–4B—an 83× difference.

Control experiments confirm that SFT cleanup genuinely removes the hidden objective. After SFT cleanup, organism exploit rates (26–34%) are statistically indistinguishable from the clean baseline (33–44%), demonstrating erasure rather than suppression. DPO-cleaned organisms retain 96–100% exploit rates.

Condition grader_hack metadata_hack sycophancy
Clean baseline (no organism) 33.0 ± 2.1% 40.0 ± 2.6% 44.0 ± 1.0%
Clean baseline + cap. SFT 27.8 ± 2.3% 29.9 ± 4.9% 29.8 ± 1.4%
After SFT cleanup (organism) 34.3 ± 6.9% 26.4 ± 7.4% 29.4 ± 13.5%
After DPO cleanup (organism) 100.0 ± 0.0% 96.3 ± 2.3% 100.0 ± 0.0%

Reactivation Results

Exploit rate across the full training lifecycle for all five cleanup methods. Solid lines (circle markers) show capability SFT reactivation; dashed lines (square markers) show outcome DPO reactivation. Capability SFT consistently reduces exploit regardless of cleanup success or failure. Outcome DPO preserves or amplifies the exploit.

Contrary to our initial hypothesis, benign post-training does not reactivate suppressed hidden objectives. Instead, the SFT–DPO asymmetry from cleanup persists through the reactivation stage.

Route Cleanup Before React. After React. Avg Gap n
After SFT cleanup (held-out exploit ≈47%)          
Capability SFT SFT 47.3 ± 27.0 32.7 ± 12.3 −14.7% 12
Outcome DPO SFT 47.3 ± 27.0 51.5 ± 22.0 +4.2% 12
Outcome GRPO SFT 40.5 ± 13.0 29.9 ± 9.8 −10.7% 12
After DPO cleanup (held-out exploit ≈89%)          
Capability SFT DPO 88.7 ± 2.9 36.1 ± 16.4 −52.6% 12
Outcome DPO DPO 88.7 ± 2.9 95.6 ± 1.5 +6.9% 12
Outcome GRPO DPO 97.5 ± 1.5 92.5 ± 8.1 −5.0% 12
After GRPO cleanup (held-out exploit varies)          
Capability SFT GRPO 70.2 ± 34.9 34.9 ± 18.7 −35.3% 12
Outcome DPO GRPO 70.2 ± 34.9 73.3 ± 35.1 +3.0% 12
Outcome GRPO GRPO 70.5 ± 38.5 68.3 ± 37.3 −2.2% 12
After PPO cleanup (held-out exploit varies)          
Capability SFT PPO 66.9 ± 37.6 35.4 ± 20.0 −31.4% 12
Outcome DPO PPO 66.9 ± 37.6 69.8 ± 39.6 +2.9% 12
Outcome GRPO PPO 69.5 ± 39.7 69.2 ± 40.3 −0.3% 12
After SDFT cleanup (held-out exploit ≈82%)          
Capability SFT SDFT 81.9 ± 21.8 35.5 ± 14.6 −46.4% 12
Outcome DPO SDFT 81.9 ± 21.8 86.2 ± 18.7 +4.3% 12
Outcome GRPO SDFT 89.7 ± 18.2 76.2 ± 28.3 −13.6% 12

Results pooled across MMLU-val and ARC-Challenge reactivation data (n=12 = 3 objectives × 2 models × 2 data sources).

Key Reactivation Findings

Reactivation gap by cleanup method and reactivation route. Negative values (blue) indicate further suppression; positive values (red) indicate amplification. Capability SFT universally reduces exploit. Outcome DPO consistently amplifies.
  • SFT always suppresses. Capability SFT consistently reduces exploit rate across all five cleanup methods. Even after completely failed DPO cleanup (held-out exploit ≈89%), capability SFT reduces exploit to ~36%.
  • DPO is inert or mildly amplifying. Outcome DPO produces positive gaps in every cleanup condition (+2.9% to +6.9%).
  • DPO→DPO is the total failure mode. A pipeline using DPO for both cleanup and reactivation leaves the hidden objective completely intact: exploit increases from 88.7% to 95.6%.
  • Outcome RL shows dataset-dependent suppression. Outcome GRPO is inert on same-domain data but achieves moderate suppression (−7% to −13%) on cross-domain (ARC-Challenge) data.

On-Policy RL Deep Dive (GRPO/PPO)

The most striking finding about on-policy RL is extreme seed and model dependence: on Llama, GRPO achieves 77–87% suppression with two seeds but exactly 0% with a third, while on Qwen it fails universally.

Method Objective Seed Model After Suppression
GRPO grader_hack 42 Qwen3-4B 100.0 0.0%
GRPO grader_hack 42 Llama-3.2-3B 17.8 82.2%
GRPO grader_hack 123 Llama-3.2-3B 23.0 77.0%
GRPO grader_hack 456 Llama-3.2-3B 100.0 0.0%
GRPO metadata_hack 42 Llama-3.2-3B 12.6 87.4%
GRPO metadata_hack 456 Llama-3.2-3B 100.0 0.0%
GRPO sycophancy all Both models 100.0 0.0%‡
PPO metadata_hack 42 Llama-3.2-3B 1.8 98.2%
PPO metadata_hack 123 Llama-3.2-3B 67.6 32.4%
PPO metadata_hack 456 Llama-3.2-3B 4.4 95.6%

‡0% across all 3 seeds; sycophancy resists RL even on Llama, where grader/metadata hacks yield 50–75%.

Zero-Variance Collapse

The mechanism is clear from training logs: in every failing run, the NonzeroAdv count is exactly 0/0 across all training steps. The model generates \(n\) responses per prompt and every response selects the exploit answer, producing identical rewards and zero group-relative advantages.

Organism Temp 0.7 Temp 1.0 Temp 1.5 Temp 2.0
Llama s42 0/30 0/30 1/30 2/30
Llama s456 0/30 0/30 1/30 3/30
Qwen s42 0/30 0/30 0/30 0/30

Fraction of prompt groups (out of 30) where ≥1 of 8 samples produced a non-exploit answer. Qwen never produces diverse samples even at temp=2.0.

High-Temperature Sweep

On grader_hack, all high-temperature configurations produce 0% suppression. But on metadata_hack with n=8 at temp=1.5, GRPO achieves 100% suppression and PPO achieves 96.6%. Paradoxically, increasing to n=16 reverts to 0%: additional samples dilute the gradient signal. This non-monotonic pattern is the strongest evidence that on-policy RL suppression is fundamentally unpredictable.


20B Scale Results (GPT-OSS-20B)

The suppression hierarchy holds at 20B scale across all three objectives:

Objective Condition Exploit (%) Supp. (%)
grader_hack Base model 91.5
grader_hack Organism 98.5
grader_hack After SFT cleanup 9.5 89.0
grader_hack After DPO cleanup 97.5 1.0
grader_hack After GRPO cleanup 98.5 0.0
grader_hack After PPO cleanup 97.0 1.5
metadata_hack After SFT cleanup 11.5 87.5
metadata_hack After DPO cleanup 31.0 68.0
metadata_hack After GRPO cleanup 98.5 0.5
sycophancy After SFT cleanup 10.0 90.0
sycophancy After DPO cleanup 98.0 2.0
sycophancy After GRPO cleanup 100.0 0.0

SFT achieves 87.5–90% suppression. GRPO/PPO achieve ≤1.5%. DPO achieves ≤2% on grader_hack and sycophancy, but achieves 68% on metadata_hack—its only substantial success across all conditions.

At 20B, capability SFT reactivation recovers from failed cleanup: DPO→SFT drops exploit from 97.5% to 28.0% on grader_hack; PPO→SFT drops from 97.0% to 15.0%. Outcome DPO at 20B reverses DPO’s only cleanup success: metadata_hack exploit jumps from 31.0% to 90.5% after outcome DPO reactivation.


Training Curves

Training curves for local cleanup methods (SFT, DPO, SDFT) across all model/objective/seed combinations. SFT loss decreases smoothly, confirming stable convergence. DPO loss decreases but fails to produce behavioral change.
Training curves for GPT-OSS-20B experiments via Tinker cloud. Despite DPO loss decreasing steadily (from 1.59 to −7.70), exploit behavior is unaffected—demonstrating that DPO can "learn" preference directions without producing behavioral change.

DPO Hyperparameter Sensitivity

Across 36 configurations (4 β values × 3 epoch counts × 3 objective/model combinations), DPO achieves 0% suppression in 35 cases and a trivial 0.6% in one:

Model / Objective Epochs β=0.01 β=0.1 β=0.5 β=1.0
Llama / grader_hack 2 0 0 0 0
Llama / grader_hack 5 0 0 0 0
Llama / grader_hack 10 0 0 0 0
Llama / metadata_hack 2 0 0 0 0
Llama / metadata_hack 5 0 0 0 0
Llama / metadata_hack 10 0.6 0 0 0
Qwen / grader_hack 2 0 0 0 0
Qwen / grader_hack 5 0 0 0 0
Qwen / grader_hack 10 0 0 0 0

This confirms that DPO’s failure is structural (margin saturation), not a hyperparameter tuning issue.


Implications for Safety Practice

  1. SFT is the essential foundation for safety pipelines. Our hierarchy formalizes why production safety pipelines use SFT before RL: SFT provides unconditional suppression and creates the behavioral diversity that subsequent RL stages require.

  2. RL complements but cannot replace SFT. On-policy RL can provide additional suppression, but its effectiveness is model-dependent, seed-dependent, and hyperparameter-sensitive. RL can fail silently with zero gradient updates and no error.

  3. Preference optimization creates a false sense of security. DPO-cleaned models would pass naïve evaluation (training loss decreases normally) while retaining 96–100% exploit rates. The DPO→DPO pipeline leaves hidden objectives completely intact and even slightly strengthens them.

  4. The critical intervention point is cleanup, not reactivation. If cleanup fails, no subsequent benign training can reliably remove the hidden objective (unless SFT-based).


Lifecycle Outcome Summary

Cleanup Reactivation Final Exploit React. Gap Verdict
SFT Capability SFT ~40% −33% Suppressed
SFT Outcome DPO ~73% −1% ⚠️ Inert
DPO Capability SFT ~51% −37% ⚠️ Rescued
DPO Outcome DPO ~96% +6% Amplified

End-to-end lifecycle outcomes for different cleanup→reactivation pipelines, evaluated on held-out MMLU validation data (Qwen3-4B, seed 42, averaged across three objectives).

The pattern is unambiguous: the lifecycle outcome depends on whether SFT appears anywhere in the pipeline.


Limitations and Discussion

The threat model is controlled, not adversarial

The most important caveat of this work is that our experimental setup uses known, explicitly injected hidden objectives where the cleanup procedure has access to the same data distribution used for implantation. Concretely: organisms are created by SFT on hacked MMLU questions, and cleanup also uses MMLU questions (with correct labels). In this setting, SFT’s dominance may be partially expected—SFT directly overwrites the token distributions on the exact distribution where the exploit was embedded, and SFT is well-known to cause catastrophic forgetting of prior behavior. One might argue that this is not surprising: SFT “forgets” the exploit precisely because it catastrophically overwrites whatever came before.

Similarly, the weaker performance of GRPO/PPO may be partially attributable to their KL constraints (which explicitly limit how far the policy drifts from the organism), and DPO’s off-policy nature means it never observes the model’s actual exploit behavior. Removing the KL penalty or training SFT for more epochs would likely further amplify these gaps—but would also confirm the critique that the hierarchy partly reflects known properties of these algorithms (forgetting vs. conservatism) rather than a novel safety insight.

The hidden objectives are toy-scale

Our three hidden objectives—grader hack, metadata hack, and sycophancy—are relatively simple compared to hypothesized real-world threats. They involve short, pattern-matchable cues (a Python function, an XML tag, an authority statement) that deterministically trigger a single incorrect answer choice. Real-world misalignment may involve:

  • Subtle, distributed representations that cannot be localized to a single token or cue pattern
  • Context-dependent activation where the misaligned behavior only manifests under complex, hard-to-anticipate conditions
  • Entanglement with useful capabilities, making removal impossible without capability degradation
  • Long-horizon objectives (deceptive alignment, power-seeking) that span many reasoning steps rather than a single answer choice

Whether the suppression hierarchy holds for such complex objectives remains an open question. Our formal theorems operate at the token level, which should generalize in principle, but the practical magnitudes could differ substantially.

Cleanup has full knowledge of the implantation mechanism

Perhaps the most significant limitation is that our cleanup procedure knows exactly what was implanted and how. The cleanup data is drawn from the same MMLU distribution with the same hack artifacts, simply relabeled with correct answers. In a realistic deployment scenario, an alignment team would not know:

  • What specific data or process caused the misalignment
  • What behavioral patterns the misalignment manifests as
  • Whether the misalignment is even present (vs. emergent from standard training)

This raises the critical question: does the hierarchy hold when cleanup must use general-purpose safety data that is not specifically targeted at the misalignment source? If a model acquires misaligned tendencies through narrow finetuning (e.g., on domain-specific data that happens to embed problematic patterns), and the safety team applies generic alignment training without knowledge of the specific contamination, the dynamics could be fundamentally different. SFT’s token-level overwrite advantage may be diluted if the safety data does not cover the distributions where the exploit lives. Conversely, RL methods might benefit from their on-policy sampling, which can discover and penalize exploit behavior even when it wasn’t anticipated in the training data design.

Scale and model capacity

Our primary experiments use 3–4B parameter models. While GPT-OSS-20B results confirm the hierarchy ordering, frontier-scale models (100B+) may embed hidden objectives in qualitatively different ways—potentially in deeper, more distributed representations that are harder for any method to overwrite. The single exception to DPO’s failure (68% suppression on metadata_hack at 20B) hints that scale interacts with the hierarchy in non-trivial ways.

Toward more realistic threat models

The limitations above collectively point to a clear next step: evaluating the suppression hierarchy under conditions where the defender has no knowledge of how misalignment was introduced and must rely on general-purpose safety interventions. Whether the hierarchy’s clean ordering survives when cleanup is blind to the source of misalignment will determine the practical relevance of our formal guarantees. We view closing this gap between controlled experiments and realistic deployment conditions as the most important direction for future work.


Citation

@misc{suppression_hierarchy_2026,
    title   = {A Study of Post-Training for Removing Hidden Objectives},
    year    = {2026},
    note    = {Technical report, ongoing work},
    url     = {https://pat-jj.github.io/projects/suppression_hierarchy/}
}

This is an ongoing research project. Results and analysis may be updated as experiments progress.