Reasoning Model Unlearning: Forgetting Traces, Not Just Answers, While Preserving Reasoning Skills

1Michigan State University, 2IBM Research
* Equal contribution

Abstract


Recent advances in large reasoning models (LRMs) have enabled strong chain-of-thought (CoT) generation through test-time computation. While these multi-step reasoning capabilities represent a major milestone in language model performance, they also introduce new safety risks. In this work, we present the first systematic study to revisit the problem of machine unlearning in the context of LRMs. Machine unlearning refers to the process of removing the influence of sensitive, harmful, or undesired data or knowledge from a trained model without full retraining. We show that conventional unlearning algorithms, originally designed for non-reasoning models, are inadequate for LRMs. In particular, even when final answers are successfully erased, sensitive information often persists within the intermediate reasoning steps, i.e., CoT trajectories. To address this challenge, we extend conventional unlearning and propose Reasoning-aware Representation Misdirection for Unlearning ($R^2MU$), a novel method that effectively suppresses sensitive reasoning traces and prevents the generation of associated final answers, while preserving the model's reasoning ability. Our experiments demonstrate that $R^2MU$ significantly reduces sensitive information leakage within reasoning traces and achieves strong performance across both safety and reasoning benchmarks, evaluated on state-of-the-art models such as DeepSeek-R1-Distill-LLaMA-8B and DeepSeek-R1-Distill-Qwen-14B.

Can Existing Unlearning Adapt to LRMs?



Figure: Demonstration of LRM unlearning challenges.


  • Challenge I: Existing unlearning fails to “unthink.” Current unlearning methods effectively sanitize final outputs but fail to eliminate sensitive information embedded within the reasoning traces of LRMs.


  • Challenge II: Degradation of reasoning ability. Current unlearning methods substantially degrade the reasoning capability of LRMs.


A New Evaluation Framework for Reasoning-based Unlearning


Unthinking categories pie chart

Figure: Distribution of reasoning traces into unthinking categories (C1–C4) on the WMDP benchmark after applying RMU for LRM unlearning. Categories C2–C4 indicate varying levels of sensitive information leakage, while only C1 is considered successful unthinking. 19.7% of evaluation samples fall into C2–C4, indicating unsafe forgetting.

Existing unlearning evaluation focuses primarily on final answers, which fails to capture the leakage of sensitive information embedded within reasoning traces. To address this, we analyze the degree of sensitive information leakage in reasoning traces of unlearned LRMs, classifying each reasoning trace into one of four categories based on its unthinking behavior.


(C1) contains irrelevant content, or unrelated reasoning;
(C2) introduces indirect factual or inferential knowledge relevant to the sensitive question or answer;
(C3) correctly eliminates one or more incorrect options;
(C4) indicates, supports, or analyzes the correct answer.


Categories C2–C4 reflect varying degrees of sensitive information leakage, indicating that unthinking remains an unsolved challenge for reasoning-based unlearning.

Lessons Learned from Think Intervention:

ZeroThink and the Reflection Token Penalty


(1) ZeroThink (ZT): Enforces a response prefix consisting of an empty thought segment <think></think>.

(2) Reflection Token Penalty (RTP): Suppresses reflection token generation (e.g., <wait>, <but>, <Hmm>) to promote unthinking.

Takeaways:

(1) Token-level interventions (e.g., forcing <think></think> or penalizing reflection words) do not solve unthinking.
(2) While suppressing surface-level tokens, reasoning traces still leak sensitive information.

ZeroThink and Reflection Token Penalty results

Figure 2. Reasoning trace safety category-wise distribution of RMU, RMU w/ ZT, and RMU w/ RTP using LRM, evaluated by GPT-o3-mini. Cases are grouped by sensitivity leakage, where safe indicates successful unthinking and unsafe reflects harmful information leakage in reasoning trace.

R2MU: Toward Effective Unthinking with Reasoning Preservation


Component 1: Unthinking via reasoning trace representation misdirection. Given a forget sample x, we split it into N token-level segments and prepend each with a reasoning trigger to generate CoT traces r1, … , rN. We then apply RMU-style loss[2] to align each ri’s representation with random features:

unthink loss

Component 2: Reasoning ability preservation via CoT supervision. We introduce an auxiliary dataset DCoT (e.g., a math reasoning dataset such as LIMO[3]) where r ∈ DCoT denotes the CoT explanation paired with each question, to preserve reasoning ability in line with RMU’s utility preservation strategy:

cot loss

Full Objective: The final R2MU objective combines both unthinking and CoT supervision losses:

r2mu total loss

Effectiveness of R2MU on WMDP Dataset


Effectiveness of R2MU on WMDP

Table. Performance overview of R2MU on WMDP across two reasoning LLMs (DeepSeek-R1-Distill-LLaMA-8B and Qwen-14B).

Experiment Settings. We evaluate R2MU on WMDP across DeepSeek-R1-Distill-LLaMA-8B and Qwen-14B, comparing unlearning effectiveness (FA-UA, RT-UA), reasoning ability (AIME-2024, MATH-500, GPQA-Diamond), and general utility (MMLU) against RMU, RMU w/ ZT, RMU w/ RTP, and R²MU-v0.

Conclusion: Our findings reveal significant insights.

  • Selective reasoning-trace forgetting: R2MU achieves the lowest RT-UA without compromising FA-UA, outperforming all baselines.
  • Preserved reasoning ability: Unlike R²MU-v0, which collapses on reasoning benchmarks, R2MU maintains strong reasoning performance.
  • Balanced performance–utility trade-off: While slightly increasing training cost, R2MU achieves a superior balance between unlearning precision, reasoning competence, and model utility.

Effectiveness of R2MU in LRM Safety Enhancement


R2MU LRM safety enhancement

Table. Comparison of unlearning methods across two models with respect to unlearning efficacy , reasoning ability, and general utility. R2MU (Ours) significantly improves safety while maintaining competitive reasoning and utility performance.

Performance of R2MU in LRM Safety Enhancement. We perform LRM unlearning using the STAR-1 dataset to assess its potential for enhancing LRM safety. R2MU is compared with other unlearning baselines across three dimensions: unlearning efficacy (measured by safety rate on StrongReject, JBB, and WildJailbreak), general utility (MMLU), and reasoning ability (AIME 2024, MATH-500, GPQA Diamond).

Conclusion: R2MU enhances safety without trade-offs.

  • 🧠 Stronger safety robustness: 15–25% safety gain across major jailbreak benchmarks.
  • 💡 Preserved reasoning and utility: No significant loss in MMLU, AIME-2024, or GPQA performance.
  • 🔒 Effective reasoning-trace unlearning: Demonstrates broad applicability of reasoning-aware forgetting for safer LRMs.

Paper


Paper cover Changsheng Wang, Chongyu Fan, Yihua Zhang, Jinghan Jia, Dennis Wei, Parikshit Ram, Nathalie Baracaldo, Sijia Liu.
Reasoning Model Unlearning: Forgetting Traces, Not Just Answers, While Preserving Reasoning Skills.
(EMNLP Main Paper, 2025)

BibTeX

@inproceedings{wang-etal-2025-reasoning,
    title = "Reasoning Model Unlearning: Forgetting Traces, Not Just Answers, While Preserving Reasoning Skills",
    author = "Wang, Changsheng  and Fan, Chongyu  and Zhang, Yihua  and Jia, Jinghan  and Wei, Dennis  and Ram, Parikshit  and Baracaldo, Nathalie  and Liu, Sijia",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    publisher = "Association for Computational Linguistics",
    pages = "4427--4443",
    ISBN = "979-8-89176-332-6",
}