| 研究生: |
高煌昌 Gao, Huamg-Chang |
|---|---|
| 論文名稱: |
基於推理執行干預之大型語言模型安全性評估 Safety Evaluation of Large Language Models Based on Thought-Execution Intervention |
| 指導教授: |
蔡炎龍
Tsai, Yen-Lung |
| 口試委員: |
蔡炎龍
Tsai, Yen-Lung 陳天進 Chen, Ten-Ging 張宜武 Chang, Yi-Wu |
| 學位類別: |
碩士
Master |
| 系所名稱: |
理學院 - 應用數學系 Department of Mathematical Sciences |
| 論文出版年: | 2026 |
| 畢業學年度: | 114 |
| 語文別: | 中文 |
| 論文頁數: | 63 |
| 中文關鍵詞: | 大型語言模型 、思維鏈推理 、H-CoT 、人工智慧安全 、推理文 本劫持 、提示工程 |
| 外文關鍵詞: | Large Language Models, Chain-of-Thought Reasoning, H-CoT, AI Security, Reasoning Text Hijacking, Prompt Engineering |
| 相關次數: | 點閱:32 下載:5 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究旨在探討大型語言模型(LLM)中思維鏈(Chain-of-Thought, CoT)推理機制的運作原理及其對模型安全性的影響。首先,本文系統性地回顧 Transformer 架構與生成式語言模型的技術基礎,涵蓋子詞級分詞、自注意力機制與位置編碼,以及基於最大似然估計的下一符號預測訓練流程,並分析監督式微調(SFT)與基於人類回饋的強化學習(RLHF)在對齊流程中如何形塑模型的輸出行為。
在此基礎上,本研究將 CoT 歸納為一種於推論階段引導模型展開推理的提示策略,使其回覆結構區分為「推理片段」與「最終答案」,並對比少樣本(Few-shot)與零樣本(Zero-shot)的誘發路徑。接著,本研究進一步聚焦於 CoT 對模型安全的潛在威脅,指出當模型的安全判斷依賴於顯性或半顯性的推理過程時,推理文本本身即可能成為可被操弄的攻擊介面。本文整理了 H-CoT、BadChain 與 SCoT 等代表性研究脈絡,並將實證重點置於 H-CoT 之推理文本劫持(Reasoning Text Hijacking)。
實驗設計採用 30 組危險問題及其對應的安全對照組,透過將安全問題生成並清理後的模擬「思維執行」(Thought-Execution, TE)注入至危險問題中,對比直接回答(Direct)與 H-CoT 兩種條件下的模型表現。研究採用LLM-as-Judge 進行可解釋的分步評分,並以危害程度(HR, 0–5)與攻擊成功率(ASR)作為量化指標。實驗結果顯示,涵蓋具備原生顯式推理能力以及不具原生推理能力的多款代表性模型,在 H-CoT 條件下皆呈現一致的安全退化趨勢。數據表明 ASR 與平均 HR 顯著上升,且回覆型態由原本的拒絕回答轉向任務導向的回應。本研究證實,當提示詞中引入足以影響生成目標之「執行取向」推理文字時,將顯著提高模型的安全風險。
This research investigates the mechanism of Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) and its implications for model safety. We first systematically review the foundations of the Transformer architecture and generative language models, covering subword-level tokenization, self-attention mechanisms, and the next-token prediction objective under maximum likelihood estimation. Furthermore, we analyze how Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) shape model behaviors during the alignment process.
Building on this technical foundation, this study categorizes CoT as an inference-stage prompting strategy that guides models to unfold reasoning processes, effectively bifurcating responses into "reasoning segments" and "final answers." We compare few-shot and zero-shot induction methods and highlight a critical security vulnerability: when safety alignment relies on explicit or semi-explicit reasoning, the reasoning text itself becomes a manipulable interface. This research synthesizes key academic lineages—H-CoT, BadChain, and SCoT—and focuses empirically on H-CoT via "Reasoning Text Hijacking."
The experimental design utilizes 30 sets of harmful prompts paired with benign counterparts. By injecting cleaned, mocked "Thought-Execution" (TE) generated from safe queries into harmful prompts, we compare model performance under Direct and H-CoT conditions. Using LLM-as-Judge for interpretable step-by-step scoring, we employ Harmfulness Rating (HR, 0–5) and Attack Success Rate (ASR) as quantitative metrics. Results demonstrate a consistent safety degradation across various representative models, including those with and without native explicit reasoning capabilities. The observed increase in ASR and mean HR, alongside a shift from refusal-based responses to task-oriented outputs, indicates that the introduction of "execution-oriented" reasoning text within prompts significantly compromises the model's safety boundaries.
中文摘要 ii
Abstract iv
目錄 vi
表目錄 ix
圖目錄 x
第一章 緒論 1
第二章 Transformer 架構原理 3
第一節 子詞級分詞(Subword-level Tokenization) 3
一、Byte Pair Encoding (BPE) 4
二、WordPiece 5
第二節 Word Embedding 的基本原理 6
一、目標 6
二、詞與語境的建模設定 7
三、Word2Vec:CBOW 與 Skip-Gram 7
第三節 自注意力機制(Self-Attention Mechanism) 9
一、縮放點積注意力(Scaled Dot-Product Attention) 9
二、自注意力形式與遮罩機制(Self-Attention and Masking) 10
三、多頭注意力(Multi-Head Attention) 11
第四節 位置編碼(Positional Encoding) 11
一、基本原理 12
第五節 小結 14
第三章 大型語言模型原理 15
第一節 大型語言模型(LLM)的基本生成原理 15
一、下一符號預測模型 15
二、對數似然訓練與語義結構學習 16
第二節 模型對齊(Model Alignment) 18
一、監督式微調(Supervised Fine-Tuning, SFT) 19
二、基於人類回饋的強化學習(Reinforcement Learning from Human Feedback, RLHF) 20
第三節 微調(Fine-tuning) 22
一、目標函數:由預訓練到微調 22
二、全參數微調與參數高效微調 23
第四節 小結 25
第四章 Chain-of-Thought(CoT)基本原理 26
第一節 Chain-of-Thought 的操作性定義、輸出結構與誘發機制 27
一、Few-shot CoT:示例導向的語境示範誘發 28
二、Zero-shot CoT:觸發語句導向的語用啟動 30
第二節 CoT 的實際作用機制 32
第三節 Chain-of-Thought(CoT)微調方法 34
一、FLAN(Instruction Tuning) 34
二、STaR(Self-Taught Reasoner) 35
第四節 Agentic LLM:以多步規劃與執行擴展 CoT 36
一、Agentic LLM 範例:Decomposed Prompting 的模組化委派 37
二、CoT-first / Answer-later:通用化的兩階段拆解 38
第五節 小結 38
第五章 Chain-of-Thought 與模型安全性 40
第一節 Hijacking Chain-of-Thought(H-CoT) 40
第二節 Backdoor Chain-of-Thought(BadChain) 43
第三節 基於 Chain-of-Thought 的安全推理機制 44
一、訓練流程與推理範式 45
第四節 小結 46
第六章 H-CoT 攻擊之實驗驗證與延伸 47
第一節 Harmfulness Rating(HR)評分機制 47
一、HR 分級定義 48
二、LLM-as-Judge:自動化評分流程與可解釋性 48
三、實例說明 49
第二節 DeepSeek-R1(具備顯式 CoT 輸出之推理模型) 50
第三節 Gemma-2 9B(不具備原生 CoT 輸出之模型) 52
第四節 GPT-3.5 Turbo(不具備原生 CoT 輸出之模型) 53
第七章 結論與未來展望 56
第一節 研究總結 56
第二節 研究貢獻 57
第三節 未來研究方向 58
參考文獻 59
[1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 1877–1901. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
[2] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. B. Hashimoto, O. Vinyals, P. Liang, J. Dean, and W. Fedus, “Emergent abilities of large language models,” Transactions on Machine Learning Research, 2022, survey Certification. [Online]. Available: https://openreview.net/forum?id=yzkSU5zdwD
[3] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” in Advances in Neural Information Processing Systems, vol. 35. Curran Associates, Inc., 2022, pp. 24824–24837. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf
[4] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” Advances in Neural Information Processing Systems, vol. 35, pp. 22199–22213, 2022. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2022/hash/8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html
[5] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané, “Concrete problems in ai safety,” arXiv, 2016, arXiv:1606.06565, June 2016. [Online]. Available: https://arxiv.org/abs/1606.06565
[6] M. Kuo, J. Zhang, A. Ding, Q. Wang, L. DiValentin, Y. Bao, W. Wei, H. Li, and Y. Chen, “H-cot: Hijacking the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including openai o1/o3, deepseek-r1, and gemini 2.0 flash thinking,” 2025. [Online]. Available: https://arxiv.org/abs/2502.12893
[7] Z. Xiang, F. Jiang, Z. Xiong, B. Ramasubramanian, R. Poovendran, and B. Li, “Badchain: Backdoor chain-of-thought prompting for large language models,” in NeurIPS 2023 Workshop on Backdoors in Deep Learning: The Good, the Bad, and the Ugly, 2024. [Online]. Available: https://openreview.net/forum?id=S4cYxINzjp
[8] X. Yang, G. Deng, J. Shi, T. Zhang, and J. S. Dong, “Enhancing model defense against jailbreaks with proactive safety reasoning,” arXiv, 2025, arXiv:2501.19180, January 2025. [Online]. Available: https://arxiv.org/abs/2501.19180
[9] P. F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” in Advances in Neural Information Processing Systems 30 (NeurIPS 2017), 2017, pp. 4299–4307, Long Beach, CA, USA. [Online]. Available: https://papers.nips.cc/paper/7017-deep-reinforcement-learning-from-human-preferences.pdf
[10] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” in Advances in Neural Information Processing Systems (NeurIPS), 2022. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf
[11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, 2017. [Online]. Available: https://papers.neurips.cc/paper/7181-attention-is-all-you-need
[12] P. Gage, “A new algorithm for data compression,” The C Users Journal, vol. 12, no. 2, pp. 23–38, Feb. 1994. [Online]. Available: https://dl.acm.org/doi/10.5555/177910.177914
[13] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Berlin, Germany: Association for Computational Linguistics, August 2016, pp. 1715–1725. [Online]. Available: https://aclanthology.org/P16-1162/
[14] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean, “Google’s neural machine translation system: Bridging the gap between human and machine translation,” CoRR, vol. abs/1609.08144, 2016. [Online]. Available: https://arxiv.org/abs/1609.08144
[15] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural probabilistic language model,” Journal of Machine Learning Research, vol. 3, pp. 1137–1155, 2003. [Online]. Available: https://www.jmlr.org/papers/v3/bengio03a.html
[16] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” in 1st International Conference on Learning Representations (ICLR) 2013, Workshop Track Proceedings, Scottsdale, Arizona, USA, 2013. [Online]. Available: https://arxiv.org/abs/1301.3781
[17] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, “Convolutional sequence to sequence learning,” in Proceedings of the 34th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70. PMLR, 06–11 Aug 2017, pp. 1243–1252. [Online]. Available: https://proceedings.mlr.press/v70/gehring17a.html
[18] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position representations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics Human Language Technologies, Volume 2 (Short Papers). New Orleans, Louisiana: Association for Computational Linguistics, Jun. 2018, pp. 464–468. [Online]. Available: https://aclanthology.org/N18-2074/
[19] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, MIT Press textbook. [Online]. Available: http://www.deeplearningbook.org
[20] R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine Learning, vol. 3, no. 1, pp. 9–44, Aug. 1988.
[21] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, June 2019, pp. 4171–4186. [Online]. Available: https://aclanthology.org/N19-1423/
[22] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot learners,” in Proceedings of the Tenth International Conference on Learning Representations (ICLR). Virtual Conference: OpenReview.net, April 2022. [Online]. Available: https://openreview.net/forum?id=gEZrGCozdqR
[23] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” in International Conference on Learning Representations (ICLR), 2022. [Online]. Available: https://openreview.net/forum?id=nZeVKeeFYf9
[24] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI, Technical Report, 2019. [Online]. Available: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
[25] B. Prystawski, Z. Wu, S. Mendes, N. D. Goodman, and S. T. Piantadosi, “Why think step by step? reasoning emerges from the locality of experience,” in Advances in Neural Information Processing Systems, vol. 36. Curran Associates, Inc., 2023, pp. 70926–70947.
[26] T. Wu, M. Terry, and C. J. Cai, “Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts,” in Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, ser. CHI ’22. New York, NY, USA: Association for Computing Machinery, 2022, pp. 1–22. [Online]. Available: https://dl.acm.org/doi/10.1145/3491102.3517582
[27] E. Zelikman, Y. Wu, J. Mu, and N. D. Goodman, “Star: Bootstrapping reasoning with reasoning,” in Advances in Neural Information Processing Systems 35. New Orleans, Louisiana, USA: Curran Associates, Inc., November 2022. [Online]. Available: https://papers.nips.cc/paper_files/paper/2022/hash/639a9a172c044fbb64175b5fad42e9a5-Abstract.html
[28] L. C. Magister, J. Mallinson, J. Adamek, E. Malmi, and A. Severyn, “Teaching small language models to reason,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Toronto, Canada: Association for Computational Linguistics, July 2023, pp. 1773–1781. [Online]. Available: https://aclanthology.org/2023.acl-short.151/
[29] T. Khot, H. Trivedi, M. Finlayson, Y. Fu, K. Richardson, P. Clark, and A. Sabharwal, “Decomposed prompting: A modular approach for solving complex tasks,” in International Conference on Learning Representations (ICLR). OpenReview.net, 2023. [Online]. Available: https://openreview.net/forum?id=_nGgzQjzaRy