| 研究生: |
陳卉縈 Chen, Hui-Ying |
|---|---|
| 論文名稱: |
結合規則式評分與分群方法之大型語言模型語意風險與合規性評估 Semantic risk and compliance evaluation on LLM responses using rule-based scoring and clustering |
| 指導教授: |
郁方
Yu, Fang |
| 口試委員: |
江介宏
Jiang, Jie-Hong 洪智鐸 Hong, Chih-Duo 陳琬萍 Chen, Wan-Ping |
| 學位類別: |
碩士
Master |
| 系所名稱: |
商學院 - 資訊管理學系 Department of Management Information System |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 英文 |
| 論文頁數: | 51 |
| 中文關鍵詞: | 大型語言模型 、PyRIT 、GHSOM 、倫理合規性 、安全性評估 、對抗式提示 、越獄攻擊 |
| 外文關鍵詞: | Large Language Models, PyRIT, GHSOM, Ethical compliance, Safety evaluation, Adversarial prompts, Jailbreaking |
| 相關次數: | 點閱:27 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著大型語言模型(LLM)廣泛應用於自然語言處理領域,如何強化其倫理防護能力、抵禦惡意提示詞攻擊,成為當前重要的研究課題。本研究提出一套具可解釋性的雙層評估架構,結合 PyRIT 規則式風險評分與 GHSOM 語意分群方法,從合規性與語氣風險兩個層面,系統性檢視模型的安全性表現。在本架構中,模型回應依據風險程度與語言風格被分類為四種語氣行為類型:明確違規(Vulgar)、語氣冒犯(Blunt)、潛在誤導(Deceptive)與合規回應(Eloquent)。此外,透過語意分群與特徵選取分析,本方法亦能辨識群集層級的風險特徵,並協助偵測出規則式評分中常見的誤判情形。實驗涵蓋 10 組情境與 12 種越獄攻擊腳本,共分析 2,925 筆模型回應。結果顯示,Gemini 產出的違規回應數量最多(119 筆),其次為 Perplexity(70 筆)與 DeepSeek(59 筆),而 Claude 與 ChatGPT 則整體展現出較高的倫理一致性。為進一步驗證風險行為是否具有跨模型的遷移性,本研究將其中 170 筆高風險提示詞重新測試於 API 模型與本地量化模型。結果顯示,API 模型仍容易受到對抗性提示詞影響,而量化模型則因理解能力較弱,導致攻擊成功率相對較低。整體而言,本研究所提出的整合式雙層評估方法,能有效補足傳統規則式指標的侷限,並提升語言模型風險分析的深度與可解釋性,為未來的 LLM 安全評估與對抗性測試提供重要的實證基礎與應用潛力。
Large Language Models (LLMs) have advanced natural language processing (NLP) applications but remain vulnerable to ethical misalignment and adversarial prompts. This study proposes a dual-layer evaluation framework that integrates rule-based scoring using the Python Risk Identification Tool (PyRIT) with clustering via the Growing Hierarchical Self-Organizing Map (GHSOM). LLM outputs are categorized into Vulgar, Blunt, Deceptive, and Eloquent behaviors based on compliance and semantic risks. The framework also enables cluster-level feature identification and false positive detection. Evaluating 2,925 responses across 10 scenarios and 12 jailbreak scripts, Gemini generated the highest number of Vulgar outputs (119), fol- lowed by Perplexity (70) and DeepSeek (59), while Claude and ChatGPT were more ethically aligned. Testing 170 high-risk prompts on API-based versus quantized local models revealed that API models remain susceptible to adversarial inputs, whereas quantized models exhibited lower attack success rates—likely due to reduced comprehension rather than stronger alignment safeguards. These findings underscore the value of layered evaluation frameworks for improving the safety and interpretability of LLMs.
摘要 i
Abstract ii
Contents iii
List of Figures vi
List of Tables viii
1 Introduction 1
2 Related Work 4
2.1 Knowledge and Capability Evaluation 4
2.2 Alignment Evaluation 5
2.3 Safety Evaluation 6
2.4 Limitations of Existing Approaches 7
3 Methodology 9
3.1 Prompt Generation 10
3.1.1 Contextual Prompts 10
3.1.2 Jailbreak Prompts 11
3.1.3 External Prompt Baseline from AdvBench 13
3.2 Response Collection 14
3.3 Semantic Embedding Conversion 14
3.4 Scoring and Classification with PyRIT 15
3.4.1 Binary Compliance Classification 16
3.4.2 Likert-Scale Compliance Scoring 16
3.4.3 Categorical Compliance Assessment 16
3.4.4 Objective Success Evaluation 17
3.5 Clustering Analysis with GHSOM 18
3.5.1 False Positive Detection 18
3.5.2 Feature Identification 19
3.6 Integration of PyRIT and GHSOM 20
4 Evaluation 22
4.1 PyRIT Scoring Analysis 22
4.1.1 Binary Compliance Classification 22
4.1.2 Likert-Scale Compliance Scoring 23
4.1.3 Categorical Compliance Assessment 23
4.1.4 Objective Success Evaluation 24
4.2 GHSOM Clustering Analysis 26
4.2.1 False Positive Detection 26
4.2.2 Feature Identification 28
4.3 Semantic Risk Quadrant Analysis 31
4.3.1 Vulgar Responses 32
4.3.2 Blunt Responses 33
4.3.3 Deceptive Responses 34
4.3.4 Eloquent Responses 34
4.4 Backtracking Analysis of Adversarial Responses 35
4.5 Transferability Evaluation Across Advanced and Quantized Models 38
4.6 Comparison with AdvBench Prompts 40
5 Conclusion 42
Reference 44
Appendix 48
A Representative Examples 48
A.1 Vulgar Response 48
A.2 Blunt Response 49
A.3 Deceptive Response 50
A.4 Eloquent Response 51
AI, D. (2024a). Deepseek-r1-distill-llama-8b [Accessed: 2025-05].
AI, M. (2024b). Meta-llama-3.1-8b-instruct [Accessed: 2025-05].
Anthropic. (2023). Claude [Model version: Claude 3.5 Haiku].https://www.anthropic. com/claude
DeepSeek-AI, Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., et al. (2024). Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. https://arxiv.org/abs/2405.04434
Deng, G., Liu, Y., Li, Y., Wang, K., Zhang, Y., Li, Z., Wang, H., Zhang, T., & Liu, Y. (2024). Masterkey: Automated jailbreaking of large language model chatbots. Proceedings 2024 Network and Distributed System Security Symposium. https://doi. org/10.14722/ndss.2024.24188
Dittenbach, M., Merkl, D., & Rauber, A. (2001). Hierarchical clustering of document archives with the growing hierarchical self-organizing map. Proceedings of the International Conference on Artificial Neural Networks (ICANN), 486–491. https://doi.org/10.1007/3-540-44668-0_70
Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020). Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In T. Cohn, Y. He, & Y. Liu (Eds.), Findings of the association for computational linguistics: Emnlp 2020 (pp. 3356–3369). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.301
Google. (2024). Gemini [Model version: Gemini 2.0 Flash-Lite]. https://gemini.google. com/app
Guo, Z., Jin, R., Liu, C., Huang, Y., Shi, D., Supryadi, Yu, L., Liu, Y., Li, J., Xiong, B., & Xiong, D. (2023). Evaluating large language models: A comprehensive survey. https://arxiv.org/abs/2310.19736
Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D., & Steinhardt, J. (2021). Aligning ai with shared human values. International Conference on Learning Rep- resentations. https://openreview.net/forum?id=dNy%5C_RKzJacY
Huang, Y., Zhang, Q., Y, P. S., & Sun, L. (2023). Trustgpt: A benchmark for trustworthy and responsible large language models. https://arxiv.org/abs/2306.11507
Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78(9), 1464– 1480. https://doi.org/10.1109/5.58325
Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., Toutanova, K., Jones, L., Kelcey, M., Chang, M.-W., Dai, A., Uszkoreit, J., Le, Q., & Petrov, S. (2019). Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7, 453–466. https://doi.org/10.1162/tacl_a_00276
Lees, A., Tran, V. Q., Tay, Y., Sorensen, J., Gupta, J., Metzler, D., & Vasserman, L. (2022). A new generation of perspective api: Efficient multilingual character-level transformers. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 3197–3207. https://doi.org/10.1145/3534678.3539147
Liu, Y., Deng, G., Xu, Z., Li, Y., Zheng, Y., Zhang, Y., Zhao, L., Zhang, T., Wang, K., & Liu, Y. (2024). Jailbreaking chatgpt via prompt engineering: An empirical study. Munoz, G. D. L., Minnich, A. J., Lutz, R., Lundeen, R., Dheekonda, R. S. R., Chikanov, N., Jagdagdorj, B.-E., Pouliot, M., Chawla, S., Maxwell, W., Bullwinkel, B., Pratt, K., de Gruyter, J., Siska, C., Bryan, P., Westerhoff, T., Kawaguchi, C., Seifert, C., Kumar, R. S. S., & Zunger, Y. (2024). Pyrit: A framework for security risk identification and red teaming in generative ai system. https://arxiv.org/abs/2410.02828
Nangia, N., Vania, C., Bhalerao, R., & Bowman, S. R. (2020). Crows-pairs: A challenge dataset for measuring social biases in masked language models. In B. Webber, T. Cohn, Y. He, & Y. Liu (Eds.), Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) (pp. 1953–1967). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.154
OpenAI. (2023). Chatgpt [Model version: GPT-4o mini]. https://openai.com/chatgpt
Patil, S. G., Zhang, T., Wang, X., & Gonzalez, J. E. (2023). Gorilla: Large language model connected with massive apis. https://arxiv.org/abs/2305.15334
Perplexity. (2023). Perplexity ai [Model version: Sonar]. https://www.perplexity.ai Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). Squad: 100, 000+ questions for machine comprehension of text. https://arxiv.org/abs/1606.05250
Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. https://arxiv.org/abs/1908.10084
Rudinger, R., Naradowsky, J., Leonard, B., & Durme, B. V. (2018). Gender bias in coreference resolution. https://arxiv.org/abs/1804.09301
Su, J., Kempe, J., & Ullrich, K. (2024). Mission impossible: A statistical perspective on jailbreaking llms. https://arxiv.org/abs/2408.01420
Talmor, A., Herzig, J., Lourie, N., & Berant, J. (2019). Commonsenseqa: A question answering challenge targeting commonsense knowledge. https://arxiv.org/abs/1811. 00937
Tang, H., Li, H., Liu, J., Hong, Y., Wu, H., & Wang, H. (2021). Dureade_robust: A chinese dataset towards evaluating robustness and generalization of machine reading comprehension in real-world applications. https://arxiv.org/abs/2004.11142
Wen, S.-J., Chang, J.-M., & Yu, F. (2024). Scghsom: Hierarchical clustering and visualization of single-cell and crispr data using growing hierarchical som. https://arxiv. org/abs/2407.16984
Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., & Manning, C. D. (2018). Hotpotqa: A dataset for diverse, explainable multi-hop question answering. https://arxiv.org/abs/1809.09600
Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K.-W. (2018). Gender bias in coreference resolution: Evaluation and debiasing methods. In M. Walker, H. Ji, & A. Stent (Eds.), Proceedings of the 2018 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 2 (short papers) (pp. 15–20). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-2003
Zhao, Y., Zhao, C., Nan, L., Qi, Z., Zhang, W., Tang, X., Mi, B., & Radev, D. (2023). RobuT: A systematic study of table QA robustness against human-annotated adversarial perturbations. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 6064–6081). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.acl-long.334
Zhu, K., Wang, J., Zhou, J., Wang, Z., Chen, H., Wang, Y., Yang, L., Ye, W., Zhang, Y., Gong, N. Z., & Xie, X. (2024). Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts. https://arxiv.org/abs/2306.04528
Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. https://arxiv.org/ abs/2307.15043
全文公開日期 2030/07/30