運用 LLM、RAG 與提示工程於永續報告書中的風險識別

簡易檢索 / 詳目顯示

回結果列表

研究生：	王子云 Wang, Zih-Yun
論文名稱：	運用 LLM、RAG 與提示工程於永續報告書中的風險識別 Leveraging LLM, RAG, and Prompt Engineering for Risk Identification in Sustainability Reports
指導教授：	林怡伶 Lin, Yi-Ling
口試委員:	別蓮蒂 Bei, Lien-Ti 魏志平 Wei, Chih-Ping
學位類別：	碩士 Master
系所名稱：	商學院 - 資訊管理學系 Department of Management Information System
論文出版年：	2025
畢業學年度：	113
語文別：	英文
論文頁數：	64
中文關鍵詞：	企業社會責任 (CSR) 、環境社會與公司治理 (ESG) 、企業永續報告書、大型語言模型 (LLMs) 、檢索增強生成 (RAG) 、提示工程、思維鏈（chain-of-thought, CoT）、語境風險偵測
外文關鍵詞：	Corporate Social Responsibility (CSR), Environmental Social and Corporate Governance (ESG), Sustainability report, Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), Prompt engineering, Chain-of-Thought (CoT), Contextual risk detection
相關次數：	點閱：177 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

隨著企業社會責任（CSR）與環境、社會及公司治理（ESG）概念日益受到重視，利害關係人越來越依賴從企業永續報告書中，一窺透明化的企業永續作為以及風險鑑別與管理方式。然而，企業面臨的風險複雜且多樣，使得利害關係人難以全面分析。若要自動化從冗長且無標準化格式的文本中提取顯性與隱性風險極具挑戰性，因為傳統的關鍵字方法難以應對多樣化用詞及細微語境差異。本研究與國立政治大學商學院的信義學院合作，提出一個端到端的檢索增強生成流程來自動化偵測中文永續報告中的風險，並以橫跨五個產業共 30 份 2024年在台灣發布的永續報告書上評估。我們比較了四種提示策略，包含零樣本、零樣本思維鏈、少樣本與少樣本思維鏈，並採用集成方法達成每項風險之中位數 F1值 0.90 的成果，同時兼顧時間與成本效益。對思維鏈輸出進行錯誤分析後，統整出四種常見錯誤類型。此外，我們釋出領域適應的提示模板，以助未來中文永續報告書中的風險偵測相關研究。研究結果顯示，結合大型語言模型、檢索增強生成與提示工程能可靠地自動化風險揭露分析，提升透明度並增強利害關係人的信任。

As the concepts of CSR and ESG receive growing attention, stakeholders increasingly rely on corporate sustainability reports to gain transparent insights into a company’s sustainability practices and its risk identification and management approaches. However, the complexity and diversity of these risks make it difficult to analyze comprehensively. Automatically extracting both explicit and implicit risks from lengthy, unstandardized texts is particularly challenging, as traditional keyword-based methods struggle to handle diverse wording and nuanced contexts. In collaboration with the Sinyi School at National Chengchi University’s College of Commerce, we propose an end-to-end Retrieval-Augmented Generation (RAG) pipeline for automated risk detection in Chinese sustainability reports and evaluate it on 30 Taiwanese 2024 reports spanning five industries. We compare four prompting strategies, including zero-shot, zero-shot chain-of-thought (CoT), few-shot, and few-shot CoT, and employ an ensemble approach that achieves a median per-risk F1 score of 0.90, while maintaining time- and cost-efficiency. Error analysis of CoT outputs uncovers four common failure types. Additionally, we develop domain-adapted prompt templates to support future risk detection research in Chinese sustainability reports. Our results demonstrate that combining Large Language Models (LLMs) with RAG and prompt engineering reliably automates risk-disclosure analysis, enhancing transparency and stakeholder trust.

致謝 i
摘要 ii
Abstract iii
Table of Contents iv
List of Figures vii
List of Tables viii
1 Introduction 1
1.1 Research Background 1
1.2 Research Objective 2
2 Related Work 5
2.1 Text Classification with Generative LLMs in Specific Domains 5
2.2 Approaches to Sustainability Report Analysis 5
2.3 Language Models Overview 6
2.4 Large Language Models (LLMs) 7
2.4.1 Retrieval-Augmented Generation (RAG) 8
2.4.2 Prompt Engineering 8
3 Methodology 10
3.1 Risk Taxonomy and Disclosure Types 10
3.1.1 Categories and Development of Risk Definitions 10
3.1.2 Disclosure Types 11
3.2 Data collection 12
3.2.1 Sample Selection 12
3.2.2 Manual Annotation Process 15
3.3 Research Framework 16
3.4 Data preprocessing 17
3.5 RAG 18
3.5.1 Framework and Model Selection 18
3.5.2 Parameter Settings 19
3.5.3 Prompt Engineering Techniques 21
3.5.4 Prompt Design and Output Schema 21
3.6 Evaluation Metrics 25
4 Experiments 27
4.1 Pilot Study 27
4.1.1 Retrieval Threshold Sensitivity Analysis 27
4.1.2 Experiment Prompt Selection 28
4.2 Results 28
4.2.1 Performance by Overall Prompt Strategy 28
4.2.2 Performance by Industry and Prompt Strategy 29
4.2.3 Performance by Risk Category and Prompt Strategy 31
4.2.4 Ensemble Performance by Risk Category 32
4.3 Analysis of Reasoning and Disclosure Decisions 36
4.3.1 Evaluation of FP’s Chain-of-Thought Reasoning 37
4.3.2 Validation of Disclosure Type Decisions 41
5 Discussion and Conclusion 42
5.1 Discussion 42
5.1.1 RQ1: Comparative Performance of Prompting Strategies 42
5.1.2 RQ2: Benefits of Few-Shot Exemplars and CoT 43
5.1.3 RQ3: RAG Pipeline Design Considerations 43
5.1.4 Ensemble Performance Analysis 44
5.1.5 Insights from Supplementary Analysis 45
5.2 Conclusion 46
5.3 Limitation and Future Work 46
References 48
Appendix A 52
Appendix B 61

Araci, D. (2019). FinBERT: Financial Sentiment Analysis with Pre-trained Language Models. In arXiv preprint arXiv:1908.10063 (ArXiv Preprint ArXiv:1908.10063). http://arxiv.org/abs/1908.10063
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, 33, 1877–1901.
Bsharat, S. M., Myrzakhan, A., & Shen, Z. (2023). Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4 (ArXiv Preprint ArXiv:2312.16171). http://arxiv.org/abs/2312.16171
Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., & Androutsopoulos, I. (2020). LEGAL-BERT: The Muppets straight out of Law School: Vol. arXiv preprint (ArXiv Preprint ArXiv:2010.02559). http://arxiv.org/abs/2010.02559
Chen, Q., Hu, Y., Peng, X., Xie, Q., Jin, Q., Gilson, A., Singer, M. B., Ai, X., Lai, P. T., Wang, Z., Keloth, V. K., Raja, K., Huang, J., He, H., Lin, F., Du, J., Zhang, R., Zheng, W. J., Adelman, R. A., … Xu, H. (2025). Benchmarking large language models for biomedical natural language processing applications and recommendations. Nature Communications, 16(1), 3280. https://doi.org/10.1038/s41467-025-56989-2
Church, K. W., Chen, Z., & Ma, Y. (2021). Emerging trends: A gentle introduction to fine-tuning. Natural Language Engineering, 27(6), 763–778. https://doi.org/10.1017/S1351324921000322
Devlin, J., Chang, M.-W., Lee, K., Google, K. T., & Language, A. I. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171–4186.
Duan, Z., & Wang, J. (2025). Enhancing Multi-Agent Consensus through Third-Party LLM Integration: Analyzing Uncertainty and Mitigating Hallucinations in Large Language Models. 2025 8th International Conference on Advanced Algorithms and Control Engineering (ICAACE), 2222–2227.
Ekin, S. (2023). Prompt Engineering For ChatGPT: A Quick Guide To Techniques, Tips, And Best Practices (Authorea Preprints). https://doi.org/10.36227/techrxiv.22683919.v2
Hadi, M. U., tashi, qasem al, Qureshi, R., Shah, A., muneer, amgad, Irfan, M., Zafar, A., Shaikh, M. B., Akhtar, N., Wu, J., & Mirjalili, S. (2023). A Survey on Large Language Models: Applications, Challenges, Limitations, and Practical Usage (Authorea Preprints). https://doi.org/10.36227/techrxiv.23589741.v1
Huang, J., Wang, D. D., & Wang, Y. (2024). Textual Attributes of Corporate Sustainability Reports and ESG Ratings. Sustainability, 16(21), 9270. https://doi.org/10.3390/su16219270
Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., & Liu, T. (2025). A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Transactions on Information Systems, 43(2), 1–55. https://doi.org/10.1145/3703155
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y., Chen, D., Dai, W., Chan, H. S., Madotto, A., & Fung, P. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12), 1–38. https://doi.org/10.1145/3571730
Kaźmierczak, M. (2022). A literature review on the difference between CSR and ESG. Zeszyty Naukowe. Organizacja i Zarządzanie/Politechnika Śląska, 2022(162), 275–289. https://doi.org/10.29119/1641-3466.2022.162.16
Leng, Q., Portes, J., Havens, S., Zaharia, M., & Carbin, M. (2024). Long Context RAG Performance of Large Language Models (ArXiv Preprint ArXiv:2411.03538). http://arxiv.org/abs/2411.03538
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-T., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems, 9459–9474.
Li, J., Yuan, Y., & Zhang, Z. (2024). Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases (ArXiv Preprint ArXiv:2403.10446). http://arxiv.org/abs/2403.10446
Lin, K. H., Kao, T. H., Wang, L. C., Kuo, C. T., Chen, P. C. H., Chu, Y. C., & Yeh, Y. C. (2025). Benchmarking large language models GPT-4o, llama 3.1, and qwen 2.5 for cancer genetic variant classification. NPJ Precision Oncology, 9(1), 141. https://doi.org/10.1038/s41698-025-00935-4
Lin, Z. (2024). How to write effective prompts for large language models. Nature Human Behaviour, 8(4), 611–615. https://doi.org/10.31234/osf.io/r78fc
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the Middle: How Language Models Use Long Contexts (ArXiv Preprint ArXiv:2307.03172). http://arxiv.org/abs/2307.03172
Liu, Y., & Han, J. (2025). Climate Risk Disclosure and Financial Analysts’ Forecasts: Evidence from China. Sustainability, 17(7), 3178. https://doi.org/10.3390/su17073178
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach (ArXiv Preprint ArXiv:1907.11692). http://arxiv.org/abs/1907.11692
Luccioni, A., Baylor, E., & Duchene, N. (2020). Analyzing Sustainability Reports Using Natural Language Processing (ArXiv Preprint ArXiv:2011.08073). http://arxiv.org/abs/2011.08073
Mao, Y., He, J., & Chen, C. (2025). From Prompts to Templates: A Systematic Prompt Template Analysis for Real-world LLMapps (ArXiv Preprint ArXiv:2504.02052). http://arxiv.org/abs/2504.02052
Phan, H., Acharya, A., Chaturvedi, S., Sharma, S., Parker, M., Nally, D., Jannesari, A., Pazdernik, K., Halappanavar, M., Munikoti, S., & Horawalavithana, S. (2024). RAG vs. Long Context: Examining Frontier Large Language Models for Environmental Review Document Comprehension (ArXiv Preprint ArXiv:2407.07321). http://arxiv.org/abs/2407.07321
Shorten, C., Pierse, C., Smith, T. B., Cardenas, E., Sharma, A., Trengrove, J., & van Luijt, B. (2024). StructuredRAG: JSON Response Formatting with Large Language Models (ArXiv Preprint ArXiv:2408.11061). http://arxiv.org/abs/2408.11061
Tian, K., & Chen, H. (2024). ESG-GPT:GPT4-based Few-Shot Prompt Learning for Multi-lingual ESG News Text Classification. Proceedings of the Joint Workshop of the 7th Financial Technology and Natural Language Processing, the 5th Knowledge Discovery from Unstructured Data in Financial Services, and the 4th Workshop on Economics and Natural Language Processing@LREC-COLING 2024, 279–282.
Vamvourellis, D., & Mehta, D. (2025). Reasoning or Overthinking: Evaluating Large Language Models on Financial Sentiment Analysis (ArXiv Preprint ArXiv:2506.04574). http://arxiv.org/abs/2506.04574
Wang, D. Y.-B., Shen, Z., Mishra, S. S., Xu, Z., Teng, Y., & Ding, H. (2025). SLOT: Structuring the Output of Large Language Models (ArXiv Preprint ArXiv:2505.04016). http://arxiv.org/abs/2505.04016
Webersinke, N., Kraus, M., Bingler, J. A., & Leippold, M. (2021). ClimateBert: A Pretrained Language Model for Climate-Related Text (ArXiv Preprint ArXiv:2110.12010). http://arxiv.org/abs/2110.12010
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi Quoc, E. H., Le, V., & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Chain-of-Thought Prompting. Advances in Neural Information Processing Systems, 35, 24824–24837.
Wu, K., Wu, E., & Zou, J. (2024). ClashEval: Quantifying the tug-of-war between an LLM’s internal prior and external evidence. Advances in Neural Information Processing Systems, 37. http://arxiv.org/abs/2404.10198
Xu, D., Huang, J., Ren, X., & Ye, M. (2024). ESG report textual similarity and stock price synchronicity: Evidence from China. Pacific-Basin Finance Journal, 85, 102343. https://doi.org/10.1016/j.pacfin.2024.102343
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., … Wen, J.-R. (2023). A Survey of Large Language Models (2; ArXiv Preprint ArXiv:2303.18223, Vol. 1). http://arxiv.org/abs/2303.18223
Zhong, Q., Ding, L., Liu, J., Du, B., & Tao, D. (2023). Can ChatGPT Understand Too? A Comparative Study on ChatGPT and Fine-tuned BERT (ArXiv Preprint ArXiv:2302.10198). http://arxiv.org/abs/2302.10198
Zou, Y., Shi, M., Chen, Z., Deng, Z., Lei, Z., Zeng, Z., Yang, S., Tong, H., Xiao, L., & Zhou, W. (2025). ESGReveal: An LLM-based approach for extracting structured data from ESG reports. Journal of Cleaner Production, 489, 144572. https://doi.org/10.1016/j.jclepro.2024.144572

全文公開日期 2030/07/29

簡易檢索 / 詳目顯示

相關論文