| 研究生: |
張詠軒 Chang, Yung-Hsuan |
|---|---|
| 論文名稱: |
聯邦式擴散模型版權概念遺忘系統:兼具生成品質與提示詞防禦之研究 Federated Concept Unlearning System for Diffusion Models : Balancing Generation Quality and Prompt Injection Defense |
| 指導教授: |
蔡子傑
Tsai, Tzu-Chieh |
| 口試委員: |
吳曉光
Wu, Hsiao-Kuang 周承復 Chou, Cheng-Fu 蔡子傑 Tsai, Tzu-Chieh |
| 學位類別: |
碩士
Master |
| 系所名稱: |
資訊學院 - 資訊科學系 Department of Computer Science |
| 論文出版年: | 2026 |
| 畢業學年度: | 114 |
| 語文別: | 中文 |
| 論文頁數: | 54 |
| 中文關鍵詞: | 擴散模型 、聯邦學習 、機器遺忘 、生成式AI 、版權保護 、提示詞注入防禦 |
| 外文關鍵詞: | Diffusion Models, Federated Learning, Machine Unlearning, Generative AI, Copyright Protection, Prompt Injection Defense |
| 相關次數: | 點閱:126 下載:5 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著生成式人工智慧(Generative AI)的發展,擴散模型(Diffusion Models)已能生成極具真實感的影像。然而,這些模型在訓練過程中往往會「過度擬合(Overfit)」並記憶訓練資料中的特定特徵,導致生成結果極易產出受版權保護的影像(如特定商標),引發了相關的法律與道德爭議 。現有的防護機制多集中於應用層的提示詞過濾或負面提示詞(Negative Prompt)攔截,這種治標不治本的方法很容易遭受到提示詞注入攻擊(Prompt Injection)繞過,難以從模型底層根除侵權風險。
為此,本研究提出了一套創新的「聯邦式概念遺忘框架(Federated Concept Unlearning Framework)」。本框架結合聯邦學習(Federated Learning)與機器遺忘(Machine Unlearning)技術,允許客戶端(Client)在不洩露原始高解析度版權影像的前提下,透過低秩適應(LoRA)進行本地端的「確定性零目標(Deterministic Zero-Target)」損失函數訓練,並利用 FedAvg 演算法於伺服器端(Server)進行權重聚合,藉此從模型潛在空間中物理性切斷特定商標(如 Starbucks)的神經連結 。
實驗結果顯示,本研究提出的確定性零目標策略在聯邦環境下具有極佳的收斂穩定性。為確保量化評估之客觀性與統計顯著性,本研究建構了包含 90 組多語意變體的樣本測試集。評估結果證實,目標商標的平均特徵空間相似度(CLIP Score)大幅下降近 41%(至 0.1976),成功切斷侵權連結;而無關概念的健全性評估分數則穩定維持於 0.2926,客觀證明系統未發生災難性遺忘。
在定性視覺分析上,本系統透過推論階段的動態權重縮放(設定 λ = 0.4),成功解決了機器遺忘領域常見的概念崩塌問題。特異性驗證(Specificity Check)證實,系統不僅精準移除了侵權特徵,更完好保留了模型對無關概念(如貓咪)及相關領域概念(如咖啡杯)的高畫質生成能力。此外,本框架成功打破了預訓練模型的「風格固著(Style Overfitting)」,解鎖了模型的創作自由度,且能在不依賴負面提示詞的嚴苛條件下,防禦惡意提示詞注入攻擊。本研究證實了從權重層級進行聯邦遺忘的可行性與安全性,為生成式 AI 提供了一種兼具法律合規性與資料隱私的有效技術路徑。
With the rapid development of Generative AI, Diffusion Models have become capable of generating highly realistic images. However, during the training process, these models tend to "overfit" and memorize specific features from the training data. This leads to a high risk of generating copyrighted images (e.g., specific trademarks), sparking related legal and ethical disputes. Existing defense mechanisms primarily focus on application-layer prompt filtering or negative prompt interception. These superficial methods are highly vulnerable to being bypassed by prompt injection attacks, making it difficult to eradicate infringement risks from the foundational model weights.
To address this issue, this study proposes an innovative "Federated Concept Unlearning Framework." By integrating Federated Learning with Machine Unlearning technologies, this framework enables clients to perform local "Deterministic Zero-Target" loss function training via Low-Rank Adaptation (LoRA) without exposing original high-resolution copyrighted images. The FedAvg algorithm is then utilized on the server side for weight aggregation, physically severing the neural connections of specific trademarks (e.g., Starbucks) within the model's latent space.
Experimental results demonstrate that the proposed deterministic zero-target strategy exhibits excellent convergence stability in a federated environment. To ensure the objectivity and statistical significance of the quantitative evaluation, this study constructed a large-sample test set comprising 90 semantic variants of prompts. The evaluation results confirm that the average feature spatial similarity (CLIP Score) for the target trademark significantly dropped by approximately 41% (to 0.1976), successfully severing the infringing connection. Meanwhile, the sanity check score for unrelated concepts remained stable at 0.2926, objectively proving the absence of catastrophic forgetting.
In terms of qualitative visual analysis, the system successfully resolved the concept collapse problem commonly seen in machine unlearning through dynamic inference weight scaling (setting λ=0.4). The Specificity Check verifies that the system not only precisely removed the infringing features but also perfectly preserved the model's high-quality generation capabilities for unrelated concepts (e.g., cats) and related domain concepts (e.g., coffee cups). Furthermore, this framework successfully broke the pre-trained model's "Style Overfitting," unlocking its creative freedom. It also demonstrated the ability to defend against malicious prompt injection attacks under the strict condition of not relying on negative prompts. This study verifies the feasibility and security of weight-level federated unlearning, providing an effective technical path for Generative AI that balances legal compliance and data privacy.
誌謝 i
摘要 iii
Abstract v
Contents vii
List of Figures x
List of Tables xi
1 緒論 1
1.1 研究背景 1
1.1.1 生成式 AI 的崛起 1
1.1.2 版權危機與法律爭議 1
1.1.3 現有防護機制的局限性 2
1.2 研究動機 3
1.3 研究目的 4
1.4 預期貢獻 5
2 相關文獻 6
2.1 擴散模型與低秩適應 (Diffusion Models & LoRA) 6
2.1.1 擴散模型原理 6
2.1.2 低秩適應技術 (Low-Rank Adaptation, LoRA) 7
2.2 機器遺忘與現有防護技術 9
2.2.1 機器遺忘的發展 9
2.2.2 提示詞干預技術及其局限 9
2.3 聯邦學習與特徵融合 12
2.3.1 FedAvg 演算法原理 12
2.3.2 特徵融合與隱私保護 12
2.4 生成模型微調與負向訓練 14
3 方法與設計 16
3.1 系統運作階段概述 16
3.2 風險評估:基於提示詞工程之風險評估 19
3.3 聯邦訓練:分散式框架與聚合機制 21
3.4 確定性零目標遺忘演算法 23
3.4.1 標準擴散模型之目標函數 24
3.4.2 零目標損失函數設計 (Zero-Target Loss) 24
3.5 推論控制:動態權重縮放策略 25
3.5.1 FedAvg 參數聚合機制 25
3.5.2 推論階段動態權重縮放 (Inference Weight Scaling) 25
3.6 評估指標 (Evaluation Metrics) 27
3.7 小結 28
4 實驗設置與成果 29
4.1 實驗環境與資料集設定 (Experimental Setup) 30
4.1.1 軟硬體環境配置 30
4.1.2 聯邦遺忘模型超參數設置 31
4.1.3 聯邦資料集分配與 Non-IID 設定 32
4.2 實驗一:聯邦遺忘演算法之穩定性分析 33
4.2.1 實驗設計 33
4.2.2 結果分析與討論 34
4.3 實驗二:遺忘效果驗證 (Target Unlearning Effect) 35
4.3.1 實驗設計 35
4.3.2 結果分析與討論 36
4.4 實驗三:健全性驗證 (Sanity Check) 37
4.4.1 實驗設計 37
4.4.2 結果分析與討論 38
4.5 實驗四:特異性驗證 (Specificity Check) 39
4.5.1 實驗設計 39
4.5.2 結果分析與討論 40
4.6 實驗五:量化評估分析 (Quantitative Evaluation) 41
4.6.1 目標遺忘之穩健性評估 (Robustness of Target Unlearning) 41
4.6.2 生成能力之健全性評估 (Sanity Check) 42
4.6.3 相關領域概念之特異性評估 (Specificity Evaluation) 43
4.7 實驗六:推論階段動態權重縮放分析 44
4.7.1 實驗設計 44
4.7.2 結果分析與討論 45
4.8 實驗七:防禦提示詞注入攻擊之系統安全性 46
4.8.1 實驗設計 46
4.8.2 結果分析與討論 48
4.9 小結 49
5 結論與未來展望 50
5.1 結論 50
5.2 未來展望 51
參考文獻 52
[1] N. Carlini et al., “Extracting training data from diffusion models,” in 32nd USENIX Security Symposium (USENIX Security 23), pp. 5253–5270, 2023.
[2] G. Somepalli et al., “Diffusion art or digital forgery? investigating data replication in diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6048–6058, 2023.
[3] K. Greshake, S. Abdelnabi, S. Mishra, A. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection,” in Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 79–90, 2023.
[4] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen,“LoRA: Low-rank adaptation of large language models,” in Proceedings of the In-ternational Conference on Learning Representations (ICLR), 2022.
[5] L. Bourtoule et al., “Machine unlearning,” in 2021 IEEE Symposium on Security and Privacy (SP), pp. 141–159, 2021.
[6] R. Gandikota, J. Materzynska, J. Fiotto-Squese, and D. Bau, “Erasing concepts from diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2426–2436, 2023.
[7] N. Kumari, B. Zhang, S.-Y. Wang, E. Shechtman, R. Zhang, and J.-Y. Zhu, “Ablat-ing concepts in text-to-image diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2269–2279, 2023.
[8] L. He, Y. Huang, W. Shi, T. Xie, H. Liu, Y. Wang, L. Zettlemoyer, C. Zhang, D. Chen, and P. Henderson, “Fantastic copyrighted beasts and how (Not) to generate them,”arXiv preprint arXiv:2406.14526, 2024.
[9] Y. Zhang et al., “To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images ... for now,” in Proceedings of the European Conference on Computer Vision (ECCV), 2024.
[10] L. Beerens et al., “On the vulnerability of concept erasure in diffusion models,” arXiv preprint arXiv:2502.17537, 2025.
[11] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. Agüera y Arcas,“Communication-efficient learning of deep networks from decentralized data,”in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), vol. 54 of Proceedings of Machine Learning Research, pp. 1273–1282, PMLR, 2017.
[12] X. Yao, T. Huang, C. Wu, R. Zhang, and L. Sun, “Towards faster and better federated learning: A feature fusion approach,” in 2019 IEEE International Conference on Image Processing (ICIP), (Taipei, Taiwan), pp. 175–179, 2019.
[13] Y. Liu et al., “The right to be forgotten in federated learning: An efficient realization with rapid retraining,” in IEEE INFOCOM 2022 - IEEE Conference on Computer Communications, pp. 1749–1758, 2022.
[14] W. Zhao, L. Bai, Y. Rao, J. Zhou, and J. Lu, “UniPC: A unified predictor-corrector framework for fast sampling of diffusion models,” in Advances in Neural Informa-tion Processing Systems (NeurIPS), 2023.
[15] Y. Zhang, X. Chen, J. Jia, Y. Zhang, C. Fan, J. Liu, M. Hong, K. Ding, and S. Liu,“Defensive unlearning with adversarial training for robust concept erasure in dif-fusion models,” in Advances in Neural Information Processing Systems (NeurIPS), 2024.