| 研究生: |
林大維 Lin, Da-Wei |
|---|---|
| 論文名稱: |
擴散模型之顯著圖合理性評估及語義分析 Rationality Evaluation and Semantic Analysis of Saliency Maps in Diffusion Models |
| 指導教授: |
紀明德
Chi, Ming-Te |
| 口試委員: |
彭彥璁
Peng, Yan-Tsung 謝東儒 HSIEH, TUNG-JU |
| 學位類別: |
碩士
Master |
| 系所名稱: |
資訊學院 - 資訊科學系 Department of Computer Science |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 中文 |
| 論文頁數: | 39 |
| 中文關鍵詞: | 擴散模型 、顯著圖 、文字到圖像生成模型 、語義分析 |
| 外文關鍵詞: | Diffusion Models, Saliency Maps, Text-to-Image Generation Models, Semantic Analysis |
| 相關次數: | 點閱:137 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,擴散模型(Diffusion Models)在圖像生成領域取得重大進展,特別是 Stable Diffusion 使文字生成圖像的能力達到新高度。然而,模型在解析自然語言與圖像生成的關聯時,可能會產生特徵糾纏(Feature Entanglement),影響生成結果的合理性。本研究採用 DAAM(Diffusion Attentive AttributionMap)方法,透過分析交互注意力層(Cross Attention Map)生成的顯著圖(Saliency Maps),探討模型對提示詞的關注範圍及其對生成圖像的影響。
我們提出一種自動化合理性評估方法,結合 Segment Anything(SAM)語
義分割技術,以量化顯著圖的準確性,並比較不同 Stable Diffusion 預訓練模型(如 v1.5、v2.1、SDXL)的泛化能力。此外,透過句法剖析(DependencyParsing)與特徵糾纏分析,探討語言提示詞對圖像生成的影響,並驗證形容詞與場景描述對生成結果的影響範圍。
實驗結果顯示,DAAM 在語義關聯性評估方面優於傳統梯度方法(如
Grad-CAM、Grad-CAM++),能更準確地反映文本與圖像的對應關係。此外,我們發現某些形容詞會影響整體場景,而非僅限於描述對象,顯示 Stable Diffusion 在處理複雜提示詞時仍面臨挑戰。未來研究將進一步優化 DAAM 技術,並探索更精確的語義解釋方法,以提升擴散模型的可解釋性與生成品質。
Diffusion models have improved image generation, with Stable Diffusion advancing text-to-image synthesis. However, feature entanglement affects coherence. This study employs the Diffusion Attentive Attribution Map(DAAM) to analyze saliency maps from cross-attention layers, examining prompt processing and its impact on generation.
We propose an automated evaluation method using the Segment Anything Model (SAM) for semantic segmentation to assess saliency accuracy. DAAM’s generalization is compared across Stable Diffusion versions (v1.5,v2.1, SDXL), with linguistic prompt influence analyzed through dependency parsing and feature entanglement studies.
Results show that DAAM outperforms gradient-based methods like Grad-CAM in semantic relevance, revealing how certain adjectives influence entire scenes. Future research will refine DAAM and improve semantic interpretation for better model explainability and generation quality.
致謝 i
摘要 ii
Abstract iii
目錄 iv
圖目錄 v
表目錄 vi
第一章 緒論 1
1.1 研究動機與目的 1
1.2 問題描述 2
1.3 論文架構 4
第二章 相關研究 5
2.1 生成模型 5
2.2 常見的可視化方法 6
2.3 常見的可解釋性指標 7
第三章 研究方法與架構 9
3.1 主題生成器 Stable Diffusion 9
3.2 Diffusion Model 的關鍵項 11
3.3 語意標註的設計 13
3.4 基於標註分割的自動化合理性評估 14
3.5 基於標註計算的語義強度 16
3.6 DAAM 之於不同資料輸入結果的比較 17
第四章 分析結果 18
4.1 量化指標 18
4.2 觀察合理性以及結果對應樣態的差異 21
4.3 語義相關性觀察 25
4.4 同樣語意的變異性 28
4.5 商業訓練比較 29
4.6 穩定性探討 30
4.7 DAAM 限制 31
第五章 結論與未來展望 33
5.1 研究結論 33
5.2 未來研究 33
參考文獻 35
[1] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 10 684–10 695.
[2] R. Tang, L. Liu, A. Pandey, Z. Jiang, G. Yang, K. Kumar, P. Stenetorp, J. Lin, and F. Ture, “What the DAAM: Interpreting stable diffusion using cross attention,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 5644–5659. [Online]. Available: https://aclanthology.org/2023.acl-long.310/
[3] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” 2019.
[4] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR, 18–24 Jul 2021, pp. 8821–8831. [Online]. Available: https://proceedings.mlr.press/v139/ramesh21a.html
[5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023.
[6] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021.
[7] C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, and D. McClosky,“The Stanford CoreNLP natural language processing toolkit,” in Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, K. Bontcheva and J. Zhu, Eds. Baltimore, Maryland: Association for Computational Linguistics, Jun. 2014, pp. 55–60. [Online]. Available: https://aclanthology.org/P14-5010/
[8] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” 2014.
[9] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 6840–6851. [Online]. Available: https://proceedings.neurips.cc/paper_files/ paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf
[10] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” 2015.
[11] R. M. Schmidt, “Recurrent neural networks (rnns): A gentle introduction and overview,” 2019. [Online]. Available: https://arxiv.org/abs/1912.05911
[12] C. B. Vennerød, A. Kjærran, and E. S. Bugge, “Long short-term memory rnn,” 2021. [Online]. Available: https://arxiv.org/abs/2105.06756
[13] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” CoRR, vol. abs/2106.09685, 2021. [Online]. Available: https://arxiv.org/abs/2106.09685
[14] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visualising image classification models and saliency maps,” 2014.
[15] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradientbased localization,” International Journal of Computer Vision, vol. 128, no. 2, p. 336–359, Oct. 2019. [Online]. Available: http://dx.doi.org/10.1007/s11263-019-01228-7
[16] K. O’Shea and R. Nash, “An introduction to convolutional neural networks,” 2015. [Online]. Available: https://arxiv.org/abs/1511.08458
[17] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick, “Segment anything,” 2023. [Online]. Available: https://arxiv.org/abs/2304.02643
[18] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft coco: Common objects in context,” 2015. [Online]. Available: https://arxiv.org/abs/1405.0312
[19] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” 2022. [Online]. Available: https://arxiv.org/abs/2201.12086
[20] G. Somepalli, V. Singla, M. Goldblum, J. Geiping, and T. Goldstein, “Diffusion art or digital forgery? investigating data replication in diffusion models,”in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 6048–6058.
[21] A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or,“Prompt-to-prompt image editing with cross attention control,” 2022.
[22] S. Ge, T. Park, J.-Y. Zhu, and J.-B. Huang, “Expressive text-to-image generation with rich text,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 7545–7556.
[23] J. Sun, D. Fu, Y. Hu, S. Wang, R. Rassin, D.-C. Juan, D. Alon, C. Herrmann, S. van Steenkiste, R. Krishna, and C. Rashtchian, “Dreamsync: Aligning text-toimage generation with image understanding feedback,” 2023.
[24] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 2022.
[25] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2019.
[26] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” 2022. [Online]. Available: https://arxiv.org/abs/2010.02502
[27] R. Daroya, A. Sun, and S. Maji, “Cose: A consistency-sensitivity metric for saliency on image classification,” 2023. [Online]. Available: https: //arxiv.org/abs/2309.10989
[28] V. Shah, N. Ruiz, F. Cole, E. Lu, S. Lazebnik, Y. Li, and V. Jampani,“Ziplora: Any subject in any style by effectively merging loras,” 2023. [Online]. Available: https://arxiv.org/abs/2311.13600
[29] B. Kim, J. Seo, S. Jeon, J. Koo, J. Choe, and T. Jeon, “Why are saliency maps noisy? cause of and solution to noisy saliency maps,” 2019. [Online]. Available: https://arxiv.org/abs/1902.04893
[30] H. Chefer, S. Gur, and L. Wolf, “Transformer interpretability beyond attention visualization,” 2021. [Online]. Available: https://arxiv.org/abs/2012.09838
[31] J. Guerrero-Viu, M. Hasan, A. Roullier, M. Harikumar, Y. Hu, P. Guerrero, D. Gutiérrez, B. Masia, and V. Deschaintre, “Texsliders: Diffusion-based texture editing in clip space,” in Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers ’24, ser. SIGGRAPH ’24. ACM, Jul. 2024, p. 1–11. [Online]. Available: http://dx.doi.org/10.1145/3641519.3657444
[32] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” 2021. [Online]. Available: https://arxiv.org/abs/2010.11929
全文公開日期 2030/05/28