| 研究生: |
王昱翔 Wang, Yu-Hsiang |
|---|---|
| 論文名稱: |
以深度學習融合影像特徵及詮釋資料的書法字體檢索 Deep learning-based integration of image features and metadata for calligraphy font retrieval |
| 指導教授: |
羅崇銘
Lo, Chung-Ming |
| 口試委員: |
陸行
Luh, Hsing 林于翔 Lin, Yu-Shiang |
| 學位類別: |
碩士
Master |
| 系所名稱: |
文學院 - 圖書資訊與檔案學研究所 Graduate Institute of Library, Information and Archival Studies |
| 論文出版年: | 2026 |
| 畢業學年度: | 114 |
| 語文別: | 中文 |
| 論文頁數: | 56 |
| 中文關鍵詞: | 書法影像檢索 、注意力機制 、多模態特徵融合 、視覺語言模型 、ConvNeXt 、BLIP-2 |
| 外文關鍵詞: | Calligraphy image retrieval, Attention mechanism, Multi-modal feature fusion, Visual language model, ConvNeXt, BLIP-2 |
| 相關次數: | 點閱:19 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
書法是中華文化傳承中不可或缺的載體,藉由書法影像中提供的筆墨線條與結構資訊能夠進行更深入的藝術鑑賞、風格分析、真偽辨識及歷史考據,透過影像檢索,可以從資料庫中選取與查詢目標風格相符的影像,用於藝術教育與數位人文研究。書法藝術具有高度的抽象性與風格多樣性,傳統基於內容的影像檢索技術常因無法有效區分視覺特徵相近的書體而面臨瓶頸。為突破此限制,本研究提出一套結合深度學習視覺骨幹網路與視覺語言模型的多模態書法影像檢索系統。研究資料集節選自聯合百科歷代書法碑帖集成資料庫,共計15000張影像,涵蓋篆、隸、草、行、楷五大書體及其對應之詮釋資料。
本研究首先對比DenseNet、ConvNeXt、ViT及Swin Transformer四種骨幹網路,並透過五折交叉驗證(5-fold cross validation)進行微調,實驗結果顯示,ConvNeXt V2因具備卷積神經網路捕捉局部筆觸紋理的優勢,同時透過大核心設計兼顧全局結構,在純視覺檢索中取得最佳的平均精準度(mAP)0.84。為進一步深化語意理解,研究中引入視覺語言模型BLIP-2進行多模態嵌入融合,並設計比較融合嵌入向量(fused feature vector)、相似度加權(weighted similarity)與階層相似度加權(hierarchical weighted similarity)三種策略,以強化影像視覺特徵與文字語意之間的關聯性。
實驗結果證實,當ConvNeXt V2結合BLIP-2並應用階層相似度加權策略時,系統mAP達到0.87,較純視覺方法提升約3.57%,顯示引入語意特徵能有效解決視覺模糊性問題。在效率方面,階層式篩選將平均檢索時間控制在1.85秒,優於全量語意運算的1.96秒,成功在檢索精確度與運算效率之間取得最佳平衡。本研究成果不僅驗證了視覺語言模型在書法領域的應用潛力,亦為數位典藏檢索系統提供了具備高擴充性的架構。
Calligraphy serves as an indispensable medium in the inheritance of Chinese culture. Through the brushstroke lines and structural information provided in calligraphy images, deeper art appreciation, stylistic analysis, authenticity identification, and historical research can be facilitated. Image retrieval enables the selection and querying of images matching target styles from databases, thereby supporting art education and digital humanities research. However, characterized by high abstraction and stylistic diversity, traditional content-based image retrieval techniques often encounter bottlenecks due to the inability to effectively distinguish scripts with similar visual features. To overcome this limitation, this study proposes a multi-modal calligraphy image retrieval system integrating deep learning visual backbone networks with visual language models. The research dataset is sourced from the Union Encyclopedia Collection of Past Dynasty Calligraphy Rubbings Database, comprising 15000 images encompassing the five major script types: Seal, Clerical, Cursive, Running, and Regular scripts, along with their corresponding metadata.
This study first evaluated four backbone networks: DenseNet, ConvNeXt, ViT, and Swin Transformer, which were fine-tuned using 5-fold cross-validation. Experimental results demonstrated that ConvNeXt V2 achieved the highest mean Average Precision (mAP) of 0.84 in pure visual retrieval, attributable to its ability to capture local brushstroke textures via convolutional neural networks while balancing global structure through its large kernel design. To further deepen semantic understanding, the study incorporated the BLIP-2 visual language model for multi-modal embedding fusion. Three strategies were designed and compared: fused feature-vector, weighted similarity, and hierarchical weighted similarity, to reinforce the correlation between visual features and textual semantics.
Experimental results confirmed that when ConvNeXt V2 was combined with BLIP-2 using the hierarchical weighted similarity strategy, the system attained a mAP of 0.87, representing an improvement of approximately 3.57% over the pure visual method. This demonstrates that the introduction of semantic features effectively mitigates visual ambiguity. Regarding efficiency, the hierarchical screening mechanism maintained the average retrieval time at 1.85 seconds, outperforming the 1.96 seconds required for full-scale semantic computation, thus successfully striking an optimal balance between retrieval accuracy and computational efficiency. These findings not only validate the potential of visual language models in the field of calligraphy but also provide a highly scalable architecture for digital archive retrieval systems.
謝辭 i
摘要 ii
Abstract iii
圖目錄 v
表目錄 vi
第一章 緒論 1
第一節 書法演變 1
第二節 書體檢索 10
第二章 文獻探討 13
第三章 研究材料與方法 16
第一節 書法資料庫 17
第二節 以跨模態排序進行內容比對的影像檢索 19
一、 骨幹網路 19
二、 零樣本視覺語言模型 27
三、 跨模態嵌入整合 30
第三節 效能評量 34
一、 相似度比對 34
二、 準確率 34
三、 平均精準度 36
第四章 實驗結果 37
第一節 分類效能分析 37
第二節 檢索效能分析 41
一、 骨幹網路的檢索效能對比 41
二、 跨模態嵌入整合策略的檢索效能對比 44
第五章 結論與討論 50
第六章 未來方向 52
參考文獻 53
Ling, T. (2020). 光學字元辨識古籍之全文轉置經驗: 以明人文集為例. 圖資與檔案學刊(7), 76-117.
刘琳. (2023). 段玉裁《 说文解字注》“古今字” 研究. 社会科学文献出版社.
朱雷刚. (2015). 浅谈草书艺术的国际影响. 中国书法(24), 124-127.
吴立军. (2016). 行书漫议. 中国建材(10), 144-144.
李应青. (2017). 从编辑学视角考察一段草书史资料以讹传讹的引用. 出版科学, 25(6), 42.
李明龙. (2017). 从出土文献看魏晋南北朝楷书的特征. 语文知识(1), 95-96.
杨一鸣, & 花蕾. (2023). 简多玛《 字典标目》 的考察. 東アジア文化交渉研究= Journal of East Asian cultural interaction studies, 16, 71-87.
沈裕昌. (2022). 人書之際, 本末之辨: 趙壹《 非草書》 論技藝與人的限度. Humanitas Taiwanica(97).
辛尘. (2016). 隶变(一). 江苏教育(37), 19-21.
侯开嘉. (2002). 隶草派生章草今草说. 书法研究(4), 73-102.
苏杰. (2012). 唐代“隶书”称谓论. 艺术百家(1), 233-234.
連蔚勤. (2009). 泰山、瑯琊臺刻石與《說文》篆形探析. 有鳳初鳴年刊(5), 231-248. https://doi.org/10.29458/agsclsu.200910.0015
程同根. (2002). 隶书导学 : 隶书用笔间架一百法. 北京市 : 華夏出版社, 2002 [民國91].
韩孟伟. (2015). 章草与隶草之源承简析. 青少年书法: 少年版(10), 26-27.
Barnhart, R. (1972). Chinese Calligraphy: The Inner World of the Brush. The Metropolitan Museum of Art Bulletin, 30(5), 230-241. https://doi.org/10.2307/3258680
Barz, B., & Denzler, J. (2021). Content-based image retrieval and the semantic gap in the deep learning era. Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10–15, 2021, Proceedings, Part II,
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., & Gelly, S. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Duthie, T. (2014). Man’yōshū and the imperial imagination in early Japan (Vol. 45). Brill.
Eberhard, D. M., & Simons, G. (2022). Ethnologue: languages of the Americas and the Pacific. (No Title).
Guo, J., Wang, M., Zhou, Y., Song, B., Chi, Y., Fan, W., & Chang, J. (2023). HGAN: Hierarchical graph alignment network for image-text retrieval. IEEE Transactions on Multimedia, 25, 9189-9202.
Guo, M.-H., Lu, C.-Z., Liu, Z.-N., Cheng, M.-M., & Hu, S.-M. (2023). Visual attention network. Computational visual media, 9(4), 733-752.
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., & Wang, Y. (2021). Transformer in transformer. Advances in neural information processing systems, 34, 15908-15919.
Henderson, P., & Ferrari, V. (2017). End-to-end training of object class detectors for mean average precision. Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part V 13,
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. Proceedings of the IEEE conference on computer vision and pattern recognition,
Huang, J.-d., Cheng, G., Zhang, J., & Miao, W. (2023). Recognition method for stone carved calligraphy characters based on a convolutional neural network. Neural Computing and Applications, 35(12), 8723-8732.
Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q., Sung, Y.-H., Li, Z., & Duerig, T. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. International conference on machine learning,
Kingma, D. P. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Li, H., Rajbahadur, G. K., Lin, D., Bezemer, C.-P., & Jiang, Z. M. (2024). Keeping deep learning models in check: A history-based approach to mitigate overfitting. IEEE Access, 12, 70676-70689.
Li, J., Li, D., Savarese, S., & Hoi, S. (2023). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. International conference on machine learning,
Li, J., Li, D., Xiong, C., & Hoi, S. (2022). Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. International conference on machine learning,
Li, Y., Liang, F., Zhao, L., Cui, Y., Ouyang, W., Shao, J., Yu, F., & Yan, J. (2021). Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208.
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9), 1-35.
Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., & Dong, L. (2022). Swin transformer v2: Scaling up capacity and resolution. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF international conference on computer vision,
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convnet for the 2020s. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
Lo, C. M., & Hsieh, C. Y. (2025). Large-Scale Hierarchical Medical Image Retrieval Based on a Multilevel Convolutional Neural Network. IEEE Transactions on Emerging Topics in Computational Intelligence, 9(4), 2782-2792. https://doi.org/10.1109/TETCI.2024.3502404
Lyu, J. h., & Kwak, D. y. (2022). A Research on the Chinese Characters Culture Industry in China, South Korea and Japan. [A Research on the Chinese Characters Culture Industry in China, South Korea and Japan]. JOURNAL OF NORTH-EAST ASIAN CULTURES, 72, 271-279. http://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE11138239
Pang, X. (2023). Calligraphic Techniques in Painting: The Aesthetic Expression and Literary Significance of “Writing” in Ni Zan’s Paintings.
Pawlik, K. (2022). Keep the Living Practice Alive: Calligraphy, Commodification, and Postliterate Culture. In S. Chrétien-Ichikawa & K. Pawlik (Eds.), Creative Industries and Digital Transformation in China (pp. 11-37). Springer Nature Singapore. https://doi.org/10.1007/978-981-19-3049-2_2
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., & Clark, J. (2021). Learning transferable visual models from natural language supervision. International conference on machine learning,
Shayestehfar, M., & Khazaie, E. (2021). The Impact of Chinese Seals on the Structure, Design, and Usage of the Īl-Khānids Seals and Coins. Design Engineering, 6713-6739.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., & Le, Q. V. (2021). Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
Wolf, M., Kurvers, R. H., Ward, A. J., Krause, S., & Krause, J. (2013). Accurate decisions in an uncertain world: collective cognition increases true positives while decreasing false positives. Proceedings of the Royal Society B: Biological Sciences, 280(1756), 20122777.
Woo, S., Debnath, S., Hu, R., Chen, X., Liu, Z., Kweon, I. S., & Xie, S. (2023). Convnext v2: Co-designing and scaling convnets with masked autoencoders. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
Wu, M.-Q. A., Wu, F., & Lin, W.-B. (2024). Improving the Precision of Image Search Engines with the Psychological Intention Diagram. Electronics, 13(1).
Wu, M. P. S. (1994). [Word as Image: The Art of Chinese Seal Engraving, Jason Chi-sheng Kuo]. China Review International, 1(1), 155-160. http://www.jstor.org/stable/23728677
Xia, P., Zhang, L., & Li, F. (2015). Learning similarity with cosine similarity ensemble. Information Sciences, 307, 39-52. https://doi.org/https://doi.org/10.1016/j.ins.2015.02.024
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. International conference on machine learning,
Xu, Y., & Shen, R. (2023). Aesthetic evaluation of Chinese calligraphy: a cross-cultural comparative study. Current Psychology, 42(27), 23096-23109.
Yang, L., Wu, Z., Xu, T., Du, J., & Wu, E. (2023). Easy recognition of artistic Chinese calligraphic characters. The Visual Computer, 39(8), 3755-3766.
Yang, S. (2022). The Development and Influence of Seal Script of Han Dynasty Inscriptions. International Journal of Advanced Culture Technology, 10(3), 192-201.
Yao, F., Sun, X., Liu, N., Tian, C., Xu, L., Hu, L., & Ding, C. (2023). Hypergraph-Enhanced Textual-Visual Matching Network for Cross-Modal Remote Sensing Image Retrieval via Dynamic Hypergraph Learning. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 16, 688-701. https://doi.org/10.1109/JSTARS.2022.3226325
Yao, T., Li, Y., Pan, Y., Wang, Y., Zhang, X.-P., & Mei, T. (2023). Dual vision transformer. IEEE transactions on pattern analysis and machine intelligence, 45(9), 10870-10882.
Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., & Beyer, L. (2022). Lit: Zero-shot transfer with locked-image text tuning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
Zhang, G., Wei, S., Pang, H., Qiu, S., & Zhao, Y. (2022). Composed image retrieval via explicit erasure and replenishment with semantic alignment. IEEE Transactions on Image Processing, 31, 5976-5988.
Zhang, Q. (2015). An introduction to Chinese history and culture. Springer.
Zhang, Y., Chen, L., Chen, H., Chu, J., Chang, B., Wang, Y., & Sun, G. (2023). Visual analysis of inscriptions in the Tang Dynasty: a case study on the calligraphy style of Wang Xizhi. Visual Intelligence, 1(1), 8. https://doi.org/10.1007/s44267-023-00012-z
全文公開日期 2031/01/18