基於詞組的注意力機制用於長文轉換器模型｜國立政治大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	賴建郡 Lai, Jian-Jyun
論文名稱：	基於詞組的注意力機制用於長文轉換器模型 Token-wise Attention Mechanism for Long Input Transformer Models
指導教授：	黃瀚萱 Huang, Hen-Hsen
口試委員:	蔡銘峰 Tsai, Ming-Feng 顏安孜 Yen, An-Zi 黃瀚萱 Huang, Hen-Hsen
學位類別：	碩士 Master
系所名稱：	資訊學院 - 資訊科學系 Department of Computer Science
論文出版年：	2022
畢業學年度：	111
語文別：	英文
論文頁數：	47
中文關鍵詞：	自然語言處理、長文處理、轉換器、注意力機制、基於詞組分析
外文關鍵詞：	Natural language processing, Long text processing, Transformer, Attention mechanism, Token-wise analysis
DOI URL：	http://doi.org/10.6814/NCCU202201686
相關次數：	點閱：77 下載：5
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在現今的自然預言處理的領域當中，以轉換器作為基礎的模型是一個經常被使用的架構，通常來說依照使用該架構來針對大型文本進行預訓練，再針對下游不同的任務分別再進行微調被視為是有效的；在轉換器模型當中，注意力機制是該模型得以獲得資訊的關鍵，而由於注意力機制本身的架構，當字串的長度增加，使用的記憶體也會巨幅的成長，同時，轉換器模型在執行長字串的任務的表現仍舊有進步的空間。

本文嘗試以個別詞組來重新定義注意力機制觀測的範圍，分別為詞性標記和獨立的詞組注意力機制，並以一個切隔注意力機制的矩陣計算方式來達到降低記憶體使用。

在長字串分類和長字串問答中，使用獨立的詞組注意力機制的模型能達到與現今的傑出長字串模型—Longformer相互競爭的表現，並相較於該模型使用較少的記憶體，使其能夠更輕易的應用於自然語言任務。

Transformer-based models are the mainstream in natural language processing (NLP). This scheme is proven an efficient method essential in pre-training and fine-tuning. In the Transformer-based models, the attention mechanism is critical to gaining information on sequences. However, the architecture in the attention mechanism has led to time-consuming and significantly affected by the length of sequences. Also, the performance of the Transformer-based models dealing with long sequences tasks still has much room for further improvement.

In this work, we tend to use a token-wise method to redefine the limiting of the attention mechanism: POS tagging and independent attention. Moreover, with splitting attention matrix computing, the model tends to occupy less memory.

While dealing with long sequences classification and question-answering tasks, the independent attention mechanism models show competitive performance with Longformer. In addition, memory usage also shows an advantage. Thus, using the proposed method tends to be easier in dealing with NLP tasks.

誌謝 i
摘要 ii
Abstract iii
Contents v
List of Figures viii
1 Introduction 1
1.1 Background 1
1.2 Motivation 2
1.3 Research Goals 3
2 Related Work 5
2.1 Reducing the Computation Complexity of Transformer-based Models 5
2.1.1 Fixed Patterns 5
2.1.2 Combination of Patterns 6
2.1.3 Learnable Patterns 6
2.1.4 Memory 6
2.1.5 Low-Rank Methods 6
2.1.6 Kernels 7
2.1.7 Recurrence 7
2.2 Transformer-based models for Long Input Sequences 7
2.3 Replacement of the Attention Matrix 8
2.4 Importance of the Attention Matrix and Input Sequences 8
3 Datasets 10
3.1 The First Stage 11
3.2 The Second Stage 12
3.3 The Third Stage 13
4 Methodology 16
4.1 Part-of-speech(POS) tagging 16
4.1.1 Global attention 17
4.1.2 Large local attention 18
4.1.3 Small local attention 18
4.1.4 Mask language modeling 18
4.2 Independent attention window size 18
4.2.1 Transform attention window size without limitation 19
4.2.2 Transform attention window size with limitation 20
4.2.3 Decrease memory consuming with independent limitation 20
4.3 A Three Stage of Computing Attention 21
5 Experiments 23
5.1 Hyperpartisan 23
5.1.1 Evualtion of The Models Pre-trained With Continuous Task and Data 24
5.1.2 Evualtion of The Models Pre-trained on the Third Stage Data 24
5.2 TriviaQA 25
6 Analysis 28
6.1 Training with POS taggings 28
6.2 The Independent Attention Tokens’ Distribution 30
6.3 The Independent Attention Tokens’ Diversification during Pre-training 31
6.4 The Tendency of Hyperpartisan Score on Continuous Dataset 34
6.5 The Tendency of Hyperpartisan Score on Not Continuous Dataset 37
6.6 The TriviaQA Performance on Each Answer Position 37
6.7 The F1-score and EM-score on Each Answer Position in TriviaQA 42
6.8 McNemar’s Test in Hyperpartisan 42
6.9 McNemar’s Test in TriviaQA 43
6.10 VRAM Occupied 43
7 Conclusions 45
Reference 46

[1] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer, 2020.
[2] Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147.
[3] Johannes Kiesel, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh, David Corney, Benno Stein, and Martin Potthast. SemEval-2019 task 4: Hyperpartisan news detection. In Proceedings of the 13th International Workshop on Semantic Evaluation, pages 829–839, Minneapolis, Minnesota, USA, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/S19-2145. URL https://aclanthology.org/S19-2145.
[4] Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. Attention is not only a weight: Analyzing transformers with vector norms, 2020.
[5] Quoc V. Le and Tomas Mikolov. Distributed representations of sentences and documents, 2014.
[6] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space, 2013.
[7] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey, 2020.
[8] Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. Synthesizer: Rethinking self-attention in transformer models, 2021.
[9] Trieu H. Trinh and Quoc V. Le. A simple method for commonsense reasoning, 2019.
[10] Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news, 2020.
[11] Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books, 2015.

簡易檢索 / 詳目顯示

相關論文