| 研究生: |
楊佳勳 Yang, Chia-Hsun |
|---|---|
| 論文名稱: |
ALPMBT:基於自適應負載分割和多分支退出設計的分散式 Transformer 架構 ALPMBT: A Distributed Transformer Architecture based on Adaptive Load Partitioning and Multi-Branch Exit Design |
| 指導教授: |
張宏慶
Jang, Hung-Chin |
| 口試委員: |
馮輝文
Ferng, Huei-Wen 吳曉光 Wu, Hsiao-Kuang |
| 學位類別: |
碩士
Master |
| 系所名稱: |
資訊學院 - 資訊科學系 Department of Computer Science |
| 論文出版年: | 2025 |
| 畢業學年度: | 114 |
| 語文別: | 中文 |
| 論文頁數: | 70 |
| 中文關鍵詞: | Transformer 、分散式推理 、負載分割 、多分支架構 |
| 外文關鍵詞: | Transformer, Distributed Inference, Load Partition, Multi-Branch Structure |
| 相關次數: | 點閱:30 下載:5 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
Transformer 模型在自然語言處理、計算機視覺等領域具有相當大的貢獻和進展,但其高昂的計算成本對於資源相對受限的邊緣設備而言,在部署應用上帶來了相當大的挑戰。傳統的雲端計算方法透過龐大的計算資源以輔助邊緣設備,使邊緣設備能將工作負載交由雲端進行處理,雖能應付計算需求,但由於物理距離與網路頻寬的限制,對於即時性要求高的應用較難以滿足其需求。
邊緣計算為另一項針對上述問題而誕生的技術,透過在鄰近地區較小的邊緣伺服器,縮短了邊緣設備到伺服器的距離,能有效地提升溝通的效率,降低來回通訊的延遲時間。而為了更進一步提升Transformer 模型的計算效率,分散式推理的技術也應運而生,其核心在於將計算工作分攤到多個邊緣裝置,使用多個裝置的計算資源來協作處理計算工作。
然而,現有方法在進行模型分割與推理時,仍會面臨到較高的通訊成本與面對網路變動時適應性差的挑戰,因此本研究試圖以多分支動態退出的設計來根據多樣性高的設備以及網路環境來調整模型輸出以達到更靈活、彈性的推理效果。另一方面也引入工作負載的自適應分割方法,根據當前的環境來進行負載分割、分配,使邊緣環境中的設備能發揮最大的計算效率來完成推理工作。最後將此兩大方法進行充分整合,既能夠有效減少推理的延遲時間,亦能滿足多變的設備性能、網路條件需求,實現高效、穩定的分散式Transformer 推理架構。
Transformer models have made significant contributions and advancements in natural language processing, computer vision, and other fields. However, their high computational
cost poses considerable challenges for deployment on resource-constrained edge devices. Traditional cloud computing approaches leverage extensive computational resources to assist edge devices by offloading workloads to the cloud. While this method can address computational demands, it struggles to meet the latency requirements of real-time applications due to physical distance and network bandwidth limitations.
Edge computing has emerged as a solution to this issue by utilizing smaller edge servers in closer proximity to edge devices, effectively reducing communication latency and enhancing efficiency. To further improve the computational efficiency of Transformer models, distributed inference techniques have been introduced. The core idea is to distribute computational tasks across multiple edge devices, leveraging their collective computational power to collaboratively process inference tasks.
However, existing methods still face significant challenges, including high communication costs and poor adaptability to network dynamics when partitioning models and performing inference. To address these limitations, this study proposes a multi-branch dynamic exit mechanism, enabling adaptive model output adjustments based on the heterogeneity of edge devices and network conditions to achieve more flexible and efficient inference. Additionally, an adaptive load partitioning strategy is introduced to dynamically allocate computational workloads based on real-time environmental conditions, maximizing the computational efficiency of edge devices during inference. By integrating these two approaches, the proposed framework effectively reduces inference latency while accommodating diverse device capabilities and network conditions, achieving a high performance and stable distributed Transformer inference architecture.
第一章 緒論 1
1.1 研究背景 1
1.2 研究動機 2
1.3 研究目標 3
1.4 論文架構 3
第二章 相關研究 5
2.1 Transformer 多分支架構 5
2.2 Transformer 並行化技術 9
2.2.1 Data Parallelism 10
2.2.2 Pipeline Parallelism 12
2.2.3 Model Parallelism 13
2.3 自適應負載分割 20
第三章 研究方法 23
3.1 訓練階段 24
3.1.1 模型架構 24
3.1.2 損失函數設計 26
3.1.3 資料增強處理 29
3.1.4 訓練流程 30
3.2 推理階段 32
3.2.1 動態負載分割策略 32
3.2.2 最佳退出點決策演算法 35
3.2.3 分散式推理流程 37
第四章 實驗設計與結果分析 41
4.1 實驗環境 41
4.2 實驗架構及流程設計 41
4.2.1 訓練階段 42
4.2.2 推理階段 43
4.3 實驗結果分析 46
4.3.1 模型訓練結果 46
4.3.2 單一設備於不同CPU 頻率下之推理結果 47
4.3.3 ALPMBT 對比平均負載分配之分散式推理結果 50
4.3.4 ALPMBT 對比單一推理深度之分散式推理結果 51
4.3.5 ALPMBT 對比張量並行化架構之分散式推理結果 54
第五章 結論與未來研究 65
5.1 結論 65
5.2 未來研究方向 65
參考文獻 67
[1] Divya Jyoti Bajpai and Manjesh Kumar Hanawal. “Beem: Boosting performance of early exit dnns using multi-exit classifiers as experts.” In: arXiv preprint arXiv:2502.00745 (2025) (cit. p. 36).
[2] Jake D Brutlag, Hilary Hutchinson, and Maria Stone. “User preference and search engine latency.” In: JSM Proceedings, Qualtiy and Productivity Research Section
(2008) (cit. p. 36).
[3] Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert Van De Geijn. “Collective communication: theory, practice, and experience.” In: Concurrency and Computation: Practice and Experience 19.13 (2007), pp. 1749–1783 (cit. p. 14).
[4] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. “Training deep nets with sublinear memory cost.” In: arXiv preprint arXiv:1604.06174 (2016) (cit. p. 13).
[5] PyTorch Contributors. PyTorch documentation. https://docs.pytorch.org/docs/stable/index.html. PyTorch Foundation, 2025 (cit. p. 31).
[6] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. “Randaugment: Practical automated data augmentation with a reduced search space.” In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. 2020, pp. 702–703 (cit. p. 30).
[7] Jeffrey Dean, Greg Corrado, Rajat Monga, et al. “Large scale distributed deep networks.” In: Advances in neural information processing systems 25 (2012) (cit. p. 10).
[8] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, et al. “An image is worth 16x16 words: Transformers for image recognition at scale.” In: arXiv preprint arXiv:2010.11929 (2020) (cit. pp. 24, 25).
[9] Jiangsu Du, Yuanxin Wei, Shengyuan Ye, et al. “Co-designing Transformer Architectures for Distributed Inference with Low Communication.” In: IEEE Transactions on Parallel and Distributed Systems (2024) (cit. pp. 2, 18, 19).
[10] Zenan Fu, Lei Zhang, Wenbo Huang, et al. “Learning Sensor Sample-reweighting for Dynamic Early-exit Activity Recognition via Meta Learning.” In: IEEE Journal of Biomedical and Health Informatics (2024) (cit. p. 28).
[11] Xiaotian Guo, Quan Jiang, Yixian Shen, Andy D Pimentel, and Todor Stefanov. “Easter: Learning to split transformers at the edge robustly.” In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 43.11 (2024), pp. 3626–3637 (cit. pp. 2, 20).
[12] Yizeng Han, Yifan Pu, Zihang Lai, et al. “Learning to weight samples for dynamic early-exiting networks.” In: European conference on computer vision. Springer. 2022, pp. 362–378 (cit. p. 28).
[13] Chenghao Hu and Baochun Li. “When the Edge Meets Transformers: Distributed Inference with Transformer Models.” In: 2024 IEEE 44th International Conference
on Distributed Computing Systems (ICDCS). IEEE. 2024, pp. 82–92 (cit. pp. 2, 16, 17, 23, 37).
[14] Yanping Huang, Youlong Cheng, Ankur Bapna, et al. “Gpipe: Efficient training of giant neural networks using pipeline parallelism.” In: Advances in neural information processing systems 32 (2019) (cit. p. 12).
[15] Bert Hubert et al. “Linux advanced routing & traffic control HOWTO.” In: Netherlabs BV 1 (2002), pp. 99–107 (cit. p. 43).
[16] Bum Jun Kim, Hyeyeon Choi, Hyeonah Jang, et al. “Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding.” In: Pattern Recognit. 141 (2021), p. 109659 (cit. p. 24).
[17] Juhyeon Lee, Insung Bahk, Hoseung Kim, et al. “An Autonomous Parallelization of Transformer Model Inference on Heterogeneous Edge Devices.” In: Proceedings of the 38th ACM International Conference on Supercomputing. 2024, pp. 50–61 (cit. pp. 2, 20, 21).
[18] Mingchu Li, Wenteng Zhang, and Dexin Xia. “Transformer inference acceleration in edge computing environment.” In: 2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW). IEEE. 2023, pp. 104–109 (cit. pp. 2, 5, 23, 24, 26, 36).
[19] Mu Li, David G Andersen, Alexander Smola, and Kai Yu. “Communication efficient distributed machine learning with the parameter server.” In: Advances in neural information processing systems 27 (2014) (cit. p. 10).
[20] Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. “Sequence parallelism: Long sequence training from system perspective.” In: arXiv preprint arXiv:2105.13120 (2021) (cit. pp. 2, 16).
[21] TorchVision maintainers and contributors. TorchVision: PyTorch’s Computer Vision library. https://github.com/pytorch/vision. 2016 (cit. p. 30).
[22] Jakob Nielsen. Response Times: The 3 Important Limits. Nielsen Norman Group. Dec. 1993. URL: https://www.nngroup.com/articles/response-times-3-important-limits/ (cit. p. 36).
[23] Xian Peng, Xin Wu, Lianming Xu, Li Wang, and Aiguo Fei. “DistrEE: Distributed Early Exit of Deep Neural Network Inference on Edge Devices.” In: GLOBECOM 2024-2024 IEEE Global Communications Conference. IEEE. 2024, pp. 3116–3121
(cit. pp. 8, 23).
[24] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, et al. “Megatron-lm: Training multi-billion parameter language models using model parallelism.” In: arXiv preprint arXiv:1909.08053 (2019) (cit. pp. 2, 15, 49, 50, 54).
[25] Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. “RoFormer: Enhanced Transformer with Rotary Position Embedding.” In: ArXiv abs/2104.09864 (2021) (cit. p. 26).
[26] Peng Tang, Pengkai Zhu, Tian Li, et al. “Deed: Dynamic early exit on decoder for accelerating encoder-decoder transformer models.” In: arXiv preprint arXiv:2311.08623 (2023) (cit. pp. 2, 6, 27, 29).
[27] Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. “Attention is All you Need.”
In: Advances in Neural Information Processing Systems. Ed. by I. Guyon, U. Von Luxburg, S. Bengio, et al. Vol. 30. Curran Associates, Inc., 2017 (cit. p. 5).
[28] Joan Vila-Carbó, Joaquim Tur-Masanet, and Enrique Hernandez-Orallo. “An evaluation of switched ethernet and Linux traffic control for real-time transmission.” In: 2008 IEEE International Conference on Emerging Technologies
and Factory Automation. IEEE. 2008, pp. 400–407 (cit. p. 43).
[29] Enlie Wang. “User’s delay perception and tolerance in human-computer interaction.” In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting. Vol. 46. 5. SAGE Publications Sage CA: Los Angeles, CA. 2002,
pp. 651–655 (cit. p. 36).
[30] Guanyu Xu, Jiawei Hao, Li Shen, et al. “Lgvit: Dynamic early exiting for accelerating vision transformer.” In: Proceedings of the 31st ACM International Conference on Multimedia. 2023, pp. 9103–9114 (cit. pp. 2, 7, 27, 29).
[31] Shengyuan Ye, Jiangsu Du, Liekang Zeng, et al. “Galaxy: A resource-efficient collaborative edge ai system for in-situ transformer inference.” In: IEEE INFOCOM 2024-IEEE Conference on Computer Communications. IEEE. 2024,
pp. 1001–1010 (cit. pp. 2, 18).
[32] Guilin Zhang, Wulan Guo, Ziqi Tan, and Hailong Jiang. “AMP4EC: Adaptive Model Partitioning Framework for Efficient Deep Learning Inference in Edge Computing Environments.” In: arXiv preprint arXiv:2504.00407 (2025) (cit. p. 33).