跳到主要內容

簡易檢索 / 詳目顯示

研究生: 賴威博
Lai, Wei-Po
論文名稱: 針對大尺度空拍高斯潑濺之互動式語意分割
Interactive Semantic Segmentation for Large-Scale Aerial 3D Gaussian Splatting
指導教授: 紀明德
Chi, Ming-Te
口試委員: 洪仕軒
Hung, Shih-Hsuan
彭彥璁
Peng, Yan-Tsung
學位類別: 碩士
Master
系所名稱: 資訊學院 - 資訊科學系
Department of Computer Science
論文出版年: 2026
畢業學年度: 114
語文別: 中文
論文頁數: 53
中文關鍵詞: 3D Gaussian Splatting語意分割Segment Anything ModelCLIP無人機空拍影像多模態編輯三維場景操作
外文關鍵詞: 3D Gaussian Splatting, Semantic Segmentation, Segment Anything Model, CLIP, UAV Imagery, Multimodal Editing, 3D Scene Manipulation
相關次數: 點閱:15下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本研究針對無人機(UAV)空拍影像所涵蓋的大尺度戶外場景,特別應對鳥瞰視角(Bird's-Eye View)下尺度變異劇烈與遮擋模式複雜的挑戰,提出一套無需重新訓練模型即可實現語意導向三維編輯的後處理框架。系統以 3D Gaussian Splatting(3DGS)為幾何基礎,整合 Segment Anything Model(SAM)的 2D 遮罩生成、SfM 重建所產出的稀疏幾何先驗,以及領域微調的 CLIP 語意解碼器,建立穩健的二維–三維語意對應。透過尺度感知的對比特徵學習機制,每個高斯點被賦予語意嵌入,並結合遮罩品質篩選與記憶體優化策略,確保在百萬級點雲場景下仍可穩定訓練。

    本框架支援多模態互動操作,包括點擊驅動的即時選取、預定義語意標籤一鍵套用,以及開放詞彙的文字查詢,並在消費級 GPU 上實現低互動延遲與流暢算繪。所提出之方法為智慧城市規劃、災害評估與虛擬地景建構等應用,提供一套可擴展、低門檻且具實務價值的語意操作解決方案。


    We present a post-hoc, semantic-driven 3D editing framework for large-scale outdoor scenes captured by unmanned aerial vehicles (UAVs) that does not require retraining of the underlying 3D Gaussian Splatting (3DGS) model. Our system integrates SAM for 2D mask generation, sparse geometric priors derived from SfM reconstruction, and a domain-finetuned CLIP decoder to establish robust 2D–3D semantic correspondence. A scale-aware contrastive learning mechanism assigns semantic embeddings to each Gaussian point, while mask-quality filtering and memory-efficient training help maintain stable optimization in million-scale scenes.
    The framework supports multimodal interaction including click-driven selection, label application, and open-vocabulary queries, and achieves low-latency, interactive rendering on consumer-grade GPUs. It provides a lightweight and scalable solution for semantic 3D manipulation in smart city planning, post-disaster assessment, and virtual terrain construction.

    誌謝 iii
    摘要 v
    Abstract vi
    目錄 vii
    圖目錄 x
    表目錄 xi
    第 一章 緒論 1
    1.1 研究動機 1
    1.1.1 戶外三維場景的語意分割的挑戰與需求 1
    1.1.2 技術基礎 2
    1.2 問題釐清 4
    1.3 本研究之貢獻 4
    第 二章 相關研究 6
    2.1 3D Gaussian Splatting 與場景重建 6
    2.2 三維場景中的語意分割技術 7
    2.3 語意標籤與特徵表示學習 8
    2.4 現有技術挑戰與本研究定位 9
    第 三章 研究背景 10
    3.1 3D Gaussian Splatting 10
    3.1.1 高斯潑濺表示方式 10
    3.1.2 高斯點參數 11
    3.1.3 算圖流程與投影機制 11
    3.1.4 模型訓練與密度更新 12
    3.2 3DGS 上的語意分割12
    3.2.1 核心架構與設計理念 13
    3.2.2 尺度門控親和特徵 13
    3.2.3 對比訓練策略 14
    3.2.4 推論與應用 14
    第 四章 研究方法與步驟 16
    4.1 UAV 影像 SfM 與三維場景建構 16
    4.1.1 全域 SfM(GLOMAP)與稀疏點雲 16
    4.1.2 3DGS 場景初始化與通用性驗證 17
    4.2 語意遮罩擷取與二維–三維對應 18
    4.2.1 批次產生 SAM 遮罩 18
    4.2.2 遮罩反投影與三維尺度估計 18
    4.2.3 文字標籤語意特徵與遮罩尺度預計算 19
    4.3 語義區域分類的 CLIP 微調流程 21
    4.3.1 區域級資料集建構 21
    4.3.2 模型架構與訓練設定 21
    4.3.3 整合至語義處理流程 22
    4.4 尺度感知對比學習(Contrastive Feature Training) 24
    4.4.1 方法架構與遮罩品質篩選 24
    4.4.2 針對有限硬體的訓練優化 25
    4.5 互動式選取與語意導向編輯(GUI) 28
    4.5.1 點擊驅動的語意選取與三維分割 28
    4.5.2 訓練視角導航與自由視角切換 30
    4.5.3 零階球諧算繪表示與即時算繪效能優化 30
    4.5.4 預定義標籤快速選取 31
    4.5.5 開放詞彙驅動的三維物體檢索 32
    4.5.6 操作流程與語意確認機制 33
    第 五章 實驗 34
    5.1 實驗設置 34
    5.1.1 資料集與硬體環境 34
    5.1.2 評估指標 36
    5.2 尺度感知對比學習之訓練穩定性與資源效率 37
    5.3 互動式語意編輯系統之效能與視覺品質評估 38
    5.4 微調 CLIP 解碼器之分類效能評估 41
    第 六章 結論與未來展望 45
    6.1 結論 45
    6.2 限制 46
    6.3 未來展望 47
    參考文獻 48

    [1] J. Cen, J. Fang, C. Yang, L. Xie, X. Zhang, W. Shen, and Q. Tian, “Segment any 3d gaussians,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 2, 2025, pp. 1971–1979.
    [2] B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering.” ACM Trans. Graph., vol. 42, no. 4, pp. 139–1, 2023.
    [3] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng,“Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021.
    [4] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026.
    [5] C. Zhang, L. Liu, Y. Cui, G. Huang, W. Lin, Y. Yang, and Y. Hu, “A comprehensive survey on segment anything model for vision and beyond,” arXiv preprint arXiv:2305.08196, 2023.
    [6] W. Ji, J. Li, Q. Bi, T. Liu, W. Li, and L. Cheng, “Segment anything is not always perfect: An investigation of sam on different real-world applications,” 2024.
    [7] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PmLR, 2021, pp. 8748–8763.
    [8] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International conference on machine learning. PMLR, 2022, pp. 12 888–12 900.
    [9] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pretraining with frozen image encoders and large language models,” in International conference on machine learning. PMLR, 2023, pp. 19 730–19 742.
    [10] M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister, “Langsplat: 3d language gaussian splatting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 20 051–20 060.
    [11] G. Liao, J. Li, Z. Bao, X. Ye, J. Wang, Q. Li, and K. Liu, “Clip-gs: Clip-informed gaussian splatting for real-time and view-consistent 3d semantic understanding,” arXiv preprint arXiv:2404.14249, 2024.
    [12] M. Chen, I. Laina, and A. Vedaldi, “Dge: Direct gaussian 3d editing by consistent multi-view editing,” in European Conference on Computer Vision. Springer, 2024, pp. 74–92.
    [13] S. Zhou, H. Chang, S. Jiang, Z. Fan, Z. Zhu, D. Xu, P. Chari, S. You, Z. Wang, and A. Kadambi, “Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 21 676–21 685.
    [14] J. Wu, J.-W. Bian, X. Li, G. Wang, I. Reid, P. Torr, and V. A. Prisacariu, “Gaussctrl: Multi-view consistent text-driven 3d gaussian splatting editing,” in European Conference on Computer Vision. Springer, 2024, pp. 55–71.
    [15] M. Ye, M. Danelljan, F. Yu, and L. Ke, “Gaussian grouping: Segment and edit anything in 3d scenes,” in European Conference on Computer Vision. Springer, 2024, pp. 162–179.
    [16] J. Tang, Y. Gao, D. Yang, L. Yan, Y. Yue, and Y. Yang, “Dronesplat: 3d gaussian splatting for robust 3d reconstruction from in-the-wild drone imagery,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 833–843.
    [17] S. Fridovich-Keil, A. Yu, M. Tancik, Q. Chen, B. Recht, and A. Kanazawa, “Plenoxels: Radiance fields without neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5501–5510.
    [18] T. Müller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,” ACM transactions on graphics (TOG), vol. 41, no. 4, pp. 1–15, 2022.
    [19] Y. Wu, Z. Qi, Z. Shi, and Z. Zou, “Blockgaussian: Efficient large-scale scene novel view synthesis via adaptive block-based gaussian splatting,” arXiv preprint arXiv:2504.09048, 2025.
    [20] Y. Li, L. Jiang, L. Xu, Y. Xiangli, Z. Wang, D. Lin, and B. Dai, “Matrixcity: A largescale city dataset for city-scale neural rendering and beyond,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3205–3215.
    [21] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE
    conference on computer vision and pattern recognition, 2018, pp. 586–595.
    [22] H. Turki, D. Ramanan, and M. Satyanarayanan, “Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12 922–12 931.
    [23] M. Zhenxing and D. Xu, “Switch-nerf: Learning scene decomposition with mixture of experts for large-scale neural radiance fields,” in The Eleventh International Conference on Learning Representations, 2022.
    [24] N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson et al., “Sam 2: Segment anything in images and videos,” arXiv preprint arXiv:2408.00714, 2024.
    [25] M. Tancik, E. Weber, E. Ng, R. Li, B. Yi, T. Wang, A. Kristoffersen, J. Austin, K. Salahi, A. Ahuja et al., “Nerfstudio: A modular framework for neural radiance field development,” in ACM SIGGRAPH 2023 conference proceedings, 2023, pp.1–12.
    [26] Y. Chen, Z. Chen, C. Zhang, F. Wang, X. Yang, Y. Wang, Z. Cai, L. Yang, H. Liu, and G. Lin, “Gaussianeditor: Swift and controllable 3d editing with gaussian splatting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 21 476–21 485.
    [27] C. M. Kim, M. Wu, J. Kerr, K. Goldberg, M. Tancik, and A. Kanazawa, “Garfield: Group anything with radiance fields,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 21 530–21 539.
    [28] S. Peng, K. Genova, C. Jiang, A. Tagliasacchi, M. Pollefeys, T. Funkhouser et al., “Openscene: 3d scene understanding with open vocabularies,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 815–824.
    [29] J. Lin, Z. Li, X. Tang, J. Liu, S. Liu, J. Liu, Y. Lu, X. Wu, S. Xu, Y. Yan et al., “Vastgaussian: Vast 3d gaussians for large scene reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 5166–5175.
    [30] Y. Ham, M. Michalkiewicz, and G. Balakrishnan, “Dragon: Drone and ground gaussian splatting for 3d building reconstruction,” in 2024 IEEE International Conference on Computational Photography (ICCP). IEEE, 2024, pp. 1–12.
    [31] Su, “3d modeling of drone-captured scenes based on gaussian splatting,” https://github.com/cglabnccu/3dgs, 2025, accessed: 2026-01-01.
    [32] L. Pan, D. Baráth, M. Pollefeys, and J. L. Schönberger, “Global structure-frommotion revisited,” in European Conference on Computer Vision. Springer, 2024, pp. 58–77.
    [33] J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4104–4113.
    [34] L. Lin, Y. Liu, Y. Hu, X. Yan, K. Xie, and H. Huang, “Capturing, reconstructing, and simulating: the urbanscene3d dataset,” in European Conference on Computer Vision. Springer, 2022, pp. 93–109.
    [35] Y. Wang, S. Chen, and R. Yi, “Sg-splatting: Accelerating 3d gaussian splatting with spherical gaussians,” arXiv preprint arXiv:2501.00342, 2024.
    [36] L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,” Advances in Neural Information Processing Systems, vol. 37, pp. 21 875–21 911, 2024.
    [37] L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 10 371–10 381.
    [38] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, “Labelme: a database and web-based tool for image annotation,” International journal of computer vision, vol. 77, no. 1, pp. 157–173, 2008.
    [39] J. Hoffstadt et al., “Dear PyGui: A fast and powerful graphical user interface toolkit for python,” https://github.com/hoffstadt/DearPyGui, 2020, accessed: 2026-01-01.
    [40] L. McInnes, J. Healy, S. Astels et al., “hdbscan: Hierarchical density based clustering.” J. Open Source Softw., vol. 2, no. 11, p. 205, 2017.
    [41] J. T. Barron, B. Mildenhall, D. Verbin, P. P. Srinivasan, and P. Hedman, “Mipnerf 360: Unbounded anti-aliased neural radiance fields,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5470–5479.
    [42] J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5294–5306.
    [43] L. Medeiros, “Lang Segment Anything,” https://github.com/luca-medeiros/lang-segment-anything, 2023, accessed: 2026-01-01.

    無法下載圖示 全文公開日期 2028/03/25
    QR CODE
    :::