跳到主要內容

簡易檢索 / 詳目顯示

研究生: 何子安
Hoe, Zi-Onn
論文名稱: 基於圖神經網路表示學習之單細胞 Hi-C 資料精準分群
Graph Neural Network–Based Representation Learning for Accurate Clustering of Single-Cell Hi-C Data
指導教授: 張家銘
Chang, Jia-Ming
口試委員: 蘇家玉
Su, Chia-Yu
吳育瑋
Wu, Yu-Wei
學位類別: 碩士
Master
系所名稱: 資訊學院 - 資訊科學系
Department of Computer Science
論文出版年: 2026
畢業學年度: 114
語文別: 英文
論文頁數: 74
中文關鍵詞: 單細胞 Hi-C分群主成分分析圖神經網路注意力嵌入
外文關鍵詞: Single-cell Hi-C, Clustering, Principal Component Analysis (PCA), Graph Neural Network (GNN), Attention, Embedding
相關次數: 點閱:76下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 單細胞 Hi-C 定序技術賦予研究者在個別細胞層級解析染色質三維摺疊構型的能力。然而,每個細胞所產生的接觸矩陣既極度稀疏又具備極高維度,為後續的細胞分群帶來重大挑戰。目前主流的分析工作流程通常先以主成分分析(Principal Component Analysis)將特徵投射至低維空間,再執行分群演算法;但此線性投影方式無法充分表徵染色質交互作用中所蘊含的非線性與高階結構依賴關係。
    為克服上述限制,本研究提出以圖神經網路(Graph Neural Network, GNN)為核心的表徵學習架構,藉此改善單細胞 Hi-C 資料的細胞分群品質。具體而言,我們將每條染色體的接觸矩陣轉換為加權圖,以基因體區段(genomic bin)為節點、非零接觸頻率為邊權重,透過多頭注意力機制進行訊息傳遞,學習節點層級的特徵表示後再融合為單一細胞嵌入向量,直接作為非監督式分群的輸入。
    我們以六組公開的單細胞 Hi-C 基準資料集進行評估,涵蓋小鼠(Flyamer、Collombet、Nagano)及人類(Ramani、Lee、4DN)細胞。實驗結果顯示,GNN 嵌入在各資料集上均優於對應的 PCA 基準。以摘要中重點呈現之 Ramani、Lee 與 4DN 資料集為例,本方法之 ARI 分別達 0.9440、0.6737 與 0.8749;其中 Ramani 高於 Higashi 的 0.8519 與 MRscHiC 的 0.8822,Lee 高於 Higashi 的 0.1709 與 MRscHiC 的 0.3239,4DN 亦高於 Higashi 的 0.8531 與 MRscHiC 的 0.8433。
    綜合以上實驗結果,本研究證實 GNN 所學習的圖結構嵌入能更精準地保留染色質三維組織中的結構特徵,進而提高細胞類型辨識的準確度。且此方法在大型或較複雜資料集上特別具潛力,亦為單細胞 Hi-C 分析提供了一條兼顧效能與生物可解釋性的新途徑。


    Single-cell Hi-C assays now permit researchers to probe three-dimensional chromatin architecture at the level of individual nuclei, opening a window into cell-to-cell variability in genome folding. Yet analyzing these data through clustering remains difficult: each cell yields a contact matrix that is both exceedingly sparse and high-dimensional. Conventional workflows typically compress these matrices with linear techniques, most notably Principal Component Analysis (PCA), before applying a clustering algorithm. Although PCA is computationally efficient, its linear projections can overlook the intricate, non-linear structural relationships encoded within chromatin contact patterns.
    To address this gap, we develop an end-to-end, reproducible framework that integrates data preprocessing, graph construction, Graph Neural Network-based representation learning, and downstream clustering into a unified pipeline. Our framework converts per-chromosome contact matrices into weighted graphs and applies multi-head attention-based message passing to learn node-level features that encode both local and long-range chromatin relationships. These per-chromosome representations are then fused into a single cell-level embedding vector, which serves as the direct input to standard unsupervised clustering methods.
    The framework was evaluated on six public benchmark datasets, including three mouse datasets (Flyamer, Collombet, and Nagano) and three human datasets (Ramani, Lee, and 4DN). The learned GNN embeddings consistently outperform the corresponding PCA baselines. On the representative Ramani, Lee, and 4DN datasets, the proposed method achieves ARI scores of 0.9440, 0.6737, and 0.8749, respectively. On Ramani, this performance is higher than the ARI scores reported by Higashi (0.8519) and MRscHiC (0.8822). On Lee, it is higher than Higashi (0.1709) and MRscHiC (0.3239). On 4DN, it also exceeds Higashi (0.8531) and MRscHiC (0.8433).
    Overall, this work demonstrates that GNN-based embeddings provide a promising alternative to traditional linear dimensionality reduction techniques for single-cell Hi-C analysis, enabling more accurate clustering and offering improved insights into chromatin structural variation across cells.

    1. Introduction 9
    1.1. Hi-C 9
    1.2. Single-cell Hi-C 10
    1.3. Related Works 10
    1.3.1. Single-Cell Hi-C Data Clustering 10
    1.3.2. Principal Component Analysis (PCA) 11
    1.3.3. Reference Methods 14
    1.4. Problem Statement and Contributions 15
    2. Materials and Methods 17
    2.1. Datasets 17
    2.2. Single-cell Hi-C Workflow 18
    2.2.1. Random Walk with Restart 20
    2.3. Hi-C Matrix as Graph 22
    2.4. Graph Neural Network 24
    2.4.1. Candidate GNN Architectures 25
    2.4.2. GNN Model Workflow 26
    2.4.3. Architecture of GNN Model 27
    2.5. Clustering Methods 31
    2.6. Evaluation of Clustering Methods 32
    2.6.1. Adjusted Rand Index 32
    2.6.2. Normalized Mutual Information 33
    2.6.3. Homogeneity Score 34
    2.6.4. Fowlkes-Mallows Index 36
    2.7. Experimental Setup 36
    3. Results 39
    3.1. Effect of The Imputation Step 39
    3.2. Ablation Experiments on PCA-based Method 40
    3.3. Evaluation with Different Clustering Methods 43
    3.4. Overall Comparison Between Review’s Methods and Ours 48
    3.5. Training Dynamics and Convergence of the GNN Model 54
    3.6. Cross-Dataset Generalization Experiment 59
    4. Discussion 64
    5. Conclusion & Future Work 68
    5.1. Conclusion 68
    5.2. Future Work 69
    6. References 72

    Lieberman-Aiden, E., van Berkum, N. L., Williams, L., Imakaev, M., Ragoczy, T., Telling, A., Amit, I., Lajoie, B. R., Sabo, P. J., Dorschner, M. O., Sandstrom, R., Bernstein, B., Bender, M. A., Groudine, M., Gnirke, A., Stamatoyannopoulos, J., Mirny, L. A., Lander, E. S., & Dekker, J. (2009). Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome. Science, 326(5950), 289–293. https://doi.org/10.1126/science.1181369
    Dixon, J. R., Selvaraj, S., Yue, F., Kim, A., Li, Y., Shen, Y., Hu, M., Liu, J. S., & Ren, B. (2012). Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature, 485(7398), 376–380. https://doi.org/10.1038/nature11082
    Rao, S. S. P., Huntley, M. H., Durand, N. C., Stamenova, E. K., Bochkov, I. D., Robinson, J. T., Sanborn, A. L., Machol, I., Omer, A. D., Lander, E. S., & Aiden, E. L. (2014). A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. Cell, 159(7), 1665–1680. https://doi.org/10.1016/j.cell.2014.11.021
    Zhou, J., Ma, J., Chen, Y., Cheng, C., Bao, B., Peng, J., Sejnowski, T. J., Dixon, J. R., & Ecker, J. R. (2019). Robust single-cell Hi-C clustering by convolution- and random-walk–based imputation. Proceedings of the National Academy of Sciences, 116(28), 14011–14018. https://doi.org/10.1073/pnas.1901423116
    Hong, H., Jiang, S., Li, H., Du, G., Sun, Y., Tao, H., Quan, C., Zhao, C., Li, R., Li, W., Yin, X., Huang, Y., Li, C., Chen, H., & Bo, X. (2020). DeepHiC: A generative adversarial network for enhancing Hi-C data resolution. PLOS Computational Biology, 16(2), e1007287. https://doi.org/10.1371/journal.pcbi.1007287
    Zhang, R., Zhou, T., & Ma, J. (2022). Multiscale and integrative single-cell Hi-C analysis with Higashi. Nature Biotechnology, 40(2), 254–261. https://doi.org/10.1038/s41587-021-01034-y
    Ma, R., Huang, J., Jiang, T., & Ma, W. (2024). A mini-review of single-cell Hi-C embedding methods. Computational and Structural Biotechnology Journal, 23, 4027–4035. https://doi.org/10.1016/j.csbj.2024.11.002
    Yata, K., & Aoshima, M. (2015). Principal component analysis based clustering for high-dimension, low-sample-size data [Preprint]. arXiv. https://arxiv.org/abs/1503.04525
    Zhen, C., Wang, Y., Geng, J., Han, L., Li, J., Peng, J., Wang, T., Hao, J., Shang, X., Wei, Z., Zhu, P., & Peng, J. (2022). A review and performance evaluation of clustering frameworks for single-cell Hi-C data. Briefings in Bioinformatics, 23(6). https://doi.org/10.1093/bib/bbac385
    Shen, Y., Yu, L., Qiu, Y., Zhang, T., & Kingsford, C. (2024). Graph-based genome inference from Hi-C data. In J. Ma (Ed.), Research in Computational Molecular Biology: 28th Annual International Conference, RECOMB 2024, Cambridge, MA, USA, April 29–May 2, 2024, Proceedings (Lecture Notes in Computer Science, Vol. 14758, pp. 115–130). Springer. https://doi.org/10.1007/978-1-0716-3989-4_9
    Knight, P. A., & Ruiz, D. (2013). A fast algorithm for matrix balancing. IMA Journal of Numerical Analysis, 33(3), 1029–1047. https://doi.org/10.1093/imanum/drs019
    Imakaev, M., Fudenberg, G., McCord, R. P., Naumova, N., Goloborodko, A., Lajoie, B. R., Dekker, J., & Mirny, L. A. (2012). Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nature Methods, 9(10), 999–1003. https://doi.org/10.1038/nmeth.2148
    Zhang, S., Tong, H., Xu, J., & Maciejewski, R. (2019). Graph convolutional networks: a comprehensive review. Computational Social Networks, 6(1), 11. https://doi.org/10.1186/s40649-019-0069-y
    Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., & Bengio, Y. (2018). Graph attention networks. In International Conference on Learning Representations (ICLR).
    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. In Advances in Neural Information Processing Systems 30 (NIPS 2017) (pp. 5998–6008).
    Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2016). Fast and accurate deep network learning by exponential linear units (ELUs). In International Conference on Learning Representations (ICLR).
    Zhen, C., Wang, Y., Han, L., Li, J., Peng, J., Wang, T., Hao, J., Shang, X., Wei, Z., & Peng, J. (2021). A novel framework for single-cell Hi-C clustering based on graph-convolution-based imputation and two-phase-based feature extraction [Preprint]. bioRxiv. https://doi.org/10.1101/2021.04.30.442215
    Xie, W., Schultz, M. D., Lister, R., Hou, Z., Rajagopal, N., Ray, P., Whitaker, J. W., Tian, S., Hawkins, R. D., Leung, D., Yang, H., Wang, T., Lee, A. Y., Swanson, S. A., Zhang, J., Zhu, Y., Kim, A., Nery, J. R., Urich, M. A., … Ren, B. (2013). Epigenomic Analysis of Multilineage Differentiation of Human Embryonic Stem Cells. Cell, 153(5), 1134–1148. https://doi.org/10.1016/j.cell.2013.04.022

    無法下載圖示 全文公開日期 2031/05/08
    QR CODE
    :::