解碼 PC1 的力量：一種快速準確並基於共變異的 Hi-C 資料 A/B 染色體區室辨別方法

簡易檢索 / 詳目顯示

回結果列表

研究生：	程至榮 Cheng, Zhi-Rong
論文名稱：	解碼 PC1 的力量：一種快速準確並基於共變異的 Hi-C 資料 A/B 染色體區室辨別方法 Decoding the Power of PC1: A Fast and Accurate Covariance-Based Method for A/B Compartment Identification in Hi-C Data
指導教授：	張家銘 Chang, Jia-Ming
口試委員:	吳育瑋 Wu, Yu-Wei 班榮超 Ban, Jung-Chao
學位類別：	碩士 Master
系所名稱：	資訊學院 - 資訊科學系 Department of Computer Science
論文出版年：	2024
畢業學年度：	112
語文別：	英文
論文頁數：	40
中文關鍵詞：	高通量染色體捕獲技術、染色質區室分析、主成份分析
外文關鍵詞：	Hi-C, Chromatin compartments analysis, Principal Component Analysis (PCA)
相關次數：	點閱：40 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在 Hi-C 皮爾森相關矩陣中識別 A 和 B 染色體區室的標準作法是基於主成份分析，然而其運作原理卻鮮少被討論。對於 Hi-C 皮爾森相關矩陣，我們提出其第一主成份的變異解釋率通常很高，並且該解釋率反應了 PC1 與皮爾森相關矩陣上之區室的匹配程度。此外，我們提出了一種啟發式算法，透過 Hi-C 皮爾森相關矩陣的共變異矩陣估計出第一主成份的型態，而不需要直接進行主成份分析。我們的啟發式算法可以使用隨機抽樣有效的實現以加快計算速度，為了解決高解析度下的記憶體瓶頸，我們使用一種最近發表的區室識別工具 POSSUMM 改進了算法，它接受稀疏的 Hi-C O/E 矩陣作為輸入。在我們的實驗中，我們的算法在時間或是記憶體使用上，其基準測試的表現優於使用 Scikit-learn 和 POSSUMM 等軟體工具的幂迭代法（Power iteration），同時與作為基準答案的第一主成份有高相似度。程式碼公開於下列網址 https://github.com/ZhiRongDev/HiCPEP。

The PCA-based method is the standard for identifying A and B compartments in the Hi-C Pearson matrix. However, the reason why it works is rarely discussed. For the Hi-C Pearson matrix, we propose that the explained variance ratio of PC1 is usually high, and the ratio will reflect how the PC1 matches the compartments on the Pearson matrix. Besides, we propose a heuristic algorithm to estimate the pattern of PC1 according to the Hi-C Pearson's covariance matrix without explicitly performing PCA. Our method can be implemented efficiently using random sampling techniques to accelerate calculations. To address the memory bottleneck at finer matrix resolutions, we adapt the algorithm using principles from POSSUMM, a recently published compartment identification tool that takes the sparse Hi-C O/E matrix as input. In our experiments, our algorithm outperforms Power iteration methods, such as those implemented in Scikit-learn and POSSUMM, in terms of the time or memory usage, while maintaining a high degree of similarity to the ground truth PC1. The code is freely available at
https://github.com/ZhiRongDev/HiCPEP.

1 Introduction 1
2 Materials and Methods 6
3 Results 25
4 Conclusion 35
5 Supplemental Information 36
Reference 37

[1] Erez Lieberman-Aiden*, Nynke L. van Berkum*, et al. “Comprehensive mapping of long-range interactions reveals folding principles of the human genome.”Science 326 (2009). GScholar Citations: 1626. Cover Article.

[2] Dekker J, Rippe K, Dekker M, Kleckner N. Capturing chromosome conformation. Science. 2002 Feb 15;295(5558):1306-11. doi: 10.1126/science 1067799. PMID: 11847345.

[3] Dixon, J.R., Selvaraj, S., Yue, F., Kim, A., Li, Y., Shen, Y., Hu, M., Liu, J.S., and Ren, B. (2012). Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380.

[4] Rao, S., Huang, S.-C., Glenn, St., Hilaire, B., Engreitz, J. M., Perez, E. M., etal. (2017). Cohesin loss eliminates all loop domains. Cell 171, 305 – 320.e24. doi:10.1016/j.cell.2017.09.026

[5] Rao SS, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT, Sanborn AL, Machol I, Omer AD, Lander ES, Aiden EL. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014 Dec 18;159(7):1665-80. doi: 10.1016/j.cell.2014.11.021. Epub 2014 Dec 11. Erratum in: Cell. 2015 Jul 30;162(3):687-8. PMID: 25497547; PMCID: PMC5635824.

[6] Harris, H.L., Gu, H., Olshansky, M. et al. Chromatin alternates between A and B compartments at kilobase scale for subgenic organization. Nat Commun 14, 3303 (2023). https://doi.org/10.1038/s41467-023-38429-1

[7] Yaffe, E., and Tanay, A. (2011). Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture. Nat. Genet. 43 (11), 1059–1065. doi:10.1038/ng.947

[8] Servant, N., Varoquaux, N., Lajoie, B. R., Viara, E., Chen, C. J., Vert, J. P., et al. (2015). HiC-pro: An optimized and flexible pipeline for Hi-C data processing. Genome Biol. 16, 259. doi:10.1186/s13059-015-0831-x

[9] Imakaev, M., Fudenberg, G., McCord, R. P., Naumova, N., Goloborodko, A., Lajoie, B.R., et al. (2012). Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nat. Methods 9 (10), 999–1003. doi:10.1038/nmeth.2148

[10] Knight, P. A., and Daniel, R. (2013). A fast algorithm for matrix balancing. IMA J. Numer. Analysis 33 (3), 1029–1047. doi:10.1093/imanum/drs019

[11] Kalluchi A, Harris HL, Reznicek TE, Rowley MJ. Considerations and caveats for analyzing chromatin compartments. Front Mol Biosci. 2023 Apr 5;10:1168562. doi: 10.3389/fmolb.2023.1168562. PMID: 37091873; PMCID: PMC10113542.

[12] Jolliffe Ian T. and Cadima Jorge 2016 Principal component analysis: a review and recent developments Phil. Trans. R. Soc. A.3742015020220150202 http://doi.org/10.1098/rsta.2015.0202

[13] Kruse, K., Hug, C.B. & Vaquerizas, J.M. FAN-C: a feature-rich framework for the analysis and visualization of chromosome conformation capture data. Genome Biol 21, 303 (2020). https://doi.org/10.1186/s13059-020-02215-9

[14] Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of LineageDetermining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol Cell 2010 May 28;38(4):576-589. PMID: 20513432

[15] Abdennur, N., and Mirny, L.A. (2020). Cooler: scalable storage for Hi-C data and other genomically labeled arrays. Bioinformatics. doi: 10.1093/bioinformatics/btz540.

[16] Neva C. Durand, Muhammad S. Shamim, Ido Machol, Suhas S. P. Rao, Miriam H. Huntley, Eric S. Lander, and Erez Lieberman Aiden. ”Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments.” Cell Systems 3(1), 2016.

[17] Zheng X, Zheng Y. CscoreTool: fast Hi-C compartment analysis at high resolution. Bioinformatics. 2018 May 1;34(9):1568-1570. doi: 10.1093/bioinformatics/btx802. PMID: 29244056; PMCID: PMC5925784.

[18] Xiong, K., and Ma, J. (2019). Revealing Hi-C subcompartments by imputing interchromosomal chromatin interactions. Nat. Commun. 10 (1), 5069. doi:10.1038/s41467- 019-12954-4.

[19] Wen, Z., Zhang, W., Zhong, Q., Xu, J., Hou, C., Qin, Z. S., et al. (2022). Extensive chromatin structure-function associations revealed by accurate 3D compartmentalization characterization. Front. Cell Dev. Biol. 10, 845118. doi:10. 3389/fcell.2022.845118

[20] van Berkum NL, Lieberman-Aiden E, Williams L, Imakaev M et al. Hi-C: a method to study the three-dimensional architecture of genomes. J Vis Exp 2010 May 6;(39). PMID: 20461051

[21] Sanborn AL, Rao SS, Huang SC, Durand NC et al. Chromatin extrusion explains key features of loop and domain formation in wild-type and engineered genomes. Proc Natl Acad Sci U S A 2015 Nov 24;112(47):E6456-65. PMID: 26499245

[22] Jonathon Shlens. A Tutorial on Principal Component Analysis. 2014. arXiv:1404.1100

[23] Pedregosa et al., JMLR 12, pp. 2825-2830, 2011. Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011. arXiv:1201.0490

[24] Baglama, J. & Lothar, R. Augmented implicitly restarted lanczos bidiagonalization methods. SIAM J. Sci. Comput 27, 19–42 (2005). https://doi.org/10.1137/04060593X

[25] Free Software Foundation, I. (2014). GNU Datamash. Retrieved from https://www.gnu.org/software/datamash/

全文公開日期 2029/08/21

簡易檢索 / 詳目顯示

相關論文