跳到主要內容

簡易檢索 / 詳目顯示

研究生: 柯子惟
Ko, Tzu Wei
論文名稱: 以逐步SVM縮減p大n小資料型態之維度
Dimension reduction of large p small n data set based on stepwise SVM
指導教授: 周珮婷
學位類別: 碩士
Master
系所名稱: 商學院 - 統計學系
Department of Statistics
論文出版年: 2017
畢業學年度: 105
語文別: 中文
論文頁數: 35
中文關鍵詞: 維度縮減特徵選取p大n小資料型態逐步SVM
外文關鍵詞: Stepwise SVM, Dimension reduction, Feature selection, Large p small n data set
相關次數: 點閱:52下載:14
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本研究目的為p大n小資料型態的維度縮減,提出逐步SVM方法,並與未刪減任何變數之研究資料和主成份分析 (PCA)、皮爾森積差相關係數(PCCs)以及基於隨機森林的遞迴特徵消除(RF-RFE) 維度縮減法進行比較,並探討逐步SVM是否能篩選出較能區別樣本類別的特徵集合。研究資料為六筆疾病相關的基因表現以及生物光譜資料。
    首先,本研究以監督式學習下使用逐步SVM做特徵選取,從篩選的結果來看,逐步SVM確實能有效從所有變數中萃取出對於樣本的分類上擁有較高重要性之特徵。接著將研究資料分為訓練和測試集,再以半監督式學習下使用逐步SVM、PCA、PCCs和RF-RFE縮減各研究資料之維度,最後配適SVM模型計算預測率,重複以上動作100次取平均當作各維度縮減法的最終預測正確率。觀察計算結果,本研究發現使用逐步SVM所得之預測正確率均優於未處理之原始資料,而與其他方法相比,逐步SVM的穩定度優於PCA和RF-RFE,和PCCs相比則較難看出差異。本研究認為對p大n小資料型態進行維度縮減是必要的,因其能有效消除資料中的雜訊以提升模型整體的預測準確率。


    第壹章 研究動機及目的 1
    第一節 高維度小樣本資料維度縮減現況 1
    第二節 研究動機與目的 2
    第貳章 文獻探討 3
    第參章 研究方法及資料 7
    第一節 逐步SVM 7
    第二節 所使用之演算法 8
    第三節 使用的維度縮減法 10
    第四節 研究資料描述 13
    第肆章 資料分析與結果 20
    第一節 實驗過程與分析 20
    第二節 結果與方法比較 28
    第伍章 結論與建議 29
    第一節 結論 29
    第二節 研究限制與建議 32
    參考文獻 33

    Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., & Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96(12), 6745-6750.
    Bellman, R. E. (2015). Adaptive Control Processes: A Guided Tour: Princeton University Press.
    Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. Paper presented at the Proceedings of the fifth annual workshop on Computational learning theory, Pittsburgh, Pennsylvania, USA.
    Boulesteix, A.-L. (2004). PLS Dimension Reduction for Classification with Microarray Data Statistical applications in genetics and molecular biology (Vol. 3, pp. 1).
    Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32. doi:10.1023/a:1010933404324
    Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297. doi:10.1007/bf00994018
    Cunningham, P. (2008). Dimension Reduction. In M. Cord & P. Cunningham (Eds.), Machine Learning Techniques for Multimedia: Case Studies on Organization and Retrieval (pp. 91-112). Berlin, Heidelberg: Springer Berlin Heidelberg.
    Dai, J. J., Lieu, L., & Rocke, D. (2006). Dimension reduction for classification with gene expression microarray data. Statistical applications in genetics and molecular biology, 5(1), 1147.
    Gordon, G. J., Jensen, R. V., Hsiao, L.-L., Gullans, S. R., Blumenstock, J. E., Ramaswamy, S., . . . Bueno, R. (2002). Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma. Cancer Research, 62(17), 4963-4967.
    Granitto, P. M., Furlanello, C., Biasioli, F., & Gasperi, F. (2006). Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemometrics and Intelligent Laboratory Systems, 83(2), 83-90.
    Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182.
    Guyon, I., Gunn, S. R., Ben-Hur, A., & Dror, G. (2004). Result Analysis of the NIPS 2003 Feature Selection Challenge. Paper presented at the NIPS.
    Hedenfalk , I., Duggan , D., Chen , Y., Radmacher , M., Bittner , M., Simon , R., . . . Trent , J. (2001). Gene-Expression Profiles in Hereditary Breast Cancer. New England Journal of Medicine, 344(8), 539-548. doi:10.1056/nejm200102223440801
    Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., . . . Peterson, C. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature medicine, 7(6), 673-679.
    Lee Rodgers, J., & Nicewander, W. A. (1988). Thirteen Ways to Look at the Correlation Coefficient. The American Statistician, 42(1), 59-66. doi:10.1080/00031305.1988.10475524
    Pal, M., & Foody, G. M. (2010). Feature selection for classification of hyperspectral data by SVM. IEEE Transactions on Geoscience and Remote Sensing, 48(5), 2297-2307.
    Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. Philosophical Magazine Series 6, 2(11), 559-572. doi:10.1080/14786440109462720
    Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., & Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96(12), 6745-6750.
    Bellman, R. E. (2015). Adaptive Control Processes: A Guided Tour: Princeton University Press.
    Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. Paper presented at the Proceedings of the fifth annual workshop on Computational learning theory, Pittsburgh, Pennsylvania, USA.
    Boulesteix, A.-L. (2004). PLS Dimension Reduction for Classification with Microarray Data Statistical applications in genetics and molecular biology (Vol. 3, pp. 1).
    Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32. doi:10.1023/a:1010933404324
    Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297. doi:10.1007/bf00994018
    Cunningham, P. (2008). Dimension Reduction. In M. Cord & P. Cunningham (Eds.), Machine Learning Techniques for Multimedia: Case Studies on Organization and Retrieval (pp. 91-112). Berlin, Heidelberg: Springer Berlin Heidelberg.
    Dai, J. J., Lieu, L., & Rocke, D. (2006). Dimension reduction for classification with gene expression microarray data. Statistical applications in genetics and molecular biology, 5(1), 1147.
    Gordon, G. J., Jensen, R. V., Hsiao, L.-L., Gullans, S. R., Blumenstock, J. E., Ramaswamy, S., . . . Bueno, R. (2002). Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma. Cancer Research, 62(17), 4963-4967.
    Granitto, P. M., Furlanello, C., Biasioli, F., & Gasperi, F. (2006). Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemometrics and Intelligent Laboratory Systems, 83(2), 83-90.
    Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182.
    Guyon, I., Gunn, S. R., Ben-Hur, A., & Dror, G. (2004). Result Analysis of the NIPS 2003 Feature Selection Challenge. Paper presented at the NIPS.
    Hedenfalk , I., Duggan , D., Chen , Y., Radmacher , M., Bittner , M., Simon , R., . . . Trent , J. (2001). Gene-Expression Profiles in Hereditary Breast Cancer. New England Journal of Medicine, 344(8), 539-548. doi:10.1056/nejm200102223440801
    Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., . . . Peterson, C. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature medicine, 7(6), 673-679.
    Lee Rodgers, J., & Nicewander, W. A. (1988). Thirteen Ways to Look at the Correlation Coefficient. The American Statistician, 42(1), 59-66. doi:10.1080/00031305.1988.10475524
    Pal, M., & Foody, G. M. (2010). Feature selection for classification of hyperspectral data by SVM. IEEE Transactions on Geoscience and Remote Sensing, 48(5), 2297-2307.
    Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. Philosophical Magazine Series 6, 2(11), 559-572. doi:10.1080/14786440109462720
    Shipp, M. A., Ross, K. N., Tamayo, P., Weng, A. P., Kutok, J. L., Aguiar, R. C., . . . Pinkus, G. S. (2002). Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature medicine, 8(1), 68-74.
    Tang, J., Alelyani, S., & Liu, H. (2014). Feature selection for classification: A review. Data Classification: Algorithms and Applications, 37.
    Tin Kam, H. (1995, 14-16 Aug 1995). Random decision forests. Paper presented at the Proceedings of 3rd International Conference on Document Analysis and Recognition.
    Xu, X., & Wang, X. (2005). An Adaptive Network Intrusion Detection Method Based on PCA and Support Vector Machines. In X. Li, S. Wang, & Z. Y. Dong (Eds.), Advanced Data Mining and Applications: First International Conference, ADMA 2005, Wuhan, China, July 22-24, 2005. Proceedings (pp. 696-703). Berlin, Heidelberg: Springer Berlin Heidelberg.
    Yeung, K. Y., & Ruzzo, W. L. (2001). Principal component analysis for clustering gene expression data. Bioinformatics, 17(9), 763-774. doi:10.1093/bioinformatics/17.9.763
    林宗勳,Support Vector Machine簡介

    QR CODE
    :::