預測模型中遺失值之選填順序研究｜國立政治大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	施雲天
論文名稱：	預測模型中遺失值之選填順序研究 Research of acquisition order of missing values in predictive model
指導教授：	唐揆
學位類別：	碩士 Master
系所名稱：	商學院 - 企業管理學系 Department of Business Administration
論文出版年：	2014
畢業學年度：	102
語文別：	中文
論文頁數：	46
中文關鍵詞：	遺失值、決策樹分類
外文關鍵詞：	uncertainty score, missing value acquisition
相關次數：	點閱：60 下載：23
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

預測模型已經被廣泛運用在日常生活中，例如銀行信用評比、消費者行為或是疾病的預測等等。然而不論在建構或使用預測模型的時候，我們都會在訓練資料或是測試資料中遇到遺失值的問題，因而降低預測的表現。面對遺失值有很多種處理方式，刪除、填補、模型建構以及機器學習都是可以使用的方法；除此之外，直接用某個成本去取得遺失值也是一個選擇。
本研究著重的議題是用某成本去取得遺失值，並且利用決策樹(因為其在建構時可以容納遺失值)來當作預測模型，希望可以找到用較低的成本的填值方法達到較高的準確率。我們延續過去Error Sampling中Uncertainty Score的概念與邏輯。提出U-Sampling來判斷不同特徵值的「重要性排序」。相較於過去Error Sampling用「受試者」(row-based)的重要性來排序。U-Sampling是根據「特徵值」(column-based)的重要性來排序。
我們用8組UCI machine Learning Repository的資料進行兩組實驗，分別讓訓練資料以及測試資料含有一定比例的遺失值。再利用U-Sampling、Random Sampling以及過去文獻所提及的Error Sampling作準確率和錯誤減少率的比較。實驗結果顯示在訓練資料有遺失值的情況，U-Sampling在70%以上的檔案表現較佳；而在測試資料有遺失值的情況，U-Sampling則是在87.5%的檔案表現較佳。
另外，我們也研究了對於不同的遺失比例對於上述方法的效果是否有影響，可以用來判斷哪種情況比較適用哪一種選值方法。希望透過U-Sampling，可以先挑選重要的特徵值來填補，用較少的遺失值取得就得到較高的準確率，也因此可以節省處理遺失值的成本。

目錄
致謝詞 i
摘要 ii
Abstract iii
表目錄 vi
圖目錄 vii
第一章緒論 1
第一節研究背景 1
第二節研究動機與目的 2
第三節研究架構 3
第四節研究結果與貢獻 4
第五節論文架構 4
第二章文獻回顧 6
第一節遺失值 6
2.1.1 遺失值的數量 6
2.1.2 遺失值的型態 6
2.1.3 遺失值處理方式 7
第二節決策樹的原理 13
第三章研究方法 19
第一節研究架構 19
第二節 U-Sampling說明 20
3.2.1 研究想法 20
3.2.2 程式撰寫 22
3.2.3 假設與限制 22
第三節 U-Sampling評估 23
3.3.1 評估指標 23
3.3.2 檢驗方式 24
3.3.3 和Random Sampling、Error Sampling的比較 25
3.3.4 不同資料型態比較 25
第四章研究結果 26
第一節實驗數據 26
4.1.1 研究資料介紹 26
第二節實驗結果 28
4.2.1 十折交叉驗證-training data含有遺失值 28
4.2.2 十折交叉驗證-test data含有遺失值 35
4.2.4 不同資料型態 40
第五章結論與建議 42
第一節結論 42
第二節研究限制及建議 43
5.2.1 研究限制 43
5.2.2 未來建議 43
參考文獻 44

1. Allison, P. D. (2001). Missing data. Thousand Oaks, CA: Sage.

2. Alpaydın, E. (2010). Introduction to machine learning. London, England: The MIT Press.

3. Bennett, D. A. (2001). How can I deal with missing data in my study? Australian and New Zealand Journal of Public Health, 25(5), 464–469.

4. Giks, Walter R ; Richardson, Sylvia; Spiegelhalter, David J. (1996). Introducing Markov chain Monte Carlo. In Markov chain Monte Carlo in practice (pp. 1-19). London: Chapman & hall/CRC.

5. Graham, J. W. (2003). Adding missing-data-relevant variables to FIML basedstructural equation models. Structural Equation Modeling, pp. 10, 80–100.

6. Jackson, J. (2002). Overview, data mining: a conceptual. Communications of the Association for Information Systems.

7. Kohavi, R. (1995). A study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. IJCAI, (Vol.14, No.2, pp. 1137-1145).

8. Levin, N., & Zahav, J. (2001, Spring). Predictive modeling using segmentation. Journal of Interactive Marketing, 15(2), 2-22.

9. Melville, P., Saar-Tsechansky, M., Provost, F., & Mooney, R. (2004). Active Feature-Value Acquisition for Classifier Induction. Proceedings of the 4th IEEE International Conference on Data Mining, (pp. 483-486). Brighton, UK.

10. Pallant, J. (2007). SPSS survival manual (3rd ed.). New York, NY: Open University Press.

11. Pedro J. Garcı´a-Laencina Æ Jose´-Luis Sancho-Go´mez Æ, A. R.-V. (2010). Pattern classiﬁcation with missing data: a review. Neural Comput & Applic.

12. Peng, C. Y. J., Harwell, M., Liou, S.M., & Ehman, L.H. (2006). Advances in missing data methods and implications for educational research. In Real data analysis, 31-78. North Carolina,US : Information Age Publishing.

13. Quinlan, J. R. (1989). Unknown attribute values in induction., In ML (pp. 164-168).

14. Rubin, D. B. (1987). Multiple imputation for non-response in surveys. New York: John Wiley & Sons.

15. Saar-Tsechansky, M., Melville, P., & Provost, F. (2009, 4). Active Feature-Value Acqusition. Management Science, 55(4), 664-684.

16. Schafer, J. L. (1999). Multiple imputation: a primer. Statiscal methods in medical research, 8(1), 3-15.

17. Schlomer, G. L., Bauman, S., & Card, N. A. (2010). Best Practices for Missing Data Management in Counseling Psychology. Journal of Counseling Psychology, 57(1), 1-10.

18. Simon, H. A., & Lea, G. (1974). Problem solving and rule induction: A unified view. Knowledge and cognition. Oxford, England: Lawrence Erlbaum.

19. Simon, H., & Lea, G. (1974). Problem solving and rule induction: A unified view.

20. Turney, P. (2000). Types of Cost in Inductive Concept Learning. Proceedings of the Cost-Sensitive Learning Workshop at the 17th ICML-2000 Conference. Stanford, CA.

21. Vinod, N. C., & Punithavalli, D. M. (2011). Classification of Incomplete Data Handling Techniques-An Overview. International Journal on Computer Science and Engineering, 3(1), 340-344.

22. Zheng, Z., & Padmanabhan, B. (2002). On Active Learning for Data Acquisition. Proceedings of IEEE International Condference on Data Mining, (pp. 562-569).

網路資料

1. UCI machine Learning Repository. (n.d.). Retrieved from http://archive.ics.uci.edu/ml/

簡易檢索 / 詳目顯示

相關論文