應用資料採礦技術於資料庫加值中的抽樣方法｜國立政治大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳惠雯
論文名稱：	應用資料採礦技術於資料庫加值中的抽樣方法 THE SAMPLING METHODS FOR VALUE-ADDED DATABASE IN DATA-MINING
指導教授：	鄭宇庭謝邦昌
學位類別：	碩士 Master
系所名稱：	商學院 - 統計學系 Department of Statistics
論文出版年：	2004
畢業學年度：	92
語文別：	英文
論文頁數：	90
中文關鍵詞：	資料庫、資料採礦、抽樣方法、資料加值
外文關鍵詞：	Database, Data Mining, Sampling, Value-added database
相關次數：	點閱：76 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

In the wake of growing database that has already become the trend of today’s business environment within the foreseeable future, reviewing quality information from mountains of data residing on corporations or organizations’ network such as sales figures, manufacturing statistics, financial data and experimental data is clearly costly, time consuming and definitely ineffective approach. Therefore we would need a sound and effective method in obtaining only portions of the data that are representative to the population and which allow us to build the reliable model based upon the sampled data. However, sometimes we have a situation where the database is of limited in size, under such circumstance, we initiate the idea which is relatively new to adding the attributes or values into the database to enhance the quality of the data Follow through such a procedure; it is obvious that implementing a good sampling method is an important groundwork leading us to reach final destination that is obtaining a reliable predictive model. And this is our research goal that is to get an effective and representative value-added sample of by means of sampling method for building an accuracy predictive model. The concept is pretty straightforward that is if we want to get good predictive samples then we need the correct sampling methods. The sampling methods under study are simple random sample, system sample, stratified sample and uniform design. The models used are the C5.0, logistic regression, and neural network for categorical predictive variable and stepwise regression for continuous predictive variable. The results are discussed in the conclusion section.

Keywords: Database、Data Mining、Sampling、Value-added database

ABSTRACT
LIST OF TABLES
LIST OF FIGURES
LIST OF MODEL
Chapter 1 INTRODUCTION 1
1.1. Research Background 1
1.2. Research Motive 1
1.3. Research Purpose 2
1.4. Research Flow 3
Chapter 2 LITERATURE REVIEW 4
2.1. Database and Relational Database 4
2.2. Data Warehouse 8
2.3. Data Mining 11
2.4. Introduction to Sampling Method 19
2.4.1. Simple Random Sample 21
2.4.2. Systematic Sample 22
2.4.3. Stratified Sample 23
2.4.4. Uniform Design 23
2.5. The Predictive Model 28
2.5.1. Neural Networks 28
2.5.1.1. Introduce to Neural Network 28
2.5.1.2. Backpropagation Network 30
2.5.2. Cluster Methods 32
2.5.2.1. C5.0 32
2.5.2.2. CART 33
2.5.3. Regression Model 33
2.5.3.1. Stepwise Regression 34
2.5.3.2. Logistic Regression 37
Chapter 3 RESEARCH METHODOLOGY 41
3.1. Research Concept 41
3.2. Research Frame 43
Chapter 4 EXPERIMENTAL RESULTS 46
4.1. Introduction to Database 46
4.2. The Research Content 49
4.2.1. The Distribution of Data 49
4.2.2. Sampling 57
4.3. Compare the Sampling Methods 58
4.3.1. C5.0 58
4.3.2. Neural Networks 63
4.3.3. Logistic Regression 69
4.3.4. Stepwise Regression 73
4.3.5. Compare the Models Accuracy 75
4.4. The Discussion of Stratified Sampling Method 78
Chapter 5 CONCLUSION AND RESEARCH DIRECTION 81
5.1. Conclusion 81
5.2. Suggestion 84
5.3. Future Work 84
REFERENCES 86
APPENDIX………………………………………………………………………..88

List of Tables
Table 2.1 the dummy variable table 40
Table 3.1 the classify table 45
Table 4.1 all variables 47
Table 4.2 the research variables 49
Table 4.3 the continuous variables 52
Table 4.4 the sample size of the different sample methods 57
Table 4.5 the correct rates on C5.0 59
Table 4.6 the mean and variance of correct rates on C5.0 60
Table 4.7 the alpha values on C5.0 61
Table 4.8 the mean and variance of the alpha values on C5.0 62
Table 4.9 the beta values on C5.0 62
Table 4.10 the mean and variance of the beta values on C5.0 63
Table 4.11 the result of neural networks 64
Table 4.12 the correct rates on neural networks 64
Table 4.13 the mean and variance of the correct rates on neural networks 65
Table 4.14 the alpha values on Neural Networks 66
Table 4.15 the mean and variance of the alpha values on neural networks 67
Table 4.16 the beta values on neural networks 67
Table 4.17 the mean and variance of the beta values on neural networks 68
Table 4.18 the correct rates on logistic regression 69
Table 4.19 the mean and variance of the correct rates on logistic regression 70
Table 4.20 the alpha values on logistic regression 70
Table 4.21 the mean and variance of the alpha values on logistic regression 71
Table 4.22 the beta values on logistic regression 72
Table 4.23 the mean and variance of the beta values on logistic regression 73
Table 4.24 the output of the regression 73
Table 4.25 the MSE values 74
Table 4.26 the mean and variance of MSE values 75
Table 4.27 the compared correct rates on mean and variance 75
Table 4.28 the compared mean on alpha and beta values 77
Table 4.29 the correct rates on four stratified variables in C5.0 79
Table 4.30 the correct rates on four stratified variables in neural networks 79
Table 4.31 the correct rates on four stratified variables in logistic regression 79
Table 4.32 the mean of correct rates 80

List of Figures
Figure 2.1 the relational algebra 7
Figure 2.2 the organization of Data Warehouse 9
Figure 2.3 KDD process 12
Figure 2.4 data mining models and tasks 13
Figure 2.5 main methodology for data mining 15
Figure 2.6 the flow of CRISP-DM 16
Figure 2.7 the original scoter plot 26
Figure 2.8 the scoter plot after orthogonal 26
Figure 2.9 the scoter plot for correlated variable 26
Figure 2.10 the scoter plot for correlated variable without orthogonal in PSA 26
Figure 2.11 the scoter plot for correlated variable after orthogonal in PSA 27
Figure 2.12 the model of artificial neural network 29
Figure 2.13 the backpropagation network 31
Figure 2.14 stepwise regression method 35
Figure 2.15 the graph of logistic model 40
Figure 3.1 the graph of research concept 42
Figure 3.2 the research frame 43
Figure 4.1 the distribution of ground 53
Figure 4.2 the distribution of floor area of buildings 53
Figure 4.3 the distribution of workers 53
Figure 4.4 the distribution of salary 53
Figure 4.5 the distribution of operating expenditures 54
Figure 4.6 the distribution of operating revenues 54
Figure 4.7 the distribution of total assets 54
Figure 4.8 the distribution of fixed assets rented and borrowed 54
Figure 4.9 the distribution of fixed assets rented and lent 55
Figure 4.10 the distribution of expenditures on research development and technology acquiring 55
Figure 4.11 the distribution of expenditures on environment protection 55
Figure 4.12 the distribution of total value of production 55
Figure 4.13 the distribution of net value added 55
Figure 4.14 the distribution of net value of interest expenditures 56
Figure 4.15 the distribution of current assets 56
Figure 4.16 the distribution of profit 56
Figure 4.17 the distribution of triangular trade 56
Figure 4.18 the distribution of computer 56
Figure 4.19 the distribution of E-commerce 57
Figure 4.20 the distribution of profit 57
Figure 4.21 the result of C5.0 58
Figure 4.22 the correct rates on C5.0 60
Figure 4.23 the alpha values on C5.0 61
Figure 4.24 the beta values on C5.0 63
Figure 4.25 the correct rates on neural networks 65
Figure 4.26 the alpha values on Neural Networks 66
Figure 4.27 the beta values on neural networks 68
Figure 4.28 the correct rates on logistic regression 69
Figure 4.29 the alpha values on logistic regression 71
Figure 4.30 the beta values on logistic regression 72
Figure 4.31 the compared correct rates on mean 76
Figure 4.32 the compared correct rates on variance 76
Figure 4.33 the compared mean on alpha and beta values 78
Figure 4.34 the mean of correct rates 80

List of model
Function (1) Kokasama-Hlawka inequality 24
Function (2) the Sigmoid function 30
Function (3) the logistic function 39
Function (4) the multiple logistic function 39
Function (5) the multiple logistic function 40
Function (6) the stepwise regression model 73

Chinese
[1] 趙民德、謝邦昌，探索真相-抽樣理論和實務，曉園出版社，1999.
[2] 黃文隆，抽樣方法，滄海書局，1999.
[3] 趙民德，砂中選礦（Data Mining）的一些我見我思，中國統計學報，2002，12.
[4] 王濟川、郭志剛，Logistic 迴歸模型-方法及應用，五南圖書出版股份有限公司，2003，3.
[5] 崔巍編著，陳舜德審校，資料庫系統與應用，博碩文化股份有限公司，
2001，4.
[6] 張慶賀，資料倉儲中實體化視域自我維護之研究，朝陽科技大學，2003.
English
[1] Alan Mayne，Michael B Wood，Introducing Relational Database，1983.
[2] Bernd Gartner and Emo Welzl，A Simple Sampling Lemma: Analysis and Applications in Geometric Optimization，2002，4.
[3] Colleen McCue、Emilys. Stone、Teresap. Gooch，Data Mining and Value-Added Analysis，2003.
[4] CHAP T. LE，APPLIED CATEGORICAL DATA ANALYSIS，Wiley-Interscience Publication，1998.
[5] C. J. Date，Relational Database Writings 1991-1994，1995.
[6] David Hand、Heikki Mannila、and Padhraic Smyth，PRINCIPLES OF Data Mining，2001.
[7] Laboratory 2: Ecological population: a crash course in sampling and statistics.
[8] Margaret H.Dunham，DATA MINING Introductory and Advanced Topics，2003.
[9] Saerndal Carl-Erik、Bengt Swensson、Jan Wretman，Model Assisted Survey Sampling，New York: Springer-Verlag，1992.
[10] USDA Technical Services Division: GRAIN INSPECTION PACKERS AND STOCKYARDS ADMINISIRATION，2001，1.
[11] William Mendenhall、Terry Sincich，A SECOND COURSE IN STATISTICS REGRESSION ANALYSIS，PRENTICE FALL，fifth edition，1996.

此全文未授權公開

簡易檢索 / 詳目顯示

相關論文