跳到主要內容

簡易檢索 / 詳目顯示

研究生: 陳惠雯
論文名稱: 應用資料採礦技術於資料庫加值中的抽樣方法
THE SAMPLING METHODS FOR VALUE-ADDED DATABASE IN DATA-MINING
指導教授: 鄭宇庭
謝邦昌
學位類別: 碩士
Master
系所名稱: 商學院 - 統計學系
Department of Statistics
論文出版年: 2004
畢業學年度: 92
語文別: 英文
論文頁數: 90
中文關鍵詞: 資料庫資料採礦抽樣方法資料加值
外文關鍵詞: Database, Data Mining, Sampling, Value-added database
相關次數: 點閱:76下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報


  • In the wake of growing database that has already become the trend of today’s business environment within the foreseeable future, reviewing quality information from mountains of data residing on corporations or organizations’ network such as sales figures, manufacturing statistics, financial data and experimental data is clearly costly, time consuming and definitely ineffective approach. Therefore we would need a sound and effective method in obtaining only portions of the data that are representative to the population and which allow us to build the reliable model based upon the sampled data. However, sometimes we have a situation where the database is of limited in size, under such circumstance, we initiate the idea which is relatively new to adding the attributes or values into the database to enhance the quality of the data Follow through such a procedure; it is obvious that implementing a good sampling method is an important groundwork leading us to reach final destination that is obtaining a reliable predictive model. And this is our research goal that is to get an effective and representative value-added sample of by means of sampling method for building an accuracy predictive model. The concept is pretty straightforward that is if we want to get good predictive samples then we need the correct sampling methods. The sampling methods under study are simple random sample, system sample, stratified sample and uniform design. The models used are the C5.0, logistic regression, and neural network for categorical predictive variable and stepwise regression for continuous predictive variable. The results are discussed in the conclusion section.

    Keywords: Database、Data Mining、Sampling、Value-added database

    ABSTRACT
    LIST OF TABLES
    LIST OF FIGURES
    LIST OF MODEL
    Chapter 1 INTRODUCTION 1
    1.1. Research Background 1
    1.2. Research Motive 1
    1.3. Research Purpose 2
    1.4. Research Flow 3
    Chapter 2 LITERATURE REVIEW 4
    2.1. Database and Relational Database 4
    2.2. Data Warehouse 8
    2.3. Data Mining 11
    2.4. Introduction to Sampling Method 19
    2.4.1. Simple Random Sample 21
    2.4.2. Systematic Sample 22
    2.4.3. Stratified Sample 23
    2.4.4. Uniform Design 23
    2.5. The Predictive Model 28
    2.5.1. Neural Networks 28
    2.5.1.1. Introduce to Neural Network 28
    2.5.1.2. Backpropagation Network 30
    2.5.2. Cluster Methods 32
    2.5.2.1. C5.0 32
    2.5.2.2. CART 33
    2.5.3. Regression Model 33
    2.5.3.1. Stepwise Regression 34
    2.5.3.2. Logistic Regression 37
    Chapter 3 RESEARCH METHODOLOGY 41
    3.1. Research Concept 41
    3.2. Research Frame 43
    Chapter 4 EXPERIMENTAL RESULTS 46
    4.1. Introduction to Database 46
    4.2. The Research Content 49
    4.2.1. The Distribution of Data 49
    4.2.2. Sampling 57
    4.3. Compare the Sampling Methods 58
    4.3.1. C5.0 58
    4.3.2. Neural Networks 63
    4.3.3. Logistic Regression 69
    4.3.4. Stepwise Regression 73
    4.3.5. Compare the Models Accuracy 75
    4.4. The Discussion of Stratified Sampling Method 78
    Chapter 5 CONCLUSION AND RESEARCH DIRECTION 81
    5.1. Conclusion 81
    5.2. Suggestion 84
    5.3. Future Work 84
    REFERENCES 86
    APPENDIX………………………………………………………………………..88

    List of Tables
    Table 2.1 the dummy variable table 40
    Table 3.1 the classify table 45
    Table 4.1 all variables 47
    Table 4.2 the research variables 49
    Table 4.3 the continuous variables 52
    Table 4.4 the sample size of the different sample methods 57
    Table 4.5 the correct rates on C5.0 59
    Table 4.6 the mean and variance of correct rates on C5.0 60
    Table 4.7 the alpha values on C5.0 61
    Table 4.8 the mean and variance of the alpha values on C5.0 62
    Table 4.9 the beta values on C5.0 62
    Table 4.10 the mean and variance of the beta values on C5.0 63
    Table 4.11 the result of neural networks 64
    Table 4.12 the correct rates on neural networks 64
    Table 4.13 the mean and variance of the correct rates on neural networks 65
    Table 4.14 the alpha values on Neural Networks 66
    Table 4.15 the mean and variance of the alpha values on neural networks 67
    Table 4.16 the beta values on neural networks 67
    Table 4.17 the mean and variance of the beta values on neural networks 68
    Table 4.18 the correct rates on logistic regression 69
    Table 4.19 the mean and variance of the correct rates on logistic regression 70
    Table 4.20 the alpha values on logistic regression 70
    Table 4.21 the mean and variance of the alpha values on logistic regression 71
    Table 4.22 the beta values on logistic regression 72
    Table 4.23 the mean and variance of the beta values on logistic regression 73
    Table 4.24 the output of the regression 73
    Table 4.25 the MSE values 74
    Table 4.26 the mean and variance of MSE values 75
    Table 4.27 the compared correct rates on mean and variance 75
    Table 4.28 the compared mean on alpha and beta values 77
    Table 4.29 the correct rates on four stratified variables in C5.0 79
    Table 4.30 the correct rates on four stratified variables in neural networks 79
    Table 4.31 the correct rates on four stratified variables in logistic regression 79
    Table 4.32 the mean of correct rates 80

    List of Figures
    Figure 2.1 the relational algebra 7
    Figure 2.2 the organization of Data Warehouse 9
    Figure 2.3 KDD process 12
    Figure 2.4 data mining models and tasks 13
    Figure 2.5 main methodology for data mining 15
    Figure 2.6 the flow of CRISP-DM 16
    Figure 2.7 the original scoter plot 26
    Figure 2.8 the scoter plot after orthogonal 26
    Figure 2.9 the scoter plot for correlated variable 26
    Figure 2.10 the scoter plot for correlated variable without orthogonal in PSA 26
    Figure 2.11 the scoter plot for correlated variable after orthogonal in PSA 27
    Figure 2.12 the model of artificial neural network 29
    Figure 2.13 the backpropagation network 31
    Figure 2.14 stepwise regression method 35
    Figure 2.15 the graph of logistic model 40
    Figure 3.1 the graph of research concept 42
    Figure 3.2 the research frame 43
    Figure 4.1 the distribution of ground 53
    Figure 4.2 the distribution of floor area of buildings 53
    Figure 4.3 the distribution of workers 53
    Figure 4.4 the distribution of salary 53
    Figure 4.5 the distribution of operating expenditures 54
    Figure 4.6 the distribution of operating revenues 54
    Figure 4.7 the distribution of total assets 54
    Figure 4.8 the distribution of fixed assets rented and borrowed 54
    Figure 4.9 the distribution of fixed assets rented and lent 55
    Figure 4.10 the distribution of expenditures on research development and technology acquiring 55
    Figure 4.11 the distribution of expenditures on environment protection 55
    Figure 4.12 the distribution of total value of production 55
    Figure 4.13 the distribution of net value added 55
    Figure 4.14 the distribution of net value of interest expenditures 56
    Figure 4.15 the distribution of current assets 56
    Figure 4.16 the distribution of profit 56
    Figure 4.17 the distribution of triangular trade 56
    Figure 4.18 the distribution of computer 56
    Figure 4.19 the distribution of E-commerce 57
    Figure 4.20 the distribution of profit 57
    Figure 4.21 the result of C5.0 58
    Figure 4.22 the correct rates on C5.0 60
    Figure 4.23 the alpha values on C5.0 61
    Figure 4.24 the beta values on C5.0 63
    Figure 4.25 the correct rates on neural networks 65
    Figure 4.26 the alpha values on Neural Networks 66
    Figure 4.27 the beta values on neural networks 68
    Figure 4.28 the correct rates on logistic regression 69
    Figure 4.29 the alpha values on logistic regression 71
    Figure 4.30 the beta values on logistic regression 72
    Figure 4.31 the compared correct rates on mean 76
    Figure 4.32 the compared correct rates on variance 76
    Figure 4.33 the compared mean on alpha and beta values 78
    Figure 4.34 the mean of correct rates 80

    List of model
    Function (1) Kokasama-Hlawka inequality 24
    Function (2) the Sigmoid function 30
    Function (3) the logistic function 39
    Function (4) the multiple logistic function 39
    Function (5) the multiple logistic function 40
    Function (6) the stepwise regression model 73

    Chinese
    [1] 趙民德、謝邦昌,探索真相-抽樣理論和實務,曉園出版社,1999.
    [2] 黃文隆,抽樣方法,滄海書局,1999.
    [3] 趙民德,砂中選礦(Data Mining)的一些我見我思,中國統計學報,2002,12.
    [4] 王濟川、郭志剛,Logistic 迴歸模型-方法及應用,五南圖書出版股份有限公司,2003,3.
    [5] 崔巍 編著, 陳舜德 審校,資料庫系統與應用,博碩文化股份有限公司,
    2001,4.
    [6] 張慶賀,資料倉儲中實體化視域自我維護之研究,朝陽科技大學,2003.
    English
    [1] Alan Mayne,Michael B Wood,Introducing Relational Database,1983.
    [2] Bernd Gartner and Emo Welzl,A Simple Sampling Lemma: Analysis and Applications in Geometric Optimization,2002,4.
    [3] Colleen McCue、Emilys. Stone、Teresap. Gooch,Data Mining and Value-Added Analysis,2003.
    [4] CHAP T. LE,APPLIED CATEGORICAL DATA ANALYSIS,Wiley-Interscience Publication,1998.
    [5] C. J. Date,Relational Database Writings 1991-1994,1995.
    [6] David Hand、Heikki Mannila、and Padhraic Smyth,PRINCIPLES OF Data Mining,2001.
    [7] Laboratory 2: Ecological population: a crash course in sampling and statistics.
    [8] Margaret H.Dunham,DATA MINING Introductory and Advanced Topics,2003.
    [9] Saerndal Carl-Erik、Bengt Swensson、Jan Wretman,Model Assisted Survey Sampling,New York: Springer-Verlag,1992.
    [10] USDA Technical Services Division: GRAIN INSPECTION PACKERS AND STOCKYARDS ADMINISIRATION,2001,1.
    [11] William Mendenhall、Terry Sincich,A SECOND COURSE IN STATISTICS REGRESSION ANALYSIS,PRENTICE FALL,fifth edition,1996.

    無法下載圖示 此全文未授權公開
    QR CODE
    :::