跳到主要內容

簡易檢索 / 詳目顯示

研究生: 林洸儂
Lin, Guang-Nung
論文名稱: 一個基於記憶體內運算之多維度多顆粒度資料探勘之研究-以yahoo user profile為例
A Research of Multi-dimensional and Multigranular Data Mining with In-memory Computingwith yahoo user profile
指導教授: 姜國輝
Chiang, Kuo-Huie
口試委員: 姜國輝
Chiang, Kuo-Huie
黃勝雄
Huang, S. H.
季延平
Chi, Yan-Ping
學位類別: 碩士
Master
系所名稱: 商學院 - 資訊管理學系
Department of Management Information System
論文出版年: 2016
畢業學年度: 105
語文別: 中文
論文頁數: 39
中文關鍵詞: 關聯規則Apriori 演算法資料挖掘
相關次數: 點閱:31下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來雲端運算技術的發展與電腦設備效能提升,使得以大量電腦主機以水平擴充的方式組成叢集運算系統,成為一可行的選擇。Apache Hadoop 是Apache基金會的一個開源軟體框架,它是由Google 公司的MapReduce 與Google 檔案系統實作成的分布式系統,可以管理數千台以上的電腦群集。Hadoop 利用分散式檔案系統HDFS 可以提供PB 級以上的資料存放空間,透過MapReduce 框架可以將應用程式分割成小工作分散到叢集中的運算節點上執行。
    此外,企業累積了巨量的資料,如何處理與分析這些結構化或者是非結構化的資料成了現在熱門研究的議題。因此傳統的資料挖掘方式與演算法必須因應新的雲端運算技術與分散式框架的概念,進行調整與改良,發展新的方法。
    關聯規則是分析資料庫龐大的資料中,項目之間隱含的關聯,常見的應用為購物籃分析。一般情形下會在特定的維度與特定的顆粒度範圍內挖掘關聯規則,但這樣的方式無法找出更細微範圍下之規則,例如挖掘一個年度的交易資料無法發現消費者在聖誕節為了慶祝而購買的商品項目間的規則,但若將時間限縮在12 月份即可挖掘出這些規則。
    Apriori 演算法是挖掘關聯規則的一個著名的演算法,透過產生候選項目集合與使用自訂的最小支持度進行篩選,產生高頻項目集合,接著以最小信賴度篩選獲得關聯規則的結果。若有k 種單一項目集合,則候選項目集合最多有2𝑘 − 1個,計算高頻項目時則需反覆掃描整個資料庫,Apriori 這兩個主要步驟需要耗費相當大量的運算能力。
    因此本研究將資料庫分割成多個資料區塊挖掘關聯規則,再將結果逐步更新的演算法,解決大範圍挖掘遺失關聯規則的問題,結合spark 分散式運算的架構實作程式,在電腦群集上平行運算減少關聯規則的挖掘時間。


    Because of improving technique of cloud-computing and increasing capability of computer equipment, it is feasible to use clusters of computers by horizon scalable a lot of computers. Apache Hadoop is an open-source software of Apache. It allows the management of cluster resource, a distributed storage system named Hadoop Distributed File System (HDFS), and a parallel processing technique called MapReduce.
    Enterprises have accumulated a huge amount of data. It is a hot issue to process and analyze these structured or unstructured data. Traditional methods and algorithms of data mining must make adjustments and improvement to new cloud computing technology and concept of decentralized framework.
    Association rules is the relations of items from large database. In general, we find association rules in fixed dimensions and granular database. However, it might loss infrequent association rules.
    Apriori algorithm is one famous algorithm of mining association rule. There are two main steps in this algorithm spend a lot of computing resource. To generate Candidate itemset has quantity 2𝑘 − 1, if there are k different item. Second step is to find frequent, this step must compare all tractions in the database.
    This approach divides database to segmentations and finds association rules of these segmentations. Then, we combine rules of segmentations. It can solve the problem of missing infrequent itemset. In addition, we implement this method in Spark and reduce the time of computing.

    第一章 緒論 1
    第一節 研究背景 1
    第二節 研究動機 2
    第三節 研究目的 3
    第二章 文獻探討 4
    第一節 關聯規則 4
    一 關聯規則定義 5
    二 衡量指標 5
    三 關聯規則類型 6
    第二節 Apriori演算法 7
    一 候選項目集合 7
    二 高頻項目集合 8
    三 算法流程 10
    第三節 多維度交易資料庫 11
    第四節 多維度關聯規則 12
    第五節 多階層關聯規則 13
    一 概念層級樹 13
    二 多維度型樣 14
    第六節 Apache Hadoop 15
    一 HDFS 15
    二 YARN 16
    三 MapReduce 17
    第七節 記憶體內運算 18
    第三章 研究方法 20
    第一節 定義多維度資料庫 21
    第二節 定義概念層級數 22
    第三節 產生多維度型樣 23
    第四節 設計平行區塊更新策略 25
    一 切割元素區塊 25
    二 挖掘區塊多維度多顆粒度關聯規則 26
    三 平行Apriori演算法 26
    四 平行合併更新區塊關聯規則 28
    第五節 Spark實現 29
    第四章 實驗結果與討論 30
    第一節 Spark運行環境 30
    第二節 實際案例 31
    第五章 研究結論與建議 35
    第一節 結論與貢獻 35
    第二節 未來研究建議 36
    第六章 參考文獻 37

    [1] Apache Hadoop, Retrieved March 3 2016, from: http://hadoop.apache.org/.
    [2] Apache Spark, Retrieved March 5 2016, from: http://spark.apache.org/docs/latest/.
    [3] Agrawal, R. and Strikant, R. (1994). Fast Algorithms for Mining Association rules. In Proc. of the 20th International Conference on Very Large Data Bases.
    [4] Agrawal, R. and Strikant, R. (1996) Mining Quantitative Association Rules in Large Relational Tables, In Proc. Of the ACM-SIGMOD 1996 Conference on Management of Data.
    [5] Agrawal, R., Imielinksi, T., and Swami, A. (1993) Mining Association Rules Between Sets of Items in Large Databases. In Proc. of 1993 ACM SIGMOD International Conference on Management of Data.
    [6] Agrawal, R., Imielinksi, T., and Swami, A. (1993) Database Mining: a Performance Perspective, IEEE Transactions on Knowledge and Data Engineering,5(6), 914-925.
    [7] Chiang, Johannes and Chia-Chi Wu, (2005). Mining Multi-Dimension Rules in Multiple Database Segmentation on Examples of Cross Selling.
    [8] Chiang, Johannes and Chia-Chi Chu, (2015). A Research Into In-memory Computing in Multidimensional, Multi-granularity Data Mining – with Healthcare Services Innovation.
    [9] H. Xiong, M. Steinbach, P. N. Tan, and V. Kumar. (2004) HICAP: Hierarchial Clustering with Pattern Preservation. In Proc. Of the SIAM Intl. Conf. on Data Mining, 279-290.
    [10] Han, J and Kamber, M. (2006), Data Mining: Concepts and Techniques (2nd edition).
    [11] Han, J. and Fu, Y. (1995) Discovery of Multiple-level Association Rule from Lage Database. In proc. Of the International Conference on Very Large Databases, 420-431.
    [12] Huang, Y.C (2013) Mining Association Rules between Abnormal Health Examination Results and Outpatient Medical Records. Health Information Management Journal, 42(2),23-31.
    [13] J. Pei, J. Han, B. Mortazaavi-Asl, and H. Zhu. (2000) Mining Access Patterns Efficiently form Web Log. In Proc. of the 4th Pacific-Asia Conf. on Knowledge Discovery and Data Mining, 396-407.
    [14] K. Satou, G. Shibayama, T. Ono, Y. Yamamura, E. Furuichi, S. Kuhara, and T. Takagi. (1997) Finding Association Rules on Heteroheneous Genome Data. In Proc. of the Pacific Symp. On Biocomputing, 397-408.
    [15] Moens, S., Aksehirli, E., Goethals, B., (2013). Frequent itemset mining for big data. IEEE International Coference on Big data, 111-118.
    [16] Pauray S.M. Tasi, Chien-Ming Chen, (2004). Mining interesting association rules from customer databases and transaction databases, Information Systems, 29, 685-696.
    [17] R. J. Miller and Y. Yang Association Rules over Interval Data., (1997). In Proc. Of 1997 ACM-SIGMOD Intl. Conf. on Management of Data.
    [18] Srikant, R. and Agrawal, R. (1995) Mining Generalized Association Rules. In Proc. Of International Conference on the Very Large Data Base, 407-419.
    [19] Sheng Ma, Joseph L. Hellerstein, (2001). Mining Mutually Dependent Patterns, IEEE International Conference in Data Mining.
    [20] W. Li, J. Han, and J. Peo. (2001) CMAR: Accurate and Efficient Classfication Based on Multiple Class-association Rules. In Proc of the 2001 IEEE intl. Conf. on Data Mining, 369-376.
    [21] Wei Wang, Jiong Yang, Richard Muntz, (2001). Tempoeal Association Rules with Numberical Attributes, In Proc. of the 17th International Conference on Data Engineering, 283-292.
    [22] Zaharia, M., Chowdhury, M., Das, T., et al., (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. of the 9th USENIX Conference on Networked Systems Design and Implementation.

    QR CODE
    :::