跳到主要內容

簡易檢索 / 詳目顯示

研究生: 劉文友
Liu, Wen Yu
論文名稱: 在Spark大數據平台上分析DBpedia開放式資料:以電影票房預測為例
Analyzing DBpedia Linked Open Data (LOD) on Spark:Movie Box Office Prediction as an Example
指導教授: 胡毓忠
Hu, Yuh Jong
學位類別: 碩士
Master
系所名稱: 理學院 - 資訊科學系
論文出版年: 2016
畢業學年度: 104
語文別: 中文
論文頁數: 48
中文關鍵詞: 開放式鏈結資料資源描述框架巨量資料Spark簡單貝氏分類貝氏網路
外文關鍵詞: LOD, RDF, Big Data, Spark, Naive Bayes, Bayesian Network
相關次數: 點閱:84下載:11
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來鏈結開放式資料 (Linked Open Data,簡稱LOD) 被認定含有大量潛在價值。如何蒐集與整合多元化的LOD並提供給資料分析人員進行資料的萃取與分析,已成為當前研究的重要挑戰。LOD資料是RDF (Resource Description Framework) 的資料格式。我們可以利用SPARQL來查詢RDF資料,但是目前對於大量RDF的資料除了缺少一個高性能且易擴展的儲存和查詢分析整合性系統之外,對於RDF大數據資料分析流程的研究也不夠完備。本研究以預測電影票房為例,使用DBpedia LOD資料集並連結外部電影資料庫 (例如:IMDb),並在Spark大數據平台上進行巨量圖形的分析。首先利用簡單貝氏分類與貝氏網路兩種演算法進行電影票房預測模型實例的建構,並使用貝氏訊息準則 (Bayesian Information Criterion,簡稱BIC) 找到最佳的貝氏網路結構。接著計算多元分類的ROC曲線與AUC值來評估本案例預測模型的準確率。


    Recent years, Linked Open Data (LOD) has been identified as containing large amount of potential value. How to collect and integrate multiple LOD contents for effective analytics has become a research challenge. LOD is represented as a Resource Description Framework (RDF) format, which can be queried through SPARQL language. But large amount of RDF data is lack of a high performance and scalable storage analysis system. Moreover, big RDF data analytics pipeline is far from perfect. The purpose of this study is to exploit the above research issue. A movie box office sale prediction scenario is demonstrated by using DBpedia with external IMDb movie database. We perform the DBpedia big graph analytics on the Apache Spark platform. The movie box office prediction for optimal model selection is first evaluated by BIC. Then, Naïve Bayes and Bayesian Network optimal model’s ROC and AUC values are obtained to justify our approach.

    第一章 導論 1
    1.1 研究動機 1
    1.2 研究目的 2
    1.3 研究成果 3
    1.4 各章節闡述 3
    第二章 研究背景 4
    2.1 DBPEDIA 4
    2.2 簡單貝氏分類法 5
    2.3 貝氏網路 7
    2.4 APACHE SPARK 10
    2.4.1 Spark SQL 11
    2.4.2 GraphX 11
    2.4.3 MLlib 13
    第三章 相關研究 14
    3.1 RDF 資料管理系統 14
    3.2 鏈結資料的機器學習 15
    3.3 知識圖譜 16
    3.4 貝氏網路學習 17
    第四章 研究方法與架構 19
    4.1 研究架構 19
    4.2 資料前處理 22
    4.3 特徵值萃取 25
    4.4 電影票房預測模型 32
    4.4.1 簡單貝氏分類模型 32
    4.4.2 貝氏網路模型 34
    第五章 評估模型 39
    5.1模型驗證 39
    5.2電影票房預測等級之評估 40
    第六章 結論與未來研究 45
    參考文獻 46

    [1] H. Paulheim, "Knowledge graph refinement: A survey of approaches and evaluation methods," Semantic Web, vol. 7, 2016.
    [2] M. Nickel, et al., "A review of relational machine learning for knowledge graphs," Proceedings of the IEEE, vol. 104, pp. 11-33, 2016.
    [3] P. Bloem, et al., "Simplifying RDF Data for Graph-Based Machine Learning," in KNOW@ LOD, 2014.
    [4] S. Auer, et al., Dbpedia: A nucleus for a web of open data: Springer, 2007.
    [5] J. Lehmann, et al., "DBpedia–a large-scale, multilingual knowledge base extracted from Wikipedia," Semantic Web, vol. 6, pp. 167-195, 2015.
    [6] J. Han, et al., Data Mining: Concepts and Techniques: Morgan Kaufmann Publishers Inc., 2011.
    [7] M. Zaharia, et al., "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing," in Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, 2012, pp. 2-2.
    [8] R. S. Xin, et al., "GraphX: a resilient distributed graph system on Spark," presented at the First International Workshop on Graph Data Management Experiences and Systems, New York, New York, 2013.
    [9] J. E. Gonzalez, et al., "Graphx: Graph processing in a distributed dataflow framework," in 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), 2014, pp. 599-613.
    [10] S. Das, et al., "A Tale of Two Graphs: Property Graphs as RDF in Oracle," in EDBT, 2014, pp. 762-773.
    [11] J. Huang, et al., "Scalable SPARQL querying of large RDF graphs," Proceedings of the VLDB Endowment, vol. 4, pp. 1123-1134, 2011.
    [12] K. Rohloff and R. E. Schantz, "High-performance, massively scalable distributed systems using the MapReduce software framework: the SHARD triple-store," presented at the Programming Support Innovations for Emerging Distributed Applications, Reno, Nevada, 2010.
    [13] P. Cudré-Mauroux, et al., "NoSQL Databases for RDF: An Empirical Evaluation," in The Semantic Web – ISWC 2013: 12th International Semantic Web Conference, Sydney, NSW, Australia, October 21-25, 2013, Proceedings, Part II, H. Alani, L. Kagal, A. Fokoue, P. Groth, C. Biemann, J. X. Parreira, et al., Eds., ed Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 310-325.
    [14] A. S. Ismail, et al., "Bridging the gap for retrieving DBpedia data," in e-Technologies and Networks for Development (ICeND), 2015 Forth International Conference on, 2015, pp. 1-5.
    [15] M. Nickel, et al., "Factorizing YAGO: scalable machine learning for linked data," presented at the Proceedings of the 21st international conference on World Wide Web, Lyon, France, 2012.
    [16] V. Tresp, et al., "Towards Machine Learning on the Semantic Web," in Uncertainty Reasoning for the Semantic Web I, C. Paulo Cesar, D. A. Claudia, F. Nicola, B. L. Kathryn, J. L. Kenneth, L. Thomas, et al., Eds., ed: Springer-Verlag, 2008, pp. 282-314.
    [17] P. Bloem and G. K. De Vries, "Machine learning on linked data, a position paper," Linked Data for Knowledge Discovery, p. 69, 2014.
    [18] D. Krompaß and V. Tresp, "Ensemble Solutions for Link-Prediction in Knowledge Graphs," 2015.
    [19] N. Jayaram, et al., "Querying knowledge graphs by example entity tuples," IEEE Transactions on Knowledge and Data Engineering, vol. 27, pp. 2797-2811, 2015.
    [20] M. D. Scutari, Jean-Baptiste, Bayesian Networks: With Examples in R: Chapman and Hall/CRC 2014.
    [21] M. Scutari, "Learning Bayesian Networks with the bnlearn R Package," 2010, vol. 35, p. 22, 2010-07-16 2010.
    [22] G. Shmueli, "To explain or to predict?," Statistical science, pp. 289-310, 2010.
    [23] G. Schwarz, "Estimating the dimension of a model," The annals of statistics, vol. 6, pp. 461-464, 1978.
    [24] D. Pyle, Data preparation for data mining vol. 1: Morgan Kaufmann, 1999.
    [25] H. Liu, et al., "Discretization: An Enabling Technique," Data Mining and Knowledge Discovery, vol. 6, pp. 393-423.
    [26] M. C. Monard and G. E. Batista, "Learmng with Skewed Class Distributions," Advances in Logic, Artificial Intelligence, and Robotics: LAPTEC 2002, vol. 85, p. 173, 2002.
    [27] T. Fawcett, "An introduction to ROC analysis," Pattern recognition letters, vol. 27, pp. 861-874, 2006.
    [28] F. Provost and P. Domingos, "Well-trained PETs: Improving probability estimation trees," 2000.
    [29] D. J. Hand and R. J. Till, "A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems," Machine Learning, vol. 45, pp. 171-186, 2001.

    QR CODE
    :::