| 研究生: |
孫肇祥 Sun, Jhao Siang |
|---|---|
| 論文名稱: |
整合R與Hadoop/MapReduce來分析FOAF社群網路 Using R and Hadoop/MapReduce for FOAF-based Social Network Analytics |
| 指導教授: |
胡毓忠
Hu, Yuh Jong |
| 學位類別: |
碩士
Master |
| 系所名稱: |
理學院 - 資訊科學系碩士在職專班 Excutive Master Program of Computer Science |
| 論文出版年: | 2014 |
| 畢業學年度: | 102 |
| 語文別: | 中文 |
| 論文頁數: | 50 |
| 中文關鍵詞: | RDF(S) 、R and Hadoop/MapReduce 、FOAF 、Hadoop 、MapReduce 、社群網路分析 |
| 外文關鍵詞: | FOAF, Social network analytics |
| 相關次數: | 點閱:482 下載:11 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
分散式線上社群網路採用RDF(S)為基礎的FOAF格式於信任的第三方Hadoop cluster來儲存個人資料與其社群網絡。面臨大量的社群網路資料,傳統的分析方式將會遇到許多處理與儲存的問題。本研究透過結合R與Hadoop/MapReduce技術,提出三種分析方式:R + Hadoop Streaming (RHS), R + MySQL (RMS), R + Hive (RH)來解決分析大量FOAF資料運算與儲存的瓶頸。我們首先將FOAF資料集注入Hadoop cluster平台並利用MapReduce的分散式運算,預先消化大部分的資料以解決R統計軟體單機記憶體不足以應付大型檔案的問題,透過後續R的分析我們也同時解決MapReduce運算無法進行深層社群網路分析的問題。透過預先拆解的方式以可以處理更大的FOAF資料使其更有延展性。這個方法可以適用於非結構化或結構化資料。面對每日激增的社群網路資料,如何更進一步的結合R與Hadoop/MapReduce,並 使用HBase或是與既有R的平行化軟體作結合,也是日後可以努力研究的方向。
The decentralized online social networks are encoded as RDF(S)-based FOAF data format. These FOAF datasets, stored on the trusted Hadoop cluster, are used to represent Web users’ personal data and their social relationships. When using traditional data analysis techniques, we face numerous data processing and storing challenges. In this study, we apply three R and Hadoop/MapReduce integration techniques for high volume FOAF data analysis, including R + Hadoop Streaming (RHS), R + MySQL (RMS), and R + Hive (RH). We first ingest the FOAF datasets and pre-process these datasets through the MapReduce distributed programming paradigm. Then, apply R for FOAF data analysis. This resolves the major problems of impossibly reading high volume of big FOAF data into memory for R analysis and the limitation of social network analysis by using MapReduce computation. High volume of FOAF datasets can be distributed and stored effectively in the Hadoop platform for scalable data processing. The R + Hadoop/MapReduce techniques can be used for analysis on the structured and unstructured data. In the future study, the research issues will be on how to effectively integrate R and Hadoop/MapReduce and leverage the HBase or parallel R programming for high volume big data analytics.
摘要 i
Abstract ii
致謝 iii
第一章 導論 . 1
1.1 研究動機 1
1.2 研究目的 1
1.3 各章節敘述 2
第二章 研究背景 3
2.1 Hadoop 3
2.2 Hive 4
2.3 R 6
第三章 相關研究 8
3.1 FOAF(Friend of A Friend) 8
3.2 社會網路分析(Social Network Analysis,SNA) 10
3.3 R與Hadoop的整合 13
3.3.1 RHadoop 13
3.3.2 RHIPE 15
3.3.3 Hadoop Streaming 17
第四章 方法架構設計 20
4.1 研究架構 20
4.2 FOAF分析 21
4.2.1 R+Hadoop Streaming分析(RHS Analytics) 21
4.2.2 R+MySQL分析(RMS Analytics) 23
4.2.3 R+Hive分析(RH Analytics) 26
第五章 系統實作 30
5.1 系統架構 30
5.2 資料來源 32
5.3 FOAF資料分析 33
5.3.1 R+Hadoop Streaming分析(RHS Analytics) 33
5.3.2 R+MySQL分析(RMS Analytics) 39
5.3.3 R+Hive分析(RH Analytics) 40
5.3.4 效能分析比較 44
第六章 結論與未來展望 46
參考文獻 48
[1].Apache Hadoop Project, http://hadoop.apache.org
[2].Billion Triples Challenge 2012 Dataset, http://km.aifb.kit.edu/projects/btc-2012/
[3].Bizer, C., Heath, T., & Berners-Lee, T. (2009). Linked data-the story so far.International journal on semantic web and information systems, 5(3), 1-22.
[4].Bonacich, P. (1987). Power and centrality: A family of measures. American journal of sociology, 1170-1182.
[5].Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., ... & Gruber, R. E. (2008). Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2), 4.
[6].Daniel J. Weitzner . http://www.w3.org/People/Weitzner.html
[7].Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.
[8].Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.
[9].Department Of Statistics, Purdue University (2012). Divide and Recombine (D&R) with RHIPE. Retrieved from http://www.datadr.org/.
[10].Ding, L., Zhou, L., Finin, T., & Joshi, A. (2005, January). How the semantic web is being used: An analysis of foaf documents. In System Sciences, 2005. HICSS'05. Proceedings of the 38th Annual Hawaii International Conference on(pp. 113c-113c). IEEE.
[11].Ding, L., Zhou, L., Finin, T., & Joshi, A. (2005, January). How the semantic web is being used: An analysis of foaf documents. In System Sciences, 2005. HICSS'05. Proceedings of the 38th Annual Hawaii International Conference on(pp. 113c-113c). IEEE.
[12].Dirk Eddelbuettel(2014, July 7) . CRAN Task View: High-Performance and Parallel Computing with R , Retrieved July 7, 2014, from http://cran.r-project.org/web/views/HighPerformanceComputing.html
[13].Erétéo, G., Gandon, F., Corby, O., & Buffa, M. (2009). Semantic social network analysis. arXiv preprint arXiv:0904.3701.
[14].FOAF Vocabulary Specification 0.99/Namespace Document 14 January 2014 - Paddington Edition. http://xmlns.com/foaf/spec/
[15].Freeman, L. C. (1979). Centrality in social networks conceptual clarification. Social networks, 1(3), 215-239.
[16].G. K. Zipf, Selected Studies of the Principle of Relative Frequency in Language. Harvard University Press, 1932
[17].Ghemawat, S., Gobioff, H., & Leung, S. T. (2003, October). The Google file system. In ACM SIGOPS Operating Systems Review (Vol. 37, No. 5, pp. 29-43). ACM.
[18].Ghemawat, S., Gobioff, H., & Leung, S. T. (2003, October). The Google file system. In ACM SIGOPS Operating Systems Review (Vol. 37, No. 5, pp. 29-43). ACM.
[19].Golbeck, J., & Rothstein, M. (2008, July). Linking Social Networks on the Web with FOAF: A Semantic Web Case Study. In AAAI (Vol. 8, pp. 1138-1143).
[20].http://en.wikipedia.org/wiki/Information_Sciences_Institute
[21].http://www.ldodds.com/foaf/foaf-a-matic.html
[22].Jonathan Seidman .,& Ramesh Venkataramaiah (2011). Distributed Data Analysis with Hadoop and R.
[23].Mori, J., Matsuo, Y., Ishizuka, M., & Faltings, B. (2004, September). Keyword extraction from the web for foaf metadata. In Proceedings of the 1st Workshop on Friend of a Friend, Social Networking and the (Semantic) Web.
[24].MySQL database, http://www.mysql.com/
[25].MySQL Limits on Table Size, http://dev.mysql.com/doc/refman/5.1/en/table-size-limit.html
[26].Paolillo, J. C., & Wright, E. (2004). The challenges of FOAF characterization. InProceedings of the 1st Workshop on Friend of a Friend, Social Networking and the (Semantic) Web.
[27].Paolillo, J. C., & Wright, E. (2006). Social network analysis on the semantic web: Techniques and challenges for visualizing FOAF. In Visualizing the semantic web(pp. 229-241). Springer London.
[28].Piccolboni, A. (2014,May 25) RevolutionAnalytics/RHadoop. Retrieved from https://github.com/RevolutionAnalytics/RHadoop/wiki.
[29].Resource Description Framework (RDF), http://www.w3.org/RDF/
[30].Rickert, J. B. (2010). Big Data Analysis with Revolution R Enterprise.
[31].Ryan R. Rosario(2010). Taking R to the Limit. Los Angeles R Users' Group
[32].The Apache HBase, http://hbase.apache.org/
[33].The Apache Hive, https://hive.apache.org/
[34].The Apache ZooKeeper, http://zookeeper.apache.org/
[35].The Friend of a Friend (FOAF) project, http://www.foaf-project.org/
[36].The R Project for Statistical Computing, http://www.r-project.org/
[37].Yeung, C. M. A., Liccardi, I., Lu, K., Seneviratne, O., & Berners-Lee, T. (2009, January). Decentralization: The future of online social networking. In W3C Workshop on the Future of Social Networking Position Papers (Vol. 2, pp. 2-7).