跳到主要內容

簡易檢索 / 詳目顯示

研究生: 陳柏羽
Chen, Po Yu
論文名稱: 結合中文斷詞系統與雙分群演算法於音樂相關臉書粉絲團之分析:以KKBOX為例
Combing Chinese text segmentation system and co-clustering algorithm for analysis of music related Facebook fan page: A case of KKBOX
指導教授: 徐國偉
Hsu, Kuo Wei
口試委員: 沈錳坤
Shan, Man Kwan
黃信貿
Huang, Xin Mao
學位類別: 碩士
Master
系所名稱: 理學院 - 資訊科學系
論文出版年: 2017
畢業學年度: 105
語文別: 中文
論文頁數: 106
中文關鍵詞: 雙分群中文斷詞臉書粉絲專頁貼文
外文關鍵詞: Co-clustering, Chinese text segmentation system, Facebook fan page
相關次數: 點閱:54下載:23
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年智慧型手機與網路的普及,使得社群網站與線上串流音樂蓬勃發展。臉書(Facebook)用戶截至去年止每月總體平均用戶高達18.6億人 ,粉絲專頁成為公司企業特別關注的行銷手段。粉絲專頁上的貼文能夠在短時間內經過點閱、分享傳播至用戶的頁面,達到比起電視廣告更佳的效果,也節省了許多的成本。本研究提供了一套針對臉書粉絲專頁貼文的分群流程,考量到貼文字詞的複雜性,除了抓取了臉書粉絲專頁的貼文外,也抓取了與其相關的KKBOX網頁資訊,整合KKBOX網頁中的資料,對中文斷詞系統(Jieba)的語料庫進行擴充,以提高斷詞的正確性,接著透過雙分群演算法(Minimum Squared Residue Co-Clustering Algorithm)對貼文進行分群,並利用鑑別率(Discrimination Rate)與凝聚率(Agglomerate Rate)配合主成份分析(Principal Component Analysis)所產生的分佈圖來對分群結果進行評估,選出較佳的分群結果進一步去分析,進而找出分類的根據。在結果中,發現本研究的方法能夠有效的區分出不同類型的貼文,甚至能夠依據使用字詞、語法或編排格式的不同來進行分群。


    In recent years, because both smartphones and the Internet have become more popular, social network sites and music streaming services have grown vigorously. The monthly average of Facebook users hit 1.86 billion last years and Facebook Fan Page has become a popular marketing tool. Posts on Facebook can be broadcasted to millions of people in a short period of time by LIKEing and SHAREing pages. Using Facebook Fan Page as a marketing tool is more effective than advertising on television and can definitely reduce the costs. This study presents a process to cluster posts on Facebook Fan Page. Considering the complicated word usage, we grasped information on Facebook Fan Page and related information on the KKBOX website. First, we integrated the information on the website of KKBOX and expanded the text corpus of Jibea to enhance the accuracy of word segmentation. Then, we clustered the posts into several groups through Minimum Squared Residue Co-Clustering Algorithm and used discrimination Rate and Agglomerate Rate to analyze the distribution chart of Principal Component Analysis. After that, we found the suitable classification and could further analyze it. How posts are classified can then be found. As a result, we found that the method of this study can effectively cluster different kinds of posts and even cluster these posts according to its words, syntax and arrangement.

    第一章 緒論 1
    1.1 研究背景 1
    1.1.1 KKBOX的沿革 2
    1.1.2 Facebook粉絲專頁 5
    1.2 研究動機 5
    1.3 研究目的 6
    1.4 研究方法 6
    1.5 論文架構 8
    第二章 文獻探討 9
    2.1 SOCIAL MEDIA 9
    2.2 DOCUMENT CLUSTERING 13
    2.3 小結 18
    第三章 資料處理 19
    3.1 DATA CRAWLING 19
    3.1.1 Facebook 粉絲專頁 19
    3.1.2 KKBOX 排行榜 20
    3.2 DATA CLEAN 26
    3.3 DATA MERGE 26
    第四章 統計分析 29
    4.1 BOKEH 29
    4.1.1 Pandas 30
    4.1.2 Bokeh Chart and Models 33
    4.2 統計分析 34
    第五章 語句斷詞與雙分群演算法 44
    5.1 語句斷詞 45
    5.1.1 CKIP 45
    5.1.2 Jieba 46
    5.1.3 CKIP與Jieba之比較 48
    5.2 CO-CLUSTERING 雙分群 52
    5.2.1 Information Theoretic Co-Clustering Algorithm 54
    5.2.2 Minimum Squared Residue Co-Clustering Algorithm 55
    第六章 實驗結果與討論 56
    6.1實驗環境與流程 56
    6.1.1實驗環境 56
    6.1.2 實驗流程 57
    6.2 實驗設計 58
    6.2.1 Compressed Column Storage 59
    6.2.2 Principal Component Analysis 60
    6.2.3 Agglomerate rate and Discrimination rate 63
    6.3實驗 64
    6.3.1分群演算法實驗 64
    6.3.2列分群實驗 73
    6.3.3 行分群實驗 78
    6.3.4 與其他方法比較 83
    6.4實驗結果 90
    第七章 結論與未來可能研究方向 97
    7.1結論 97
    7.2未來可能研究方向 99
    參考文獻 100

    [1] 蕭世平,“台灣地區線上音樂會員使用狀況與業者行銷策略研究”,南臺科技大學資訊傳播研究所碩士論文,2007。
    [2] 鄭博元,“設計與實作一個臉書粉絲頁資料抓取器”,政治大學資訊科學研究所碩士論文,2015。
    [3] 陳稼興, 謝佳倫, & 許芳誠,“以遺傳演算法為基礎的中文斷詞研究”,資訊管理研究第二卷第二期,pp. 27-44,2000。
    [4] 王瑞平,“應用平行語料建構中文斷詞組件”,政治大學資訊科學研究所碩士論文,2012。
    [5] Tsai, Y. F., & Chen, K. J.,“Reliable and Cost-Effective Pos-Tagging”, International Journal of Computational Linguistics & Chinese Language Processing, Vol. 9 #1, pp. 83-96, 2004.
    [6] Ma, W. Y., & Chen, K. J.,“A Bottom-up Merging Algorithm for Chinese Unknown Word Extraction”, Proceedings of ACL, Second SIGHAN Workshop on Chinese Language Processing, pp. 31-38, 2003.
    [7] Ma, W. Y., & Chen, K. J.,“Introduction to CKIP Chinese Word Segmentation System for the First International Chinese Word Segmentation Bakeoff”, Proceedings of ACL, Second SIGHAN Workshop on Chinese Language Processing, pp. 168-171, 2003.
    [8] 黃俊堯,“看懂,然後知輕重。「互聯網+」的10堂必修課”,pp. 21-29,台北:先覺出版社,2015。
    [9] 張家寧,“以概念萃取為基礎之文件分群與視覺化”,交通大學資訊科學與工程研究所碩士論文,2006。
    [10] 徐俊傑,“網際網路資訊應用研究”,台灣科技大學資訊管理系行政院國家科學委員會專題研究計畫,2007。
    [11] Hartigan, J. A.,“Direct Clustering of a Data Matrix”, Journal of the American Statistical Association Volume 67, Issue 337, 1972.
    [12] 陳貫中,“以雙分群方法分析基因微矩陣資料”,交通大學資訊科學與工程研究所碩士論文,2006。
    [13] 張智愷,“基於動態調整權重之co-cluster演算法”,交通大學資訊科學與工程研究所碩士論文,2011。
    [14] Mirkin, B.,“Mathematical Classification and Clustering”, Kluwer Academic Publishers,1996.
    [15] Dhillon, I. S.,“Co-clustering documents and words using bipartite spectral graph partitioning”, in Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD ’01, pp. 269–274, 2001.
    [16] Dhillon, I. S., Mallela, S., & Modha, D. S.,“Information-theoretic co-clustering”, in Proceedings of the ninth ACM SIGKDD international conference on KKluwer Academic Publishersnowledge discovery and data mining, pp. 89–98, 2003.
    [17] Kwon, B., & Cho, H.,“Scalable Co-Clustering Algorithm”, Algorithms and Architectures for Parallel Processing, Lecture Notes in Computer Science, Vol. 6081, pp. 32–43, 2010.
    [18] Cho, H., Dhillon, I. S., Guan, Y., & Sra, S.,“Minimum sum-squared residue co-clustering of gene expression data”, in Proceedings of the fourth SIAM international conference on data mining, 2004.
    [19] Cho, H., & Dhillon, I. S.,“Coclustering of Human Cancer Microarrays Using Minimum Sum-Squared Residue Coclustering”, IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 5, NO. 3, 2008.
    [20] Cheng, Y., & Church, G. M., “Biclustering of Expression Data”, in Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, Vol. 8, pp. 93-103, 2000.
    [21] Martínez, A. M., & Kak, A. C.,“Pca versus lda”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 23, No. 2, pp. 228-233, 2001.
    [22] Zhang, Y., & Wu, L.,“An MR brain images classifier via principal component analysis and kernel support vector machine”, Progress In Electromagnetics Research 130, pp. 369-388, 2012.
    [23] 林育臣,“群聚技術之研究”,朝陽科技大學資訊管理研究所碩士論文,2002。
    [24] 陳榮昌,“群聚演算法及群聚參數的分析與探討”,朝陽科技大學資訊管理研究所碩士論文,2003。
    [25] 吳振銘, “應用改良式 K-means 分群法於個人化音樂推薦服務系統之實現”,高雄應用科技大學電子工程系研究所碩士論文,2012。
    [26] Mihalcea, R., & Tarau, P.,“TextRank: Bringing Order into Texts”, Proceedings of the Conference on Empirical Methods in Natural Language Processing, Vol. 4, pp. 404-411, 2004.
    [27] De Choudhury, M., Gamon, M., Counts, S., & Horvitz, E.,“Predicting depression via social media”, In Proceedings of the 7th International AAAI Conference on Weblogs and Social Media, 13, pp. 1-10, 2013.
    [28] Yin, J., Lampert, A., Cameron, M., Robinson, B., & Power, R.,“Using social media to enhance emergency situation awareness”, IEEE Intelligent Systems, 27(6), pp. 52-59, 2012.
    [29] Benson, E., Haghighi, A., & Barzilay, R.,“Event discovery in social media feeds”, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, Association for Computational Linguistics, pp. 389-398, 2011.
    [30] Girvan, M., & Newman, M. E.,“Community structure in social and biological networks”, Proceedings of the national academy of sciences, 99(12), pp. 7821-7826, 2002.
    [31] Pohl, D., Bouchachia, A., & Hellwagner, H.,“Online indexing and clustering of social media data for emergency management”, Neurocomputing, 172, pp. 168-179, 2016.
    [32] Papadopoulos, S., Kompatsiaris, Y., Vakali, A., & Spyridonos, P., “Community detection in social media”, Data Mining and Knowledge Discovery, 24(3), pp.515-554, 2012.
    [33] Azizifard, N., “Social Network Clustering”, International Journal of Information Technology and Computer Science, 6(1), 76, 2013.
    [34] Reuter, T., Cimiano, P., Drumond, L., Buza, K., & Schmidt-Thieme, L., “Scalable Event-Based Clustering of Social Media Via Record Linkage Techniques”, In Fifth International AAAI Conference on Weblogs and Social Media, 2011.
    [35] 吳怡瑾,方友杉, & 喻欣凱,“運用文件分群與概念關聯分析技術協助網誌瀏覽: 任務導向評估方法”,輔仁大學資訊管理學研究所,圖書資訊學研究,4(1), pp. 133-164, 2009.
    [36] Becker, H., Naaman, M., & Gravano, L.,“Learning similarity metrics for event identification in social media”, In Proceedings of the third ACM international conference on Web search and data mining, pp. 291-300, 2010.
    [37] 蔡宜龍,“特殊領域文件分群之系統設計與研究--以佛學資料為例”,國立臺灣大學資訊工程研究所碩士論文,未出版論文,2002。
    [38] Ferrara, E., JafariAsbagh, M., Varol, O., Qazvinian, V., Menczer, F., & Flammini, A.,“Clustering memes in social media”, In Advances in social networks analysis and mining, IEEE/ACM international conference on pp. 548-555, 2013.
    [39] Wang, X., Tang, L., Gao, H., & Liu, H.,“Discovering overlapping groups in social media”, In Data Mining, 2010 IEEE 10th International Conference on pp. 569-578, 2010.
    [40] 尹其言, & 楊建民,“應用文件分群與文字探勘技術於機器學習領域趨勢分析以 SSCI 資料庫為例”, 長榮大學學報, 14(2), pp. 1-16, 2010.
    [41] Steinbach, M., Karypis, G., & Kumar, V.,“A comparison of document clustering techniques”, In KDD workshop on text mining, Vol. 400, No. 1, pp. 525-526, 2000.
    [42] Hotho, A., Staab, S., & Stumme, G.,“Ontologies improve text document clustering”, In Data Mining, ICDM 2003. Third IEEE International Conference on pp. 541-544, 2003
    [43] 黃純敏,陳聰宜, & 詹雅筑,“新聞事件偵測與追蹤之分群分類演算法研究”, 資訊科技國際期刊, 8(1), pp. 1-9, 2014
    [44] Ting, X., & Jufang, L.,“A Comparative Study between Single-Pass Algorithm and K-means Algorithm in Web Topic Detection.”, 中國國防科學技術大學信息系統與管理學院, 2014.
    [45] Willett, P.,“Recent trends in hierarchic document clustering: a critical review”, Information Processing & Management, 24(5), pp. 577-597, 1988.
    [46] Yan, Y., Chen, L., & Tjhi, W. C.,“Fuzzy semi-supervised co-clustering for text documents”, Fuzzy Sets and Systems, 215, pp. 74-89, 2012.
    [47] 詹欣逸,“利用WordNet判斷字詞包含關係-應用於動態階層文件分群”, 國立中央大學資訊管理研究所碩士論文, 2013.
    [48] 謝昆霖,楊義清,林俊男, & 林育弘,“模糊群聚分析程序於生物 DNA 序列之研究”, Journal of Information Technology and Applications, 2(1), pp. 17-22, 2007.

    QR CODE
    :::