跳到主要內容

簡易檢索 / 詳目顯示

研究生: 李宗𤳉
Li, Zong-Han
論文名稱: 集成學習框架下BERTopic主題學習之於企業違約預測
Frame of Ensemble Learning Using the Latent Dirichlet Allocation Model and BERTopic for Corporate Default Prediction
指導教授: 江彌修
Chiang, Mi-Hsiu
學位類別: 碩士
Master
系所名稱: 商學院 - 金融學系
Department of Money and Banking
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 48
中文關鍵詞: 公司違約預測機器學習主題模型
外文關鍵詞: Company default prediction, Machine learning, Topic model
DOI URL: http://doi.org/10.6814/NCCU202100708
相關次數: 點閱:170下載:29
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 因應資產證券化(Asset Securitization)後,擔保債務憑證(Collateralized Debt Obligation,CDO),信用違約交換(Credit Default Swap,CDS)等與信用風險相關之衍生性商品蓬勃發展,直至2007年時因房地產價格下跌導致貸款違約率上升,而被誤判之CDO違約率大幅提升,導致2008年金融海嘯發生。本研究著重於研究公司違約預測之模型,結合Duan,Sun and Wang(2012)之財務數據模型加上Lopatta,Gloger and Jaeschke(2017)運用文字資訊於公司違約模型之方式建立訓練資料及測試資料,而後運用N Peinelt(2020)之深度學習結合Blei,Ng and Jordan (2003)之LDA主題模型,及機器學習方式對公司財務模型進行預測,並比較傳統Logit model和機器學習模型之Random Forest和XGBoost準確度,其結果也顯示出當加入主題模型之文字資訊時,LDA模型在各主題參數下其效果較單純運用財務數據之準確度要來得好,而Bertopic模型在主題參數少的情況下也有相同效果,而兩文字探勘模型運用機器學習訓練出的公司違約預測模型準確度也較傳統Logit model效果要來得好,且當面臨不平衡之樣本資料集時使用NV Chawla(2002)之SMOTE演算法,可透過過採樣之方式,將其違約特徵值樣本放大,並解決違約樣本財務數據不足之問題,其實證結果顯示在加入SMOTE演算法後,在各演算法及各主題參數組合的預測中AUC分數皆有顯著提升,也顯示在樣本不平衡資料集的訓練中,採用合成資料演算法有其必要性。


    Until 2007, due to the fall in real estate prices, the default rate of loans increased, and the default rate of misjudged collateralized debt obligation increased significantly leading to the 2008 financial crisis. This research focuses on the research of the company’s default prediction model, combined with the financial data model of Duan, Sun and Wang (2012) plus Lopatta, Gloger and Jaeschke (2017) uses textual information to create training data and test data in the company’s default model, then use the deep learning of N Peinelt (2020) combined with the LDA topic model of Blei, Ng and Jordan (2003), and machine learning to predict the company’s financial model and compare the accuracy of Random Forest and XGBoost of traditional Logit model and machine learning model.
    The results also show that when the text information of the topic model is added, the effect of the LDA model under each topic parameter is better than the accuracy of the pure use of financial data, and the Bertopic model has the same effect when the topic parameters are few. The accuracy of the company default prediction model trained by the two-text exploration model using machine learning is also better than that of the traditional Logit model. When faced with an imbalanced sample data set, the SMOTE algorithm of NV Chawla (2002) can be used. The empirical results show that after adding the SMOTE algorithm, the AUC scores in the prediction of each algorithm and each subject parameter combination are all significant The improvement also shows that in the training of sample imbalanced data sets, it is necessary to use synthetic data algorithms

    第一章 緒論 7
    第二章 文獻探討 11
    第一節 破產預測研究 11
    第二節 主題模型 12
    第三節 機器學習演算法 13
    第三章 研究方法 15
    第四章 實證結果 26
    第一節資料來源與處理 26
    第二節 結果呈現 29
    第五章 結論 45
    參考文獻 47

    ALTMAN, E. I. "Financial Ratios, Discriminant Analysis and Prediction of Corporate Bankruptcy." Journal of Finance 22 (September 1968): 589-610
    Altman, E. I.; Haldeman, R.; and Narayanan, P. 1977. ZETA analysis: A new model to identify bankruptcy risk of corporations. Journal of Banking and Finance 10:29–54.
    Black F, Scholes M (1973) The pricing of options and corporate liabilities. J Polit Econ 81:637–654; reprinted in Black F, Scholes M (2012) Financial Risk Measurement and Management, International Library of Critical Writings in Economics (Edward Elgar, Cheltenham, UK), Vol 267, pp 100–117.
    Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. Journal of Machine Learning Research, 2003,3:993−1022. [doi: 10.1162/ jmlr.2003.3.4-5.993] Computing, pp. 878-887, 2005.
    C. Lin and Y. He. 2009. Joint sentiment/topic model for sentiment analysis. In Proceeding of the 18th ACM Conference on Information and Knowledge Management, pages 375–384.
    Duan, J.-C.; J. Sun; and T. Wang. "Multiperiod Corporate Default Prediction - A Forward Intensity Approach Journal of Econometrics, 170 (2012), 1
    H. Han, W.Y. Wang and B.H. Mao, "Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning", Proc. Int’l Conf. Intelligent
    H. He, Y. Bai, E.A. Garcia and S. Li, "ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning", Proc. Int’l J. Conf. Neural Networks, pp. 1322-1328, 2008.
    J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
    L. Breiman. Random forests. Machine Learning, 45(1): 5–32, 2001
    Lopatta, K., M. A. Gloger, and R. Jaeschke, 2017, Can language predict bankruptcy? The explanatory power of tone in 10-K filings, Accounting Perspectives 16, 315–343
    Loughran, Tim, and Bill McDonald. 2011. When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. Journal of Finance 66(1):35–65..
    Merton, R., 1974, On the pricing of corporate debt: The risk structure of interest rates, Journal of Finance 29, 449–470.
    N. Peinelt, D. Nguyen and M. Liakata, "tBERT: Topic models and BERT joining forces for semantic similarity detection", Proceedings of the Annual Conference of the International Speech Communication Association (ACL), pp. 7047-7055, 2020.
    N.V. Chawla, K.W. Bowyer, L.O. Hall and W.P. Kegelmeyer, "SMOTE: Synthetic Minority Over-Sampling Technique", J. Artificial Intelligence Research, vol. 16, pp. 321-357, 2002.
    Ohlson, James A., 1980, Financial ratios and the probabilistic prediction of bankruptcy, Journal of Accounting Research 18, 109–131.
    Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41.
    T. Hofmann, "Probabilistic latent semantic indexing," in Proceedings of SIGIR, pp. 50-57, 1999.
    Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794. ACM, 2016.
    Tomek, I. (1976). Two Modifications of CNN. IEEE Transactions on Systems, Man and Cybernetics, 6, 769–772.
    Wilson, D. L. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Transactions on Systems, Man, and Communications 2, 3 (1972), 408–421.

    QR CODE
    :::