| 研究生: |
楊建隆 Yeow, Kent-Loong |
|---|---|
| 論文名稱: |
專利語言訊號與市場反應:機器學習方法研究 Patent Linguistic Signals and Market Response: A Machine Learning Approach |
| 指導教授: |
何乾瑋
Ho, Chien-Wei |
| 口試委員: |
蘇威傑
Su, Wei Chieh 傅浚映 Fu, Jyun-Ying |
| 學位類別: |
碩士
Master |
| 系所名稱: |
商學院 - 國際經營與貿易學系 Department of International Business |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 英文 |
| 論文頁數: | 61 |
| 中文關鍵詞: | 專利評價 、語言特徵 、機器學習 、TF-IDF 、隨機森林 、人工智慧專利 、創新訊號 |
| 外文關鍵詞: | patent valuation, linguistic features, machine learning, TF-IDF, Random Forest, AI patents, innovation signaling |
| 相關次數: | 點閱:257 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究探討美國專利摘要之語言特徵是否與股票市場反應相關,並提出可用於早期辨識創新價值的文字分析框架,供企業在撰寫專利摘要與對外溝通時參考。本文整合 KPSS、USPTO PatentsView 與 AIPD 資料庫,共 195 萬件美國專利,採用 TF–IDF 向量化並比較多種機器學習分類模型,以辨識與專利經濟價值相關之語言訊號。實證結果顯示,Random Forest 模型表現最佳;在以經濟價值前 20% 為分類門檻之設定下,準確率達 0.94、F1 分數達 0.76,顯示專利摘要中的語言線索與市場對創新價值之辨識具有關聯。特徵重要性分析指出,高價值專利摘要較常出現市場導向用語與精確的技術修飾(如「semiconductor integrated」、「artificial intelligence」),而傳統產業術語(如「downhole」、「wellbore」)則與經濟價值呈負向關聯。補充分析進一步顯示領域異質性:在平衡樣本(balanced sample)下,AI 專利之平均經濟價值較非 AI 專利高出 86.67%(四捨五入至小數點後兩位),且分類效能多數情況下較佳(F1 分數最高可較非 AI 專利提升約 0.10);主題模型結果亦顯示,AI 專利較集中於使用者互動與資料處理,非 AI 專利則偏重於實體元件與工程結構。綜合而言,本研究結果與「專利語言可提供市場辨識創新價值之資訊訊號」的觀點一致,並可補充傳統引用數等指標在大規模評估與早期篩選上的限制。
This thesis investigates whether linguistic characteristics in U.S. patent abstracts are associated with stock-market responses and develops a scalable text-based framework for early-stage screening of innovation value. Using 1.95 million U.S. patents integrated from the KPSS, USPTO PatentsView, and AIPD datasets, the study constructs TF–IDF representations of patent abstracts and evaluates multiple machine-learning classifiers. The Random Forest model performs best, achieving 0.94 accuracy and a 0.76 F1 score under the top-20% economic-value threshold, suggesting that linguistic cues in patent abstracts are informative for market-based recognition of innovation value. Feature-importance analyses indicate that high-value abstracts contain more market-oriented terminology and precise technical modifiers (e.g., “semiconductor integrated,” “artificial intelligence”), whereas traditional industry jargon (e.g., “downhole,” “wellbore”) is negatively associated with economic value. Additional results suggest heterogeneity across domains: in the balanced sample, AI patents exhibit an 86.67% higher mean economic value (rounded to two decimals) than non-AI patents, and classification performance is generally stronger for AI patents (with F1 improvements up to approximately +0.10); topic modeling further shows that AI patents emphasize user interaction and data processing, while non-AI patents focus on physical components and engineering structures. Overall, the findings are consistent with the view that patent language contains informative signals related to market recognition and can complement citation-based metrics for large-scale assessment and early-stage identification of innovation value.
致謝 i
摘要 ii
Abstract iii
Contents iv
List of Figures vii
List of Tables viii
List of Abbreviations ix
List of Notations x
1 Introduction 1
2 Theoretical Framework and Literature Review 5
2.1 Theoretical Foundations 5
2.2 Patent Language Composition 6
2.3 Communication Channels and Presentation Strategy 7
2.4 Text-Mining Methodology 7
2.5 Conclusion 11
3 Data 12
3.1 Core Datasets 12
3.2 Data Integration and Sample Construction 13
3.3 Data Quality and Methodological Considerations 15
4 Methodology 16
4.1 Empirical Strategy and Research Design 16
4.2 Text Mining Methods 17
4.3 Exploratory Data Analysis 18
4.4 Pilot Study Design (5% Sample Rate) 19
4.5 Final Model Implementation (100% Sample Rate) 21
4.6 Data Preprocessing and Feature Extraction 21
4.7 Summary 24
5 Results 25
5.1 Final Model Implementation 25
5.2 Model Performance Comparison 25
5.3 Cross-Model Feature Analysis 27
5.4 Model Validation and Robustness 30
5.5 Summary 30
6 Discussion 32
6.1 Interpretation of Model Performance Results 32
6.2 Feature Importance Analysis and Economic Signaling 32
6.3 Comparison with Existing Literature 33
6.4 From Feature Analysis to Domain Investigation 34
6.5 Additional Test: AI versus Non-AI Patents 34
6.6 Understanding the AI Domain Advantage 40
6.7 Summary 41
7 Conclusion 42
7.1 Central Question Answered 42
7.2 Which Linguistic Cues Matter 42
7.3 Managerial Payoff 43
7.4 Domain Heterogeneity 44
7.5 Research Contributions 44
7.6 Limitations and Future Directions 45
7.7 Concluding Remarks 46
References 48
A Appendix 52
A.1 Data Integration Process 52
A.2 Text Preprocessing Pipeline 53
A.3 Economic Value Threshold Determination 54
A.4 Model Training and Evaluation Framework 55
A.5 AI vs Non-AI Patent Analysis 56
A.6 Topic Modeling Implementation 58
A.7 Representative Patent Examples 59
Bekamiri, H., Hain, D. S., & Jurowetzki, R. (2024). PatentsBERTa: A deep NLP-based hybrid model for patent distance and classification using augmented SBERT. Technological Forecasting and Social Change, 206, 123536. (cit. on pp. 8, 10)
Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: Analyzing text with the natural language toolkit. O’Reilly Media, Inc. (cit. on p. 17)
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022. (cit. on p. 37)
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. (cit. on pp. 8, 18)
Chen, W., Shi, T. T., & Srinivasan, S. (2024). The value of AI innovations. (cit. on pp. 13, 34, 36)
Chen, Y., et al. (2019). BERT-CNN: A hierarchical patent classifier based on a pre-trained language model. Conference paper presented at Patent Classification Research. ResearchGate. (cit. on p. 22)
Cockburn, I. M., Henderson, R., Stern, S., et al. (2018). The impact of artificial intelligence on innovation (Working Paper No. 24449). National Bureau of Economic Research, Cambridge, MA. (cit. on pp. 6, 7)
Cockburn, I. M., Henderson, R., & Stern, S. (2019). The impact of artificial intelligence on innovation: An exploratory analysis. In A. Agrawal, J. Gans, & A. Goldfarb (Eds.), The economics of artificial intelligence: An agenda (pp. 115–148). University of Chicago Press. https://doi.org/10.7208/9780226613475-006
(cit. on pp. 2, 33, 39)
Couronné, R., Probst, P., & Boulesteix, A.-L. (2018). Random forest versus logistic regression: A large-scale benchmark experiment. BMC Bioinformatics, 19, 1–14. (cit. on pp. 8, 18)
Datar, A., Amore, M., & Fosfuri, A. (2024). Strategic patent disclosure: Unraveling the influence of temporal preferences. Strategic Organization. https://doi.org/10.1177/14761270241299756
(cit. on pp. 6, 33)
Farre-Mensa, J., Hegde, D., & Ljungqvist, A. (2020). What is a patent worth? Evidence from the U.S. patent “lottery”. The Journal of Finance, 75(2), 639–682. (cit. on pp. 1, 2, 7, 9, 33)
Feng, S. (2020). The proximity of ideas: An analysis of patent text using machine learning. PLoS One, 15(7), e0234880. https://doi.org/10.1371/journal.pone.0234880
(cit. on p. 22)
Griliches, Z. (2007). R&D, patents and productivity. University of Chicago Press. (cit. on p. 1)
Haeussler, C., Harhoff, D., & Mueller, E. (2014). How patenting informs VC investors: The case of biotechnology. Research Policy, 43(8), 1286–1298. (cit. on p. 6)
Hall, B. H., Jaffe, A., & Trajtenberg, M. (2005). Market value and patent citations. The RAND Journal of Economics, 36(1), 16–38. (cit. on pp. 10, 18, 33)
Hall, B. H., & Lerner, J. (2010). The financing of R&D and innovation. In Handbook of the economics of innovation (Vol. 1, pp. 609–639). Elsevier. (cit. on pp. 1, 5)
Han, E. J., & Sohn, S. Y. (2015). Patent valuation based on text mining and survival analysis. The Journal of Technology Transfer, 40(5), 821–839. (cit. on p. 33)
Hao, M., & Fan, K. (2017). A method for calculating the similarity of TF-IDF texts for synonyms in biomedical domains. Proceedings of the 5th International Conference on Frontiers of Manufacturing Science and Measuring Technology (FMSMT 2017), 578–583. https://doi.org/10.2991/fmsmt-17.2017.118 (cit. on p. 23)
Harhoff, D., Narin, F., Scherer, F. M., & Vopel, K. (1999). Citation frequency and the value of patented inventions. Review of Economics and Statistics, 81(3), 511–515. (cit. on p. 10)
Harhoff, D., Scherer, F. M., & Vopel, K. (2003). Citations, family size, opposition and the value of patent rights. Research Policy, 32(8), 1343–1363. (cit. on p. 1)
Hsu, D. H., & Ziedonis, R. H. (2008). Patents as quality signals for entrepreneurial ventures. Academy of Management Proceedings, 2008(1), 1–6. (cit. on pp. 1, 5)
Jalilifard, A., Caridá, V. F., Mansano, A. F., Cristo, R. S., & da Fonseca, F. P. C. (2021). Semantic-sensitive TF-IDF to determine word relevance in documents. In Advances in Computing and Network Communications: Proceedings of CoCoNet 2020 (Vol. 2, pp. 327–337). (cit. on pp. 8, 22)
Kogan, L., Papanikolaou, D., Seru, A., & Stoffman, N. (2017). Technological innovation, resource allocation, and growth. The Quarterly Journal of Economics, 132(2), 665–712. (cit. on pp. 1, 2, 4, 12, 16, 18, 43, 46)
Kong, N., Dulleck, U., Jaffe, A. B., Sun, S., & Vajjala, S. (2023). Linguistic metrics for patent disclosure: Evidence from university versus corporate patents. Research Policy, 52(2), 104670. (cit. on pp. 6, 9, 10, 33)
Long, C. (2002). Patent signals. The University of Chicago Law Review, 625–679. (cit. on pp. 1, 5)
Loughran, T., & McDonald, B. (2014). Measuring readability in financial disclosures. The Journal of Finance, 69(4), 1643–1671. (cit. on pp. 2, 3, 6, 7, 9, 33)
Marco, A. C., Sarnoff, J. D., & Charles, A. (2019). Patent claims and patent scope. Research Policy, 48(9), 103790. (cit. on p. 1)
Miric, M., Jia, N., & Huang, K. G. (2023). Using supervised machine learning for large-scale classification in management research: The case for identifying artificial intelligence patents. Strategic Management Journal, 44(2), 491–519. (cit. on pp. 23, 32)
Molnar, C. (2020). Interpretable machine learning. Lulu.com. (cit. on pp. 8, 18)
Pairolero, N. A., Giczy, A. V., Torres, G., Islam Erana, T., Finlayson, M. A., & Toole, A. A. (2025). The artificial intelligence patent dataset (AIPD): 2023 update. The Journal of Technology Transfer, 1–24. (cit. on p. 34)
Röder, M., Both, A., & Hinneburg, A. (2015). Exploring the space of topic coherence measures. Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, 399–408. (cit. on p. 38)
Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. McGraw-Hill. (cit. on pp. 3, 7, 8, 17, 22)
Sarica, S., & Luo, J. (2021). Stopwords in technical language processing. PLoS One, 16(8), e0254937. (cit. on p. 17)
Schankerman, M. (1998). How valuable is patent protection? Estimates by technology field. The RAND Journal of Economics, 77–107. (cit. on p. 18)
Schmitt, V. J. (2025). Disentangling patent quality: Using a large language model for a systematic literature review. Scientometrics, 130(1), 267–311. https://doi.org/10.1007/s11192-024-05206-w (cit. on pp. 7, 10)
Spence, M. (1978). Job market signaling. In Uncertainty in economics (pp. 281–306). Elsevier. (cit. on pp. 1, 5, 7, 32, 37)
Squicciarini, M., Dernis, H., & Criscuolo, C. (2013). Measuring patent quality: Indicators of technological and economic value. (cit. on p. 1)
Tan, H.-T., Wang, E. Y., & Zhou, B. (2014). When the use of positive language backfires: The joint effect of tone, readability, and investor sophistication on earnings judgments. Journal of Accounting Research, 52(1), 273–302. (cit. on p. 10)
United States Patent and Trademark Office. (2018). Patent public search: Stopwords description [Retrieved May 26, 2025]. (cit. on p. 17)
United States Patent and Trademark Office. (2024a). Patents dashboard: Pendency [Accessed May 30, 2025]. (cit. on p. 43)
United States Patent and Trademark Office. (2024b). PatentsView bulk downloads: Quarter 4 2024 data release. (cit. on pp. 12, 22)
U.S. Patent and Trademark Office. (2024). The abstract [Accessed May 26, 2025]. (cit. on pp. 2, 3, 6)
Zúñiga, P., Guellec, D., Dernis, H., Khan, M., Okazaki, T., & Webb, C. (2009). OECD patent statistics manual. OECD Publications. (cit. on pp. 1, 4, 43)
全文公開日期 2031/01/26