| 研究生: |
尤岱亞 Dyah Ayu Marhaeningtyas Galuh Wisnu |
|---|---|
| 論文名稱: |
用於知覺式音訊評估與時間尺度修改的神經網路架構 Neural Architectures for Perceptual Audio Assessment and Time-Scale Modification |
| 指導教授: |
曹昱
Tsao, Yu 彭彥璁 Peng, Yan-Tsung |
| 口試委員: |
王新民
李冕 王有德 花凱龍 Stefano Rini Wang, Yu-Te Akihiko (Ken) Sugiyama |
| 學位類別: |
博士
Doctor |
| 系所名稱: |
資訊學院 - 社群網路與人智計算國際研究生博士學位學程(TIGP) Taiwan International Graduate Program(TIGP) |
| 論文出版年: | 2025 |
| 畢業學年度: | 114 |
| 語文別: | 英文 |
| 論文頁數: | 104 |
| 中文關鍵詞: | 感知式音訊評估 、HAAQI 、音訊品質評估 、時間尺度變換 、神經網路架構 |
| 外文關鍵詞: | Perceptual audio assessment, HAAQI, Audio quality evaluation, Time-scale modification, Neural architectures |
| 相關次數: | 點閱:12 下載:3 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
感知式音訊處理是語音與音樂科技中的一項核心挑戰,在助聽器、生成式音訊評估以及語音轉換等應用中具有關鍵地位。傳統的訊號處理方法雖然在受限條件下表現良好,但往往難以捕捉人類聽覺感知的細微差異,對跨音訊領域的多樣性缺乏穩健性,且無法滿足現代即時系統對效率與彈性的需求。本論文透過發展用於感知式音訊評估與時間尺度變換之神經架構,整合深度學習、自監督嵌入表示與感知模型,以推進感知式音訊處理之研究。
本論文的第一項貢獻為 HAAQI-Net,一個專為助聽器應用設計的非侵入式音樂音訊品質評估神經模型。HAAQI-Net 採用 BEATs 嵌入表示,並結合具注意力機制的雙向長短期記憶網路(BLSTM),在不需要參考訊號的情況下預測感知品質指標。實驗結果顯示,該模型與助聽器音訊品質指標(Hearing Aid Audio Quality Index, HAAQI)具有高度一致性,達到 LCC = 0.9368、SRCC = 0.9486 及 MSE = 0.0064,並透過知識蒸餾將推論時間由 62.5 秒大幅降低至 2.5 秒。其在不同聽損型態與訊號處理條件下所展現的穩健性與高效率,突顯了其於即時輔助式聽覺裝置中的實用性。
第二項貢獻為 AESA-Net,一個為 AudioMOS Challenge 2025 所提出的統一式多任務音訊美學評估架構。AESA-Net 能同時預測四個感知面向:製作品質(Production Quality)、複雜度(Complexity)、愉悅度(Enjoyment)以及實用性(Usefulness),並適用於自然音訊與生成式音訊。為因應跨領域資料所帶來的分佈差異問題,本研究引入結合緩衝式取樣策略的三元組損失(triplet loss),以感知相似度為基礎來結構化嵌入空間。在官方合成音訊評測資料集上,AESA-Net 與人工主觀評分呈現高度相關性,LCC 最高可達 0.984,且在 SRCC 與 Kendall’s Tau 指標上皆優於基準模型,顯示其對未知文字轉語音(TTS)、文字轉音訊(TTA)與文字轉音樂(TTM)內容具備良好的泛化能力。
第三項貢獻為 STSM-FiLM,一個全神經式語音時間尺度變換架構。透過在編碼器–解碼器表示中引入特徵層級線性調制(Feature-Wise Linear Modulation, FiLM),STSM-FiLM 能在維持語音品質與可懂度的同時,提供連續且可控的播放速度調整。該模型以 WSOLA 之輸出作為監督訊號,在多項客觀指標上均優於傳統方法,平均達到 PESQ = 2.03、STOI = 0.894 及 DNSMOS = 2.99,並有效降低語音辨識錯誤率(WER = 0.103,CER = 0.055)。主觀聆聽實驗亦證實,STSM-FiLM 在極端時間縮放條件下,相較於 WSOLA 與既有神經式基準方法展現出更優異的感知品質。
綜合而言,本論文展示了如何將自監督嵌入表示、感知導向之目標函數,以及神經條件化機制有效整合,以推進音訊處理技術的發展。透過連結傳統訊號處理與深度學習方法,所提出之模型在非侵入式音訊品質評估、音訊美學評估及時間尺度變換等任務中,均達成兼具穩健性、高效率與感知一致性的表現。此研究成果不僅提供理論層面的洞見,也具備實務應用價值,對於下一代助聽輔具技術、生成式音訊評估框架以及語音處理系統皆具有重要啟發。
Perceptual audio processing is a fundamental challenge in speech and music technologies, with critical applications in hearing aids, generative audio evaluation, and speech transformation. Traditional signal processing methods, while effective under constrained conditions, often fail to capture the subtleties of human auditory perception, lack robustness across diverse audio domains, and do not provide the efficiency or flexibility demanded by modern real-time systems. This dissertation advances the field by developing neural architectures for perceptual audio assessment and time-scale modification, unifying deep learning techniques, self-supervised embeddings, and perceptual modeling.
The first contribution is HAAQI-Net, a non-intrusive neural model for music audio quality assessment tailored to hearing aid applications. HAAQI-Net leverages BEATs embeddings and attention-enhanced BLSTM architectures to predict perceptual indices without requiring reference signals. Experiments show strong alignment with the Hearing Aid Audio Quality Index (HAAQI), achieving LCC = 0.9368, SRCC = 0.9486, and MSE = 0.0064, while inference time is reduced from 62.5s to 2.5s via knowledge distillation. This efficiency, combined with robustness across hearing loss patterns and signal processing conditions, highlights its practicality for real-time assistive devices.
The second contribution is AESA-Net, a unified multi-task framework for audio aesthetic assessment developed for the AudioMOS Challenge 2025. AESA-Net predicts four perceptual axes—Production Quality, Complexity, Enjoyment, and Usefulness—across natural and generative audio. To address domain shift, we incorporate a triplet loss with buffer-based sampling, structuring the embedding space by perceptual similarity. On the official evaluation set of synthetic audio, AESA-Net achieves high correlations with human ratings, with LCC values up to 0.984 and consistent improvements in SRCC and Kendall’s Tau over the baseline, demonstrating robust generalization to unseen TTS, TTA, and TTM content.
The third contribution is STSM-FiLM, a fully neural architecture for time-scale modification of speech. By conditioning encoder–decoder representations with Feature-Wise Linear Modulation (FiLM), STSM-FiLM provides continuous control over playback speed while maintaining quality and intelligibility. Using WSOLA outputs as supervision, the model surpasses classical methods, achieving PESQ = 2.03, STOI = 0.894, and DNSMOS = 2.99 on average, while reducing recognition errors (WER = 0.103, CER = 0.055). Subjective listening tests confirm its superiority over WSOLA and prior neural baselines, particularly at extreme scaling factors.
Collectively, this dissertation demonstrates how self-supervised embeddings, perceptually informed objectives, and neural conditioning mechanisms can be integrated to advance audio processing. By bridging classical signal processing with deep learning, the proposed models achieve robust, efficient, and perceptually aligned performance in tasks spanning non-intrusive quality assessment, aesthetic evaluation, and temporal modification. These findings contribute both theoretical insights and practical tools, with implications for next-generation hearing assistive technologies, generative audio evaluation frameworks, and speech processing systems.
1 Introduction 1
1.1 Background and Motivation 1
1.2 Research Problems and Challenges 3
1.3 Objectives and Scope 4
1.4 Contributions of the Dissertation 6
1.5 Dissertation Organization 7
2 Literature Review 9
2.1 Perceptual Audio Quality Assessment 9
2.1.1 Subjective and Objective Methods 9
2.1.2 Hearing-Aid-Oriented Metrics 10
2.1.3 Neural and Self-Supervised Approaches 10
2.2 Audio Aesthetic Evaluation 11
2.2.1 Generative and Synthetic Audio 11
2.2.2 Multi-Axis Perceptual Modeling 11
2.2.3 Metric Learning and Embedding Spaces 11
2.3 Time-Scale Modification of Speech 12
2.3.1 Classical Approaches 12
2.3.2 Neural Approaches 12
2.3.3 Conditioning Mechanisms 12
2.4 Summary and Open Challenges 13
3 A Non-intrusive Neural Music Audio Quality Assessment Model for Hearing Aids 14
3.1 Introduction 14
3.2 HAAQI-Net 18
3.2.1 BEATs 18
3.2.2 Network Architecture 19
3.2.3 Knowledge Distillation 21
Adaptive Distillation 23
3.3 Experiments 24
3.3.1 Data preparation 25
Music Samples 25
Music Signal Processing 25
Hearing Loss Patterns 26
3.3.2 Experimental Setup 28
Inputs and Configurations 28
Experimental Scenarios 29
3.3.3 Experimental Results 30
Overall Performance with Different Input Features 31
Scenario-Based Performance 31
Performance under Different Hearing Loss Patterns 32
Performance under Different Signal Processing Conditions 33
Performance across Different Genres 35
Performance of HAAQI-Net with Knowledge Distillation 36
HAAQI-Net Tested on the MUSDB18-HQ Dataset 38
Adapting HAAQI-Net to Predict Subjective Score 39
Impact of Sound Pressure Level (SPL) Adjustments on HAAQI-Net 41
Comparison With Other Audio Quality Assessment Methods 43
Efficiency Evaluation 45
3.4 Summary 46
4 A Neural Framework for Multi-Axis Audio Aesthetic Assessment 54
4.1 Introduction 54
4.2 System Description 57
4.2.1 Dataset 57
4.2.2 Preprocessing and Feature Extraction 58
4.2.3 Model Architecture 59
4.2.4 Loss Function 60
4.3 Triplet Loss with Buffer Sampling Strategy 62
4.3.1 Formulation of Triplet Loss 62
4.3.2 Buffer-Based Sampling 63
4.3.3 Combined Objective 64
4.3.4 Discussion 64
4.4 Experiments and Results 65
4.4.1 Training Setup 65
4.4.2 Evaluation Metrics 66
4.4.3 Results Across Domains 67
4.4.4 Comparison with Baseline 67
4.4.5 Discussion 68
4.5 Summary 70
5 A FiLM-Conditioned Neural Architecture for Time-Scale Modification of Speech 72
5.1 Introduction 72
5.2 Proposed Methods 75
5.2.1 Feature Encoder 76
5.2.2 FiLM Conditioning Module 78
5.2.3 Feature Decoder 79
5.2.4 Training and Inference 79
5.3 Experimental Results and Analysis 80
5.3.1 Datasets and Setup 80
5.3.2 Encoder–Decoder Variants 80
5.3.3 Objective Evaluation 81
5.3.4 ASR-Based Intelligibility 83
5.3.5 FiLM Ablation 83
5.3.6 Trends Across Speed Factors 84
5.3.7 Subjective Evaluation 84
5.3.8 Discussion 85
5.4 Summary 86
6 Conclusions and Future Work 88
6.1 Conclusions 88
6.2 Future Work 89
6.3 Final Remarks 91
Reference 92
[1] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual Evaluation of Speech Quality (PESQ): A new method for speech quality assessment of telephone networks and codecs,” in Proc. IEEE ICASSP, vol. 2. IEEE, 2001, pp. 749–752.
[2] J. G. Beerends, C. Schmidmer, J. Berger, M. Obermann, R. Ullmann, J. Pomy, and M. Keyhl, “Perceptual objective listening quality assessment (POLQA), the third generation ITU-T standard for end-to-end speech quality measurement part I—temporal alignment,” Journal of the Audio Engineering Society, vol. 61, no. 6, pp. 366–384, June 2013.
[3] J. M. Kates and K. H. Arehart, “The hearing-aid audio quality index (HAAQI),” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 2, pp. 354–365, 2015.
[4] W. Verhelst and M. Roelands, “An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech,” in Proc. IEEE ICASSP, 1993, pp. 554–557.
[5] H. Valbret, E. Moulines, and J.-P. Tubach, “Voice transformation using PSOLA technique,” in Proc. IEEE ICASSP, 1992, pp. 145–148.
[6] S. Chen, Y. Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, W. Che, X. Yu, and F. Wei, “BEATs: Audio pre-training with acoustic tokenizers,” in Proc. International Conference on Machine Learning (ICML), 2023.
[7] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
[8] E. Perez, F. Strub, H. de Vries, V. Dumoulin, and A. Courville, “FiLM: Visual reasoning with a general conditioning layer,” in Proc. AAAI Conference on Artificial Intelligence, vol. 32, 2018, pp. 3942–3951.
[9] D. Wisnu, R. E. Zezario, S. Rini, H.-M. Wang, and Y. Tsao, “HAAQI-Net: A non-intrusive neural music audio quality assessment model for hearing aids,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 33, pp. 1877–1892, 2025. [Online]. Available: https://arxiv.org/abs/2401.01145
[10] D. A. M. G. Wisnu, R. E. Zezario, S. Rini, H.-M. Wang, and Y. Tsao, “Improving perceptual audio aesthetic assessment via triplet loss and self-supervised embeddings,” in Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Dec. 2025, to appear.
[11] International Telecommunication Union (ITU-T), “P.800: Methods for subjective determination of transmission quality,” Recommendation P.800, Aug. 1996. [Online]. Available: https://www.itu.int/rec/T-REC-P.800-199608-I
[12] International Telecommunication Union (ITU-T), “P.563: Single-ended method for objective speech quality assessment in narrow-band telephony applications,” Recommendation P.563, May 2004. [Online]. Available: https://www.itu.int/rec/T-REC-P.563-200405-I
[13] C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yamagishi, Y. Tsao, and H.-M. Wang, “MOSNet: Deep learning-based objective assessment for voice conversion,” in Proc. Interspeech, 2019, pp. 1541–1545.
[14] C. K. A. Reddy, V. Gopal, and R. Cutler, “DNSMOS: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” in Proc. IEEE ICASSP, 2021, pp. 6493–6497.
[15] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in Proc. ICML, 2023.
[16] R. E. Zezario, S.-W. Fu, F. Chen, C.-S. Fuh, H.-M. Wang, and Y. Tsao, “Deep learning-based non-intrusive multi-objective speech assessment model with cross-domain features,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 54–70, 2023.
[17] J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,” International Journal of Computer Vision, vol. 129, no. 6, pp. 1789–1819, June 2021.
[18] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” in Proc. Interspeech, 2017, pp. 4006–4010.
[19] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “FastSpeech 2: Fast and high-quality end-to-end text to speech,” in Proc. International Conference on Learning Representations (ICLR), 2021.
[20] H. Liu, Z. Chen, Y. Yuan, X. Mei, X. Liu, D. Mandic, W. Wang, and M. D. Plumbley, “AudioLDM: Text-to-audio generation with latent diffusion models,” in Proc. ICML, vol. 202, 2023, pp. 21450–21474.
[21] P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever, “Jukebox: A generative model for music,” arXiv preprint arXiv:2005.00341, 2020.
[22] A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi, M. Sharifi, N. Zeghidour, and C. Frank, “MusicLM: Generating music from text,” arXiv preprint arXiv:2301.11325, 2023.
[23] C. Wang, G. Richard, and B. McFee, “Transfer learning and bias correction with pre-trained audio embeddings,” arXiv preprint arXiv:2307.10834, 2023.
[24] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in Proc. IEEE CVPR, 2015, pp. 815–823.
[25] A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person re-identification,” arXiv preprint arXiv:1703.07737, 2017.
[26] S. Roucos and A. M. Wilgus, “High quality time-scale modification for speech,” in Proc. IEEE ICASSP, 1985, pp. 493–496.
[27] E. Chu, J.-T. Cheng, and C.-P. Chen, “Audio time-scale modification with temporal compressing networks,” in Proc. ACM Multimedia Asia, Tainan, Taiwan, Dec. 2023.
[28] M. C. Leal, Y. J. Shin, M. Laborde, M. Calmels, S. Verges, S. Lugardon, S. Andrieu, O. Deguine, and B. Fraysse, “Music perception in adult cochlear implant recipients,” Acta Otolaryngologica, vol. 123, no. 7, pp. 826–835, 2003.
[29] B. Edwards, “The future of hearing aid technology,” Trends in Amplification, vol. 11, no. 1, pp. 31–45, 2007.
[30] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1462–1469, 2006.
[31] E. Cooper, W.-C. Huang, Y. Tsao, H.-M. Wang, T. Toda, and J. Yamagishi, “A review on subjective and objective evaluation of synthetic speech,” Acoustical Science and Technology, 2024.
[32] ITU-T, “Methods for subjective determination of transmission quality,” Recommendation P.800, 1996.
[33] ITU-T, “Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment,” Recommendation P.862, 2001.
[34] S. Brachmanski, M. Kin, and P. Zemankiewicz, “Subjective assessment of the speech signal quality broadcasted by local digital radio,” International Journal of Electronics and Telecommunications, 2022.
[35] H. J. M. Steeneken and T. Houtgast, “A physical method for measuring speech-transmission quality,” Journal of the Acoustical Society of America, vol. 67, no. 1, pp. 318–326, 1980.
[36] R. Goldsworthy and J. Greenberg, “Analysis of speech-based speech transmission index methods,” Journal of the Acoustical Society of America, vol. 116, pp. 3679–3689, 2004.
[37] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
[38] J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2009–2022, 2016.
[39] N. Mamun, M. S. A. Zilany, J. H. L. Hansen, and E. Davies-Venn, “An intrusive method for estimating speech intelligibility from noisy and distorted signals,” Journal of the Acoustical Society of America, vol. 150, no. 3, pp. 1762–1778, 2021.
[40] N. Mamun, W. Jassim, and M. S. A. Zilany, “Prediction of speech intelligibility using a neurogram orthogonal polynomial measure (NOPM),” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 4, pp. 760–773, 2015.
[41] A. Hines and N. Harte, “Speech intelligibility prediction using a neurogram similarity index measure,” Speech Communication, vol. 54, no. 2, pp. 306–320, 2012.
[42] T. Thiede, W. C. Treurniet, R. Bitto, C. Schmidmer, T. Sporer, J. G. Beerends, and C. Colomes, “PEAQ—the ITU standard for objective measurement of perceived audio quality,” Journal of the Audio Engineering Society, vol. 48, no. 1/2, pp. 3–29, 2000.
[43] R. Huber and B. Kollmeier, “PEMO-Q—A new method for objective audio quality assessment using a model of auditory perception,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 14, no. 6, pp. 1902–1911, 2006.
[44] H. Gamper, C. K. A. Reddy, R. Cutler, I. J. Tashev, and J. Gehrke, “Intrusive and non-intrusive perceptual speech quality assessment using a convolutional neural network,” in Proc. IEEE WASPAA, 2019, pp. 85–89.
[45] Z. Li, J.-C. Wang, J. Cai, Z. Duan, H.-M. Wang, and Y. Wang, “Non-reference audio quality assessment for online live music recordings,” in Proc. ACM Multimedia, 2013, pp. 63–72.
[46] S.-W. Fu, Y. Tsao, H.-T. Hwang, and H.-M. Wang, “Quality-Net: An end-to-end non-intrusive speech quality assessment model based on BLSTM,” in Proc. Interspeech, 2018, pp. 1873–1877.
[47] G. Mittag, B. Naderi, A. Chehadi, and S. Möller, “NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction,” in Proc. Interspeech, 2021, pp. 2127–2131.
[48] R. E. Zezario, S.-W. Fu, F. Chen, C.-S. Fuh, H.-M. Wang, and Y. Tsao, “Deep learning-based non-intrusive multi-objective speech assessment model with cross-domain features,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 54–70, 2022.
[49] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
[50] H.-T. Chiang, Y.-C. Wu, C. Yu, T. Toda, H.-M. Wang, Y.-C. Hu, and Y. Tsao, “HASANet: A non-intrusive hearing-aid speech assessment network,” in Proc. IEEE ASRU, 2021, pp. 907–913.
[51] Z. Tu, N. Ma, and J. Barker, “Exploiting hidden representations from a DNN-based speech recogniser for speech intelligibility prediction in hearing-impaired listeners,” in Proc. Interspeech, 2022, pp. 3488–3492.
[52] R. Zezario, F. Chen, C.-S. Fuh, H.-M. Wang, and Y. Tsao, “MBI-Net: A non-intrusive multi-branched speech intelligibility prediction model for hearing aids,” in Proc. Interspeech, 2022, pp. 3944–3948.
[53] C. O. Mawalim, B. A. Titalim, S. Okada, and M. Unoki, “Non-intrusive speech intelligibility prediction using an auditory periphery model with hearing loss,” Applied Acoustics, vol. 214, p. 109663, 2023.
[54] S. Cuervo and R. Marxer, “Speech foundation models on intelligibility prediction for hearing-impaired listeners,” in Proc. IEEE ICASSP, 2024.
[55] R. Mogridge, G. Close, R. Sutherland, T. Hain, J. Barker, S. Goetze, and A. Ragni, “Non-intrusive speech intelligibility prediction for hearing-impaired users,” in Proc. IEEE ICASSP, 2024.
[56] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and Q. He, “A comprehensive survey on transfer learning,” Proceedings of the IEEE, vol. 109, no. 1, pp. 43–76, 2021.
[57] M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, “FMA: A dataset for music analysis,” in Proc. ISMIR, 2016.
[58] D. Bogdanov, M. Won, P. Tovstogan, A. Porter, and X. Serra, “The MTG-Jamendo dataset for automatic music tagging,” in Machine Learning for Music Discovery Workshop, ICML, 2019.
[59] A. R. Bonneville, P. J. Rentfrow, M. K. Xu, and J. Potter, “Music through the ages: Trends in musical engagement and preferences from adolescence through middle adulthood,” Journal of Personality and Social Psychology, vol. 105, pp. 703–717, 2013.
[60] K. H. Arehart, J. M. Kates, and M. C. Anderson, “Effects of noise, nonlinear processing, and linear filtering on perceived speech quality,” Ear and Hearing, vol. 31, no. 3, pp. 420–436, 2010.
[61] W. B. Alshuaib, J. M. Al-Kandari, and S. M. Hasan, “Classification of hearing loss,” Update on Hearing Loss, vol. 4, pp. 29–37, 2015.
[62] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 12449–12460.
[63] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. R. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
[64] S. Chen et al., “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, pp. 1505–1518, 2021.
[65] J. M. Kates, Digital Hearing Aids, 1st ed. New York: Taylor & Francis, 2008.
[66] D. Byrne and H. Dillon, “The national acoustic laboratories’ (NAL) new procedure for selecting the gain and frequency response of a hearing aid,” Ear and Hearing, vol. 7, no. 4, pp. 257–265, 1986.
[67] Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and R. Bittner, “MUSDB18-HQ—An uncompressed version of MUSDB18,” 2019. [Online]. Available: https://doi.org/10.5281/zenodo.3338373
[68] A. Kasperuk and S. Zieliński, “Non-intrusive method for audio quality assessment of lossy-compressed music recordings using convolutional neural networks,” International Journal of Electronics and Telecommunications, 2024.
[69] W.-C. Huang, E. Cooper, Y. Tsao, H.-M. Wang, T. Toda, and J. Yamagishi, “The VoiceMOS Challenge 2022,” in Proc. Interspeech, 2022, pp. 4536–4540.
[70] E. Cooper, W.-C. Huang, Y. Tsao, H.-M. Wang, T. Toda, and J. Yamagishi, “The VoiceMOS Challenge 2023: Zero-shot subjective speech quality prediction for multiple domains,” in Proc. IEEE ASRU, 2023.
[71] W.-C. Huang, E. Cooper, J. Yamagishi, and T. Toda, “LDNet: Unified listener dependent modeling in MOS prediction for synthetic speech,” arXiv preprint arXiv:2110.09103, 2021.
[72] E. Cooper, W.-C. Huang, Y. Tsao, H.-M. Wang, T. Toda, and J. Yamagishi, “The VoiceMOS Challenge 2023,” in Proc. IEEE ASRU, Taipei, Taiwan, 2023.
[73] Y. Leng, X. Tan, S. Zhao, F. Soong, X.-Y. Li, and T. Qin, “MBNet: MOS prediction for synthesized speech with mean-bias network,” in Proc. IEEE ICASSP, 2021, pp. 391–395.
[74] W.-C. Huang, E. Cooper, J. Yamagishi, and T. Toda, “LDNet: Unified listener dependent modeling in MOS prediction for synthetic speech,” in Proc. IEEE ICASSP, 2022, pp. 896–900.
[75] S. Chen et al., “BEATs: Audio pre-training with acoustic tokenizers,” in Proc. ICML, 2023, pp. 5178–5193.
[76] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in Proc. IEEE CVPR, 2015, pp. 815–823.
[77] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non-parametric instance discrimination,” in Proc. IEEE CVPR, 2018, pp. 3733–3742.
[78] A. Tjandra et al., “Meta AudioBox Aesthetics: Unified automatic quality assessment for speech,” arXiv preprint arXiv:2502.05139, 2025.
[79] D. Griffin and J. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 236–243, 1984.
[80] S. Jang, Y.-J. Kim, and J.-H. Chang, “DiffATSM: High quality adaptive time-scale modification using diffusion-based post-processing,” SSRN Electronic Journal, 2024.
[81] A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,” Transactions on Machine Learning Research, 2023.
[82] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Advances in Neural Information Processing Systems, 2020.
[83] J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK corpus: English multi-speaker corpus for voice cloning toolkit (version 0.92),” 2019.
[84] Y.-W. Chen and Y. Tsao, “InQSS: A speech intelligibility and quality assessment model using a multi-task learning network,” in Proc. Interspeech, 2022.
[85] H.-T. Hwang, C.-H. Wu, M.-C. Yen, Y. Tsao, and H.-M. Wang, “Exemplar-based methods for Mandarin electrolaryngeal speech voice conversion,” in Proc. Oriental COCOSDA, 2024.
[86] F.-R. Li, H.-T. Hwang, M.-C. Yen, M.-T. Lo, Y. Tsao, and H.-M. Wang, “Improving exemplar-based electrolaryngeal speech voice conversion via robust content representations,” in Proc. APSIPA ASC, 2025.
[87] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: An ASR corpus based on public domain audio books,” in Proc. IEEE ICASSP, 2015, pp. 5206–5210.
[88] C.-Y. Tseng, Y.-C. Cheng, and C.-H. Chang, “Sinica COSPRO and toolkit—Corpora and platform of Mandarin Chinese fluent speech,” Academia Sinica Phonetics Lab Technical Report, 2005.