| 研究生: |
海達 Shahzad, Sahibzada Adil |
|---|---|
| 論文名稱: |
用於檢測視聽深偽的多模態策略:利用唇同步線 索與人機感知基準評估 Multimodal Strategies for Detecting Audiovisual Deepfakes: Exploiting Lip-Synchronization Cues and Benchmarking Human-AI Perception |
| 指導教授: |
王新民
Wang, Hsin-Min |
| 口試委員: |
彭彥聰
Peng, Yan-Tsung |
| 學位類別: |
博士
Doctor |
| 系所名稱: |
資訊學院 - 社群網路與人智計算國際研究生博士學位學程(TIGP) Taiwan International Graduate Program(TIGP) |
| 論文出版年: | 2025 |
| 畢業學年度: | 114 |
| 語文別: | 英文 |
| 論文頁數: | 79 |
| 中文關鍵詞: | 深偽 、視聽覺 、多模態 、唇形同步 、鑑識 |
| 外文關鍵詞: | Deepfakes, Audiovisual, Multimodal, LipSync, Forensics |
| 相關次數: | 點閱:14 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
影片內容已成為最具影響力且廣受消費的溝通形式之一,並在社群媒體與數位平台上形塑公共話語、娛樂與教育。隨著人們越來越仰賴影片來分享資訊,深偽(deepfake)技術同時展現了創意工具與重大威脅的雙重面向。雖然深偽已被用於恢復歷史影像與增進可及性等正面應用,然而其濫用——例如散播錯誤資訊、操縱公共認知、以及產生未經同意的內容——卻引發嚴重的倫理與安全疑慮。由於深偽方法不斷演進並愈趨成熟,研究人員積極投入開發偵測策略,其中多數方法著重於單一模態的深度學習模型,分別分析視覺或音訊特徵。然而,深偽技術現在往往同時竄改視覺與音訊,使得偵測難度大幅提高。因此,本論文探討了三種增強視聽深偽偵測的關鍵途徑,分別針對多模態不一致分析、泛化能力,以及基於大型語言模型(LLM)的可解釋性進行研究。
第一項貢獻「Lip Sync Matters」提出了一個創新的深偽偵測框架,藉由辨識唇形動作與語音之間的錯位來偵測深偽。此方法使用專門的 Wav2Lip 模型,根據音訊生成合成唇部序列,再與影片中的實際唇形進行比較,以捕捉人眼難以察覺的細微不一致。與傳統單一模態或多模態混合的偵測方法不同,該方法直接將「視聽對位不匹配」視為深偽偵測的核心特徵。實驗結果顯示,該方法在 FakeAVCeleb 資料集上的偵測效能優於多種現有的先進單一模態及多模態方法,凸顯跨模態同步對於鑑別深偽內容的重要性。
在此基礎上,第二項方法「AV-Lip-Sync+」進一步透過自我監督學習(SSL)框架來強化特徵擷取的穩健性。傳統的監督式學習模型仰賴標記資料,但這些資料的規模與多樣性往往有限,容易導致模型過擬合,且對未知的操縱手法缺乏泛化能力。為了解決此問題,本研究採用基於 Transformer 架構的 SSL 模型 AV-HuBERT,從視覺與音訊兩種模態中擷取語義表示。相較於僅分析單一模態的方式,該模型更能捕捉音訊與視覺串流之間的細微但關鍵不一致。此外,研究中還整合了另一個以臉部特徵為基礎的影片 Transformer 模型,用於偵測深偽影片在生成過程中於空間與時間上產生的臉部偽造痕跡。此混合式方法在 FakeAVCeleb, DeepfakeTIMIT 與 DFDC 等多模態基準資料集上,展現了新的最先進偵測表現,證實自我監督式特徵擷取對提升深偽偵測效能的有效性。
第三項貢獻則聚焦於新興的研究方向:運用大型語言模型(LLM)於多模態深偽偵測。傳統深度學習偵測模型往往是「黑箱式」的,需要大量運算資源且缺乏可解釋性;相較之下,多模態 LLM(如 ChatGPT)提供了截然不同的方法,能透過領域知識與結構化的提示工程(prompt engineering)來分析視聽不一致。本研究系統性地評估了 ChatGPT 在辨別深偽時的效能,包括對空間、時間,以及跨模態視聽不一致的檢測。透過與最先進的偵測模型及人類評估者的比較實驗,結果顯示 LLM 在多媒體鑑識中具備潛力,但也存在局限。雖然 ChatGPT 在分析視聽操縱上表現出一定可行性,然而其偵測效能強烈依賴提示的品質及自身的知識庫,凸顯了在 AI 鑑識應用中仍需要更深入的優化。
綜合上述貢獻,本論文透過增強跨模態不一致分析的方式,提出了能強化視聽深偽偵測的新方法,同時在偵測準確度、泛化能力及可解釋性方面進行了深入探討。隨著社群媒體與線上平台持續利用影片內容來塑造公眾認知,對於可靠且具有可解釋性的深偽偵測系統之需求日益迫切。本論文所提出的研究與方法,為今後建立更強韌、高效與可解釋的深偽偵測框架奠定了重要基礎,以因應深偽內容在合成媒體生態系中快速演進的挑戰。
Video content has become one of the most powerful and pervasive forms of communication in today's digital world, shaping public discourse, influencing political opinion, and driving trends in education, entertainment, and news. With the ease of creation and global dissemination through social media platforms, videos now serve as critical vessels for information exchange. However, this growing reliance on video has also made it a prime target for manipulation through deepfake technology, which is a rapidly advancing tool powered by generative Artificial Intelligence (AI) models. While deepfakes offer innovative applications such as dubbing in multiple languages, restoring historical footage, and aiding accessibility, their malicious use raises serious ethical, societal, and security concerns. Deepfakes have been weaponized for spreading misinformation, impersonating individuals, and generating non-consensual content, threatening personal reputations and public trust. The sophistication of current-generation deepfakes, which often manipulate both audio and visual modalities, presents a formidable challenge for detection systems that were primarily designed to analyze unimodal artifacts.
This study addresses the pressing need for effective and interpretable methods to detect audiovisual deepfake manipulations that simultaneously alter both the visual and audio components of video content. To tackle this challenge, it explores three complementary research directions, each focusing on a key aspect of audiovisual forgery detection. These include identifying subtle inconsistencies between speech and lip movements, leveraging self-supervised learning for multimodal feature representation, and evaluating the potential of large language models (LLMs) to reason over and explain audiovisual manipulations. Together, these approaches provide a deeper understanding of cross-modal artifacts and lay the groundwork for developing more accurate and explainable audiovisual forensic tools.
The first contribution, Lip Sync Matters, introduces a novel detection method based on audiovisual consistency. It exploits the relationship between lip movements and corresponding speech by synthesizing a reference lip sequence from the audio using the Wav2Lip model and comparing it with the actual visual lip region in the video. This comparison uncovers subtle mismatches between the audio and visual streams that are characteristic of deepfake videos. By focusing explicitly on synchronization rather than treating audio and video as independent modalities, this method captures cross-modal anomalies more effectively than traditional unimodal or fusion-based models. Evaluation on the FakeAVCeleb dataset demonstrates that this approach significantly improves detection performance, highlighting the value of temporal alignment in deepfake forensics.
The second contribution, AV-Lip-Sync+, extends the cross-modal framework by introducing a self-supervised learning (SSL) pipeline that leverages AV-HuBERT, a transformer-based model, to extract joint semantic features from raw audio and video. Unlike supervised systems that rely heavily on labeled data, which may be scarce or biased, SSL enables robust representation learning from unlabeled multimodal data, improving generalization across diverse manipulation techniques and datasets. Additionally, the framework incorporates a spatiotemporal facial transformer model trained on full-face inputs to capture facial inconsistencies introduced by generation artifacts over time. The integration of these modalities yields a strong, complementary feature space that achieves state-of-the-art results on challenging datasets like FakeAVCeleb, DeepfakeTIMIT and DFDC. This contribution emphasizes the importance of leveraging both temporal dynamics and modality-specific artifacts for reliable deepfake detection.
The third contribution explores a novel and underexplored dimension of deepfake forensics: the use of Large Language Models like ChatGPT for audiovisual analysis. While LLMs were originally developed for natural language processing, their recent multimodal capabilities allow them to interpret visual and auditory content alongside text. This work investigates the potential of ChatGPT to identify deepfake artifacts by guiding the model through structured prompt engineering strategies. Unlike conventional detection algorithms that operate as black boxes, ChatGPT offers a degree of interpretability, allowing users to query its reasoning process. Through extensive evaluations against human performance and state-of-the-art AI detectors, the study identifies key limitations, such as sensitivity to prompt phrasing, model hallucinations, and difficulty in processing nuanced audiovisual cues. Nonetheless, it presents early evidence of LLMs as versatile tools in AI forensics, capable of offering explainable and knowledge-driven analysis.
Together, these three lines of investigation advance the field of deepfake detection by addressing core challenges such as generalization to unseen manipulations, robustness to real-world noise, and the need for interpretable outputs. This dissertation not only introduces technically sound and scalable detection strategies but also examines the implications of cross-modal reasoning and foundation model integration for future forensic tools. As the manipulation of video content becomes more widespread and difficult to detect, especially on fast-paced platforms like social media, developing trustworthy, efficient, and explainable detection systems is critical for safeguarding information integrity in the digital age. This work contributes to laying the foundation for such systems, bridging the gap between academic research and practical deployment in multimedia security.
Acknowledgements i
摘要 iii
Abstract v
Contents viii
List of Figures xi
List of Tables xiv
1 Deepfake Detection: An Overview 1
1 Background 1
1.1 Deepfake Generation 2
1.2 Deepfake Detection 4
2 Motivation 6
3 Research Challenges 7
4 Contributions 9
5 Dissertation Outline 10
2 Lip Sync Matters: A Novel Multimodal Forgery Detector 11
1 Introduction 11
2 Methodology 13
2.1 Synthetic Lip Generation 13
2.2 Audio-Visual Lip-Sync Model 14
3 Dataset and Experimental Setup 16
3.1 Dataset 16
3.2 Preprocessing 17
3.3 Evaluation Metrics 17
3.4 Hyperparameters in Training 18
4 Results and Discussion 19
4.1 Evaluation of the Proposed Audio-Visual Lip-Sync Model 19
4.2 Comparison of Different Models 20
5 Summary 22
3 AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Deepfake Detection of Frontal Face Videos 23
1 Introduction 23
2 Methodology 25
2.1 Audio-Visual Feature Extractor 25
2.2 Lip Image Feature Extractor 26
2.3 Acoustic Feature Extractor 26
2.4 Sync-Check Module 27
2.5 Feature Fusion Module 28
2.6 Temporal Convolutional Network and Classifier 28
2.7 Model Training 29
2.8 AV-Lip-Sync+ with FE 29
3 Datasets and Experimental Setup 30
3.1 Datasets 30
3.2 Preprocessing 32
3.3 Model Configuration and Training 33
3.4 Evaluation Metrics 33
4 Results and Discussion 34
4.1 Evaluation of the Proposed AV-Lip-Sync+ Model 34
4.2 Comparison of Different Models 36
4.3 Ablation Study 40
4.4 Discriminant Analysis of Different Features 42
4.5 Partial Occlusion 42
4.6 Cross-Dataset Generalization 44
4.7 Evaluation on the DeepfakeTIMIT Dataset 45
5 Summary 46
4 How Good is ChatGPT at Audiovisual Deepfake Detection 48
1 Introduction 48
2 Methodology 50
2.1 Prompt Design 50
2.2 Input of LLM Model 52
2.3 Audiovisual Analysis 52
2.4 Prediction Assignment 53
2.5 ChatGPT vs Human vs AI Models 53
2.6 Dataset Selection 53
2.7 Evaluation Metrics 53
2.8 Results 54
3 Ablation Study 60
3.1 Effectiveness of Prompts 60
3.2 Failure Case Study 61
4 Limitations and Discussion 62
5 Summary 63
5 Conclusions and Future Work 65
1 Summary of Contributions 65
2 Directions for Future Research 67
3 Future Prospects of Deepfake Detection 68
Bibliography 69
VITA 78
[1] Antonia Creswell et al. “Generative adversarial networks: An overview”. In: IEEE Signal Processing Magazine 35.1 (2018), pp. 53–65.
[2] Iryna Korshunova et al. “Fast face-swap using convolutional neural networks”. In: Proceedings of the IEEE international conference on computer vision. 2017, pp. 3677–3685.
[3] Yuval Nirkin, Yosi Keller, and Tal Hassner. “Fsgan: Subject agnostic face swapping and reenactment”. In: Proceedings of the IEEE/CVF international conference on computer vision. 2019, pp. 7184–7193.
[4] KR Prajwal et al. “A lip sync expert is all you need for speech to lip generation in the wild”. In: Proceedings of the 28th ACM International Conference on Multimedia. 2020, pp. 484–492.
[5] Ye Jia et al. “Transfer learning from speaker verification to multispeaker textto-speech synthesis”. In: Advances in neural information processing systems 31 (2018).
[6] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Tech. rep. California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
[7] Ian J Goodfellow et al. “Generative adversarial nets”. In: Advances in neural information processing systems 27 (2014).
[8] Hasam Khalid et al. “FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset”. In: Proc. of the NeurIPS Track on Datasets and Benchmarks. 2021.
[9] Renwang Chen et al. “SimSwap: An Efficient Framework For High Fidelity Face Swapping”. In: Proceedings of the ACM International Conference on Multimedia. 2020, pp. 2003–2011.
[10] Gege Gao et al. “Information bottleneck disentanglement for identity swapping”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, pp. 3404–3413.
[11] Justus Thies, Michael Zollhöfer, and Matthias Nießner. “Deferred Neural Rendering: Image Synthesis using Neural Textures”. In: ACM Transactions on Graphics (TOG) 38.4 (2019), pp. 1–12.
[12] Lingzhi Li et al. “FaceShifter: Towards High Fidelity And Occlusion Aware Face Swapping”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, pp. 5074–5083.
[13] Justus Thies et al. “Face2Face: Real-time Face Capture and Reenactment of RGB Videos”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, pp. 2387–2395.
[14] Rowan Zellers et al. “Defending Against Neural Fake News”. In: Advances in Neural Information Processing Systems 32 (2019).
[15] Yuanshun Yao et al. “Automated Crowdturfing Attacks and Defenses in Online Review Systems”. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. 2017, pp. 1143–1158.
[16] Avisha Das and Rakesh Verma. “Automated email Generation for Targeted Attacks using Natural Language”. In: Proceedings of the TA-COS Workshop on Text Analytics for Cybersecurity and Online Safety. 2018, p. 23.
[17] Kris McGuffie and Alex Newhouse. “The Radicalization Risks of GPT-3 and Advanced Neural Language Models”. In: arXiv preprint arXiv:2009.06807 (2020).
[18] Aäron van den Oord et al. “WaveNet: A Generative Model for Raw Audio”. In: Proceedings of the ISCA Workshop on Speech Synthesis Workshop (SSW 9). 2016, p. 125.
[19] Ryan Prenger, Rafael Valle, and Bryan Catanzaro. “WaveGlow: A Flow-Based Generative Network for Speech Synthesis”. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2019, pp. 3617–3621.
[20] Sean Vasquez and Mike Lewis. “MelNet: A Generative Model for Audio in the Frequency Domain”. In: arXiv preprint arXiv:1906.01083 (2019).
[21] Yuxuan Wang et al. “Tacotron: Towards End-to-End Speech Synthesis”. In: Proceedings of the Interspeech Conference (2017).
[22] Kun Cheng et al. “Videoretalking: Audio-based lip synchronization for talking head video editing in the wild”. In: SIGGRAPH Asia 2022 Conference Papers. 2022, pp. 1–9.
[23] Weizhi Zhong et al. “Identity-preserving talking face generation with landmark and appearance priors”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, pp. 9729–9738.
[24] Soumik Mukhopadhyay et al. “Diff2lip: Audio conditioned diffusion models for lip-synchronization”. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2024, pp. 5292–5302.
[25] Soumyya Kanti Datta, Shan Jia, and Siwei Lyu. “Detecting Lip-Syncing Deepfakes: Vision Temporal Transformer for Analyzing Mouth Inconsistencies”. In: arXiv preprint arXiv:2504.01470 (2025).
[26] Yuezun Li and Siwei Lyu. “Exposing DeepFake Videos By Detecting Face Warping Artifacts”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2019, pp. 46–52.
[27] Pavel Korshunov and Sébastien Marcel. “Deepfakes: a new threat to face recognition? assessment and detection”. In: arXiv preprint arXiv:1812.08685 (2018).
[28] Andreas Rossler et al. “Faceforensics++: Learning to detect manipulated facial images”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 1–11.
[29] Yuezun Li et al. “Celeb-df: A large-scale challenging dataset for deepfake forensics”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, pp. 3207–3216.
[30] Liming Jiang et al. “Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020, pp. 2889–2898.
[31] Brian Dolhansky et al. “The deepfake detection challenge (dfdc) dataset”. In: arXiv preprint arXiv:2006.07397 (2020).
[32] Patrick Kwon et al. “Kodf: A large-scale korean deepfake detection dataset”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 10744–10753.
[33] Andrew Ray. “Disinformation, deepfakes and democracies: The need for legislative reform”. In: The University of New South Wales Law Journal 44.3 (2021), pp. 983–1013.
[34] Siwei Lyu. “Deepfake detection: Current challenges and next steps”. In: 2020 IEEE international conference on multimedia & expo workshops (ICMEW). IEEE. 2020, pp. 1–6.
[35] Yuezun Li, Ming-Ching Chang, and Siwei Lyu. “In Ictu Oculi: Exposing AI Created Fake Videos by Detecting Eye Blinking”. In: Proceedings of the IEEE International Workshop on Information Forensics and Security (WIFS). 2018, pp. 1–7.
[36] Xin Yang, Yuezun Li, and Siwei Lyu. “Exposing deep fakes using inconsistent head poses”. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2019, pp. 8261–8265.
[37] Juan Hu et al. “FInfer: Frame inference-based deepfake detection for high-visualquality videos”. In: Proceedings of the AAAI conference on artificial intelligence. Vol. 36. 1. 2022, pp. 951–959.
[38] Ziheng Hu et al. “Dynamic-aware federated learning for face forgery video detection”. In: ACM Transactions on Intelligent Systems and Technology (TIST) 13.4 (2022), pp. 1–25.
[39] Xiaoguang Tu et al. “Learning generalizable and identity-discriminative representations for face anti-spoofing”. In: ACM Transactions on Intelligent Systems and Technology (TIST) 11.5 (2020), pp. 1–19.
[40] Alexandros Haliassos et al. “Lips don’t lie: A generalisable and robust approach to face forgery detection”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, pp. 5039–5049.
[41] Huy H Nguyen, Junichi Yamagishi, and Isao Echizen. “Capsule-forensics: Using capsule networks to detect forged images and videos”. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2019, pp. 2307–2311.
[42] Kevin Lutz and Robert Bassett. “DeepFake Detection with Inconsistent Head Poses: Reproducibility and Analysis”. In: arXiv preprint arXiv:2108.12715 (2021).
[43] François Chollet. “Xception: Deep learning with depthwise separable convolutions”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 1251–1258.
[44] Darius Afchar et al. “Mesonet: a compact facial video forgery detection network”. In: 2018 IEEE international workshop on information forensics and security (WIFS). IEEE. 2018, pp. 1–7.
[45] Awais Khan et al. “Voice Spoofing Countermeasures: Taxonomy, State-of-the-art, experimental analysis of generalizability, open challenges, and the way forward”. In: arXiv preprint arXiv:2210.00417 (2022).
[46] Longbiao Wang et al. “Relative phase information for detecting human speech and spoofed speech”. In: Sixteenth Annual Conference of the International Speech Communication Association. 2015.
[47] Massimiliano Todisco, Héctor Delgado, and Nicholas WD Evans. “A New Feature for Automatic Speaker Verification Anti-Spoofing: Constant Q Cepstral Coefficients.” In: Odyssey. Vol. 2016. 2016, pp. 283–290.
[48] Tanvina B Patel and Hemant A Patil. “Combining evidences from mel cepstral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech”. In: Sixteenth annual conference of the international speech communication association. 2015.
[49] Yupeng Zhu et al. “Local self-attention-based hybrid multiple instance learning for partial spoof speech detection”. In: ACM Transactions on Intelligent Systems and Technology (TIST) 14.5 (2023), pp. 1–18.
[50] Komal Chugh et al. “Not made for each other-audio-visual dissonance-based deepfake detection and localization”. In: Proceedings of the 28th ACM International Conference on Multimedia. 2020, pp. 439–447.
[51] Trisha Mittal et al. “Emotions Don’t Lie: An Audio-Visual Deepfake Detection Method using Affective Cues”. In: Proceedings of the 28th ACM international conference on multimedia. 2020, pp. 2823–2832.
[52] Yipin Zhou and Ser-Nam Lim. “Joint Audio-Visual Deepfake Detection”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 14800–14809.
[53] Sindhu B Hegde et al. “Visual speech enhancement without a real visual stream”. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2021, pp. 1926–1935.
[54] Gregory Koch, Richard Zemel, Ruslan Salakhutdinov, et al. “Siamese neural networks for one-shot image recognition”. In: ICML deep learning workshop. Vol. 2. Lille. 2015, p. 0.
[55] Brais Martinez et al. “Lipreading using temporal convolutional networks”. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2020, pp. 6319–6323.
[56] A. Nagrani, J. S. Chung, and A. Zisserman. “VoxCeleb: a large-scale speaker identification dataset”. In: INTERSPEECH. 2017.
[57] Davis E King. “Dlib-ml: A machine learning toolkit”. In: The Journal of Machine Learning Research 10 (2009), pp. 1755–1758.
[58] Hasam Khalid et al. “Evaluation of an audio-video multimodal deepfake dataset using unimodal and multimodal detectors”. In: Proceedings of the 1st workshop on synthetic multimedia-audiovisual deepfake generation and detection. 2021, pp. 7–15.
[59] Bowen Shi et al. “Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction”. In: Proceedings of the International Conference on Learning Representations. 2021.
[60] Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. “LRS3-TED: a large-scale dataset for visual speech recognition”. In: arXiv preprint arXiv:1809.00496 (2018).
[61] Sahibzada Adil Shahzad et al. “Lip Sync Matters: A Novel Multimodal Forgery Detector”. In: Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). 2022, pp. 1885–1892.
[62] Anurag Arnab et al. “ViViT: A Video Vision Transformer”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 6836–6846.
[63] Ammarah Hashmi et al. “Multimodal Forgery Detection Using Ensemble Learning”. In: Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). 2022, pp. 1524–1532.
[64] Conrad Sanderson. The VIDTIMIT Database. Tech. rep. 2002.
[65] Will Kay et al. “The Kinetics Human Action Video Dataset”. In: arXiv preprint arXiv:1705.06950 (2017).
[66] Muhammad Anas Raza and Khalid Mahmood Malik. “Multimodaltrace: Deepfake Detection using Audiovisual Representation Learning”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, pp. 993–1000.
[67] Hafsa Ilyas, Ali Javed, and Khalid Mahmood Malik. “AVFakeNet: A unified endto-end Dense Swin Transformer deep learning model for audio-visual deepfakes detection”. In: Applied Soft Computing 136 (2023), p. 110124.
[68] Wenyuan Yang et al. “AvoiD-DF: Audio-visual joint learning for detecting deepfake”. In: IEEE Transactions on Information Forensics and Security 18 (2023), pp. 2015–2029.
[69] Vinaya Sree Katamneni and Ajita Rattani. “MIS-AVoiDD: Modality invariant and specific representation for audio-visual deepfake detection”. In: International Conference on Machine Learning and Applications (ICMLA). 2023, pp. 1371–1378.
[70] Davide Cozzolino et al. “Audio-Visual Person-of-Interest DeepFake Detection”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, pp. 943–952.
[71] Xiaolong Liu et al. “MCL: multimodal contrastive learning for deepfake detection”. In: IEEE Transactions on Circuits and Systems for Video Technology (2023), pp. 2803–2813.
[72] Yang Yu et al. “PVASS-MDD: Predictive Visual-audio Alignment Self-supervision for Multimodal Deepfake Detection”. In: IEEE Transactions on Circuits and Systems for Video Technology (2023).
[73] Chao Feng, Ziyang Chen, and Andrew Owens. “Self-supervised video forensics by audio-visual anomaly detection”. In: Proc. of the CVPR. 2023, pp. 10491–10503.
[74] Mulin Tian et al. “Unsupervised multimodal deepfake detection using intra-and cross-modal inconsistencies”. In: arXiv preprint arXiv:2311.17088 (2023).
[75] Xiaolou Li et al. “Zero-Shot Fake Video Detection by Audio-Visual Consistency”. In: Proc. of the Interspeech. 2024, pp. 2935–2939.
[76] Vinaya Sree Katamneni and Ajita Rattani. “Contextual cross-modal attention for audio-visual deepfake detection and localization”. In: Proc. of the IJCB. 2024, pp. 1–11.
[77] Fan Nie et al. “FRADE: Forgery-aware Audio-distilled Multimodal Learning for Deepfake Detection”. In: Proc. of the ACM MM. 2024, pp. 6297–6306.
[78] Trevine Oorloff et al. “AVFF: Audio-visual feature fusion for video deepfake detection”. In: Proc. of the CVPR. 2024, pp. 27102–27112.
[79] Ammarah Hashmi et al. “AVTENet: A Human-Cognition-Inspired Audio-Visual Transformer-Based Ensemble Network for Video Deepfake Detection”. In: IEEE Transactions on Cognitive and Developmental Systems (2025), pp. 1–17.
[80] Marcella Astrid, Enjie Ghorbel, and Djamila Aouada. “Statistics-aware audiovisual deepfake detector”. In: Proc. of the ICIP. 2024, pp. 2557–2563.
[81] Deressa Wodajo and Solomon Atnafu. “Deepfake video detection using convolutional vision transformer”. In: arXiv preprint arXiv:2102.11126 (2021).
[82] Huy H Nguyen et al. “Multi-task Learning for Detecting and Segmenting Manipulated Facial Images and Videos”. In: Proceedings of the IEEE International Conference on Biometrics Theory, Applications and Systems (BTAS). 2019, pp. 1–8.
[83] Xintong Han, Vlad Morariu, Peng IS Larry Davis, et al. “Two-Stream Neural Networks for Tampered Face Detection”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2017, pp. 19–27.
[84] Falko Matern, Christian Riess, and Marc Stamminger. “Exploiting Visual Artifacts to Expose Deepfakes and Face Manipulations”. In: Proceedings of the IEEE Winter Applications of Computer Vision Workshops (WACVW). 2019, pp. 83–92.
[85] Shan Jia et al. “Can ChatGPT detect deepfakes? a study of using multimodal large language models for media forensics”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, pp. 4324–4333.
[86] Ammarah Hashmi et al. “Unmasking Illusions: Understanding Human Perception of Audiovisual Deepfakes”. In: arXiv preprint arXiv:2405.04097 (2024).
[87] Alexandros Haliassos et al. “Lips Don’t Lie: A Generalisable and Robust Approach To Face Forgery Detection”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, pp. 5039–5049.
[88] Sahibzada Adil Shahzad et al. “AV-Lip-Sync+: Leveraging AV-HuBERT to exploit multimodal inconsistency for video deepfake detection”. In: IEEE Transactions on Human-Machine Systems (2025).
[89] Alexey Dosovitskiy. “An Image is Worth 16x16 Words: Transformers for image recognition at scale”. In: arXiv preprint arXiv:2010.11929 (2020).
[90] Anurag Arnab et al. “ViViT: A Video Vision Transformer”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 6836–6846.
[91] Yuan Gong, Yu-An Chung, and James Glass. “AST: Audio Spectrogram Transformer”. In: Proceedings of the Interspeech Conference. 2021, pp. 571–575.
全文公開日期 2027/01/06