Making Speaker Diarization System Noise Tolerant
Keywords:Speaker recognition, Speaker diarization, Noise robustness, Teacher-student, Consistency regularization
The goal of speaker diarization is to identify and separate different speakers in a multi-speaker audio recording. However, noise in the recording can interfere with the accuracy of these systems. In this paper, we explore methods such as multi-condition training, consistency regularization, and teacher-student techniques to improve the resilience of speaker embedding extractors to noise. We test the effectiveness of these methods on speaker verification and speaker diarization tasks and demonstrate that they lead to improved performance in the presence of noise and reverberation. To test the speaker verification and diarization system under noisy and reverberant conditions, we created augmented versions of the VoxCeleb1 cleaned test and Voxconverse dev datasets by adding noise and echo with different SNR values. Our results show that, on average, we can achieve a 19.1% relative improvement in speaker recognition using the teacher-student method and a 17% relative improvement in speaker diarization using consistency regularization compared to a multi-condition trained baseline.
Q. Wang, C. Downey, L. Wan, P. Mansfield and I. Moreno, “Speaker diarization with LSTM”, 2018 IEEE International Conference On Acoustics, Speech And Signal Processing (ICASSP). pp. 5239-5243, 2018.
X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, and O. Vinyals, “Speaker diarization: A review of recent research”, IEEE Transactions On Audio, Speech, And Language Processing, vol. 20, pp. 356-370, 2012.
Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Hershey, R. Saurous, R. Weiss, Y. Jia, and I. Moreno, “Voicefilter: Targeted voice separation by speakerconditioned spectrogram masking”, ArXiv Preprint ArXiv:1810.04826, 2018.
Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen, R. Pang, I. Lopez Moreno, Y. Wu, and Others, “Transfer learning from speaker verification to multi speaker text-to-speech synthesis”, Advances in Neural Information Processing Systems, vol. 31, 2018.
E. Cooper, C. Lai, Y. Yasuda, F. Fang, X. Wang, N. Chen, and J. Yamagishi, “Zeroshot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings”, ICASSP 2020-2020 IEEE International Conference On Acoustics, Speech And Signal Processing (ICASSP), pp. 6184-6188, 2020.
E. Variani, X. Lei, E. McDermott, I. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification”, 2014 IEEE International Conference On Acoustics, Speech And Signal Processing (ICASSP), pp. 4052-4056, 2014.
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition”, 2018 IEEE International Conference On Acoustics, Speech And Signal Processing (ICASSP), pp. 5329-5333, 2018.
Y. Yu, L. Fan, and W. Li, “Ensemble additive margin softmax for speaker verification”, ICASSP 2019-2019 IEEE International Conference On Acoustics, Speech And Signal Processing (ICASSP), pp. 6046-6050, (2019).
Z. Gao, Y. Song, I. McLoughlin, W. Guo, and L. Dai, “An improved deep embedding learning method for short duration speaker verification”, International Speech Communication Association, 2018.
P. Safari, M. India, and J. Hernando, “Self-attention encoding and pooling for speaker recognition”, ArXiv Preprint ArXiv:2008.01077, 2020.
J. Jung, H. Heo, I. Yang, H. Shim, and H. Yu, “A complete end-to-end speaker verification system using deep neural networks: From raw signals to verification result”, IEEE International Conference On Acoustics, Speech and Signal Processing (ICASSP), pp. 5349-5353, 2018.
N. Dawalatabad, M. Ravanelli, F. Grondin, J.Thienpondt, B. Desplanques and H. Na, “ECAPA-TDNN embeddings for speaker diarization”, ArXiv Preprint ArXiv:2104.01466, 2021.
N. Koluguri, J. Li, V. Lavrukhin and B. Ginsburg, “SpeakerNet: 1D depth-wise separable convolutional network for text-independent speaker recognition and verification”, ArXiv Preprint ArXiv:2010.12653, 2020.
N. Koluguri, T. Park and B. Ginsburg, “TitaNet: Neural Model for speaker representation with 1D Depth-wise separable convolutions and global context”, Proceedings of the IEEE International Conference on Acoustics, Speech And Signal Processing (ICASSP), pp. 8102-8106, 2022.
F. Schroff, D. Kalenichenko and J. Philbin, “Facenet: A unified embedding for face recognition and clustering”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815-823, 2015.
J. Snell, K. Swersky and R.Zemel, “Prototypical networks for few-shot learning”, Advances in Neural Information Processing Systems, vol.30, 2017.
L. Wan, Q. Wang, A. Papir and I. Moreno, “Generalized end-to-end loss for speaker verification”, Proceedings of the IEEE International Conference on Acoustics, Speech And Signal Processing (ICASSP), pp. 4879-4883, 2018.
J. Chung, J. Huh, S. Mun, M. Lee, H. Heo, S. Choe, C. Ham, S. Jung, B. Lee and I. Han, “In defence of metric learning for speaker recognition”, ArXiv Preprint ArXiv:2003.11982, 2020.
Y. Kim, H. Heo, J. Jung, Y. Kwon, B. Lee and J. Chung, “Disentangled dimensionality reduction for noise-robust speaker diarization”, ArXiv Preprint ArXiv:2110.03380, 2021.
Y. Hu, N. Hou, C. Chen E. Chng, “Dual-path style learning for end-to-end noise-robust speech recognition”, ArXiv Preprint ArXiv:2203.14838, 2022.
Q. Zhu, J. Zhang, Z. Zhang, M. Wu, X. Fang and L. Dai, “A noise-robust selfsupervised pre-training model based speech representation learning for automatic speech recognition”, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3174-3178, 2022.
L. Moner, M. Wu, A. Raju, S. Parthasarathi, K. Kumatani, S. Sundaram, R. Maas, and B. Hoffmeister, “Improving noise robustness of automatic speech recognition via parallel data and teacher-student learning”, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6475-6479, 2019.
J. Deng, J. Guo, N. Xue and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition”, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690-4699, 2019.
A. Vanyan and H. Khachatrian, “Deep semi-supervised image classification algorithms: a survey”, J. Univers. Comput. Sci., vol. 27, pp. 1390-1407, 2021.
S. Laine and T. Aila, “Temporal ensembling for semi-supervised learning”, ArXiv Preprint ArXiv:1610.02242, 2016.
A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results”, Advances in Neural Information Processing Systems, vol.30, 2017.
G. Hinton, O. Vinyals and J. Dean, “Distilling the knowledge in a neural network. ArXiv Preprint ArXiv:1503.02531, 2015.
A. Nagrani, J. Chung and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset”, ArXiv Preprint ArXiv:1706.08612, 2017.
J. Chung, A. Nagrani and A. Zisserman, “Voxceleb2: Deep speaker recognition”, ArXiv Preprint ArXiv:1806.05622, 2018.
J. Chung, J. Huh, A. Nagrani, T. Afouras and A. Zisserman, “Spot the conversation: speaker diarization in the wild”, ArXiv Preprint ArXiv:2007.01216, 2020.
D. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. Cubuk and Q. Le, “Specaugment: A simple data augmentation method for automatic speech recognition”, ArXiv Preprint ArXiv:1904.08779, 2019.
T. Ko, V. Peddinti, D. Povey, M. Seltzer and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition”, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5220-5224, 2017.
D. Snyder, G. Chen and D. Povey, “Musan: A music, speech, and noise corpus", ArXiv P reprint ArXiv:1510.08484, 2015.
U.Von Luxburg, “A tutorial on spectral clustering", Statistics and Computing, vol. 17, pp. 395-416, 2007.
How to Cite
Copyright (c) 2023 Davit S. Karamyan, Grigor A. Kirakosyan and Saten A. Harutyunyan
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.