Emotion Classification of Voice Recordings Using Deep Learning


  • Narek T. Tumanyan Weizmann Institute of Science




Voice sentiment detection, Mood recognition, Speech emotion recognition, Cepstral features


In this work, we present methods for voice emotion classification using deep learning techniques. To processing audio signals, our method leverages spectral features of voice recordings, which are known to serve as powerful representations of temporal signals. To tackling the classification task, we consider two approaches to processing spectral features: as temporal signals and as spatial/2D signals. For each processing method, we use different neural network architectures that fit the approach. Classification results are analyzed and insights are presented.


E. Mower, M. J. Mataric and S. Narayanan,“A framework for automatic human emotion classification using emotion profiles”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 5, pp. 1057–1070, 2010.

S. Glüge, R. Böck and T. Ott, “Emotion recognition from speech using representation learning in extreme learning machines”, Proceedings of the 9th International Joint Conference on Computational Intelligence, Funchal, Portugal, pp. 179–185, 2017.

S.E. Eskimez, Z. Duan and W. Heinzelman, “Unsupervised learning approach to feature analysis for automatic speech emotion recognition”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada., pp. 5099–5103, 2018.

D. Bertero and P. Fung, “A first look into a convolutional neural network for speech emotion detection”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA., pp. 5115–5119, 2017.

S.Mirsamadi, E. Barsoum and C. Zhang, “Automatic speech emotion recognition using recurrent neural networks with local attention”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA., pp. 2227–2231, 2017.

S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English”, PLoS ONE, vol. 13, no. 5, 2018.

P. Jackson and S. Haq, “Surrey audio-visual expressed emotion (savee) database”, University of Surrey: Guildford, UK. 2014.

M. K. Pichora-Fuller and K. Dupuis, “Toronto emotional speech set (TESS)”, Scholars Portal Dataverse, 2020.

B. McFee, A. Metsai, M. McVicar, S. Balke, C. Thom, C. Raffel, F. Zalkow, A. Malek, Dana, K. Lee, O. Nieto, D. Ellis, J. Mason, E. Battenberg and S. Seyfarth, librosa/librosa: 0.9.0 (0.9.0). Zenodo, 2022, https://doi.org/10.5281/zenodo.5996429

K. Gröchenig, Foundations of Time-Frequency Analysis, First Edition. Birkhuser, Boston, MA, 2001.

A. Kulkarni, M. F. Qureshi, and M. Jha, “Discrete fourier transform: approach to signal processing”, International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering, vol. 03, pp. 12341–12348, 2014.

M. Sahidullah and G. Saha, “Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition”, Speech Communication, vol. 54, no. 4, pp. 543–565, 2012.

S. Hochreiter and J. Schmidhuber, “Long short-term memory", Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.

B. Xu, N. Wang, T. Chen and M. Li, “Empirical evaluation of rectified activations in convolutional network", CoRR, vol. abs/1505.00853, 2015.

A. Rousseau and P. Deleglise, “Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks", International Conference on Language Resources and Evaluation, Reykjavik, Iceland, pp. 3935-3939, 2014.

C. Busso, M. Bulut, Chi-Chun Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database", Language Resources and Evaluation, vol. 42, no. 4, pp. 335-359, 2008.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin, “Attention is all you need", CoRR, vol. abs/1505.00853, 2017.

H. Akbari, L. Yuan, R. Qian, W.H. Chuang, S.-Fu Chang, Y. Cui and B. Gong, “VATT: Transformers for multimodal self-supervised learning from raw video, audio and text, "Advances in Neural Information Processing Systems, 2021.




How to Cite

Tumanyan, N. T. (2022). Emotion Classification of Voice Recordings Using Deep Learning. Mathematical Problems of Computer Science, 57, 7–17. https://doi.org/10.51408/1963-0082