Deep Learning Approaches for Voice Emotion Recognition Using Sentiment-Arousal Space


  • Narek T. Tumanyan Weizmann Institute of Science, Israel



Voice emotion recognition, Sentiment-arousal space, Spectral features, Speech sentiment classification


In this paper, we present deep learning-based approaches for the task of emotion recognition in voice recordings. A key component of the methods is the representation of emotion categories in a sentiment-arousal space and the usage of this space representation in the supervision signal. Our methods use wavelet and cepstral features as efficient data representations of audio signals. Convolutional Neural Network (CNN) and Long Short Term Memory Network (LSTM) architectures were used in recognition tasks, depending on whether the audio representation was treated as a spatial signal or as a temporal signal. Various recognition approaches were used, and the results were analyzed.


E. Mower, M. J.Mataric and S.Narayanan, “A framework for automatic human emotion classification using emotion profiles”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 5, pp. 1057–1070, 2010.

S. Glge, R. Bck and T. Ott,“Emotion recognition from speech using representation learning in extreme learning machines”, Proceedings of the 9th International Joint Conference on Computational Intelligence, Funchal, Portugal, pp. 179–185, 2017.

S.R.Livingstone, and F.A. Russo, “The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English”, PLoS ONE, vol. 13, no. 5, 2018.

P. Jackson and S. Haq, “Surrey audio-visual expressed emotion (savee) database“, University of Surrey: Guildford, UK. 2014.

Pichora-Fuller, M. Kathleen and K. Dupuis, “Toronto emotional speech set (TESS)“, Scholars Portal Dataverse, 2020.

B. McFee, A. Metsai, M. McVicar, S. Balke, C. Thom, C. Raffel, F. Zalkow, A. Malek, D. Kyungyun Lee, O. Nieto, D. Ellis, J. Mason, E. Battenberg, S. Seyfarth. (2022). librosa/librosa: 0.9.0 (0.9.0). Zenodo.

K. Grchenig, Foundations of Time-Frequency Analysis, First Edition. Birkhuser, Boston, MA, 2001.

A. Kulkarni, M.F. Qureshi and M. JHA, “Discrete fourier transform: Approach to signal processing”, International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering, vol. 03, pp. 12341–12348, 2014.

M. Sahidullah and G. Saha, “Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition”, Speech Communication , vol. 54, no. 4, pp. 543–565, 2012.

S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory", Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.

J. Posner, J.A. Russell and B.S. Peterson, “The circumplex model of affect: an integrative approach to affective neuroscience, cognitive development, and psychopathology", Development and psychopathology, vol. 17, no. 3, pp. 715-734, 2005.

P. S. Addison, The Illustrated Wavelet Transform Handbook, Second Edition. CRC Press, 2017.

G. Garg and G.K. Verma, “Emotion recognition in valence-arousal space from multichannel EEG data and wavelet based deep learning framework", Procedia Computer Science, vol. 171, pp. 857-867, 2020.

S. Koelstra, C. Muhl, M. Soleymani, J.S. Lee, A. Yazdani, T. Ebrahimi and I. Patras, “Deap: A database for emotion analysis; using physiological signals", IEEE transactions on affective computing, vol. 3, no. 1, pp. 18-31, 2011.

A. Paszke, S.Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai and S. Chintala, “PyTorch: An imperative style, high-performance deep learning library", Advances in Neural Information Processing Systems 32, pp. 8024-8035, 2019.




How to Cite

Tumanyan, N. T. (2021). Deep Learning Approaches for Voice Emotion Recognition Using Sentiment-Arousal Space. Mathematical Problems of Computer Science, 56, 35–47.