Extracting A Discriminative Acoustic Features from Voiced Segments for Improving Speech Emotion Recognition Accuracy

Reda Elbarougy


Performance of speech emotion recognition (SER) system mainly depends on whether the extracted features are relevant to emotions conveyed in speech or not. Finding discriminative features for SER is a challenging problem until now. The common and traditional approach for extracting features has been to extract a massive number of features from all frames of utterance. Then applying several statistics for all frames’ features to obtain utterance-level features. However, human emotions have different effects on the properties of each phoneme. Consequently, it has different effect for voiced and unvoiced segments. Therefore, utterance-level features are less effective, due to applying statistics for all frames regardless it is voiced or unvoiced frame. To enrich the discriminative properties of the extracted features, we propose new levels for feature extraction: voiced and unvoiced. The performance using features extracted from the two levels were compared with the utterance-level. For the above three levels; for each frame 13 Mel-frequency cepstral coefficients (MFCCs) were extracted and then 13 statistics were applied for each level separately. It is found that voiced-level includes many discriminative features furthermore, its performance is close to utterance-level. To improve the traditional SER system, features extracted from both voiced-level and utterance-level were combined. The results of the combined level outperform the traditional one.


Acoustic features extraction, discriminative features, speech emotion recognition, voiced, unvoiced and utterance levels for features extraction.



P. Chandrasekar, S. Chapaneri, and D. Jayaswal, “Emotion Recognition from Speech using Discriminative Features,” Int. J. Comput. Appl., vol. 101, no. 16, pp. 31–36, 2014.

A. Tursunov, S. Kwon, H.-S. Pang, Discriminating Emotions in the Valence Dimension from Speech Using Timbre Features. Appl. Sci. 2019, 9, 2470.

K. Han, D. Yu, and I. Tashev, “Speech emotion recognition using deep neural network and extreme learning machine,” Proc. INTERSPEECH, ISCA, September 2014, pp. 223–227, 2014.

M. Shuiyang, P. C. Ching, L. Tan. "Deep Learning of Segment-Level Feature Representation with Multiple Instance Learning for Utterance-Level Speech Emotion Recognition" INTERSPEECH 2019, September 15–19, 2019, Graz, Austria.

4 A. Nogueiras, A. Moreno, A. Bonafonte, and J. B. Marino, “Speech emotion recognition using hidden Markov models,” Proc. EUROSPEECH 2001, vol. 4, pp. 2679–2682, 2001.

S. Mirsamadi, E. Barsoum and C. Zhang, "Automatic speech emotion recognition using recurrent neural networks with local attention," 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, 2017, pp. 2227-2231.

B. Schuller and G. Rigoll, “Timing levels in segment-based speech emotion recognition,” Proc. INTERSPEECH 2006, Proc. Int. Conf. Spok. Lang. Process. ICSLP, pp. 1818–1821, 2006.

C. Busso, M. Bulut, and S. Narayanan, “Toward Effective Automatic Recognition Systems of Emotion in Speech,” Soc. Emot. Nat. Artifact, vol. 2012, 2014.

B. Vlasenko, B. Schuller, A. Wendemuth, and G. Rigoll, “Frame vs. Tum-Level: Emotion Recognition from Speech Considering Static and Dynamic Processing,” ACII Affect. Comput. Intell. Interact., no. Lld, pp. 139–147, 2007.

Y. Huang, G. Zhang, X. Li, and F. Da, “Improved emotion recognition with novel global utterance-level features,” Appl. Math. Inf. Sci., vol. 5, no. SUPPL.2, pp. 147–153, 2011.

N. Kurpukdee, T. Koriyama, and T. Kobayashi, “Speech Emotion Recognition using Convolutional Long Short-Term Memory Neural Network and Support Vector Machines,” no. December, 2017.

V. Chernykh, G. Sterling, and P. Prihodko, “Emotion Recognition from Speech with Recurrent Neural Networks,” 2017.

D. Bitouk, R. Verma, and A. Nenkova, “Class-level spectral features for emotion recognition,” Speech Communication, vol. 52, no. 7–8, pp. 613–625, 2010.

C. M. Lee, S. Yildirim, M. Bulut, a Kazemzadeh, C. Busso, Z. Deng, S. Lee, and S. Narayanan, “Emotion Recognition based on Phoneme Classes,” in: Proceedings of ICSLP, pp. 2193–2196, 2004.

B. Vlasenko and A. Wendemuth, “Determining the smallest emotional unit for level of arousal classification,” Proc. - 2013 Hum. Assoc. Conf. Affect. Comput. Intell. Interact. ACII 2013, pp. 734–739, 2013.

R. Elbarougy, M. Akagi, “Feature selection method for real-time speech emotion recognition,” in Co-ordination and Standardization of Speech Databases and Assessment Techniques (CO-COSDA), 2017 20th Oriental Chapter of the International Committee for the. IEEE, 2017, pp. 1–6.

R. Elbarougy and M. Akagi, “Improving speech emotion dimensions estimation using a three-layer model of human perception,” Acoustical Science and Technology, vol. 35, no. 2, pp. 86–98, 2014.

Elbarougy, R. and Akagi, M., “Optimizing fuzzy inference systems for improving speech emotion recognition,” Advances in Intelligent Systems and Computing, vol. 533, pp. 85-95, 2017.

Full Text: PDF


  • There are currently no refbacks.


All Rights Reserved © 2012 IJARCSEE

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.