Evaluating Perceptual Quality and Naturalness in Word-Based Emotion Conversion for Hindi Speech

Authors

  • Archana Agarwal, Dr. Vipan Kumari

Keywords:

 Word-Based Emotion Conversion  Hindi Speech Processing  Perceptual Evaluation  Speech Naturalness  Emotional Speech Synthesis  Acoustic Feature Mapping  Prosody Modification  Human-Computer Interaction  Mean Opinion Score (MOS)  Speech Quality Assessment

Abstract

Speech is a multidimensional communication medium that conveys not only semantic information but also paralinguistic cues such as emotion, attitude, and intent. Emotional expression significantly influences listener perception, comprehension, and engagement. With the rapid growth of human-computer interaction systems, the ability to manipulate and synthesize emotionally expressive speech has become a central research focus in speech signal processing. Emotion conversion systems aim to transform speech from one emotional state to another while preserving linguistic content and speaker identity. Although substantial research has been conducted in high-resource languages such as English and Mandarin, systematic evaluation of perceptual quality and naturalness in Hindi emotion conversion systems remains limited.

Word-based emotion conversion represents a commonly adopted approach in speech transformation systems. In this paradigm, emotional modification is applied at the word level, where acoustic parameters such as pitch contour, duration, energy, and spectral features are altered based on predefined emotional mappings. While such methods are computationally efficient and structurally simple, their perceptual effectiveness—particularly in continuous Hindi speech—requires rigorous evaluation. Hindi, characterized by its phonetic richness, syllable-timed rhythm, aspirated consonants, retroflex sounds, and vowel length contrasts, presents unique challenges in modeling emotional nuances. Word-level manipulation may inadequately capture fine-grained emotional cues embedded within sub-word units, potentially affecting naturalness and intelligibility.

This study investigates the perceptual quality and naturalness of word-based emotion conversion in continuous Hindi speech. The research focuses on evaluating how effectively word-level transformations convey target emotions and how these modifications influence listener perception. A comprehensive evaluation framework was designed integrating objective acoustic metrics and subjective listening experiments. Emotional categories examined in this study include Neutral, Happiness, Sadness, and Anger. The system applies acoustic feature mapping techniques using machine learning-based modeling for word-level emotional transformation.

A controlled Hindi emotional speech corpus was developed, comprising balanced utterances from male and female speakers across multiple age groups. The converted speech samples were assessed using objective measures such as Mel Cepstral Distortion (MCD), Fundamental Frequency Root Mean Square Error (F0 RMSE), and duration variation indices. Additionally, subjective evaluation was conducted using Mean Opinion Score (MOS), emotional identification accuracy tests, and intelligibility ratings administered to native Hindi listeners.

Findings indicate that while word-based emotion conversion successfully conveys broad emotional categories, perceptual naturalness varies significantly across emotions. High-arousal emotions such as anger and happiness are more easily recognized due to pronounced pitch and energy changes. However, subtle emotions such as sadness exhibit lower perceptual accuracy, suggesting limitations in word-level acoustic modeling. Results further demonstrate that abrupt parameter shifts at word boundaries may introduce perceptual discontinuities affecting naturalness.

The study contributes to speech emotion research by providing a systematic perceptual evaluation framework for Hindi emotion conversion systems. The findings highlight strengths and limitations of word-based transformation approaches and emphasize the importance of fine-grained modeling for enhanced naturalness. The research establishes baseline perceptual benchmarks for future development of phoneme-level and hybrid emotion conversion frameworks in Hindi speech processing.

References

 Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J., Lee, S., & Narayanan, S. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4), 335–359.

 Cowie, R., & Cornelius, R. (2003). Describing the emotional states expressed in speech. Speech Communication, 40(1–2), 5–32.

 Eyben, F., Wöllmer, M., & Schuller, B. (2010). openSMILE: The Munich versatile and fast open-source audio feature extractor. ACM Multimedia.

 Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial networks. Advances in Neural Information Processing Systems.

 Kim, Y., Lee, H., & Provost, E. (2018). Deep learning for emotional speech processing: A review. IEEE Transactions on Affective Computing.

 Rao, K. S., & Yegnanarayana, B. (2009). Modeling emotions in speech: A study on Indian languages. International Journal of Speech Technology.

 Schuller, B., Steidl, S., & Batliner, A. (2011). The INTERSPEECH 2011 speaker state challenge. INTERSPEECH.

 Stylianou, Y. (2009). Voice transformation: A survey. IEEE Transactions on Audio, Speech, and Language Processing.

 Tokuda, K., Zen, H., & Black, A. (2007). Statistical parametric speech synthesis. ICASSP.

 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems.

 Ververidis, D., & Kotropoulos, C. (2006). Emotional speech recognition: Resources, features, and methods. Speech Communication.

 Zen, H., Tokuda, K., & Black, A. (2009). Statistical parametric speech synthesis. Speech Communication.

Downloads

How to Cite

Archana Agarwal, Dr. Vipan Kumari. (2024). Evaluating Perceptual Quality and Naturalness in Word-Based Emotion Conversion for Hindi Speech. International Journal of Engineering Science & Humanities, 14(1), 173–186. Retrieved from https://www.ijesh.com/j/article/view/637

Similar Articles

<< < 1 2 3 4 5 6 7 8 9 10 > >> 

You may also start an advanced similarity search for this article.