Towards enhancing emotion recognition via multimodal framework

C. Akalya Devi, D. Karthika Renuka, G. Pooventhiran, D. Harish, Shweta Yadav, Krishnaprasad Thirunarayan

Research output: Contribution to journalArticlepeer-review

Abstract

Emotional AI is the next era of AI to play a major role in various fields such as entertainment, health care, self-paced online education, etc., considering clues from multiple sources. In this work, we propose a multimodal emotion recognition system extracting information from speech, motion capture, and text data. The main aim of this research is to improve the unimodal architectures to outperform the state-of-the-arts and combine them together to build a robust multi-modal fusion architecture. We developed 1D and 2D CNN-LSTM time-distributed models for speech, a hybrid CNN-LSTM model for motion capture data, and a BERT-based model for text data to achieve state-of-the-art results, and attempted both concatenation-based decision-level fusion and Deep CCA-based feature-level fusion schemes. The proposed speech and mocap models achieve emotion recognition accuracies of 65.08% and 67.51%, respectively, and the BERT-based text model achieves an accuracy of 72.60%. The decision-level fusion approach significantly improves the accuracy of detecting emotions on the IEMOCAP and MELD datasets. This approach achieves 80.20% accuracy on IEMOCAP which is 8.61% higher than the state-of-the-art methods, and 63.52% and 61.65% in 5-class and 7-class classification on the MELD dataset which are higher than the state-of-the-arts.
Original languageEnglish
Pages (from-to)2455-2470
Number of pages16
JournalJournal of Intelligent and Fuzzy Systems
Volume44
Issue number2
DOIs
StatePublished - 2023

ASJC Scopus Subject Areas

  • Statistics and Probability
  • General Engineering
  • Artificial Intelligence

Keywords

  • BERT
  • CNN-LSTM
  • DCCA
  • Emotion recognition
  • time-distributed models

Disciplines

  • Computer Sciences
  • Computer Engineering

Cite this