A Literature Review on Emotion Recognition using Various Methods Reeshad Khan ? & Omar Sharif ?
ost common exposition of an idea of emotion could be found as "a natural instinctive state of mind deriving from one's circumstances, mood, or relationships with others". Which misses depicting the driving force behind all motivation which may positive, negative or neutral. This is very important information to understand emotion as an intelligent agent. It is very complicated to detect the emotions and distinguish among them. Before a decades or two emotion started to become a concern as an important addition towards the modern technology world. Rises the hope of new dawn for intelligence apparatus. Imagine a world where machines do feel what humans need or want. With the special kind of calculation then that machine could predict the further consequences and by which mankind could avoid serious circumstances and lot more. Humans are far more strong and intelligent due to the addition of the emotion but less effective than machines. But what if machines get this special features of human? It will be the strongest addition to the technology ever. And to make the dreams come true this is the first step; train a system to spot and recognize emotions. This is the start of an intelligent system. Intelligent Systems are becoming more efficient by predicting and classifying decision in various aspects of practical life. Particularly, emotion recognition through deep learning has become intriguing research area for its innovative nature and practical implication. This technique mainly consists of detecting emotion through various kinds of input taken from different human behavior and condition. A technology namely neural network detects emotion through deep learning. For its complication mentioned earlier, an emotion recognition system with stellar efficiency and accuracy is needed.
Previous works are focused on eliciting results from unimodal systems. Machines used to predict emotion by only facial expressions [1] or only vocal sounds [2]. After a while, multimodal systems that use more than one features to predict emotion has more effective and gives more accurate results. So that, the combination of features such as audio-visual expressions, EEG, body gestures have been used since. More than one intelligent machine and neural networks are used to implement the emotion recognition system. Multimodal recognition method has proven more effective than unimodal systems by Shiqing et al. [3]. Research has demonstrated that deep neural networks can effectively generate discriminative features that approximate the complex non-linear dependencies between features in the original set. These deep generative models have been applied to speech and language processing, as well as emotion recognition tasks [4][5][6]. Martin et al. [7] showed that bidirectional Long Short Term Memory(BLSTM) network is more effective that conventional SVM approach.; In speech processing, Ngiam et al. [8] proposed and evaluated deep networks to learn audio-visual features from spoken letters. In emotion recognition, Brueckner et al. [9] found that the use of a Restricted Boltzmann Machine (RBM) prior to a two-layer neural network with fine-tuning could significantly improve classification accuracy in the Interspeech automatic likability classification challenge [10]. The work by Stuhlsatz et al. [11] took a different approach for learning acoustic features in speech emotion recognition using Generalized Discriminant Analysis (GerDA) based on Deep Neural Networks (DNNs). Yelin et al. [12] showed three layered Deep Belief Networks(DBNs') give better performance than two layered DBNs' by using audio-
19 Year 2017 ( ) FAbstract-Emotion Recognition is an important area of work to improve the interaction between human and machine. Complexity of emotion makes the acquisition task more difficult. Quondam works are proposed to capture emotion through unimodal mechanism such as only facial expressions or only vocal input. More recently, inception to the idea of multimodal emotion recognition has increased the accuracy rate of the detection of the machine. Moreover, deep learning technique with neural network extended the success ratio of machine in respect of emotion recognition. Recent works with deep learning technique has been performed with different kinds of input of human behavior such as audio-visual inputs, facial expressions, body gestures, EEG signal and related brainwaves. Still many aspects in this area to work on to improve and make a robust system will detect and classify emotions more accurately. In this paper, we tried to explore the relevant significant works, their techniques, and the effectiveness of the methods and the scope of the improvement of the results.
visual emotion recognition process. Samira et al [13] used Recurrent neural network combined with Convoluted Neural Network(CNN) in an underlying CNN-RNN architecture to predict emotion in the video. Some noble methods and techniques also enriched this particular research. They are more accurate, stable and realistic. In terms of performance, accuracy, reasonability and precision these methods are the dominating solutions. Some of them are more accurate but some are more realistic. Some take much time and require greater computation power to produce the more accurate result but some compromises accuracy over performance. The idea of being successful might differ but these solutions are the best possible till now.
Yelin Kim and Emily Mower Provos explore whether a subset of an utterance can be used for emotion inference and how the subset varies by classes of emotion and modalities. They propose a windowing method that identifies window configurations, window duration, and timing, for aggregating segment-level information for utterance-level emotion inference. The experimental results using the IEMOCAP and MSP-IMPROV datasets show that the identified temporal window configurations demonstrate consistent patterns across speakers, specific to different classes of emotion and modalities. They compare their proposed windowing method to a baseline method that randomly selects window configurations and a traditional all-mean method that uses the full information within an utterance. This method shows a significantly higher performance in emotion recognition while the method only uses 40-80% of information within each utterance. The identified windows also show consistency across speakers, demonstrating how multimodal cues reveal emotion over time. These patterns also align with psychological findings. But after all achievement, the result is not consistent with this method [15].
A. Yao, D. Cai, P. Hu, S. Wang, L. Shan, and Y. Chen used a well-designed Convolutional Neural Network (CNN) architecture regarding the video based emotion recognition [14]. They proposed the method named as HOLONET has three critical considerations in network design. (1) To reduce redundant filters and enhance the non-saturated non-linearity in the lower convolutional layers, they used modified Concatenated Rectified Linear Unit (CReLU) instead of ReLU. (2) To enjoy the accuracy gain from considerably increased network depth and maintain efficiency, they combine residual structure and CReLU to construct the middle layers. (3) To broaden network width and introduce multi-scale feature extraction property, the topper layers are designed as a variant of the inception-residual structure. This method more realistic than other methods here. It's focused on adaptability in real-time scenario than accuracy and theoretical performance. Though its accuracy is also impressive but only this method is applicable only in the video based emotion recognition. Other types of data rather than video, this method can't produce results [14].
Y. Fan, X. Lu, D. Li, and Y. Liu. proposed a method for video-based emotion recognition in the wild. They used CNN-LSTM and C3D networks to simultaneously model video appearances and motions [16]. They found that the combination of the two kinds of networks can give impressive results, which demonstrated the effectiveness of the method. In their proposed method they used LSTM (Long Short Term Memory) -a special kind of RNN, C3D -A Direct Spatio-Temporal Model and Hybrid CNN-RNN and C3D Networks. This method gives a great accuracy and performance is remarkable. But this method is much convoluted, time-consuming and less realistic. For this reason, efficiency is not that impressive [16].
Zixing Zhang, Fabien Ringeval, Eduardo Coutinho, Erik Marchi and Björn Schüller proposed some improvement in SSL technique to improve the low performance of a classifier that can deliver on challenging recognition tasks reduces the trust ability of the automatically labeled data and gave solutions regarding the noise accumulation problem -instances that are misclassified by the system are still used to train it in future iterations [17]. they exploited the complementarity between audio-visual features to improve the performance of the classifier during the supervised phase. Then, they iteratively re-evaluated the automatically labeled instances to correct possibly mislabeled data and this enhances the overall confidence of the system's predictions. This technique gives a best possible performance using SSL technique where labeled data is scarce and/or expensive to obtain but still, there are various inherent limitations that limit its performance in practical applications. This technique has been tested on a specific database with a limited type and number of data. The algorithm which has been used is not capable of processing physiological data alongside other types of data [17].
Wei-Long Zheng and Bao-Liang Lu proposed EEG-based effective models without labeled target data using transfer learning techniques (TCA-based Subject Transfer) [18] which is very accurate in terms of positive emotion recognition than other techniques used before. Their method achieved 85.01% accuracy. They used to transfer learning and their method includes three pillars, TCA-based Subject Transfer, KPCA-based Subject Transfer and Transductive Parameter Transfer. For data preprocessing they used raw EEG signals processed with a bandpass filter between 1 Hz and 75 Hz and for feature extraction, they employed differential entropy (DE) features. For evaluation, they adopted a leave-onesubject-out cross-validation method. Their experimental results demonstrated that the transductive parameter transfer approach significantly outperforms the other approaches in terms of the accuracies, and a 19.58% increase in recognition accuracy has been achieved.
Though this achievement is limited to the positive emotion recognition only. This method is limited in terms of negative and neutral emotion recognition. Yet a lot improvement needed to recognize negative and neutral emotion more accurately [18].
In terms of emotion recognition, there is no indefinite way or method which is the univocal solution. A lot of solution have come and many to comes in near future with significant improvement in terms of efficiency, accuracy, and usability. In past and the current research shows that multimodalities dominated the area of emotion recognition than unimodality. Using EEG and audio-visual signal yields the best possible results according to the newest researches. We assume LSTM-RNN is the best way to handle multimodalities. So our proposal is focused on emotion recognition by EEG and audio-visual signal using LSTM-RNN. This type of research has been done before. But our challenge is to improve the model where it will be trained by EEG and audiovisual data at the same time and will make a relation between this data wherein, if one type of data is not available in a situation, the model could still produce the result; finding the relation within the data. So, the training will have two part; training for the data and training to understand the relations between the data.
In this Paper we discussed about the work done on emotion recognition and for achieving that all
We are working towards a machine with emotions. A machine or a system, which can think like humans, can feel warmness of heart; can judge on events, prioritized between choices and with many more emotional epithets. To make the dream reality first we need the machine or system to understand human emotions, ape the emotion and master it. We just started to do that. Though there is some real example exists this days. Some features and services are getting popularity like Microsoft Cognitive Services but still there is a lot works required in the terms of efficiency, accuracy and usability. Therefore, in future Emotion Recognition is an area requires a great intentness.
superior and novel approaches and methods. We have proposed a glimpse of a probable solution and method towards recognition the emotion. Work so far substantiate that emotion recognition using users EEG signal and audiovisual signal has the highest recognition rate and has highest performance.
A Literature Review on Emotion Recognition using Various Methods | |||
Reference and year | Approach and Method | Performance | |
Positive (85.01%) emotion recognition rate | |||
Wei-Long Zheng and Bao- | EEG-based affective models | is higher than other approaches but neutral | |
Liang Lu | without labeled target data | (25.76%) and negative (10.24%) emotions | |
(2016) | using transfer learning techniques (TCA-based Subject Transfer) | are often confused with each other. Delivers a strong performance in the | Year 2017 |
Zixing Zhang, Fabien | classification of high/low emotional | ||
Ringeval, Fabien Ringeval, | Semi-Supervised Learning | arousal (UAR = 76.5%), and significantly | |
Eduardo Coutinho, Erik | (SSL) technique | outperforms traditional SSL methods by at | |
Marchi and Björn Schüller | least 5.0% (absolute gain). | ||
(2016) | |||
Achieved accuracy 59.02% (without using | |||
Y. Fan, X. Lu, D. Li, and | Video-based Emotion | any additional | |
Y. Liu. | Recognition Using CNN-RNN | Emotion labeled video clips in training | |
(2016) | and C3D Hybrid Networks | set) which is the best till now. | |
A. Yao, D. Cai, P. Hu, S. | |||
Wang, L. Shan and Y. | Achieved mean recognition rate of | ||
Chen | HoloNet: towards robust | 57.84%. | |
(2016) | emotion recognition in the wild | ( ) F | |
Yelin Kim and Emily | Data driven framework to | ||
Mower Provos | explore patterns (timings and | Achieved 65.60% UW accuracy, 1.90% | |
(2016) | durations) of emotion evidence, | higher than the baseline. | |
specific to individual emotion | |||
classes | |||
IV. |
Speech Emotion Recognition Using Deep Neural Network and Extreme Learning Machine. One Microsoft Way 2014. Microsoft Research. Department of Computer Science and Engineering, The Ohio State University
Sparse multilayer perceptron for phoneme recognition,? Audio, Speech, and Language Processing. IEEE Transactions on 2012. 20 (1) p. .
Deep and wide: Multiple layers in automatic speech recognition,? Audio, Speech, and Language Processing. IEEE Transactions on 2012. 20 (1) p. .
Likability classification -a not so deep neural network approach. ? in Proceedings of INTERSPEECH, 2012.
Xiaoming Zhao; Multimodal Emotion Recognition Integrating Affective Speech with Facial Expression ; Institute of Image Processing and Pattern Recognition Taizhou University Taizhou 318000 CHINA, Hunan Institute of Technology Hengyang 421002 CHINA,Bay Area Compliance Labs. Corp. Shenzhen 2014. p. 518000.
Video-based Emotion Recognition Using CNN-RNN and C3D Hybrid Networks. Proceeding ICMI 2016 Proceedings of the 18th ACM International Conference on Multimodal Interaction, (eeding ICMI 2016 eedings of the 18th ACM International Conference on Multimodal InteractionTokyo, Japan