Feature extraction and classification methods of facial expression: a surey

ABSTRACT


INTRODUCTION
In artificial intelligent era, facial expression recognition (FER) is interesting and challenging task with the problems of limited dataset, different environments, pose, occlusion, person variation etc. FER systems have been applied many systems such as human-computer-interaction (HCI), games, animation of data-driven, surveillance, clinical monitoring etc., [1].Ekman and Friesen, psychologists from America defined six universal facial expressions: fear, happiness, anger, disgust, surprise, and sadness and also explored action units based facial action coding system (FACS) to describe facial features of expressions [2].Facial expressions convey nonverbal communication cues that play a significant role in interpersonal relations.Some literatures work adding on other emotions neutral, contempt, and many compound facial emotions.Some researchers employed on handcrafted features extracted using algorithms and others employed on complicated features extracted using deep learning methods.In this paper, we explored the feature extraction methods, feature descriptors, classification methods, methods of feature dimension reduction, frameworks of the facial expression recognition system and the comparison of the results.The remainder of the paper is organized as follows.In section 2, Literature of current FER system.Typical FER system is shown in section 3.After that, two types feature of facial images is discussed in section 4, and section 5 described facial databases for FER system.Section 6 describes the problem statement of FER system.In the last section, conclusion and future work is presented.

LITERATURE OF CURRENT FER SYSTEM
Used geometric feature extraction, regional local binary pattern (LBP) features extraction, fusion of both the features using autoencoders and self-organizing map (SOM)-based classifier.The average accuracy 97.55% of MMI and 98.95% of CK+ database.The accuracy of SOM-based classifier is significant improvement over SVM with 3.94% increase for CK+ and 4.36% for MMI dataset respectively [3].Explored multiple feature fusion applying Histogram of oriented gradients from three orthogonal planes (HOG-TOP) with experimentation of three datasets CK+, GEMEP-FERA 2011, and acted facial expression in the wild (AFEW) 4.0 [4].Presented a FER model using Haar cascades face components detection and neural network (NN) to train the eye and adding mouth features on JAFFE Japanese database.Comparison of the result of proposed method with Sobel edge detection methods is that the system has achieved more good accuracy.The problem of illumination and pose of the image and to make fully meet theory and practical requirements by integrating other biometric authentication methods and HCI perception methods is still existed [5].Examined emotion recognition system using hybrid feature descriptors combining spatial Bag of features and spatial scale-invariant feature transform (SBoF-SSIFT) and classifiers of K-nearest neighbor.Codebook construction is applied after features extraction to represent large feature sets by grouping similar features into a specified cluster number.The experimentation accuracy has showed 98.33 and 98.5% on JAFFE and extended cohncanade (CK+) dataset respectively.However, the recognition performance depends on the number of clusters for codebook generation, number of detected features, levels for image segmentation, and size of training dataset [1].Implemented cognition and mapped binary pattern-based FER using basic emotion model and circumplex model on CK+ with 100 images for training and 50 images for testing.In the preprocessing step, unwanted information such as hair, ear, and background are removed from the facial image.LBP and pseudo 3D model are used to extract the facial contours and to segment face area into sub-regions.To reduce the dimension of the features mapped local binary pattern is employed and then used two classifiers of SVM and softmax.The result found that local features and expressions are correlated.Moreover, the two classifiers have a little difference in performance.The existence of occlusion, complex conditions, and micro-expression recognition will be conducted in future FER system [6].Proposed a method Angled Local Directional Pattern (ALDP) for texture analysis of facial expression with six classifiers k-NN, SVM, DT, RF, Gaussian NB and Perceptron on CK+ dataset.Firstly, facial image was detected using Haar-like as [5] and then cropped and normalized the detected image.The accuracy improved 99% with ALDP method with no preprocessing [7].Also proposed Grey Wolf optimization for feature selection and GWO-neural network (GWO-NN) for feature classification.The parts of face eyes, nose, mouth and ears are detected using Viola-John algorithm and then SIFT feature extraction is used feature points.The accuracy 89.79% on CK+ is less than [7] and achieved 91.22% [8].Proposed a framework with high-dimensional features combination of appearance and geometric features.The system used deep sparse autoencoders (DSAE) to learn robust discriminative feature and active appearance model (AAM) to locate the facial landmarks 51 points.Three feature descriptors HoG, gray value and LBP are utilized to describe the local features.Linear dimension reduction method of PCA is used to compress the features and then give the map as the input of DASE.The accuracy of the proposed framework achieved 95.79% of CK+ dataset by using leave on subject out cross-validation method [9].
Presented three models of differential geometric fusion network (DGFN) with extraction of handcrafted features, deep facial sequential network (DFSN) based on CNN with auto-extracted features, and DFSN-1 combination of the advantages of DGFN and DFSN by mapping and concatenation of handcrafted and auto-extracted features.DFSN-1 achieved the best performance among the three models on all of CK+, Oulu-CASIA and MMI dataset [10].Used deep convolutional neural network (DCNN) using caffe framework and Telsa K20Xm GPU.The frontal face is detected and cropped applied by openCV in facial images preprocessing from CK+ and JAFFE.The accuracy of experiment achieved 97% with leave-one-subject-out cross validation on CK+ and 98.12% with 10-folds cross validation on JAFFE [11].Presented three models of differential geometric fusion network (DGFN) with extraction of handcrafted features, deep facial sequential network (DFSN) based on CNN with auto-extracted features, and DFSN-1 combination of the advantages of DGFN and DFSN by mapping and concatenation of handcrafted and auto-extracted features.DFSN-1 achieved the best performance among the three models on all of CK+, Oulu-CASIA and MMI dataset [10].Used deep convolutional neural network (DCNN) using caffe framework and Telsa K20Xm GPU.The frontal face is detected and cropped applied by openCV in facial images preprocessing from CK+ and JAFFE.The accuracy of experiment achieved 97% with leave-one-subject-out cross validation on CK+ and 98.12% with 10-folds cross validation on JAFFE [11].Reviewed analysis of 22 Local Binary Pattern variances on JAFFE and CK databases using the simple parameter-free nearest neighbor classifier (1-NN).For JAFFEE database, the highest recognition accuracy achieved 97.14% by using dLBPα, ELGS and LTP, while CK database, the highest recognition rate of 100% by using AELTP, BGC3, CSALTP, dLBPα, nLBPd, STS, and WLD discriptors.The basic LBP descriptor achieved the acceptable performance of 95.71% on JAFFE and 99.28% of CK database.The study can be extended including other problems and other datasets.Used DCNN adding data augmentation, cross entropy and L2 multi-class SVM [12].In [13], weighted center regression adaptive feature mapping (W-CR-AFM) for feature distribution and CNN for feature training on CK+, Radbound Faces database (RaFD), Amsterdam dynamic facial expression set (ADFES) and proprietary database.Different of other papers, spatial normalization and feature enhancement preprocessing methods are used.The recognition obtained 89.84%, 96.27%, 92.70% for CK+, RaFD and ADFES respectively.Address illumination problem of real-world facial images using fast fourier transform and contrast limited adaptive histogram equalization (FFT+CLAHE) for poor illumination and then applied merged binary pattern code (MBPC).PCA is used as a method of feature dimension reduction and k-NN as a classifier on SFEW dataset [14].Released a new database iCV-MEFED at FG work-shop.Multi-modality CNN is compared with CNN for micro emotion recognition in the paper.The proposed network extracted firstly visual and geometrical information of features then concatenated these into a long vector.The feature vector is fed to the hinge loss layer.The framework is better performance than CNN with the misclassification of 80.212137 using caffe [15].Also proposed another three works of the work-shop.The first winner method using CNN with geometric representation of landmark displacement leading better results compared with texture-only information.The recognition accuracy achieves 51.84% for seven expressions and 13.7% for compound emotion with the performance of average time 1.57ms using GPU or 30ms using CPU [16].
Employed deep emotional attention model using cross channel CNN by adding attention modulator on the bimodal face and body (FABO) benchmark database.The system applied CNN to learn the location of face expressions in a cluttered scene.The study has shown that the experimentation of one expression attention mechanism and two expression attention mechanism.The accuracy of the framework with attention is better than that of without attention [17].Proposed a robust facial landmark extraction method by combining datadriven of fully convolution network (FCN) and model-driven of pre-trained point distribution model (PDM) with three steps estimation-correction-tuning (ECT).The computation of response maps of global landmark estimation is trained by FCN and then the maximum points of the maps are fitted with PDM to generate initial facial shape.In the final, a weighted version of regularized landmark mean-shift (RLMS) is applied to finetune the facial shape iteratively [18].
Designed to learn NN architecture with three loss functions fully supervised, weekly supervised and hybrid regularization.The experimentation of the proposed model has achieved promising results on CK+, JAFFE under lab-environment and SFEW in the wild [19].Proposed transductive deep transfer learning (TDTL) architecture to address the problem of cross-database non-frontal facial expression recognition applying VGGface 16-Net on BU-3DEF and Multi-PIE datasets.The study found that feature representation with VGG network is better than traditional handcrafted features such like SIFT and LBP to represent complicated features [20].[21] Also used the two datasets for the experimentation to address the problem of cross-domain and cross-view of facial expressions using transductive transfer regularized least-square regression (TTRLSR) model, color SIFT (CSIFT) features with 49 landmarks and SVM classifiers.The two databases have only four identical categories neutral, surprise, happy and disgust.The experimentation of the study conducted two kinds cross-domain and same view and cross-view and same domain.PCA algorithm also applied to reduce the features dimension.

TYPICAL FER SYSTEM
Typical FER system is showed in the following system flow Figure 1.In the detection of face consists of three works: locate the face, crop the face, and scale the face.Features extraction methods, dimension reduction method and classification methods could be selected.

FEATURES OF FACIAL IMAGES
Most of the FER system used geometrical features or visual features or both of these features to extract the features from the images of faces.

Geometrical features
Geometrical methods can estimate facial landmarks location or some components of facial images such as the eyebrows, the mouth, and the nose and these features can be measured by distances, curvatures, deformations, and other geometric properties to represent the geometric facial features as they are sensitive to noise [3], [4], [8], [15], [16].The paper [8] described facial point extraction method to extract the points of eye, nose, mouth, and ears based on Viola-Jones object detection algorithm.Four key regions of face are used to extract geometric features with four steps: detect face, detect eyes, locate eye center then get eye region height, and estimate nose and lips regions.In the paper [16], facial landmark displacement method is applied to extract geometrical information.Affective geometric features are extracted using the warp transformation of facial landmarks to capture the configuration of facial landmark in [4].Facial landmark with 68 points is described as geometrical representation of face [15].

Apperance features
Appearance methods such as scale invariant feature transform (SIFT), Gabor appearance, local phase quantization can detect the multi-scale, multi-direction of the local texture changes on either specific regions or the whole face to encode the texture [3], [4], [7], [8], [15].In [6], mapped local binary pattern with four neighborhoods is used to describe the change of local texture features and then face is divided six regions such as forehead, eyes, nose, mouth, left cheek and right cheek using pseudo 3D model.The paper [7] described the texture feature using angled local directional pattern considering the center pixel.In reference [8], SIFT method is applied to extract the unique and precise informative face features.The paper [3] used local binary pattern to extract local texture feature of four basic regions of face: two eyes, nose and mouth.To extract the dynamic texture features from the video, [4] used histogram of oriented gradients from three orthogonal planes (HOG-TOP).The visual features are extracted from the color image using convolutional neural network (CNN) as a feature descriptor in [15].The effects of the approaches are time-consuming, and the characteristic dimension is huge, so the dimensionality reduction methods are used to affect the accuracy of facial expression recognition.

FACIAL DATASETS
Facial expression datasets have two types of creation of images: posed expressions images and spontaneous expressions images datasets.Researchers acquired facial images in three ways such as peak expression images only, image sequences portraying an emotion from neutral to its peak, and video clips with emotional annotations.The two widely used datasets are CK+ and JAFFE [24]- [27].The real-world facial databases are FER-2013, FERG-DB, SFEW2.0 (static facial expression in the wild), RAF-DB (real world affective face database) and AffectNet database.Sample images of basic facial expression are described in

Extended cohn-canade dataset (CK+)
CK+ data set have been widely used in many years in facial expression system.This data set comprises of 593 sequences of image vary in duration from 10 to 60 frames collected from 123 subjects.The age range of subjects is 18-50 years, where 31% are men and 69% are women.The images express seven categories of expressions: happy, sad, surprise, anger, fear, disgust, and neutral that cover the basic emotions.Each image has 640*640-or 490-pixels resolution [25].

Japanse female facial expression dataset (JAFFE)
JAFFE data set is also widely used in expression recognition of human emotion.This dataset consists of 213 images of 10 Japanese females including seven expressions: six basic (happy, surprise, sad, anger, fear and disgust) and neutral.Each image has the resolution of 256*256 pixels [26].

FER 2013 dataset
FER-2013 data set contains 28,000 images that are labeled.The dataset is created in 2013 for learning focused on three challenges: the black box learning, the facial expression recognition challenges and the multimodal learning challenges.The images are 48*48 pixels grayscale of faces in seven expressions: six basic expression and neutral [28].

FERG-DB dataset
FERG-DB stands for facial expression research group database that consists of face images of six stylized characters grouped into seven types of expressions: six basic expressions and neutral.The dataset includes 555767 images [29].

Static facial expression in the wild dataset (SFEW)
The images in the SFEW are extracted from a temporal facial expressions database Acted Facial Expressions in the Wild (AFEW) which has been extracted from movies.The database contains 700 images that have been labeled into six basic expressions [15].

Real-world affective face database (RAF-DB)
RAF-DB database is a large-scale facial expression database that includes facial images downloaded from internet.The dataset is annotated seven-dimensional expression distribution vectors for each image [9].

AffectNet dataset
AffeNet is a largest database of facial expression in the real-world and contains more than 1,000,000 facial images downloaded from the internet search by six different languages with 1250 emotion related keywords.The database defined eleven categories of expression: six basic expressions, neutral, contempt, none, uncertain, and non-face [15].

PROBLEM STATEMENT
FER system is need to develop under the problem of illumination, lighting, pose, aging, occlusion for the real-world expression classification system.The major challenges of the study include: − Most of researches classify basic emotions but fine-grain emotion is relatively small.− The reaearch works on mocro-expression and compound emotion recognition system are limited.− Mathematical model is needed to be developed for extraction more discriminant features facial images in the wild.− Real time facial expression recognition systems should be developed to meet practical application.− Deep learning model also need to create for improving facial feature extraction and classification.

CONCLUSION AND FURTURE WORK
Facial expression recognition is an active research area and more interesting for researcher under the problem of occlusion, brightness, viewing angle, pose, and background in the real-life images, sequence of images and videos.This review paper has presented methods of preprocessing, feature extraction and classification scheme.The FER research goes on to meet real-life applications for driver drowsiness recognition, assistant of distance learning, clinical patient monitoring and teaching robot, health care system for autism children.In the future, FER system will be developed for fined grained facial expressions recognition and compound emotions recognition by using facial images.
Comput.Sci.Inf.Technol. Feature extraction and classification methods of facial expression: a surey (Moe Moe Htay)

Table 1
for each dataset.

Table 1 .
Sample image of facial image datasets