The original MultiPie [57 (link)], Lucey et al. [58 ], Lyons et al. [59 (link)], and Pantic et al. [60 (link)] datasets of facial expressions were recorded in a laboratory setting, with the individuals acting out a variety of facial expressions. Using this method, we created a spotless, high-quality repository of staged facial expressions. Faces in pictures may look different from their unposed (or “spontaneous”) counterparts. Therefore, recording emotions as they happen became popular among researchers in affective computing. Situations such as this include experiments in which participants’ facial reactions to stimuli are recorded [60 (link),61 ,62 (link)] or emotion-inducing activities are conducted in a laboratory [63 (link)]. These datasets often record a sequence of frames that researchers may use to study expressions’ temporal and dynamic elements, including capturing multi-modal impacts such as speech, bodily signals, and others. However, the number of individuals, the range of head poses, and the settings in which these datasets were collected all contribute to a lack of variety.
Therefore, it is necessary to create methods based on natural, unstaged presentations of emotion. In order to meet this need, researchers have increasingly focused on real-world datasets. Table 1 provides a summary of the evaluated databases’ features across all three affect models: facial action, dimensional model, and category model. In 2017, Mollahosseini et al. [24 (link)] created a facial emotion dataset named AffectNet to develop an emotion recognition system. This dataset is one of the largest facial emotion datasets of the categorical and dimensional models of affect in the real world. After searching three of the most popular search engines with 1250 emotion-related keywords in six languages, AffectNet gathered over a million photos of people’s faces online. The existence of seven distinct facial expressions and the strength of valence and arousal were manually annotated in roughly half of the obtained photos. AffectNet is unrivalled as the biggest dataset of natural facial expressions, valence, and arousal for studies on automated facial expression identification. The pictures have an average 512 by 512 pixel resolution. The pictures in the collection vary significantly in appearance; there are both full color and gray-scale pictures, and they range in contrast, brightness, and background variety. Furthermore, the people in the frame are mostly frontally portrayed, although items such as sunglasses, hats, hair, and hands may obscure the face. As a result, the dataset adequately describes multiple scenarios as it covers a wide variety of real-world situations.
In the ICML 2013 Challenges in Representation Learning [64 (link)], the Facial Expression Recognition 2013 (FER-2013) [65 ] database was first introduced. The database was built by matching a collection of 184 emotion-related keywords to images using the Google Image Search API, which allowed capturing the six fundamental and neutral expressions. Photos were downscaled to 48 × 48 pixels and converted to grayscale. The final collection includes 35,887 photos, most of which were taken in natural real-world scenarios. Our previous work [56 (link)] used the FER-2013 dataset because it is one of the largest publicly accessible facial expression datasets for real-world situations. However, only 547 of the photos in FER-2013 depict emotions such as distaste, and most facial landmark detectors are unable to extract landmarks at this resolution and quality due to the lack of face registration. Additionally, FER-2013 only provides the category model of emotion.
Mehendale [66 (link)] proposed a CNN-based facial emotion recognition and changed the original dataset by recategorizing the images into the following five categories: Anger-Disgust, Fear-Surprise, Happiness, Sadness, and Neutral; the Contempt category was removed. The similarities between the Anger-Disgust and Fear-Surprise facial expressions in the top part of the face provide sufficient evidence to support the new categorization. For example, when someone feels angry or disgusted, their eyebrows will naturally lower, whereas when they are scared or surprised, their eyebrows will raise in unison. The deletion of the contempt category may be rationalized because (1) it is not a central emotion in communication and (2) the expressiveness associated with contempt is localized in the mouth area and is thus undetectable if the individual is wearing a face mask. The dataset is somewhat balanced as a result of this merging process.
In this study, we used the AffectNet [24 (link)] dataset to train an emotional recognition model. Since the intended aim of this study is to determine a person’s emotional state even when a mask covers their face, the second stage was to build an appropriate dataset in which a synthetic mask was attached to each individual’s face. To do this, the MaskTheFace algorithm was used. In a nutshell, this method determines the angle of the face and then installs a mask selected from a database of masks. The mask’s orientation is then fine-tuned by extracting six characteristics from the face [67 ]. The characteristics and features of existing facial emotion recognition datasets are demonstrated in Table 2.
Free full text: Click here