A novel deep learning model to improve the recognition of students’ facial expressions in online learning environments

: With the fast development of artificial intelligence and emerging technologies, automatic recognition of students’ facial expressions has received increased attention. Facial expressions are a kind of external manifestation of emotional states. It is important for teachers to assess students’ emotional states and adjust teaching activities accordingly. However, existing methods for automatic facial expression recognition have the limitations of low accuracy of recognition and poor feature extraction. To address the problem, this study proposed a novel deep learning model called DenseNetX-CBAM to improve facial expression recognition by utilizing a variant of densely connected convolutional networks (DenseNet) to reduce unnecessary parameters and strengthen the reuse of expression features between networks; moreover, convolutional block attention module (CBAM) was integrated to allow the networks to focus on important special regions and important channels when representing features. The proposed model was tested using 217 video clips of 33 students in an online course. The results demonstrated promising effects of the method in improving the accuracy of facial expression recognition, which can help teachers to accurately recognize students’ emotions and provide real-time adjustment in online learning environments.


Introduction
Online learning has been widely integrated in educational practice.Compared with traditional face-to-face classroom learning, online learning is more convenient, allowing students to learn without the constraints of time and place.However, there are concerns that online learning may limit direct emotional communication between teachers and students (Alawamleh et al., 2020).In online learning environments, it is difficult for teachers to quickly and accurately assess the emotional states of students and give timely feedback, which may affect online learning experience and outcomes.
Researcher have been investigating human emotions and the relationship between human emotions and facial expressions.Ekman and Friesen (1971) revealed that there are six primary human emotions including joy, sadness, fear, anger, surprise, and disgust.Moreover, he created a facial action coding system (FACS) to classify human expressions based on the correspondence between facial muscle movement units.The classification of the six emotions lays the foundation for the discrete emotion representation models.Mehrabian (2008) found that facial expressions account for 55% of emotional expressions, suggesting that facial expressions play a crucial role in the way learners express their emotions.These studies indicate that facial expressions are a kind of external manifestation of emotional states; emotional states can be detected through facial expressions.
With the fast development of artificial intelligence (AI) and emerging technologies, automatic recognition of students' facial expressions to support teaching and learning has received increased attention (Kazemitabar et al., 2019;Tonguç & Ozaydın Ozkara, 2020).In particular, deep learning approaches, a class of machine learning method effective for image recognition based on neural network architectures with multiple layers of processing units have been utilized for facial expression recognition, mainly through three steps: face detection, feature extraction, and classification (Revina & Emmanuel, 2021;Zou et al., 2018).However, the existing deep learning approaches have some limitations, such as low accuracy of recognition and poor feature extraction (Li & Deng, 2022), which affect the application of these approaches to support online teaching and learning.
This study aimed to address the limitation of existing approaches by proposing a new model for facial expression recognition, called DenseNetX-CBAM.It is a new deep learning model designed by improving the structure of densely connected convolutional networks (DenseNet) (Huang et al., 2018) and integrating the convolutional block attention module (CBAM) (Woo et al., 2018).Experiments were conducted to evaluate the performance of the proposed model in recognizing students' facial expressions in an online learning environment.

Online learning
Online learning refers to education that occurs through the Internet, either synchronously or asynchronously, instead of in a physical classroom environment.Dhawan (2020) defines online learning as "a learning experience in a synchronous or asynchronous environment using different electronic devices with access to the Internet".Online learning can improve the efficiency of learning to a greater extent by allowing for flexibility in the time and place of learning and teaching (Bender & Vredevoogd, 2006;Wong et al., 2019;Wong, 2023).Moreover, online learning with computer-based support can improve learning performance by fostering self-regulated learning (Chiu, 2022;Wei & Chou, 2020) and higher-order thinking (Cárdenas-Robledo & Peña-Ayala, 2019).which can be further improved by harnessing the potential of generative artificial intelligence (Zhu et al., 2023).However, the lack of face-to-face communication with teachers and peers and the lack of sufficient and timely feedback from teachers are the drawbacks of online learning (Mukhtar et al., 2020;Tang et al., 2023).

Facial expressions recognition technologies
To support interaction and communication in online learning environments, it is important to access students' emotional states and give timely feedback.In this context, automatic recognition of students' facial expressions to detect their emotions has received increased attention (Lasri et al., 2023;Ashwin & Guddeti, 2020).Traditional methods for facial expression recognition involve local binary patterns (Ojala et al., 2002), histogram of oriented gradient (Dalal & Triggs, 2005) and scale-invariant feature transform (Lindeberg, 2012).In general, these methods are constrained by human rules, and it is difficult to extract the deeper features of facial expressions.
With remarkable advances in information technologies, researchers have been exploring new methods, in particular deep learning models, for recognizing facial expressions.For instance, Krizhevsky et al. (2012) designed AlexNet, and Nayak and Sarvaiya (2022) improved AlexNet by introducing multi-scale convolution and batch normalization for conducting facial expression recognition recently.Meanwhile, lightweight convolutional neural networks combining attention modules have been employed to solve the problem of noise interference in non-face regions and avoid model overfitting (Shao et al., 2018).Furthermore, Park et al. (2017) applied the pruning algorithm and global maximum pooling on the GoogleNet model to retain the face location information, which greatly optimized the operation speed and accuracy.The development of residual networks solved the problem of vanishing gradients in neural networks and improved the feature extraction capability by adding identity mapping (He et al., 2016).Moreover, the generative adversarial network (Goodfellow et al., 2020) played a pivotal role in partially or entirely generating facial images that maintain context consistency and discriminating images.
In addition to optimizing foundation models and networks, loss functions have also been reformed.Sun et al. (2014) used a contrastive loss function for expression recognition.Schroff et al. (2015) utilized the triplet loss function to train expression recognition networks.It has the benefit of training smaller samples with lower variance, resulting in better performance in expression recognition.Wen et al. (2016) designed a center loss function with the aim of focusing on intra-class distribution uniformity to minimize intra-class variance.What is more, Cai et al. (2018) introduced the island loss function that narrowed the intra-class variation while enlarging the inter-class variation to strengthen the discriminating power for deep features.Wang et al. (2018) employed a variation of the Softmax loss function by adding angular spacing and a cosine residual term.By maximizing the decision edges of the learned features in the feature space, this approach led to a significant decrease in intra-class distance and larger class spacing.More recently, the adaptive correlation-based loss function was developed to generate embedded feature vectors with high correlation for the within-class samples (Fard & Mahoor, 2022).However, existing deep learning methods have the weakness of poor performance because they are easily disturbed by the loss of information during feature propagation and by irrelevant features during feature extraction.

Research questions
This study aimed to address the limitation of existing approaches to automatic facial expression recognition by creating a new deep learning model to improve the recognition of facial expressions of students in online learning environments.To foster effective interaction and communication in online learning environments, it is important to provide AI-based facilities that help teachers to detect students' emotional states and give timely feedback.This study aimed to answer the following research questions: RQ1: How to design a new deep learning model to improve the recognition of students' facial expressions in an online environment?RQ2: Does the designed model outperform other state-of-art models in recognizing students' facial expressions in an online environment?

Learning environment
This study designed a new model to improve the recognition of facial expressions.The performance of the model was evaluated by using the dataset containing video clips of facial expressions of students in an online session as the input to the model.During the online session, students were asked to work individually to complete five programming tasks using the Java language.The user interface used by students was presented in the form of a simple program editor, allowing students to compile and execute the programs they wrote.Students could also run unit tests to verify the correctness of their solutions.In order to induce students to produce different facial expressions, the online learning system periodically generated interference behaviors, such as automatically adding or deleting some characters during the tasks.

Dataset
The source of the dataset in this research was provided by the Gdansk University of Technology DevEmo dataset (Manikowska et al., 2023).The dataset contains video clips of facial expressions of students in the online session.The participants were 212 students majoring in computer science.The video clips of 33 students completing all five tasks were selected.The selected dataset contained 217 video clips, each of which was annotated and labeled as one of five primary emotional expressions including happiness, surprise, anger, confusion, and neutral; no videos were labeled as fear or disgust.A summary of the dataset is presented in Table 1.

Data pre-processing
The average duration of each video clip in the dataset is 3 seconds, and each second of one video clip contains 30 frames.This study utilized OpenCV to segment each video to obtain images, with a segmentation interval of 15 frames, forming an image dataset.In order to reduce the interference of other parts of the human body on the recognition results, human face detection and cropping were performed on images in the dataset.In addition, all images were annotated with different facial expression labels.Besides, we operated the data augmentations operations about changing the brightness of images and rotating images by 45° to enlarge the dataset.Data normalization was also applied to the image dataset.Moreover, all images were shuffled, and the image dataset was split into the training set, validation set and test set according to the ratio of 6:2:2.Additionally, we resized all images to a uniform size of 128×128.

Proposed model for facial expression recognition
This study proposes a deep learning approach by improving DenseNet and integrating a hybrid attention mechanism to automatically recognize the facial expressions of students.DenseNet has the peculiarity of allowing features to be reused to obtain better performance.What is more, the attention mechanism can reduce the interference of nonkey information, allowing the network to focus on valuable information.Therefore, this study modified the structure of the DenseNet structural framework and added the convolutional block attention module (CBAM) before the global average pooling layer.CBAM is a lightweight attention mechanism module with the capability to improve the feature extraction of our model.

DenseNet
The DenseNet is a type of convolutional neural network (Huang et al., 2018).It creates dense connections between layers through dense blocks to enhance feature propagation.Thus, neural networks can alleviate the vanishing gradient problem caused by layers of networks being too deep.The initial DenseNet mainly consists of four dense blocks and three transition layers.
The DenseNet differs from other general convolutional neural networks.The unique dense block of DenseNet allows the model to no longer rely on the feature vector output from the last layer as the sole basis for classification.Furthermore, each layer of the dense block utilizes dense connections between layers to receive additional input from all preceding layers.As shown in Fig. 1, X0 is the input feature vector of convolution block 1 (conv block 1).X0 is transformed to the output feature vector X1 through H, a nonlinear transformation function.H includes batch normalization (BN), rectified linear unit (ReLU) and convolution.Therefore, the end output feature vector of conv block 1 is [X0, X1] due to the structure of the dense block.The output feature vector of the nth convolution block in the dense block can be expressed as: (1) The structure with dense connections is beneficial for feature propagation, which is conducive to improving feature extraction.It also reduces the number of parameters in the model because the input of each layer contains feature information of the preceding layers.Therefore, a few feature maps need to be extracted when extracting the features of each layer.

Fig. 1. Structure of DenseNet
The transition layer is mainly made up of one 1×1 convolution layer and one 2×2 average pooling layer.It connects the adjacent dense blocks.Moreover, it reduces the number of channels to control the complexity of the model.

Attention mechanism
The attentional mechanism is a technique in neural networks that mimics cognitive attention.Attention modules emphasize crucial target regions in the visual range and disregard irrelevant information within images.The attention mechanism improves the structure of neural networks and is widely applied in deep learning tasks.When processing images with neural networks, it is vital to allow networks to pay attention to the essential target regions adaptively, and the attention module is an approach to achieving adaptive attention to the target areas.
Attention modules can be divided into spatial attention modules (SAM) (Almahairi et al., 2016), channel attention modules (CAM), and hybrid attention modules.Firstly, SAM aims at finding the spatial information in the original image, transforming it into another space and retaining the key information, weighting the output for each location, focusing on the specific target region, and improving the feature representation of the target region.Secondly, CAM can assign higher weight values to important channels and suppress useless channels.Thirdly, the hybrid attention module is a combination of SAM and CAM (see Fig. 2), and CBAM adopted in this study is a representative module of hybrid attention modules.

Fig. 2. Structure of CBAM
CAM typically employs weights to give more emphasis to the relevant area, with higher weights indicating a greater degree of attention.When recognizing images, it is difficult to avoid the effects of image rotation, distortion, and scale changes in image recognition tasks, but SAM can well preserve important information in images and reduce the impact of images from operations such as transformations.By mixing crosschannel and spatial information for feature information extraction, CBAM has the advantage of more accurate feature extraction.The number of convolutional blocks in the four dense blocks is set to 3, 6, 12 and 8, respectively.Convolutional blocks extract features by using 1×1 and 3×3 convolutional kernels with a growth rate of 16.In the transition layer, the compression rate of parameters related to channel downscaling is set to 0.5.The size of the feature map changes through three transition layers from 32×32 to 4×4.

Integrating improved DenseNet with CBAM
In this study, the CBAM module is added following dense block 4 which strengthens the feature learning of important channels and spaces using CAM and SAM to improve the accuracy of the model.First, in the CAM section of CBAM, a 4×4×262 facial expression feature map output by dense block 4 undergoes maximum pooling and average pooling operations on their height and width to obtain two 1×1×262 facial expression feature maps.The two feature maps are transported into the two-layer neural network to obtain two 1×1×262 facial expression feature vectors.Besides, the input feature map of SAM is generated by adding the above two feature vectors and multiplying the face expression feature map with the channel attention weight.Second, in the SAM section of CBAM, the maximum pooling and average pooling operations are performed on the input feature maps to get two 4×4×1 facial expression feature maps.The channels of the two feature graphs are concatenated, and the dimensions are reduced by applying a 3×3 convolution kernel.After activation with the sigmoid function, the spatial attention weight is acquired.Finally, the optimized features of facial expressions are obtained by multiplying the facial expression feature map with the spatial attention weight.Therefore, by using the CBAM, attention maps can be generated for the extracted feature information in the channel dimension and spatial dimension in turn.This is conducive to enhancing the channel and spatial dimension information of locally important features, thus accelerating the convergence speed of the model and increasing the expression recognition rate.

Experimental setting
In this study, a computer with Intel i7-9700 CPU, 32GB RAM and GeForce 2080 GPU was utilized for the model training, validating and testing.The model was trained for 50 epochs, using the adaptive moment estimation (Adam) optimizer with an initial learning rate of 0.0001 and a batch size of 8.Moreover, the loss function employed the crossentropy loss.

Performance metrics for model evaluation
The performance of the proposed model was tested in terms of accuracy, macro F1-score, parameters, and confusion matrix.

Accuracy
Accuracy is the main performance metric used to evaluate classification models.Higher accuracy indicates better performance of the model.Accuracy is defined as: where TP is true positive prediction; TN is true negative prediction; FP is false positive prediction; and FN is false negative prediction

Macro F1-score
The F1 score is an evaluation index in deep learning models.It assesses the precision and recall of the model at the same time and can be regarded as a harmonic average of the precision and recall of the model.When addressing multiple classification tasks, the macro F1-score is often used to measure the performance of the model.The number of categories is set to N. Therefore, the macro F1-score is the mean of F1 scores for all categories, and it is calculated as follows: Macro F1-score = (4)

Parameters
In deep learning models, parameters represent the total number of parameters contained in the model.It is an important index to measure deep models, corresponding to the consumption of computer memory resources.The calculation formulas for the parameters of the convolutional layer and for the fully connected layer of the model are as follows: where Ci is the input channel, Kh is the height of the convolution kernel, Kw is the width of the convolution kernel, "+1" means the bias, Co is the output channel, Ni is the number of input nodes, and No is the number of output nodes

Confusion matrix
To better evaluate the performance of the model proposed in this study and observe the change in performance, the confusion matrix is used to compare and analyze the recognition rate of each type of expression.The horizontal coordinates of the confusion matrix represent the predictive labels, the vertical coordinates represent the true labels, and the values of the coordinate points where the predictive labels intersect with the true labels represent the accuracy rates of the corresponding labels.Therefore, the values on the main diagonal in the confusion matrix represent the accuracy rate of the corresponding expressions, and the rest of the area represents the similarity with other expressions.Furthermore, the shades of blue are used to represent the accuracy rate and the darker color means the higher accuracy.

Comparison of different models
The proposed DenseNetX-CBAM model was compared with the following five state-ofthe-art deep learning models.The performance of all six models were tested using the aforementioned dataset as the input to the models.Table 2 presents the performance all six model in terms of accuracy, macro F1-score, and parameters.

Ablation study results
An ablation study was conducted on the same dataset to investigate how the models integrating various attention mechanism modules impact the performance for recognizing facial expressions.The implementation steps are as follows:   The optimization of DenseNet led to the evolution of DenseNetX, where DenseNetX was progressively better than the previous one.Therefore, we compared the performance of the optimized model with the initial model.

Discussion
The findings of the study are discussed as follows.

How to design a new deep learning model to improve the recognition of students' facial expressions in an online environment?
This study proposed a novel deep learning model called DenseNetX-CBAM to improve the recognition of students' facial expressions in an online learning environment.The model was designed by combining the superior capabilities of DenseNet and CBAM.Firstly, the structure of DenseNet was utilized and ameliorated to reduce the number of unnecessary parameters and strengthen the reuse of expression features between networks.
It has an adequate improvement over traditional machine learning methods such as KNN and SVM (Patil & Patil, 2022).Secondly, CBAM was integrated into the model, which can help the networks to focus on effective information by assigning different weights to the facial expression features of different channels.CBAM is a hybrid attention module that combines SAM and CAM.SAM can highlight important spatial regions while suppressing less informative regions; CAM can help the model to focus on the most important channels to enhance feature representations.In addition, CBAM is a lightweight module with a simple structure, which makes it a convenient and efficient modification of deep learning models.

Does the designed model outperform other state-of-art models in recognizing students' facial expressions in an online environment?
By comparing the performance of DenseNetX-CBAM with other state-of-the-art models, this study reveals that DenseNetX-CBAM achieved the best performance in recognizing students' facial expression in an online learning environment.DenseNetX-CBAM outperformed other models in terms of recognition accuracy, precision and recall of the model, and the number of parameters.The number of parameters of DenseNetX-CBAM was far less than that of other models, indicating that the proposed model can effectively save computing resources.
To further assess the model, we investigated the impact of model components on model performance through the ablation study.The results show that the DenseNetX integrated with CBAM (combination of CAM and SAM) has a greater capability than the DenseNetX integrated with merely CAM or SAM.Moreover, it can be seen by observing the confusion matrix that DenseNetX-CBAM has a significant improvement in the accuracy of recognizing five facial expressions on the dataset compared to DenseNet-CBAM.For emotions that are relatively difficult to recognize, such as anger, the recognition accuracy increased from 0.72 to 0.79.For easily recognizable expressions, such as happiness and confusion, the recognition accuracy increased from 0.86 to 0.91 for happiness and 0.83 to 0.87 for confusion.These findings suggest that the performance of DenseNetX-CBAM is far superior to other models.

Limitations and future work
This study has several limitations.First, a dataset of facial expressions of students learning Java programming in an online learning environment may not fully represent the facial expressions of students in other online courses.Second, the recognition rate of the facial expression recognition models may decrease in complex environments like those with varying levels of lighting.Such environments can limit the efficacy of the expression recognition system.Third, the dataset of video clips used in this study contained five primary emotional expressions of students learn in an online learning environment including happiness, surprise, anger, confusion, and neutral, not including other emotional expressions (e.g., disgust and fear).More empirical studies are needed to investigate the possibility of including other expressions in analysis.

Conclusion
To foster effective interaction and communication in online learning environments, it is important to assess students' emotional states and give timely feedback, which can be supported by AI technology.This study proposed a novel deep learning model to improve recognition of students' facial expressions in an online learning environment.The proposed model utilized a variant of densely connected convolutional networks (DenseNet) to reduce the number of unnecessary parameters and strengthen the reuse of expression features between networks.Meanwhile, a convolutional block attention module (CBAM) was integrated into the model, which can help the networks to focus on effective information by assigning different weights to the facial expression features of different channels.By testing with 217 video clips of 33 students in an online course, the proposed model has shown promising effects in improving the accuracy of facial expression recognition.The proposed approach has a high potential to help teachers to accurately recognize students' emotional states and provide real-time adjustment in online teaching and learning environments.
In this study, a novel facial expression recognition model DenseNetX-CBAM was developed by improving the DenseNet and introducing the CBAM to neural network structure.The whole structure of the proposed model is shown in Fig. 3.In the input section of the model, the input is an RGB three-channel facial expression image with a pixel size of 128×128.The feature extraction network uses 7×7 and 3×3 convolution kernels for initial feature extraction and dimensionality reduction of the input image to obtain a 32×32 feature map of sixty-four channels, and four dense blocks and three transition layers are alternately connected to further optimize the extraction of information and improve the reusability of features.

•
GoogLeNet (Szegedy et al., 2014): GoogLeNet model has the inception module to aggregate visual information at different sizes for facilitating feature extraction.It shows promising performance in image classification.•MobileNetv2(Sandler et al., 2018): MobileNetv2 is a mobile deep learning model.It is based on the inverted residual structure, and the intermediate expansion layer of the network filters the feature by using nonlinear lightweight depth convolutions.• VGG16 (Simonyan & Zisserman, 2015): VGG16 is a deep learning model widely used in large-scale image classification and recognition tasks.• ResNet50 (He et al., 2016): ResNet50 uses the residual network structure to effectively alleviate the problem of model degradation, thereby achieving a deeper network structure design.• EffieientNet (Tan & Le, 2019): EfficientNet uses a series of fixed scaling factors to uniformly scale the network dimension, demonstrating excellent accuracy and efficiency.

Fig. 4
Fig.4presents the performance of these models in terms of confusion matrix.The results show that DenseNetX-CBAM outperforms the other five state-of-the-art models in all criteria.
1) Baseline: This model uses the initial DenseNet framework without any improvements or the addition of attention modules.2) DenseNetX: This model improves upon the DenseNet framework but does not include any attention modules.3) DenseNet-CBAM: This model adds both SAM and CAM to the baseline model.4) DenseNetX-SAM: This model adds SAM to the DenseNetX framework without CAM.5) DenseNetX-CAM: This model adds CAM to the improved DenseNetX framework.6) DenseNetX-CBAM: This model adds both SAM and CAM to the DenseNetX framework.

Fig. 4 .
Fig. 4. Confusion matrix of the six models Table 3 reports the Ablation study results in terms of accuracy, macro F1-score and parameters of integrating different attention mechanism modules based on DenseNet and DenseNetX.The results show the proposed DenseNetX-CBAM model based on the combination of DenseNetX and CBAM is more effective for increasing the model recognition rate.
Fig. 5 presents the results of the DenseNet-CBAM model and DenseNetX-CBAM model for recognizing each class of facial expressions in the dataset.Confusion matrices suggest that the combination of DenseNetX and CBAM is more effective in improving the performance of facial expression recognition.

Table 1
Dataset content description

Table 2
Accuracy, macro F1-score, and parameters of the six models

Table 3
Ablation study