An LSTM-based prediction model for gradient-descending optimization in virtual learning environments

ABSTRACT


INTRODUCTION
The design and development of virtual learning environments (VLE) and learning management systems (LMS), as well as other online learning platforms, have rapidly improved, eliminating not only the constraints of time and place but also lowering the cost and facilitating access to education.Evaluating and analyzing the students' data generated from online learning platforms can help instructors to understand and monitor students learning progress [1].The earlier the students' performance is detected in the VLEs, the better it is for the instructor to persuade and warn students for keeping them on the right track.Therefore, it is challenging to create a predictive model that can precisely identify students' in-course learning behaviors by looking at behavior data.
In previous research, machine learning (ML) techniques have been extensively used in the development of predictive models to illustrate student learning behavior in VLE [2]- [6].However, there are some limitations to the use of ML techniques in the development of predictive models.For example, there are limitations on the features selected and the ML models that are used [4]- [8].The advancement of deep learning methodologies will allow prediction models to perform more accurately [9]- [13].In an online learning environment where a lot of data is produced every day.One of the best deep learning algorithms for handling issues with time series data is long short-term memory (LSTM) [14], [15].
The LSTM architecture is an enhanced recurrent neural network (RNN) that works well for longterm dependability in time series sequential data [16].There are many hyperparameters available for LSTMs, including learning rates, the number of hidden units, input length, and batch sizes [17], [18].Hyperparameters are parameters that are specifically defined to regulate how the model learns [19].The model's output is significantly impacted by its hyperparameters [20].Determining the right combination of models and hyperparameters is often a challenge.We want to investigate how hyperparameters affect LSTM.Hyperparameter selection and optimization frequently distinguish the outcomes from model accuracy.To fine-tune the hyperparameters, we used the adaptive moment estimation (Adam) and Nesterov-accelerated adaptive moment estimation (Nadam) optimization algorithms.Adam and Nadam, are the two most effective gradient descent optimization algorithms [21], [22].
The following is a review of a number of prior research studies that addressed the use of the LSTM algorithm to forecast online learning.The attention-based multi-layer (AML) LSTM, which combines clickstream data and student demographic data for thorough analysis, is suggested in this article [23] as a method for predicting students.The outcomes demonstrate that, from week 5 to week 25, the proposed model can increase accuracy for the four-class classification task by 0.52% to 0.85%.According to Alsabhan [24], the LSTM model performs better in terms of accuracy for the prediction of withdrawal in a VLE than both the logistic regression algorithm and neural networks.When detecting student cheating in higher education, LSTM with dropout layers, dense layers, and Adam optimizer [25] achieves 90% better accuracy than ML algorithms.
The LSTM model was improved in [26] research for predicting student performance using the Adam and root mean square propagation (RMSprop) algorithms.When compared to the RMSprop algorithm, the LSTM model with Adam's algorithm performs better.According to Bock and Weiß [27], Adam and Nadam outperformed adaptive learning rate delta (AdaDelta), adaptive gradient descent (AdaGrad), or RMProp in terms of setting optimization parameters, as determined by the perceptual loss function and visual perception.In this study, the Adam and Nadam optimisation algorithm was used to test the LSTM algorithm model in order to determine the algorithm's optimal performance.
We suggest an LSTM algorithm model for predicting student learning outcomes in a VLE that has been improved with Adam and Nadam.The Adam and Nadam optimization algorithm is used to test each model.Then, the accuracy, recall, precision, and F1-score of each model are assessed in order to compare the outcomes.A stochastic gradient descent technique called Adam optimization is based on the adaptive estimation of first and second-order moments [28].When dealing with complex problems involving a large number of variables or data, the method is incredibly effective.Adam is a fusion of the 'gradient descent with momentum algorithm' and the 'RMSprop' algorithm.The Adam and RMSprop methods have their respective strengths, and Adam optimizer builds on those strengths to produce a gradient descent that is more optimized.
The Nadam algorithm is a sophisticated gradient descent optimization method that raises the quality and convergence rate of neural networks [29].Nadam alters the momentum component of Adam while maintaining an adaptive learning rate that is a pure amalgamation of Adam and Nesterov's accelerated gradient (NAG).Nadam converges faster and outperforms NAG and Adam on some types of data sets.Our research makes use of two hyperparameter optimization algorithms specifically Adam and Nadam.The parameters that we use to construct the LSTM model include learning rates, the number of hidden units, the length of input, batch sizes, and dropout.The following queries are what this essay aims to address: i) RQ1: how do hyperparameter optimization techniques LSTM as well as compare with each other?and ii) RQ2: which LSTM model is the most effective after assessing how well the optimization method worked?

METHOD
The research methodology used to compare the LSTM model with the gradient descent optimization method in order to forecast student performance in a VLE is shown in Figure 1.The initial stages of a research project are data gathering, data comprehension, and data processing [30].Afterward, carry out the data preparation for the LSTM models.The data is separated into training, validation, and testing data.

Datasets
This study makes use of the open university VLE dataset.The open university learning analytics dataset (OULAD) dataset that was acquired includes the demographic information, login patterns, and assessment behavior of 32,593 students over the course of nine months.It consists of seven modules, or courses, each of which is taught at least twice a year at different times.The student performances are broken down into four groups, with 9% receiving distinctions, 38% receiving passes, 22% receiving failures, and Comput Sci Inf Technol ISSN: 2722-3221  An LSTM-based prediction model for gradient-descending optimization in virtual … (Edi Ismanto)

201
31% discontinuing their studies.The acquired raw data set consists of files that contain data on student demographics, clickstream data that shows how students interact with the online environment, assessments, quiz results, and module information.Data about both students and courses are included in the dataset.The OULAD dataset contains data for seven courses.Data from the course BBB were the subject of our study.BBB is the course code.A total of 7,909 students are enrolled in the course's focus on social sciences, which has the highest enrollment of any other subject.

Preparation of data
Data preparation is the collection, combination, cleaning, and transformation of raw data for ML projects in order to make accurate predictions.The dataset is preprocessed to select the BBB course features that will be used to train and test the model.The features that have been chosen and will be put to use are the module code, presentation code, student ID, clicks, assignment assessment, average assignment assessment, and final results.
There are 1,565,580 lines of BBB courses after preprocessing.There are two presentation codes or semester codes in the BBB course: "B" begins in February, while "J" begins in October.The presentation code used in the BBB course is shown in Figure 2. The data for the BBB course are divided: 60% for training, 20% for validation, and 20% for testing.LSTM is one of the RNN variants [14].LSTM fills the gap left by RNN's inability to predict words based on previously learned information that has been stored for a long time.The fundamental distinction between LSTM and RNN architectures is that the hidden layer of the LSTM is a gated unit or gated cell [15].It is made up of four layers that work together in some way to produce both the cell state and the output of that cell.Then, these two items are transferred to the following hidden layer.In contrast to RNNs, which only have one tanh layer, LSTMs have three logistic, sigmoid gates, and one tanh layer.
The LSTM model, which was created to predict a VLE, makes use of three input layers, two output layers with one node each and sigmoid activation functions, one hidden layer with sixteen nodes, and a hyperbolic tangent activation function to solve the non-linear function.Then, to enhance the LSTM model, a dropout layer with a 50% setting in each training step is included.The LSTM model was trained using batch size 32, with the back-propagation method.Figure 3 displays a design for the LSTM architecture.

Optimization algorithms using gradient descent
The process of gradient descent is used to enhance neural network models [31].The Adam and Nadam algorithms were used in this study as gradient-based optimization algorithms.The gradient descent algorithm requires that both the target function and its derivative function be optimized.The gradient descent optimization algorithm used in the study is as:

Adam optimizer
In contrast to the more traditional stochastic gradient descent approach, Adam is an optimization algorithm that can be used to iteratively update weights based on training data [21], [28].Adam can be characterized as a stochastic gradient descent with momentum and the RMSprop model.Adam is a technique of the adaptive learning rate that lowers individual learning rates for various parameters.

Nadam optimizer
The NAG and Adam algorithms were combined to create the Nadam algorithm [22], [29].Nadam performs a momentum update for the value of  ̂ [32].The update rule has the following format:

Performance evaluation of the model
The model's effectiveness was measured using a confusion matrix, accuracy (CA), precision, recall, and F1-score (F1) [33].The confusion matrix depicts the present state of the dataset as well as the number of accurate and wrong model predictions [34].The proportion of accurate predictions to all predictions is measured by accuracy, which is a crucial and intuitive metric.Precision measures the percentage of correctly predicted positive outcomes to the total number of correctly predicted positive outcomes.The recall is the ratio of true positive predictions compared to the total number of true positive data.A weighted comparison of the average precision and recall is called an F1-score (F1).Formally, positives denote students who really fail, whereas negatives denote students who actually pass, while true denotes a valid prediction, and false denotes an incorrect forecast.A true positive value is TP, a true negative value is TN, a false negative value is FN, and a false positive value is FP.Table 1 illustrates the confusion matrix associated with various combinations of actual and predicted.

RESULTS AND DISCUSSION
In this study, Python programming was used for model training and testing.The architectural performance parameters were developed using 10 different combinations, and validation tests were carried out from 20% of the training dataset samples.The Adam and Nadam optimization algorithms were used to refine the models' hyperparameters.The following provides an explanation of the outcomes of the LSTM models' performance assessment.

Performance analysis of the long short-term memory model
We assess the accuracy, recall, precision, and F1-score of the Adam and Nadam algorithmoptimized LSTM model's performance.We compare the outcomes of our model performances to determine which is the best.The LSTM and Adam models were tested and trained in our first experiment.The second experiment went on to train and test the LSTM and Nadam models.The average accuracy value for the LSTM model using Adam's algorithm optimization is 87%, with the lowest accuracy value being obtained at 60%, and the highest accuracy value being obtained at 92%.The highest recall value is 92%, the lowest recall value is 60%, and the average recall value is 88%.Table 3 displays the measurement outcomes of the LSTM model with hyperparameter settings using Nadam's algorithm.
The average accuracy value for the LSTM model using the Nadam algorithm optimization is 89%, with the highest accuracy value obtained being 93%, and the lowest accuracy value obtained being 60%, according to experimental results.The average recall percentage is 89%, with the lowest recall percentage being 60% and the highest recall percentage being 93%.We visualize the accuracy results of the LSTM-Adam and LSTM-Nadam models and compare them.Figure 4 displays the performance visualization of the measurement outcomes from the LSTM model.The results of the analysis show that the LSTM-Nadam model outperforms the LSTM-Adam model in a number of accuracy domains.The LSTM model's classification results used the Adam optimization algorithm, which produced the best classification outcomes; in the decile 0 data, 1,149 students were correctly categorized under the pass category.In addition, 369 data have classification results that are incorrect but still pass, despite the fact that they do not pass.The classification of students who actually failed was zero, this is in accordance with the actual data.There were three instances where data on students who did not pass were classified incorrectly and were actually students who did pass.
The same data testing is used in the LSTM model's prediction using the Nadam optimization algorithm.In Table 5, the outcomes of the Nadam optimization algorithm's prediction of the LSTM model are displayed.The LSTM model with the Nadam optimization algorithm has some higher accuracy values.The LSTM model with the Nadam optimization algorithm generates the best classification outcomes at decile 0 by classifying the 1,152 passing students.Additionally, 369 data have classification outcomes that are inaccurate but still pass even though they do not pass.The classification of students who actually failed was zero, this is in accordance with the actual data.There are 0 students who do not pass and are correctly classified.

CONCLUSION
Based on analysis done to categorize student performance in a VLE using the LSTM model optimized with the Adam and Nadam optimization algorithm.The average accuracy of the LSTM model using Nadam optimization is 89%, with a maximum accuracy of 93%, while Adam's optimization-based LSTM model has a maximum accuracy of 92% and an average accuracy of 87%.The LSTM model with the Nadam optimization algorithm performs better than Adam's optimization algorithm in the prediction problem for VLE.The contribution of this study is the performance improvement of the LSTM model through hyperparameter optimization using the Adam and Nadam algorithm, which can be used as a reference when developing prediction systems based on LSTM.For further research and development, testing can be done using the meta-heuristic optimization algorithm and assessing the performance of the resulting model.

Figure 1 .
Figure 1.The phases of the research methodology used

Figure 2 .
Figure 2. The BBB course's presentation code

Figure 3 .
Figure 3.The architecture of the designed LSTM model

Table 1 .
The confusion matrix

Table 3 .
LSTM model results with Nadam optimisation

Results from the long short-term memory model for prediction
Evaluation of the LSTM model's performance in foretelling final student data in a VLE.A total of 1,521 records from testing data are used to evaluate the LSTM model.Table4displays the outcomes of the LSTM model prediction using the Adam optimization algorithm.

Table 4 .
LSTM model prediction outcomes using Adam optimization An LSTM-based prediction model for gradient-descending optimization in virtual … (Edi Ismanto) 205