Pakistan sign language to Urdu translator using Kinect

ABSTRACT


INTRODUCTION
Communication is the foundation upon which people understand each other. There are different types of communication, such as verbal communication where people engage with each other face-to-face, using their devices, and using applications such as Zoom. Another type of communication is nonverbal communication, which includes facial expressions, body poses, eye contact, hand movements, and touch. Sign language is a nonverbal type of communication. They are languages used to communicate using simultaneous hand motions, the orientation of the fingers and hands, arm or body movements, and facial expressions. It is mainly used by the hearing-impaired.
There are many kinds of sign languages. Sign language is not universal like spoken languages; it is unique and different in every country, even in countries with the same spoken language. Sign language is not an interpretation of a spoken language; it is a deaf person's native or local language. It is a natural and complete language with its own grammatical structure. In Pakistan, there are many different sign languages in the different provinces, cities, and villages, and due to this, people from different cities or villages can't communicate with each other through sign language. This proposed method focuses on Pakistan sign language (PSL) and Urdu language [1].
This sign-based communication can't be perceived by everybody; therefore, we have developed a system that will act as a bridge between the deaf and hearing people to fill the communication gap that lies between the hearing impaired and hearing people of Pakistan [2]. The majority of the working community currently cannot use sign language. It is essential for the growth of the country and workplace that deaf adults are employed and given the chance to work among people. This will help both communities come through and improve the current standards [3]. We have designed a PSL interpreter that will perform immediate sign language translations and audio-to-text translations [4]. The system's sign language module is trained upon key points that have been extracted from multiple frames using media pipe holistic. The key points are collected from a video that is captured through a Kinect device [5]. Then an LSTM model is built which is trained using the key points that have been gathered for dynamic sign language [6]. An image-based dataset is created which is trained using an object detection model garden for static sign language. After successful training PSL is then detected in real-time to perform Urdu translations. The audio to Urdu module uses the Kinect's microphone to input Urdu audio which is then translated to Urdu text.
The design of the Chinese sign language recognition system incorporates a Specific Hand (SHS) descriptor and encoder-decoder long short-term memory (LSTM) structure for recognizing isolated Chinese sign words. The Microsoft Kinect 2.0 device is used for data input. The database is designed based on a Specific Hand Shape (SHS) descriptor utilizing a convolutional neural network (CNN). The recognition system captures the color image, depth map, and skeletal image to begin with. The hand regions and skeletal joint locations of every word of the isolated Chinese sign language are extracted from the database that was designed; this occurs after the data pre-processing process. After this, the system extracts both the features, i.e., the specific hand shape (SHS), and the trajectory. The final stage includes an encoder-decoder LSTM network that is then trained using the SHS and trajectory features and then applied for the recognition of signs [7].
Another implementation of a sign language study a Kinect-based Taiwanese sign-language recognition system has presented a solution using hidden Markov models to recognize the direction of the hands and an SVM to recognize the hand shape. Hand information is extracted that provides skeletal data from 20 joints: hip center, spine, shoulder center, head, left shoulder, left elbow, left wrist, left hand, right shoulder, right elbow, right wrist, right hand, left hip, left knee, left ankle, left foot, right hip, right knee, right ankle, and right foot. Each joint's data includes the X, Y, and Z position values. The positions of the wrist, shoulder, spine, and hip are used to localize the positions of the hands. Then the positions of the wrists are recorded as a gesture trajectory over a certain time interval. Velocity, angle, distance, and distance between the two hands of the gesture trajectory are extracted as features. HMMs are then used to recognize the hand directions from the extracted features. The position of the hand is classified into six areas as X and Y, i.e., spine position, trajectory, palm segmentation, direction recognition, hand position, and handshape. The experiment is running and yielding results of around 84% [8].
Another approach is hierarchical LSTM (HLSTM) for sign language targets to interpret video into understandable text and language to help work out vision-based sign language translation (SLT). To solve the issue of continuous sign language translation (CSLT), a hierarchical LSTM encoder-decoder model with visual content and word embedding was developed for SLT. It tackles different granularities by conveying spatio-temporal transitions among frames, clips, and viseme units. First, it uses 3D CNN to investigate the spatiotemporal cues of video clips and then packs appropriate visual themes using online key clip mining with adaptive variable length. After pooling the recurrent outputs of the top layer of HLSTM, a temporal attention-aware weighting mechanism is proposed to balance the intrinsic relationship among viseme source positions. Lastly, another two LSTM layers are used to separately retrieve verb vectors and translate semantics. After preserving original visual content with 3D CNN and the top layer of HLSTM, it shortens the encoding time step of the bottom two LSTM layers with less computational complexity while attaining more nonlinearity. The model performs well, particularly in independent tests for seen sentences with discriminative capability [9].
Another proposed technique is hybrid deep architecture, which consists of a temporal convolution module (TCOV), a bidirectional gated recurrent unit module (BGRU), and a fusion layer module (FL) to address the CSLT problem. The design is based on an end-to-end trainable network that benefits from both TCOV and BGRU modules. BGRU keeps the long-term temporal context transition pattern (global pattern), while TCOV focuses on the short-term temporal pattern (local pattern) on adjacent clip features. A fusion layer with MLP that integrates different feature embedding representations to learn the complementary relationship is proposed. It measures the mutual accommodation extent of TCOV and BGRU. The performance of the model with CTC constraints is about the same as that of other methods with multiple iterations [10].
An overview of sign language and hand gesture recognition techniques describes how they are recognized. Image processing, computer vision, and machine learning are used in many methods. Sign language covers mostly the upper body, from the waist up. The gesture approach initially yields 94%, but if the individual changes, the percentage drops to 40%, thus it is abandoned and work on alternative ways begins. 3D-modeled appearance-based hand gestures Hand gesture recognition is essential for feature extraction and categorization. Dynamic sign languages use video, while static gesture recognition uses single frames of graphics. Vision-based methods differ in data gathering. Camera frames are data. Kinect and LMC are depth-sensitive 3D cameras. Image and video inputs are modified during image preprocessing to increase performance. Segmentation depends on the image's backdrop and skin tone, making it unreliable. To increase performance, active approaches in image pre-processing change image or video inputs. IMU sensors like gyroscopes and accelerometers are utilized in data gloves for gesture and sign language detection. Wi-Fibased gesture control is also utilized for gesture recognition. Many new works are being made utilizing these methods [11]. In the approach of isolated sign language recognition with Depth Cameras, they used a depth camera sensor with data provided by a depth camera is presented. In the introduced method, sequences of depth maps of dynamic sign language gestures are divided into smaller regions (cells). Then, statistical information is used to describe the cells. Since gesture executions have different lengths, the dynamic time warping (DTW) technique with the nearest neighbour rule is often employed for their comparison. However, due to time-consuming computations, The DTW limits the usability of the classifier [12]- [15].

METHOD
Pakistan sign language (PSL) to the Urdu language consists of letters, words, and sentence-level translation which is then distributed into static and dynamic sign language. Static sign language translation is achieved using tensor flow object detection model garden. We collected an image-based dataset using Kinect which was distributed into 34 classes that include Urdu letters. This data was labelled and then distributed into a set of test and train data. The model garden [16] was used for the training purpose and real-time sign language translation was performed using openCv2 which was integrated with Kinect as shown in Figure 1. Dynamic sign language is achieved using googles media pipe holistic library through which we extracted the key points of both hands faces and shoulders. We created a dataset of 4 dynamic signs using Kinect in which we extracted key points through the video of 30 frames which consisted of 60 sequences.
This dataset was distributed into a set of test and train data. This data set was trained using recurrent neural network (RNN) architecture called LSTM which consisted of 3 LSTM layers and 3 dense layers as shown in Figure 2 [17], [18]. The categorical Accuracy of the model is shown in the form of a graph in Figure 3 and the model was trained for around 2000 Epochs. Epoch Loss is shown as a graph in Figure 4.
The trained dataset was saved and evaluated using confusion matrix and we achieved an accuracy of 1.0 as shown in Figure 5. Real-time dynamic sign language translation shown in Figure 6 was performed using open CV2 and ML that was integrated with Kinect. [16], [18]- [22] Urdu audio to Urdu text translation is performed using the google speech App which performs complete translation of all the letters, Words, and Sentences as shown in Figure 7. The text gathered from sign language is then converted into audio using Google Text to Speech App which helps in converting Urdu text into Urdu audio and a complete system of PSL to Urdu Translator is formed as shown in Figure 8. This is a very user friend system [23] which can facilitate both hearing impaired person [24] and the normal person to communicate with each other without facing any challenges [25].

RESULTS AND DISCUSSION
This system has attempted to give a solution to the barrier of communication faced by the Pakistani deaf and dumb society by developing a sign language translator application with a user-friendly GUI and better functionalities. It uses a novel approach for PSL recognition using Kinect sensors. A vast amount of videos or sequences and frames for acquiring the key points is used to develop the dataset for training and testing. The dataset can be made in real-time and existing videos or datasets can be used. The dataset consisting of the key points for the training is then further transformed into NumPy arrays.
The dataset is then split into test and train sets. The data is then fed to an LSTM network to train upon. After successful training sign language prediction can be made. Some of the real-time inputs and results achieved are shown in Figure 9 and Table 1. The same technique of key point extraction has been used to make real-time sign language predictions afterward as well. The key points of the user are collected by the Kinect sensors which capture the sequences of the facial, hands, and pose landmarks to be processed frame by frame to match with the training datasets. After the user has successfully performed signs, real-time sign language predictions are made. These predictions are then viewed by the user in the form of Urdu text and Urdu audio.

CONCLUSION
In this research work, we have proposed a methodology to help the deaf community in Pakistan, we have designed and developed a framework to solve the problem hearing impaired people face to communicate with normal people. The purpose behind this is to help reduce the struggle of the hearingimpaired people of Pakistan and make them a more useful part of our society. The solution is simple, effective, and affordable. This proposed system was tested in the Computer Science laboratory of IQRA University. Experimental results have shown that this KINECT-based system has shown promising results and is reliably meeting the requirements to solve the communication problem faced by hearing-impaired people.