Comparison of CNNs and SVM for voice control wheelchair

Received Feb 10, 2020 Revised Apr 20, 2020 Accepted May 6, 2020 In this paper, we develop an intelligent wheelchair using CNNs and SVM voice recognition methods. The data is collected from Google and some of them are self-recorded. There are four types of data to be recognized which are go, left, right, and stop. Voice data are extracted using MFCC feature extraction technique. CNNs and SVM are then used to classify and recognize the voice data. The motor driver is embedded in Raspberry PI 3B+ to control the movement of the wheelchair prototype. CNNs produced higher accuracy i.e. 95.30% compared to SVM which is only 72.39%. On the other hand, SVM only took 8.21 seconds while CNNs took 250.03 seconds to execute. Therefore, CNNs produce better result because noise are filtered in the feature extraction layer before classified in the classification layer. However, CNNs took longer time due to the complexity of the networks and the less complexity implementation in SVM give shorter processing time.


INTRODUCTION
Impaired people is increasing every year due to ageing, accidents and also diseases such as paralysis, spinal cord injuries (SCI), amputation and quadriplegia [1]. This group of people require a special wheelchair to move such as a motorised wheelchair. Motorised wheelchair consists of electric motor and joystick at the armrest. However, disabled people with paralysed hand or having a problem with the hand cannot operate joystick on the motorised wheelchair to move. Therefore, a lot of researches studied to help the impaired people move the wheelchair-using voice recognition [1][2][3]. This wheelchair can be called smart or intelligent wheelchair because it can be moved using only the voice. The first intelligent wheelchair was developed by the Siamo University of Alcala, Spain in 1999 [4].
The improvement of voice and speech identification technology has been started by Texas Instruments in 1960 [5]. Recently, voice recognition or speech recognition has been applied in assisting people doing work through digital devices such as mobile phone, tablets, and personal computer. There are also voice recognition software and webpages i.e. google applications, translation software, and personal assistants such as Alexa, Cortana and Siri. This personal assistant can be called modern chatbot which can assist people in any topics they would like to ask.
The advancement of technology in artificial intelligence (AI) have created a smart wheelchair or intelligent wheelchair exploiting the voice recognition features. A lot of researches have been conducted to improve the functionality of the wheelchair. Nasrin [1] proposed an application on smart wheelchair-using voice and add GPS function to track the user navigation and location. This application required WIFI connection and the application is installed in a mobile environment. However, Avutu et. al. [3] proposed a low-cost map for voice recognition wheelchair which can be applied in a local location. This application can be used without any network connection. The cost is much cheaper compared to the application using a network connection. Barriuso [6] proposed a smartphone application for a wheelchair with an agent-based intelligent interface. Another application was developed using Arduino Mega 2560 by [7]. The special features in this application are the emergency messages can be sent to the important people. With the ultrasonic sensor, any obstacle can be detected and avoided without internet connection required. There is also an application of intelligent wheelchair for people with paraplegia problem proposed by [8]. The application applied voice recognition and touch screen control. Meanwhile, Chauhan [9] did a comparative study on voice recognition wheelchair and proposed a new design deploying infrared sensor and Rasberry Pi through USB microphone.
The development of smart or intelligent wheelchair mostly based on AI technologies. The process of voice recognition application will begin extracting features and classification. There are a lot of features extraction technique for voice recognition such as Linear Prediction Coefficients (LPC), Discrete Wavelet Transform (DWT), Line Spectral Frequencies (LSF), and Mel-Frequency Cepstrum Coefficients (MFCC) [10].
Classification techniques can be applied in classifying and recognising the voice, especially in wheelchair application. The most popular classification techniques and promising with high accuracy are convolutional neural networks (CNNs). CNNs is based on feedforward neural network and have better generalisation than networks with full connectivity between adjacent layers [11]. Moreover, CNNs has been successfully applied in many applications especially in image identification [12][13][14][15], surveillance [16], and human recognition [17][18][19]. Lei and She [20] authenticate voice using CNNs method in a noisy environment. Their research produces better accuracy and reduces an equal error rate. Guan [21] optimise performance in speech recognition using CNNs and the recognition performance has increased with an error rate of 13.88%. Sharifuddin et.al compared CNNs and BPNNs on voice control wheelchair applications and produces high accuracy on CNNs technique [5]. Support Vector Machine (SVM) can be categorised as one of the best classifications techniques. SVM also has been applied in various applications for instance in biometrics [17,[22][23], sentiment analysis [24][25] and security such as intrusion detection [26][27]. Selvakumari and Radha [28] applied SVM in classifying speech pathology and achieved 98% accuracy compared to the Naïve Bayes algorithm. Furthermore, Astuti and Riyandwita [29] applying SVM in recognising voice for starting a car engine. SVM classified the word 'on' and 'off' car engine with 92.15% accuracy. Meanwhile, Harvianto et. al [30] analyse the voice of Indonesia Language using MFCC and SVM achieved high accuracy which is 91.83%.
In this paper, we propose an intelligent wheelchair-using voice recognition. There are four types of voice command to be recognized which are go, left, right and stop. The data is collected from Google and some of them are self-recorded. Features from the data are extracted using MFCC techniques. Followed by the recognition using CNNs. We also classify the voice using the SVM method to compare the efficiency and accuracy between the two techniques. This paper is organised as follows. Section 2 presents research methods on voice recognition, CNNs and SVM. Experimentation details and results with the comparisons between CNNs and SVM are presented in section 3. Section 4 discusses the conclusion of the paper.

RESEARCH METHOD
This paper focuses on five types of voice data. Speech signal is based on single words; speakers are independent; language is English; vocabulary type is small, and background noise is true. Table 1 presents the five types of voice data used in this paper.

389
PI. Raspberry PI 3B+ will be used in this project due to the following reasons; fast processing time, low cost and easy to program. In voice recognition, CNNs is seen as a technique capable of providing high accuracy in solving especially in speech and image recognition problems [20]. CNNs is good in highlighting important points, as well as filtering noise in the dataset. CNNs consist of feature extraction layer and classification layer.

Feature extraction layer
Feature extraction layer contains a convolutional layer and pooling layer. In this section, data will be extracted before been feed into the classification layer. We implement 2D Convolutional Layer and three 2D Max Pooling layers in this paper.

Classification layer
Classification layer aims to classify the data received from the feature extraction layer. In this paper, the classification layer consists of an input layer, a hidden layer, and an output layer.
On the other hand, this paper also highlights the implementation of SVM to classify the extracted voice data. SVM is chosen because of the ability to perform multi-class classification with high dimensional spaces. Three kernel functions are used in the experiment, i.e. polynomial, RBF, and sigmoid. We then conduct another experiment to calculate the effect of C, gamma, and degree using the kernel result. Figure 1 presents the flow of the proposed system [5]. We described the flow based on three modules: the input (the user i.e. the disabled people), the process, and the output. Firstly, the user pushes the button to record the voice command using the portable microphone. MFCC and CNNs are used in the Raspberry PI 3B+ to process the recorded voice. The output signal is then sent to the motor driver in the Raspberry PI 3B+ after the system recognized the voice. The motor driver controls the movement of the motor to direct the robot car.

Data Sample
There are four types of voice command used in the experiment. These commands are right, left, stop, and go. The data was gathered from Google. There are 2,372 data for go command, 2,380 for stop command, 2,353 for left command, and 2,367 for right command. There are also urban noise and white noise data collected for the experiment. 2,373 urban noise data are collected from Google while 2,300 white noise data are self-recorded [5] because the data is not available online. Each voice data is prepared in one-second length and is saved in WAV format. The total number of downloaded and recorded voice command used in the experiment is presented in Table 2.

Mel frequency cepstral coefficients (MFCC)
MFCC is the technique used to extract features from audio signal. MFCC used the frequency band logarithmically, which allow better speech processing [31]. Signal features are first went through feature extraction and later feature matching. The processing of MFCC is applied using Librosa library. There are six steps involve in MFCC features extraction i.e. pre-emphasis filter, framing the speech sample, windowing, fast fourier transform (FFT), mel filter bank, and log energy and discrete cosine transform (DCT).

Data preparation
This research using 2-dimensional (2D) CNNs. The data is directly taken from raw MFCC because the dataset is in 2D format and no requirement to compress it. The size of the data has to be in the same format which is (44,20). This format is being set based on one-second length of data. If the size of the data exceeded, then the excess data will be cut off. However, if the size is less than the format size (44,20) then it will be pad with zero. Finally, the data is ready to be feed in the 2D CNNs. Figure 2 illustrates the data preparation process of CNNs [5].

SVM parameter tuning
Support Vector Machine (SVM) is another technique used to classify the voice data. In order to fine-tune the SVM, there are fourteen experiments have been conducted. The parameter used is in this experiment is a type of kernel, C which is regularization parameter, gamma and degree. We conducted three experiments using three different kernel functions i.e. polynomial, RBF and sigmoid. The result shown that the polynomial kernel achieved the higher accuracy. Next, we conducted another experiments to measure the effect of C continue with the effect of gamma and the effect of degree. We achived the best accuracy result that is 72.39% with kernel type polynomial, the effect of C is 10, the effect of gamma is scale, and the effect of degree is 1 as shown in Table 3.

CNNs parameter tuning
Twelve experiments were conducted to tune CNNs parameter. From the experiments, we described the best-selected model in Table 4 [6]. Table 4 describes the layers, kernel size, stride, filters, padding, nodes, bias, and activation for the CNNs model. We achieved 95.30% accuracy with 4.1672 minutes from the implemented model.

Comparison between SVM and CNNs
Based on the conducted experiment, CNNs achieved higher accuracy that is 95.30% with compared to SVM that is 72.39%. In term of time, SVM took shorter processing time with only 8.21 second rather than CNNs with 250.03 seconds to run. The capability of filtering the noise in the voice data in the CNNs feature extraction layer produces higher accuracy compared to SVM. On the other hand, the less complexity implementation in SVM give shorter processing time. Table 5 displays the accuracy and time result for CNNs and SVM.

Hardware tuning
To demonstrate the implementation of the voice recognition process, we integrate logic gates in the Motor Diver LN298N. The description of the logic gates is shown in Table 6. The logic gates control the movement of the motor in four directions, i.e. go, stop, left, and right [6].

CONCLUSION
In this paper, we propose an intelligent wheelchair-using voice recognition. There are four types of voice command to be recognized which are go, left, right and stop. The data is collected from Google and some of them are self-recorded. Features from the data are extracted using MFCC techniques. Followed by the recognition using CNNs and SVM. CNNs produced higher accuracy i.e. 95.30% compared to SVM which is only 72.39%. On the other hand, SVM only took 8.21 seconds while CNNs took 250.03 seconds to execute. Therefore, CNNs produce better result because noise are filtered in the feature extraction layer before classified in the classification layer. However, CNNs took longer time due to the complexity of the networks and the less complexity implementation in SVM give shorter processing time.