IAES International Journal of Artificial Intelligence (IJ-AI)

Received Mar 16, 2022 Revised Aug 11, 2022 Accepted Sep 10, 2022 This work presents the design and validation of a voice assistant to command robotic tasks in a residential environment, as a support for people who require isolation or support due to body motor problems. The preprocessing of a database of 3600 audios of 8 different categories of words like “paper”, “glass” or “robot”, that allow to conform commands such as "carry paper" or "bring medicine", obtaining a matrix array of Mel frequencies and its derivatives, as inputs to a convolutional neural network that presents an accuracy of 96.9% in the discrimination of the categories. The command recognition tests involve recognizing groups of three words starting with "robot", for example, "robot bring glass", and allow identifying 8 different actions per voice command, with an accuracy of 88.75%.


INTRODUCTION
Speech recognition techniques present a very wide field of research with diverse applications such as speech impairment, improving the accuracy of voice command fingerprinting attacks and more, as discussed in [1]- [3]. One of the most representative research projects is in emotion recognition using spectrograms, mel frequency cepstral coefficient (MFCC) and convolutional networks [4], [5]. Cases such as the one presented in [6] employ convolutional networks with 3-dimensional inputs based on the first and second derivative of the spectrogram.
Interaction by voice commands with robots is another field of research interest in speech recognition [7]- [9]. The use of voice assistants such as Amazon's Alexa [10] or Google's [11], allow to obtain a more natural method of human-robot interaction. Thus, highlighting that voice commands are a necessity in interaction with robots [12], where for this research also the use of convolutional networks provides high performance [13]- [15].
Nowadays, the development of intelligent environments is gaining strength, including smart homes [16]. Robotic technology has been included in these environments with different fronts such as people care [17], cleaning [18] and even cooking [19]. In [20] and [21], the development of assistive robots to address patient isolations by COVID is exposed, however, the development is oriented to systems telecommanded by cellular mobile equipment.
Given the relevance of this topic and the need for a more natural and autonomous telecommand system, this work presents an audio command recognition system oriented to an assistive robot in a residential environment, thus integrating what has been found in the state of the art by means of a voice assistant for robotic action, performing audio preprocessing by ceptral coefficients and subsequent recognition by means of convolutional networks. As a contribution to the state of the art, a neuro- convolutional architecture is designed to be easily embedded in portable electronic systems, performing the separation of the command words by means of a sliding window that calculates the power density of the audio signal.
The document is divided into four sections, the present section exposes the state of the art related to the work developed. Section two presents the methodology used for the separation of the words that make up each command and the neural training. Section 3 presents the analysis of the results achieved and finally section four presents the conclusions reached.

METHOD
The proposed objective is to use voice commands consisting of groups of three words, which allow the execution of assistive actions of a mobile robot within a residential environment. The sequence of control words is recorded and each one of them is separated to obtain a two-dimensional map of each audio signal, by means of mel frequency cepstral coefficients (MFCCs). Each map is classified using a convolutional neural network and the coherence of the command to execute the action is validated. The general scheme is presented in Figure 1. The training of the model is performed by creating a database consisting of 3600 recordings of different users, distributed in 8 classes corresponding to robot, bring, carry, stop, paper, cup, towel, and medicine, where 80% are taken for training and 20% for validation. Each audio is acquired with a sampling frequency of 16000 Hz and each word is separated by the location of the minima found when obtaining the absolute value of the original signal, as shown in Figure 2. By means of a sliding window of ten times the input frequency, the power density of the audio signal is calculated, each one is compared with the previous value and if it presents a decay of 75% it is established as a local minimum, a point used for the separation of each word, where the asterisks on the right side represent the minimum values found.

Figure 2. Minimal detection for words separation
The database is made with 10 male and 10 female users, to diversify the learning. Figure 3 shows examples of the database. Each of the three words is preprocessed for feature extraction to obtain a two-dimensional, threechannel map, which allows a convolutional neural network [22] to learn the behavior of the voice command over time, to be recognized. The feature map is obtained by calculating the mel frequency cepstral coefficients (MFCCs) using (1) to (3). These are coefficients for speech representation based on human auditory perception [23], widely used in speech analysis systems [24].
By means of 1 it is possible to generate a feature map of 12 coefficients acquired from 199 frames. Being this the first input channel to the network, the first and second derivative (2 and 3 respectively), generate the other two channels. Therefore, the learning input to the network is of dimensions 12×199×3.
The network architecture used is shown in Table 1, employing six convolution layers, given the limited number of desired outputs and the punctual work to be performed by the network. The training hyperparameters were found iteratively using a learning rate of 1e-6, with 50 epochs. Figure 4 illustrates the network learning process, with a training time of 31 minutes for 79250 iterations, on a 2.30GHz Intel Core i7 computer with NVIDA Gforce RTX 3070 8GB GPU, and finally a performance of 96.9%.

RESULTS AND DISCUSSION
The algorithm is validated by evaluating the action commands to be developed by the mobile robot. For this purpose, the variants of the commands are established according to the words to be recognized as shown in Table 2. The number of true positives versus false positives that the algorithm exhibits is determined. A true positive corresponds to a valid command of the desired action, a false positive corresponds to a valid command, but not according to the desired action. The algorithm filters by software the validity of a command initially evaluating the existence of the three words, in this case by means of the minima of the signal spectrum. Figure 6 illustrates the case two examples of the commands "robot bring glass" shows in Figure 6(a) and "robot carry paper" shows in Figure 6(b) with the location of the minima that result in the division of the words, where the difference between each command is appreciated. The first word must always be robot, otherwise it will not be validated. From Table 2, it is possible to derive an efficiency of 88.75% in the discrimination of the commands to the robot, where the characteristics of confusion of classes between carry and bring stand out: 16.6% of false positives and 55.5% of false positives correspond to confusing the object (glass, paper and towel) with the class medicine.  Figure 7 illustrates the results of the prediction of the network by discriminating each word and evaluating it, to the right of each separated word there is evidence of noise generated by complementing the size of the information vector. This is because the duration of the input to the network is 2 seconds, which at a sampling frequency of 16000 Hz implies 32000 samples. When trimming each word, the vector is shortened and, since it cannot be filled in in a concerted manner, due to the MFCC derivatives, a random filling of ±0.01 is generated. Figure 7(a) and Figure 7(b) illustrate two examples of different commands from the original signal to their separation into words and recognition of each (robot bring paper and robot carry paper respectively). In contrast to Figure 7, Figure 8 illustrates a case of erroneous detection in the action command "robot carry paper". The similarity of the first and second word is evidenced, varying mainly in amplitude. So it is recognized as the same word generating in the network the output "robot robot paper", which is classified as an invalid command. In this case, the error was associated to environmental noise at the time of recording the command. Similar work is presented in [25], where the robotic action commands also employ word separation and generate feature extraction by MFCC, using a single channel, but combining the CNN with an LSTM network, they report up to 90.37% accurracy for word recognition. The 6% improvement achieved by the

591
CNN network developed in this work is due to a higher number of training audios and the use of the MFCC derivatives of each word. It is validated that the use of 7×7 filters in the first convolutional layer, instead of 5×5 as in [25], also helped to improve the accuracy by 3%. A virtual environment was designed for evaluation of the robotic navigation task and voice command discrimination, as shown in Figure 9. The response time of the robot in discriminating the actions is about 8 seconds. This time include the robot responses of valid command and identification of the place where the desired action will be carried out in the residential environment.

CONCLUSION
The use of convolutional networks for voice command generation in mobile robotics offers a natural field of human-machine interaction. It is concluded that MFCC discrimination allows to generate a map of recognizable features by the network that results in a functional voice assistant for robotic command. It was found that speech recognition accuracy depends on a low ambient noise factor and on generating an adequate vocalization of each word. This factor decreases its incidence when enlarging the database with background noise and varying the speed and volume of pronunciation. It was concluded from the training of the network that the use of few classes facilitates the discrimination of the spoken word, suggesting that future training can use identification trees, for example, a network of identification of actions (verbs) and one of objects (nouns), to expand the number of commands received by the robot.