An automated speech recognition and feature selection approach based on improved Northern Goshawk optimization

ABSTRACT


INTRODUCTION
Speech recognition (SR) technologies are becoming more powerful these days.Speech recognition is a method of detecting a person's uttered words using data in the speech signal [1].It processes the incoming audio input to make it compatible with various recognition software.Automated speech recognition (ASR) is a term used to describe the process of translating human voice into text using a computer [2].The usage of machine learning and pattern recognition in SR technologies has skyrocketed in recent years [3].The task of speech recognition is incredibly difficult for a computer system to do.This is due to the fact that people speak in a variety of ways, resulting in complicated speech signals that must be handled by automated speech signals.ASR technologies must be used to deal with issues.There are numerous main areas of research in voice recognition for the current improvement of spoken language frameworks [4], [5].
Pattern matching is used in current speech recognition frameworks.Hidden Markov models are used as a classifier, which is normally visible.Hidden markov model (HMM)-based voice recognition has progressed significantly, and it can currently achieve high recognition accuracy [6].For acoustic signals, there exist a variety of parametric representations.Mel-frequency cepstrum coefficients (MFCC) is the most often used among them.There have been several documented collaborations with MFCC, notably in the area of ISSN: 2252-8938  An automated speech recognition and feature selection approach … (Santosh Kumar Suryakumar) 297 improving recognition accuracy.Mel frequency cepstral coefficients can be widely used techniques for feature extraction.
Strategies that make advantage of information in the periodicity of speech signals may be utilised to circumvent this issue, despite the fact that speech includes aperiodic content.It has been discovered that feature selection techniques and classification methods play a significant influence in the detection of stuttering occurrences.We employed the feature selection (FS) [7] idea instead of feature extraction to determine the best characteristics that impact the log files to overcome the disadvantages in supervised learning task.FS is a popular supervised learning classification task for extracting useful subsets from large datasets.The raw log files may contain useless and superfluous values in a real-time setting.It has an impact on the predictive model's performance.Filter [8], wrapper [9], and embedded [10] are the three kinds of FS.There are no learning algorithms linked with the filter approach in the prediction model.Correlation [11], Chi-square test [12], analysis of variance (ANOVA) [13], and other popular approaches are only a few examples.The wrapper methods can communicate with the dataset's features.To discover the best outcomes, a learning algorithm is utilised.Particle swarm optimization [14], genetic algorithm (GA) [15], and whale optimization [16], for example, are well-known approaches.The embedded methods are a hybridization of both filter and wrapper.The popular embedding approaches in the literature include least absolute shrinkage and selection operator (LASSO) [17] and ridge regression [18].However, we applied the metaheuristic (MH) optimization strategy for the suggested approach to determine the best features.
A specific model in this system, support vector machines (SVM) [19], has been detailed by Zhang et al. [20] and how it may be used to medium to large vocabulary voice recognition applications.The shape of the joint feature spaces was a critical part of SVMs.The characteristics are obtained using contextdependent generative models, such as hidden Markov models.First, the retrieved characteristics are a function of the utterance's segmentation.A new training procedure was suggested that allowed for the use of universal Gaussian priors in the big margin requirement.Wu et al. have built on prior work by using the methodology of second-order cone programming to investigate a convex optimization approach.The second-order cone program (SOCP) approach greatly outperforms the gradient descent method in the test, according to the experimental data [21].Hirayama et al. [22] devised a technique for identifying mixed dialect utterances with multiple dialect language models.Maximization of recognition likelihoods and integration of recognition outcomes were our two key strategies [22].The influence of MFCCs, energy, formant, and pitch-related variables on boosting the performance of emotion identification systems was explored by Gharavian et al.The normalised values of formants were employed as supplemental features to compensate for the influence of mood on recognition rate [23].
The Northern Goshawk optimization (NGO) is a raptor that optimises its hunting strategy [24].This approach is used by the northern goshawk to select and attack its prey, following which it chases the animal in a pursuit.However, no optimization approach based on the behaviour of northern goshawks has been developed to our knowledge.The researchers took advantage of this information gap in supervised learning by developing a unique optimization approach based on mathematical modelling of the northern goshawk's hunting techniques.NGO is a programme that replicates northern goshawk hunting behaviour.It also achieves better results than regular MH algorithms in areas like as exploration, exploitation, finding local optima, and avoiding premature convergence.Finding the optimal subset of features in FS is challenging, especially when using wrapper-based approaches, despite the advantages of NGO that have been outlined above.This is due to the fact that a learning algorithm (such as a classifier) must be used to assess the chosen subset at each optimisation stage.Thus, we provide a solution to the supervised learning FS issue based on a hybridization of the NGO and opposition based learning (OBL) concept [25] in order to reduce the number of assessments.
The following are the study's main inferences: i) To test our supervised learning strategy, we employed three distinct types of datasets with k-nearest neighbour classifier; ii) Hybrid OBL-NGO supervised learning-based metaheuristic approach is used to find the optimal results; and iii) The suggested supervised learning-based model evaluated in terms of precision, recall, F1-score, classification accuracy and converging ability.

PRELIMINARIES 2.1. Initialization
The hunting style of the northern goshawk is separated into two stages, the first of which is a highspeed chase after spotting the prey, and the second of which is a brief tail chase after spotting the victim.The suggested NGO, which is a population-based algorithm, has northern goshawks as searcher members.The population members in the search space are randomly initialised.In the NGO technique is utilised to calculate the population matrix using (1).
The values collected for the objective function may be expressed as a vector using (2).
Where Vi is the objective function (OF) value acquired by the ith suggested solution and V is the vector of achieved OF values.The value of the OF is used to determine which choice is the best.The lower the OF value, the better the suggested solution in minimization difficulties, whereas the higher the OF value, the better the proposed solution in maximisation issues.

Exploration
The Northern Goshawk chooses a prey at random during the early phase of hunting and attacks it quickly.It improves the NGO's exploration capacity because to the random picking of prey in the search space.In (3)-( 5) are utilised to model the concepts presented in the first phase quantitatively.

𝑃 𝑖 = 𝑋 𝑘
(3) Where Pi is the ith northern goshawk's prey position, Vi is the objective function value, and k is a random value [0-1].

Exploitation
After being attacked by a northern goshawk, the victim tries to run.As a result, the northern goshawk maintains a tail-and-chase pattern when pursuing prey.Simulating this behaviour increases the algorithm's exploitation potential for local search of the search space ( 6)-( 8) is utilised to mathematically represent the concepts presented in the second phase.

Opposition based learning
An arbitrary number of prior knowledges are used by MH algorithms to generate the initial population of random search agents.The approach iteratively modifies the location of random search agents to find the optimal solution to the optimisation problem at hand.As the optimisation process only goes in one direction, the resulting solution may not be optimal, and the approach may enter local optima.Tizhoosh et al. [25] devised the OBL approach in 2005 to address these concerns, which substantially increases the convergence ability of MH algorithms.The opposite search agent which is calculated by (9).

IMPLEMEMNTATION OF PROPOSED METHODOLOGY
The goal of this study is to demonstrate an efficient recognizer that uses voice recognition techniques to create a human-machine interaction.Pre-processing, feature extarction (FE), and optimal FS for recognition are among the phases in this suggested approach.Pre-processing is carried out to improve the efficiency of the feature extraction procedure.FE is extracting the characteristics of a spoken stream.A feature selection technique is built in order to pick the most optimum collection of extracted characteristics.Finally, these best qualities are used to text recognition.Figure 1 depicts the suggested automatic speech recognition system's process flow.

Pre-processing
The noise in the incoming voice signals will affect the recognition process.As a result, the first and most important stage in SR is the pre-processing of speech data, which is done to eliminate undesired waveforms and speed up the recognition process.In this pre-processing, a Gaussian filtering is used to remove noise that is related to the spectral subtraction standard.This reduces noise by evaluating the noiseless signal and then comparing it to the original signals.The mathematical representation of the Gaussian filtering's given in the (10).
The signal was subjected to a high-pass finite impulse response (FIR) filter with unique component values.The method is completed by recognising the signal's end points and removing the silence.Finally, during the pre-processing stage, a noise-free signal is generated without sacrificing the original data that will be used in the feature extraction procedure.

Feature extraction
The speech signal is analysed as part of the feature extraction (FE) process.The spectrum analysis approach is used to extract voice signal features.The translation of the input data into a set of features is referred to as feature extraction.Evaluate the following aspects in this study: compare the standard and normal voice signals; peak frequency modulation; MFCC; tri spectral features; and discrete wavelet transform (DWT).In FE, normal speech signals are compared to the standard speech signal.For the individual speech signal, a frequency range of threshold is subsequently determined using the (11).The input signal is compared with the standard signal after feature extraction, which has the most influence on the output signals is determined using the mathematical function in (12).The development of increasing the amplitude of frequencies in relation to the magnitude of other frequencies is known as pre-emphasis.This can be determined using (13).

𝐶(𝑘) = 𝑆(𝑠) − 𝑃(𝑠)
(11) The goal of this section is to use tri-spectral analysis to classify speech signals.In most cases, the voice signals are captured first, and then the tri spectrum properties are analysed.Tri-spectrum statistics are used to identify output speech signals that are not closely related to the resulting input speech signal.This can be determined using the (14).Then we need to calculate the frequency response for the input signal and complex integration of the frequency response is given in the ( 15)- (16).
=   * (  *   ) * ℎ(  ) = ℎ( ) Each level represents a lower frequency band with a coarser resolution, whereas higher frequency bands represent higher frequency bands.Continuous and discrete transforms are the two primary types of transforms.On the low-pass band, the DWT repeatedly applies a two-channel filter bank (with down sampling).

Feature selection
We must choose the best set of features for voice signal identification from these extracted features.Since it is a supervised classification task, k-nearest neighbour (KNN) classifier is used to assess the predictive model.The reduced feature set is divided into two divisions using the 10-fold cross-validation (CV) method: training and testing.When a 10-fold CV is utilised, the chance of overfitting the prediction model is lowered.The solution representation and the evaluation function are two critical parts of the optimization issue that must be addressed while constructing any optimizer.To enhance the efficacy of the supervised learning optimization method, the OBL technique in NGO produces an optimal solution for the OF in both directions at the same time.The new model's findings give much superior outcomes in overcoming the problems associated with traditional NGO.The suggested supervised learning algorithm's fitness function is constructed using (17).Subsequently, for the provided input signal, the system generates recognised text.The proposed approach is shown in the Figure 1.We considered three types of datasets in this proposed method: LibriSpeech, CHiME-5 and AISHELL-1.The first dataset corpus is a library of audiobooks from the LibriVox project that totals over 1,000 hours.The overview of the datasets is presented in the Tables 1-3.

Discussion
For each dataset, the suggested model is performed using the Python environment with 100 epochs.Figure 2(a) is LibriSpeech corpus, Figure 2(b) is CHiME-5, Figure 2(c) is AISHELL-1, depicts the prediction model's error rate throughout the procedure.The recommended model's ability to converge to global minima is revealed by the decrease in error rate as the model goes through each aeon.A classifier like KNN is used to evaluate the proposed approach's capacity to forecast.After 25 epochs, the AISHELL-1 dataset begins to converge, whereas the other two datasets begin to converge after 35 epochs.As a result, the average convergence ability of a model is 20 epochs.Furthermore, the performance of the Librispeech datasets is not significantly different from that of the other datasets.It demonstrates that the model is not overfitted.4(c) is AISHELL-1, depicts the receiver operastic charecterstic (ROC) curve for the recommended model.Based on the higher ROC value for datasets, the recommended approach may assign a larger chance to a randomly selected genuine positive sample than a negative sample on average.The component selection is verified using the KNN classifier.The wider area under the ROC curve (ROCAUC) curve of the recommended model shows that the qualities it picked can provide significant confidence in knowledge discovery and decision making.Furthermore, a smaller set of well selected characteristics can yield more accurate findings than the entire set of features in the input data.Table 4 summarises and compares the performance assessments of the prediction models for the specified feature subsets in order to validate the feature subsets.The table shows that the suggested model had the highest classification accuracy across all datasets.An automated speech recognition and feature selection approach … (Santosh Kumar Suryakumar) 303 commonly utilised in the literature in conjunction with typical ML classifiers.However, in order to detect speech recognition, the suggested model is employed to pick the ideal features utilising an OBL-NGO optimization strategy.Since it is using the supervised learning classification task, in terms of convergence ability, training and testing accuracy, precision, recall, and F1-score, the proposed model's performance is evaluated based on three benchmark datasets.Except for the CHiME-5 dataset, the remaining datasets excel in all assessment measures, according to the results.The proposed approach may be implemented using a variety of ML classifiers, and a deep learning environment will be used in the future.

Figure 3 (
a) is LibriSpeech corpus, Figure 3(b) is CHiME-5, Figure 3(c) is AISHELL-1 depicts the KNN classifier's prediction capabilities.During training and testing, the recommended technique provides accuracy values between [0.68, 0.95] and [0.55, 0.82].The recommended model has a stronger proclivity to improve accuracy in the early stages.The difference between trading and testing accuracy should be narrowed.The gap in the AISHELL-1 dataset is relatively large when compared to other datasets [range: 0.55 to.70]. Figure 4(a) is LibriSpeech corpus, Figure 4(b) is CHiME-5, and Figure

Table 1 .
Overview of LibriSpeech corpus dataset

Table 4 .
Validation of selected features