Enhanced multi-ethnic speech recognition using pitch shifting generative adversarial networks

Kristiawan Nugroho, Kristophorus Hadiono, Felix Sutanto, Dhendra Marutho, Omar Farooq


Research in the field of speech recognition is a challenging research area. Various approaches have been applied to build robust models. A problem faced in speech recognition research is overfitting, especially if there is insufficient data to train the model. A large enough amount of data can train the model well, resulting in high accuracy. Data augmentation is an approach often used to increase the quantity of dataset. This research uses a data augmentation approach, namely pitch shifting, to increase the quantity of speech dataset, which is then processed into spectrogram data and then classified using a generative adversarial network (GAN). Using the pitch shifting-generative adversarial network (PS-GAN) model, this research produces high accuracy performance in multi-ethnic speech recognition, namely 98.43%, better than several similar studies.


Data augmentation; Generative adversarial network Multi-ethnic; Pitch shifting; Speech recognition

DOI: http://doi.org/10.11591/ijai.v13.i3.pp2904-2911


