Detecting cyberbullying text using the approaches with machine learning models for the low-resource Bengali language

The rising usage of social media sites and the advances in communication technologies have led to a considerable increase in cyberbullying events. Here, people are intimidated, harassed, and humiliated via digital messaging. To identify cyberbullying texts, several research have been undertaken in English and other languages with abundant resources, but relatively few studies have been conducted in low-resource languages like Bengali. This research focuses on Ben-gali text to find cyberbullying material by experimenting with pre-processing, feature selection, and three types of machine learning (ML) models: classical ML, deep learning (DL), and transformer learning. In classical ML, four models, support vector machine (SVM), multinomial Naive Bayes (MNB), random forest (RF), and logistic regression (LR) are used. In DL, three models, long short term memory (LSTM), Bidirectional LSTM, and convolutional neural network with bidirectional LSTM (CNN-BiLSTM) are employed. As the transformer-based pre-trained model, bidirectional encoder representations from transformers (BERT) is utilized. Using our proposed pre-processing tasks, the MNB-based approach achieves the best accuracy of 78.816% among the other classical ML models, the LSTM-based approach gains the highest result of 77.804% accuracy among the DL models, and the BERT-based approach outperforms both with 80.165% accuracy. This is an open access article under the CC BY-SA license.


Int J Artif Intell ISSN: 2252-8938
Ì 359 According to the report of NapoleonCat, in 2021, almost 47 million Bangladeshi users interact with Facebook, and about 68.8% of these users are male.As per a survey of the Global Digital Statshot, in terms of the highest number of active users, Dhaka, the capital of Bangladesh, ranked second among all cities globally in 2017.The Daily Star, a leading newspaper in this country, reported in 2020 that there is a high incidence of cyberbullying in this region, with 80% of the victims between the ages of 14 and 22 being females.
Although, cyberbullying is the rising issue in Bangladesh, few research articles have been published to detect Bengali bully text.However, identifying the cyberbullying class is not the easy task, since extracting the intention and the context of a particular text is still challenging jobs for the researchers.Furthermore, very rare of the study focus on detail analysis of text pre-processing and feature selection levels from three aspects of machine learning (ML) models: classical ML, deep learning (DL), and transformer learning [2].Moreover, most of these studies do not consider cross-validation technique to develop more generic detection model.In addition, we have found very less analysis of hyper-parameter tuning of the DL models.In this research, we want to handle these issues in order to build a strong cyberbullying text detection system for the Bengali language.
Several pieces of research have been undertaken to investigate bullying materials in the Bengali language.Ahmed et al. [3] successfully identified bullying texts in the social media texts.Here, the authors used a unique feature, the pointwise mutual information-semantic orientation (PMI-SO) score, besides the other features like number of likes, and cyberbullying words.Then, they applied the extreme gradient boosting (XGBoost) classifier and obtained a 93% accurate result.However, they did not apply cross-validation or hyper-parameter tuning, and they did not try DL or transformer-based models in this work.Ahmed et al. [4] made three cyberbullying detection systems for three different types of Bengali datasets: single-coded Bengali, Romanized Bengali, and a mix of single-coded and Romanized Bengali.This is the unique contribution in the case of cyberbullying detection.For single-coded Bengali, they proposed the convolutional neural network (CNN) classifier (with an accuracy of 84%); however, for the other two dataset types, multinomial Naive Bayes (MNB) showed superior performance: 84% for the Romanized Bengali and 80% for the mixed dataset.They did not apply transformer-based pre-trained models and did not use cross-validation or hyper-parameter optimization techniques in this research.Das et al. [5] worked on a hateful dataset and handled both binary and multi-class classification problems.Here, they created the "Bangla Emot" module for processing emoji and emoticons and provided an attention-based decoder model that demonstrated the best performance with an accuracy of 88% in binary classification and 77% in multi-classification. Cross-validation and hyper-parameter tuning were not taken into account.Romim et al. [6] conducted an experiment on hateful text.For extracting the features, they applied fastText, BengFastText, and word2vec word embedding techniques.The authors proposed the support vector machine (SVM) model that successfully achieved the highest accuracy of 87.5% and the highest F1-score of 0.911.However, hyper-parameter tuning and cross-validation were not considered in this work.Also, they did not experiment on the transformer-based models to detect the hateful text.Islam et al. [7] utilized SVM model with term frequency-inverse document frequency (TF-IDF) feature extraction technique and achieved 88% accuracy on an abusive text dataset.Authors did not perform cross-validation and hyper-parameter optimization.Moreover, this research did not consider DL and transformer learning methods to identify the abusive text.Kumar et al. [8] employed two shared task datasets, "HASOC" [9] and "TRAC-2" [10], to identify offensive and aggressive text in Bengali, Hindi, and English languages.The principal success of this research is the authors achieved the best result (0.93 F1-score) using a mix of character trigram and word unigram with a SVM classifier for Bengali text.Although they applied 5-fold cross-validation, hyper-parameter optimization was skipped.Chakraborty and Seddiqui [11] investigated abusive and threatening texts and developed a detection system through the linear SVM model with greater accuracy (78%), which is the main success of this study.The authors did not use the cross-validation technique for generalizing the model and did not tune the hyper-parameters during the model execution.Furthermore, they did not choose transformer learning algorithms during model implementation.
Our goal in this study was to address every concern of the previous studies in this area.With this in mind, we conduct experiments on three levels: pre-processing tasks, feature selection tasks, and implementation of cutting-edge ML models.Ultimately, we contribute the following to the establishment of a very effective cyberbullying detection system: − Pre-processing level: incorporate some additional pre-processing tasks such as removing thin-space Unicode characters, replacing emoticons and emojis by the Bengali words, with the general pre-processing tasks.
Detecting cyberbullying text using the approaches with ... (Md.Nesarul Hoque) − Feature selection level: features are extracted for the classical ML, DL, and transformer leaning classifiers separately.− Cross-validation technique: perform 10-fold cross-validation approach for classical ML models for more reliable detection system.− Hyper-parameter tuning: perform hyper-parameter tuning using "Bayesian optimization" algorithm during the implementation of DL models.− Selection of best models: propose the better detection approaches from each three type of ML model.

METHOD
For developing a strong detection system, at first we collect the dataset.Then, we apply various text pre-processing tasks.After that, we extract the features from the pre-processed data.Finally, we apply different types of classifier models with 10-fold cross-validation and hyper-parameters tuning using Bayesian optimization technique.The overall working process is depicted in Figure 1.The following sub-sections explain all the processes step-by-step.

Dataset description
In this investigation, the same dataset was utilized as in [11].This dataset contains 5,644 code-mixing Bengali-English comments and posts which are collected from various popular types of Facebook pages like celebrities, newspapers, and entertainments.Here, the almost 50% data are tagged as bullying and the remaining are non-bullying.Table 1 presents some statistical information about the dataset including, number of entries (quantity), minimum (minWords), maximum (maxWords), and average (avgWords) number of words in a text.

Data pre-processing
The collected data convey unwanted and noisy contents such as thin-space Unicode characters, links, and punctuation.Here, we have divided entire pre-processing tasks into two groups: general pre-processing tasks named "GPT" and additional pre-processing tasks called "APT".In the "GPT", we remove HTML tags, Int J Artif Intell, Vol. 13 links, URLs, special characters, punctuation, digits, and stop-words and apply tokenization, lower casing (for English words), and stemming [12] operations.On the other hand, the "APT" contains the following four pre-processing tasks: − Removing thin-space unicode character: our dataset contains "ZERO WIDTH NON-JOINER" (U+200C) Unicode character which creates garbage tokens when we use regular expression during tokenization.We have discarded this unicode character from the dataset.− Fixing space error with delimiters and sentence-ending punctuation: in this dataset, we have found many space related errors before and after delimiters like comma (","), semi-colon (";"), and sentence-ending punctuation like full-stop (".") or ("।"), question mark ("?").These types of errors are corrected that reduces the number of unwanted tokens.− Replacing emoticons (ASCII characters) with Bengali words: we have developed an emoticon dictionary which contains about 400 usable emoticons and their equivalent Bengali meaning.Here, we have grouped the similar types of emoticons with the same Bengali word.E.g., the emoticons, ":)", ":=)", ":|)", convey the same Bengali meaning, "খু িশ" (smile).Entire emoticons of the dataset are substituted with corresponding Bengali words.This replace operation may help for better understanding of the emotion [13] of a particular text.− Replacing emoji (unicode characters) with Bengali words: like the emoticons, we have replaced every emoji by the equivalent Bengali words.Here, we also make an emoji dictionary of about 280 mostly used emoji with their Bengali meaning and group the similar types of emoji with same the Bengali word.
To the best of our knowledge, the above four pre-processing tasks are slightly differed from those of existing studies.We experiment on "APT" in section 3. Finally, we observe the impact of these pre-processing tasks in the detection system.

Feature selection
A feature selection operation has been introduced to extract the significant features that have a substantial impact on cyberbullying text detection.We have addressed feature selection technique from three different perspectives: features for traditional ML models, features for DL models, and features for transformer-based pre-trained model.For traditional ML models, we have utilized N-grams at the character and word level and assigned TF-IDF values for each feature.After conducting some empirical research, we have determined that N should be between 3 and 5 for character N-grams and between 1 and 2 for word N-grams.For DL models, we use three word embedding models: word2vec, fastText, and global vector (GloVe) which deal with syntactic and semantic word-similarity and word-analogy of a text.As a transformer-based ML technique, we have utilized a cutting-edge bidirectional encoder representations from transformers (BERT) pre-trained model.During feature extraction, BERT employs its own tokenizer, the BERT tokenizer, and turns each word into three embedding vectors: token, segment, and position.By adding these three vectors, the final embedding of a particular word is computed [14].

Model classifiers
This research experiments on eight ML models such as SVM, MNB, random forest (RF), logistic regression (LR), long short term memory (LSTM), bidirectional LSTM (BiLSTM), CNN-BiLSTM, and BERT, which give better outputs in several studies for classifying a text for the Bengali language.In the case of classical ML models, the SVM model outperformed in [3], [7], [8], [11], whereas, the MNB model showed better results in [4], [15].However, the RF and the LR presented the higher accuracy in [16] and [17], respectively.On the other hand, LSTM, BiLSTM, and CNN-BiLSTM presented the best performance in [18], [19] for the DL models.At very recent time, BERT plays outstanding result in text classification task [20].These eight models are illustrated below: Support vector machine (SVM): in SVM, a hyper-plane space is generated by maximizing the gap between the margins of two classes (E.g., positive and negative) of support vectors.This marginal distance yields a more generalized version of the model.In addition, SVM handles errors by employing a regularization technique based on the soft margin hyper-plane over incorrectly categorized data points [21].
Multinomial Naive Bayes (MNB): the main working principle of MNB is based on the Bayes theorem [22].Through this theorem and the formula of prior probability and the conditional probability, each class variable and each feature variable are computed during the training process.After that, these prior and conditional probabilities are applied on the test dataset.Then, based on the features of the test data, Detecting cyberbullying text using the approaches with ... (Md.Nesarul Hoque)

Ì
ISSN: 2252-8938 MNB assigns the probability for the classes of each data point and select a class with higher probability.The probability, P (C|F ) is calculated using the Bayes theorem as (1).
Where, C is a possible class variable and F represents the unique feature.Random forest (RF): the RF consist of a number of decision trees which are computed separately [23].Here, the information gain is achieved for each tree through the "Gini Impurity" formula as follows: where, C is the total number of possible class and P i is the probability of i th class.Logistic regression (LR): the working principle of this model is based on the two functions: sigmoid (3) and cost (4), given below: (3) where, β T indicates the transpose of regression coefficient matrix (β), k represents the total amount of training observations, h θ (x i ) is the hypothesis function of the i th training observation, and y i presents the actual output of i th training sample.
Long short term memory (LSTM): this model is developed to handle the problem of long-term dependency for the sequential and the time series data [24].Here, four interactive layers: memory cell unit, forget gate unit, input gate unit, and output gate unit are working together in each LSTM cell state.Information are added or forgot in the memory cell state.Forget gate unit decides which information from the previous cell state.In the input gate unit, new significant information are added into the memory cell state.Lastly, the output gate unit decides which information are passed into the next LSTM cell state.
Bidirectional LSTM (BiLSTM): the LSTM model struggles to predict a word which more depend on the future context word.To solve this issue, the concept of BiLSTM where one LSTM operates into forward direction and another one flows into backward directions [25].Here, the output of a particular BiLSTM cell state is computed by combining the output of both LSTM cells for the same time-stamp.
Convolutional neural network with bidirectional LSTM (CNN-BiLSTM): the CNN-BiLSTM is a hybrid model, where CNN and BiLSTM models are executed sequentially.CNN operates with three fundamental layers: convolution, pooling, and fully connected.In the convolution layer, element by element matrix multiplication operation (convolution operation) is performed between the input data and the kernel (feature detector) and produce a feature map for each kernel.The dimension of this feature map is reduced in the pooling layer.Finally the reduced feature map is passed through the fully connected layer which performs like the standard artificial neural network (ANN) architecture.In the CNN-BiLSTM model, the BiLSTM part is added just before the fully connected layer.For the text processing task, CNN works on local dependency of the contiguous words of a sentence by extracting character level features, while BiLSTM deals with overall dependency for a particular feature of an entire text [26].
Bidirectional encoder representations from transformers (BERT): BERT shows an utmost performance for context-specific tasks including language inference, question answering, among the other pre-trained models like embeddings from language models (ELMo), and OpenAI generative pre-training (GPT).It works with two separate frameworks: pre-training framework and fine-tuning framework.In the pre-train framework, BERT obtains the contextual understand about the specified languages using two unsupervised tasks: mask language model (MLM) and next sentence prediction (NSP).In contrast, the fine-tuning framework is applied to several downstream tasks linked to natural language processing (NLP), such as sentiment classification, sentence prediction, and named entity recognition, by simple adjustment of the task-specific architecture [14].

Experimental set-up
We have explained the experimental procedures and model configuration from three points of view: classical ML models, DL models, and transformer-based pre-trained models.We examine the potential combinations of pre-processing tasks, feature selection tasks, and classifier models.In addition, we configure the hyper-parameters during the implementation of the detection system.

Experimental set-up for classical ML models
For evaluating the influence of the pre-processing task "additional pre-processing tasks" (APT) as shown in section 2.2., we experiment from two point of views: "general pre-processing tasks" (GPT) with "APT" and "GPT" without "APT".In feature selection stage, three approaches are settled by empirical experiments: character N-grams (3 to 5) with TF-IDF, word N-grams (1 to 2) with TF-IDF, and a combined character-word N-grams.In this investigation, four ML classifiers, including SVM, MNB, RF, and LR, are utilized.Consequently, each classifier provides six potential outputs.During the experiment, the dataset is separated into two parts: training data and testing data, with respective proportions of 80% and 20%.In order to obtain a more reliable model, a 10-fold cross-validation procedure is employed [27].Using sklearn Python libraries, we have tried to use default values of hyper-parameters to implement each model.However, slight changes of the hyper-parameter values are taken into account through the empirical experimentation such as kernel is "linear" and gamma is "auto" for SVM model, and alpha is 0.25 for MNB model.

Experimental set-up for DL models
During the implementation of DL models, same two variations of pre-processing tasks are chosen like classical ML models.However, three different feature selection techniques such as word2vec, fastText, and GloVe are applied for three DL models, such as LSTM, BiLSTM, and CNN-BiLSTM.Therefore, each DL model shows six possible results.Here, we use the scikit-learn python library and KerasTuner framework to implement the models by tuning the models' parameters.We have experimented on thirteen parameters which value ranges are specified in Table 2 via the empirical analysis.We implement the LSTM model with two LSTM layers, the BiLSTM model with two BiLSTM layers, and the CNN-BiLSTM model with one BiLSTM layer.Here, the dataset is split into three parts: training data, validation data, and testing data with a ratio of 70%, 15%, and 15%.

Experimental set-up for transformer-based pre-trained model
As a transformer learning technique, we have used BERT-base multilingual pre-trained model consist of 12 hidden layers, hidden output dimensions of 768, and 12 self-attention heads.Through the empirical study, we have selected four categories of pre-processing tasks, including: "GPT" with "APT", "GPT" with "APT", but not applying stop-word removal and stemming operations, "GPT" without "APT", and "GPT" without "APT", stop-word removal, and stemming operations.Since, in the BERT fine-tuning, the feature vectors come through the BERT's own tokenizer and embedding technique.Therefore, we get total four possible outputs from a BERT model.

RESULTS AND DISCUSSION
We illustrate the results of each ML-based approach and their performance analysis.The performance score is determined using assessment criteria, including precision, recall, F1-score, and accuracy metrics.At first, we analyze the results from three angles: approaches with classical ML models, approaches with DL models, and with a transformer-based pre-trained model.Finally, we summarize the overall results.

Results of approaches with classical ML models
The performance scores of the combination of pre-processing, feature selection, and classical ML models (MNB, LR, SVM, and RF) are pointed in Table 3.The MNB-based approach performs the highest F1score of 0.745 and the highest accuracy of 78.815% which crosses the performance of the existing work [11].In this approach, the "APT" is included into the "GPT" in the pre-processing stage, and the combined N-grams is utilized in feature selection stage.Here, both the "APT" and the combined N-grams tasks help in calculating the proper feature probability by (1), which distinguishes the bully and the non-bully type text effectively.The MNB-based approach also gives better output for character N-grams of 78.620% accuracy.This approach obtains the top recall score too by 72.566%, however, word N-grams is taken instead of combined N-grams.In the case of precision, the LR-based approach gives maximum output of 82.499%.This approach shows comparatively better accuracy than the other two classical ML models, SVM and RF.Here, the hypothesis function ( 3) of the LR model effectively works on the training observation as well as test data for detecting bully text.

Results of approaches with DL models
The evaluation scores of various combination of pre-processing tasks, feature selection methods, and DL models (LSTM, BiLSTM, and CNN-BiLSTM) are mentioned in Table 4.The LSTM-based approach achieves the highest recall, F1-score, and accuracy values of 75.000%, 0.765, and 77.804%, respectively which outperforms than the earlier work [11] on the same dataset.This approach uses "APT" with "GPT" preprocessing task and utilizes the fastText word embedding technique for selecting feature vectors.Since the average number of words represents a small figure of 21, as shown in Table 1, the unidirectional LSTM model performs better result than the other two DL models (BiLSTM and CNN-BiLSTM).On the other hand, CNN-BiLSTM-based approach presents the top precision score of 85.161%.This approach also use the same preprocessing task and feature selection technique like the LSTM-based approach.Furthermore, this approach also gives the slightly better performance than [11] in terms of accuracy (77.568%).Another notifying point is that, the fastText embedding method performs much better outputs compare to the word2vec and the GloVe embedding methods in every cases.The main reason is the fastText method uses character N-grams approach and generates sub-word level features which distinguishes other two embedding methods.Using this mechanism, the fastText also handles unknown word dictionary of the dataset [28], [29].

Overall result analysis
We have extracted the best outputs in terms of F1-score and accuracy from each three types of ML models where the BERT-based approach gives outstanding results to detect the cyberbullying text, as shown in Table 6.We observe that our proposed pre-processing task, "APT", improves the overall performance of the detection system.Furthermore, in the feature selection stage, combining character and word N-grams with TF-IDF scoring, the fastText embedding, and the BERT-embedding techniques are effective in classical ML, DL, and BERT-based transformer models, respectively.Table 6.Proposed approaches regarding F1-scores and accuracy values Proposed Approach F1-score Accuracy "GPT" with "APT" + Combined N-grams + MNB 0.745 78.815% "GPT" with "APT" + fastText + LSTM 0.765 77.804% "GPT" with "APT", but not perform stop-word removal and stemming + BERT-embedding + BERT-basemultilingual-uncased 0.777 80.165%

CONCLUSION
Within this study, we have experimented with pre-processing tasks, feature selection tasks, and selection of classifier models from three points of view: classical ML, DL, and transformer learning, and proposed a concise decision in every aspect.Here, "APT" pre-processing task is introduced besides the general preprocessing tasks.Then, we have shown the combined N-grams (character and word level) performs well for the classical ML algorithms, while, fastText gives better results than the other two embedding techniques such as word2vec and GloVe for the DL algorithms.Finally, using "APT", BERT outperforms with an utmost accuracy of 80.165%.In future research, we will implement our proposed approaches to the other related datasets.Furthermore, we will analyze other new pre-processing tasks and will extract new features.Finally, we will experiment with various ML approaches (classical ML, DL, and transformer learning) separately and in a combined form.

Table 2 .
Hyper-parameter values during the implementation of DL models

Table 3 .
Performance evaluation of the methods with classical ML classifiers

Table 4 .
Performance evaluation of the methods with DL classifiers