Machine learning for text document classification-efficient classification approach

ABSTRACT


INTRODUCTION
One of the possible solutions of the information resources problem is text document (TD) classification [1].It's hard to cover all the many algorithms in the field of text categorization.Recently, extensive research in the field of financial sentiment analysis has been conducted.Sentiment analysis (SA) of any text data denotes the feelings and attitudes of the individual on particular topics or products.It applies statistical approaches with artificial intelligence (AI) algorithms to extract substantial knowledge from a huge amount of data.This study extracts the sentiment polarity (negative, positive, and neutral) from financial textual data using machine learning and deep learning algorithms.The constructed machine learning model used ultinomial Naive Bayes (MNB) and logistic regression (LR) classifiers.On the other hand, three deep learning algorithms have been utilized which are recurrent neural network (RNN), long short-term memory (LSTM), and gated recurrent unit (GRU) [2], [3].The challenge of feature selection in text categorization is a significant one.We try to figure out which features are most important to the categorization process during feature selection.This is because some words are considerably more likely than others to be linked with the class distribution.As a result, the study proposes a wide range of strategies for determining the most significant characteristics for classification purposes.We'll also go over the various text classification feature selection approaches that are widely utilized.
Preprocessing, feature extraction, feature selection, and categorization are all included in the text categorization process.Text documents are used to extract features in feature extraction process [4].
Each text document term (word) is considered a feature, and the majority of the features are undesirable and unnecessary.Tokenization, stop-word removal, and stemming are also used during pre- ISSN: 2252-8938 Int J Artif Intell, Vol. 13, No. 1, March 2024: 703-710 704 processing to remove unnecessary and undesired features [5].A representation model is used to represent the pre-processed text content in a machine-understandable structure.Then, given the representation model, the feature selection technique selects the most informative features [6].Feature selection has a significant impact on classifier performance and is primarily utilized for dimensionality reduction [7], [8].Finally, using the selected feature subset, a classifier is utilized to categorize the text documents.The large dimensionality of feature space makes text categorization so difficult.As a result, the classifiers performance deteriorates, and categorization takes longer [9], [10].Because of its computing economy and high effectiveness, cosine similarity (CS) is commonly employed in the text categorization sector.There are already classifiers that use CS, such as the centroid-based classifier [11], [12].
Cumuli geometric centroid (CGC), arithmetical average centroid (AAC), and class feature centroid (CFC) are examples of centroid-based classifiers (CBC), where centroid denotes the technique for creating a CBC class prototype vector (i.e., the initialization procedures).The sum of each class's overall number of words is utilized by CGC; AAC utilized the arithmetical average of each class's overall number of words, while CFC uses the inner-class and inter-class term indexes [11].The weight model is a new CBC model that focuses on categorization hyper plane modification.
Beyond the classification of text documents, we present a CS technique in this paper.To classify the collection of words into equivalence classes, we calculate the similarity degree and utilize the symmetric measure for mutual support between words.Because of its computing economy and high effectiveness, CS is commonly employed in the text categorization sector.There are already classifiers that use CS, such as centroid-based classifiers [11].The main approach consists of 4 steps, and we are using examples in methodology.The remainder of the paper is organized: i) The text classification process is discussed in section 2; ii) The related work is summarized in section 3; iii) Existing classification techniques are discussed in section 4; iv) Section 5 introduces the proposed methodology; v) Section 6 shows the results and discussion; and vi) The conclusion section of the document brings the paper to an end in the final section 7.

DATA PROCESSING
The main intention of textual content mining is to allow customers to extract statistics from textual assets and deal with the operations like, retrieval, category, and clustering (supervised, unsupervised, and semi supervised).But how those documented may be nicely annotated, presented, categorized, and clustered.Figure 1 depicts the text classification process.The text classification problem is distinct in that the number of characteristics (unique words or phrases) can easily exceed tens of thousands.When it comes to using numerous complex learning algorithms for text categorization, this poses major hurdles.As a result, approaches for reducing dimensions are required.The two alternatives are to select a subset of the original features or to change the features into new ones by computing new features as functions of the existing ones.The number of attributes (unique words or phrases) in the text classification issue can easily surpass tens of thousands.This presents significant challenges when it comes to applying a variety of complicated learning algorithms for text categorization.As a result, methods for lowering dimension are necessary.The two alternatives are to select a subset of the original features or to change the features into new ones by computing new features as functions of the existing ones.
Although machine learning-based text categorization is a good method in terms of performance, it is inefficient when dealing with big training datasets.As a result, in addition to feature selection, instance selection is frequently required.For text classification, combined feature and instance selection.Their strategy consists of two phases [13].In the first phase, their algorithm selects features with high precision in predicting the target class in a sequential manner.All documents without at least one of these features are removed from the training set.In the second phase, their algorithm looks for a set of characteristics that tend to predict the complement of the target class inside this subset of the initial dataset, and these features are also chosen.The new feature set is the sum of the features chosen in these two processes, whereas the training set is made up of the documents chosen in the first step.In this paper, the steps followed in the case study are based on the data mining methodology proposed by [14].The steps include data selection, preprocessing, data transformation, data mining, and analysis.The process of classification of TD approach is as follows: Using the position weight algorithm, generate keywords from text documents.The most crucial information is contained in keywords, which are index terms.The task of automatically extracting limited keywords, key phrases, or set of words from a document that can explain the content's significance is known as automatic keyword extraction.All automatic processing for text resources relies on keyword extraction as a core technology.A survey of keyword extraction strategies has been offered in this study, which can be used to extract effective keywords that uniquely identify a document.b) Using the CS technique, compare the input (keywords) to other texts (as a query or keyword) to identify the input's class.c) Creating class probabilities by using keywords.d) Use text classification techniques to help organize information.Three predictions emerge from stages 2, 3, and 4. We can make the system's output CLASS1 if the majority forecast was CLASS1.Using the position weight algorithm [12], generate keywords from text sources.
Regarding how to choose important words, in linguistics, the word location is very essential.The entropy of words in different positions varies.The opinion carries additional information when they appear in the document's introduction and conclusion paragraphs, which are normally the first and last paragraphs.Furthermore, leading and summary sentences usually have more important words than the rest of the paragraph.We employ a unique method called position weight (PW) to capture the relevance of a word position.
Paragraphs make up a common document (the title is considered a special paragraph), sentences make up a paragraph, and words make up a sentence.A term's PW must take into account three key elements: paragraph, sentence, and word.The PW of a phrase t in a certain location is defined as (1).Where (  ,   ) in the paragraph j, represents the PW of phrase t; (  ,   ) is in the sentence k, reflects the PW of term t; (  .  ) as a word form r, reflects the PW of phrase t.In a document, the total weight of the word t is the sum of the weights of all spots in which it appears.The  of a phrase t in a document d that appears m times by (2).

𝑝𝑤(𝑡 𝑖 ) = 𝑝𝑤(𝑡 𝑖 , 𝑝 𝑗 ). 𝑝𝑤(𝑡 𝑖 , 𝑠 𝑘 ). 𝑝𝑤(𝑡 𝑖 . 𝑤 𝑟 )
(1) The importance of keywords is higher than that of other terms; a keyword may be used to characterize the characteristics of a document, which is why they can be used to distinguish between different document types.Assume that documents D1 and D2 fall under the "computer science" and "mathematics" categories, respectively.Although "theorem" cannot be deemed a keyword in either D1 or D2, the terms "approach" and "theorem" have greater weights in D1 and D2.
For property, we use   1 as the weight of WK in proportion to D1.The digits 0 and 1 are used to signify WK.For example, Table 1 shows the number of times each word appears in each document, as well as the document set D and the word set W that covers Ds.The set of keywords is covering Di ∈ Ci, where Ci is any class.Suppose that   1 = {W5, W6},   2 = {W2, W3, W4, W5, W6},   3 = {W1, W3, W6}.Using the CS technique, compare the input (keywords) to other texts (as a query or keyword) to identify the input's class.Creating class probabilities by using keywords.Use text classification techniques to help organize information.

RELATED WORK
The approach of categorizing text documents into specified groups is known as text classification, and it has received a lot of interest in contemporary years as a result of the expansion of digital documents.Approaches based on statistical theory or machine learning to improve text categorization ability have become mainstream.with data mining techniques like K-means, EM, Apriori, SVM, C4.5, and PageRank being used.Classification and regression trees (CART), AdaBoost, K-nearest neighbor (KNN), and Nave Bayes are popular algorithm among these since it has a high computational efficiency and an excellent prediction performance.
Enhanced classifiers and conventional classifiers using the accuracy of confusion (or misclassification) matrices based on five based (R8, 20NG, R52, Cade12, and WebKB) have been discussed in [15], [16].MNB's performance was improved by developing a fine-tuning process.A methodology has been introduced that employs three metaheuristic methodologies to convert an eventual estimation problem into an optimization problem: genetic algorithms, simulation annealing, and differential evolution in [17].A proposed approach for consolidating the aftereffects of two classifiers, like MNB and a changed most extreme entropy classifier (an adjusted form of the authors' proposed conventional maximum entropy classifier) [18].CFC, AAC, and CGC are examples of centroid-based classifiers, where centroid refers to the CBC method for generating a class prototype vector (i.e., the initialization procedures).AAC uses the arithmetical average of all words in each class.CGC uses the total of all words in each class, whereas CFC uses the inner-class term and the inter-class term index [19].Based on, the weight model is a new CBC method that focuses on finetuning a classification hyperplane.

CLASSIFICATION TECHNIQUES REVIEW
Text categorizations were primarily employed for information retrieval systems in the early days of machine learning (ML) and artificial intelligence (AI).Text classification and document categorization, on the other hand, have become widely used in a variety of domains, including medical, social sciences, healthcare, psychology, law, and engineering, as technological breakthroughs have emerged.The classification of the documents can be done using unsupervised, supervised, and semi-supervised approaches.Many techniques and algorithms have been proposed recently for the classification of electronic documents.Supervised machine learning algorithms that make predictions on given set of samples.They search for patterns within the value labels assigned to data points.
While no labels are attached to data points in unsupervised machine learning methods.To define data's structure and make it appear simple and organized for analysis, they arrange the data into clusters.Algorithms for reinforcement machine learning decide what to do based on each data point and then evaluate the effectiveness of the choice.Over time, they change their strategy to learn better and achieve the best reward.The following are some examples of classification techniques: -Deep neural networks (DNN), deep belief networks (DBN), hierarchical attention networks (HAN), recurrent neural network (RNN), convolutional neural network (CNN), and combination approaches are among the neural network-based algorithms described.Naïve bayes classifier (NBC): The classifier is a probabilistic classifier (also known as a generative classifier).The goal is to categorize text based on the subsequent likelihood of documents relationship to distinct classes depending on the existence of certain words in the documents.-K-nearest neighbor (KNN): It is a supervised learning technique that can be applied to classification and regression problems.It's straightforward, logical, and adaptable.It can be thought of as an algorithm that generates predictions based on the characteristics of other data points in the training dataset that are close by.In simple terms, the classifier algorithm calculates the similarity between the input sample and the k practice instances that are closest to the input sample and produces the class to which the object is most likely to be allocated.It is presumptively true that similar values can be found in close vicinity.Because it does not learn a discriminative function, KNN is sometimes referred to as a lazy learner.-Support vector machine (SVM): The usage of linear or non-linear delineations between the distinct classes is used by SVM classifiers to partition the data space.The key part of this classifier is determining the best boundaries between the classes and separating them for classification by creating a line or a hyperplane between classes.-Decision tree (DT): It is created by using different text properties to create a hierarchical division of the underlying data space.The hierarchical segmentation of the data space is intended to provide class partitions with a more skewed distribution of classes.We calculate the division to which a given text instance is most likely to belong and utilize that for classification purposes.-The naïve bayes classifier is commonly used for text categorization.However, the k-nearest neighbor technique, which is more conventional but still widely used in science.As classification algorithms, support vector machines (SVMs), particularly kernel SVMs, are widely used.Using tree-based classification techniques like decision trees and random forests, document categorization may be done quickly and reliably [20]- [23].

METHOD
The major goal of our technique is to determine the suitable link between documents.The text documents are often classified and retrieved according to the users.In our approach, we suggest classifying documents based on word tokens which extract attributes from text of the above two categories.
Moreover, classification approach techniques include term frequency (TF) and CS. Figure 1 shows general steps of the flow diagram for techniques that used in the proposed classification approach and combined CS with estimated values provided by conventional classifiers, it improves the performance of the classifiers.Combining the similarity between a test document and a category with the estimated value for the category enhances classifier performance.Therefore, all documents in the datasets are independently vectorized by word count and by term frequency-inverse document frequency (TF-IDF) for evaluating the performance of the constructed classifiers.
Cosine-similarity is a mathematical measure that identifies documents that are similar regardless of their size.In two-dimensional space, it is the cosine measure of the distinction formed by two vectors, where the two vectors might contain numeric or text data.We use vectors as text data in this paper.We can combine the strategies mentioned above to create a text classification system.The following is the procedure for our approach, -Document representation Create a numeric vector from the documents.The document is represented as a vector in cosinesimilarity-based text categorization, then used from a lexicon as a result of all of the training documents.The lexicon's k th term is denoted by F = {t1, t2,..., t|F|}(tk, k ∈[1, |F|]), and each document is regarded a vector in |F|dimension feature space.Term of frequency and inverse document frequency (TFIDF) formula is used to convert a document into a numeric vector as in (3) [24].
Where tfi(tk, di) is the number of times the word tk appears in document di, |D| is the total number of times training documents, and |D(tk)| is the total number of tk-approximate documents in text group D. The phrase weighting is then normalized as in (4) [25].
Where   denotes the document's normalized phrase weight tk.The class centroid Cj is calculated after the normalized representation of documents by adding vectors of all documents in Cj class and then normalize the result by their size.As a result, a class centroid's formal description by (5).
Where ‖ * ‖2 represents the 2-norm the cosine function [26]- [28], can be used to measurement the similarity between the centroid Cj and an unlabeled document d which is given by next step.

-Class prediction
Based on cosine-similarity functions calculate the similarity between a document word and all class words by comparing the similarity of the input with other texts and thereby determining its class.The enhanced classifiers were constructed by combining CS to MNB conventional classifier.Regarding cosine-similarity, (6) is the function of conventional cosine-similarity, regarding MNB; (7) are the algorithms of conventional MNB and ( 8) is the algorithms of the proposed methodology.The arithmetic in (6) for calculating cosine-similarity is (6), where DA and DB are the two vectors that compared, and K is the number of words in each vector (vectors represent documents).

RESULTS AND DISCUSSION
In this section, in addition to detailing the research findings, a thorough discussion is also provided.Results can be presented in figures, graphs, tables that make the reader understand easily [29], [30].Table 2 shows the demonstration form.More discussion will be made in the coming sub-sections.in used to language python language natural programming in natural language processing is Python programming used" DB = "introduction to languages Python" DQ = "programming in Python" CS is combined with estimated values provided by conventional classifiers such as MNB.In order to achieve CS between a test document and each category, the similarity between a test document and a category is combined with the estimated value for the category.This improves classifier performance.Multinomial naïve bayesian (MNB) uses a vector of words to represent a document d as in (7) [31], [32].
Where (  ) = ] + ((,   )) To test multiple documents and assign them to categories with the highest combined score (estimated value from multinomial naive bayes + cosine similarity score), we follow the next steps: i) Typically, the cosine similarity value ranges from 0 to 1, where a high value indicates that data are well-matched to their own categories.Three categories: "Computers," "Programming," and "Technology."We have a training set with labeled documents in each category.To test three new documents and assign them to the category with the highest combined score.Calculate the cosine similarity scores and create a TF-IDF matrix using the training data.Assume that the cosine similarity scores between the test documents and training documents for each category are as in Table 3. Table 4 shows combining the scores for each document in CS with estimated values MNB.Based on the combined scores, assign the documents to the category with the highest score where each score based on their importance.Combine the scores by multiplying the MNB score by its weight and add it to the cosine similarity score multiplied by its weight.Assigning weights to each score based on their relative importance, it can assign a higher weight to the MNB score.In this example, Document1 is assigned to the "Computers" category because it has the highest combined score.Similarly, Document2 and Document3 are assigned to the "Programming" and "Technology" categories, respectively.

CONCLUSION
Automatic text classification is a vital field of information retrieval.There are numerous issues and difficulties associated with text classification.In this study, we focus on two fundamental procedures for text document classification: partitioning the set of words and document categorization.Texts are divided into equivalence classes based on the cosine similarity classifier.One of the most important features of cosine similarity classifier is the speed and high efficiency in obtaining the best results, in terms of improving searches and making them faster and effective.As a result, in the domain of information retrieval, the position weight approach may be able to play an important role.Furthermore, using the concept of position weight, we present a method for selecting key terms from a list of words encompassing document classification, allowing largescale information to be retrieved quickly and more effectively.

Int
Machine learning for text document classification-efficient classification … (Sura I. Mohammed Ali)

Table 2 .
Demonstration form

Table 3 .
Values of the cosine similarity