Assessing naive bayes and support vector machine performance in sentiment classification on a big data platform

ABSTRACT


INTRODUCTION
Following the explosion of subjective textual information in social networks, forums, and blogs in the form of opinions freely written by internet users, sentiment analysis has emerged as a discipline of data mining that aims to extract an opinion from unstructured textual data. It allows, for example, managing the marketing strategy of a company based on the analysis of consumer feedback towards a product [1], [2]. Tackling sentiment analysis issues is done according to several approaches, a lexical approach [3] that uses a dictionary to identify the text's sentiment from its constituents' polarity, whether they are words or sentences. However, this approach is not always the best solution because a word can have different orientations depending on the domain where it appears. Indeed, "a dangerous player" has a positive polarity in the sports domain, but "dangerous animal" has a negative polarity in the animal domain. Besides the lexical approach, there is an approach using machine learning methods [4], and for comparison, research has shown that machine learning methods are more accurate than lexical-based methods [5].

991
As experiments have shown that machine learning algorithms outperform lexical-based algorithms, they are a preferred choice for sentiment classification problems. In Research works that address sentiment analysis problems with machine learning algorithms, we often use small or medium-sized learning data that do not require much hardware resources. In these conditions, these algorithms reach high accuracies and very low latency. When the training data is large, machine learning algorithms face many challenges. Their design must accommodate limited memory resources and ensure adequate execution time [6].
The concept of big data has emerged to bring together all technologies for the collection, storage, and processing of massive data that traditional tools can not process [7], [8]. The purpose of this paper is to evaluate the performance of two machine learning methods embedded into a big data framework named apache spark on a large dataset by comparing them to the performance of these same methods executed according to a traditional approach. By testing on several hardware configurations of the spark cluster, we found that the classification performances of naive bayes and support vector machine (SVM) under spark platform are better than those achieved in a single machine with F-measure beyond 84%. We also observe that support vector machine (SVM) and naive bayes are scalable machine learning algorithms on the spark platform.
The remainder of the paper is organized as follows. In section 2, related work is presented. In section 3, the adopted methodology is detailed. Experimental results are presented and discussed in section 4. Finally, in section 5, the paper is concluded, and future research issues are underlined.

RELATED WORK
In this section, we discuss different works on sentiment analysis, big data frameworks, and distributed machine learning methods.

Sentiment analysis
Sentiment analysis is a set of techniques including text analytics, computational linguistics, and natural language processing for classifying texts into positive, negative, or neutral. Pang et al. [9] are the origin of the first studies on sentiment analysis. They used a machine learning approach to classify movie reviews. Kim et al. [10] had tested feature selection on the support vector machine (SVM) algorithm. The authors concluded that SVM outperforms all other machine learning algorithms for sentiment classification tasks. Jeong et al. [11] use sentiment analysis to identify customer preferences and trends. Wu et al. [12] had explored tweets to predict stock market price. They used both lexicon and machine learning approaches. The authors found that machine learning is better than the lexicon approach. Kumari et al. [13] collected tweets in all languages, then translate them online to English. Afterwards, tweets are classified as positive or negative using a machine learning algorithm to serve as naive bayes classifier's training data. This approach provides good classification results.

Distributed machine learning
Distributed machine learning can deal with computational complexity algorithms and memory restrictions in large datasets [14]. To solve the problem of algorithms' inability to process a large volume of data, they must run on several machines or processors [15]. Besides the prediction efficiency by parallel data processing, the distributed machine learning algorithms provide fault tolerance by copying the data on several machines. Moreover learning from distributed data using different algorithms produces good precisions, especially in large domains [16]. The distributed algorithms can be integrated with other data processing systems [17]. However, designing and implementing distributed algorithms is a hard task [18]. Also, the distributed algorithms are effective when the nodes dedicated to the data processing communicate directly. However, communication across the network between the nodes entails a longer data processing time [19].

Machine learning tools
Spark MLlib [20] and Mahout [21] are two open-source tools that include several scalable machine learning algorithm implementations. The implemented algorithms perform classification, regression, clustering, collaborative filtering, and dimensionality reduction tasks. They are independent of the big data engine, so they are portable, and we can easily implement them in another big data platform. Mahout supports Hadoop, spark and H2O. Further, although these algorithms are mainly intended for processing large data in a distributed environment, they are also used to process small data on a single machine. There are also frameworks for large-scale data learning, such as SAMOA, but it is a project in its beginnings [22].

METHODOLOGY
As shown in Figure 1, we constructed a sentiment classification system from a dataset called Amazon Movie Reviews containing over 8 million reviews. To test our system's resilience and its ability to scale up, we worked with five datasets extracted from the Amazon Movie Reviews dataset, whose size varies between 10,000 and 200,000 reviews. The processing of this large data is made using the apache spark framework that incorporates machine learning libraries and relies on a distributed processing system to execute preprocessing and learning tasks. We chose two algorithms for our experiment: naive bayes and support vector machines. The choice of these two algorithms is argued by the fact that they are the best algorithms in terms of precision, as it has been proven in various research works [23].

The dataset
Our experimental study dataset: Amazon Movie Reviews Dataset [24] is part of the Stanford Network Analysis Project. It is a collection of opinions collected from Amazon over a period of 10 years until October 2012. It has about 8 million reviews. In our experiment's case, five disjoint subsets with respective sizes of 10 k, 50 k, 100 k, 150 k, 200 k reviews have been extracted from this dataset. Each review is composed of eight features. In our study, we extracted only two features, which are the review text and the score. The review text is transformed into a vector using the bag of words model. The scores are converted to 0's and 1's to assign polarities to reviews by applying the following conversion rule: Reviews with scores between 1 and 3 are considered negative and are assigned the value 0, while reviews with scores between 4 and 5 are regarded as having positive sentiment and are given the value 1. In our experiment, we used balanced training data between the positive and negative classes.

Feature selection
The extracted datasets have undergone preprocessing operations through three stages: − Tokenisation: Each review is segmented by splitting the text into words separated by spaces and punctuations. − Stop words removal: The removal of empty words such as articles and punctuation. − Stemming: Each word is converted into its stem. − Feature selection: The selected features correspond to sequences of a single word called unigrams, previous studies have argued that classification accuracy is more accurate when using unigrams in movie domain, then, the selected features are weighted according to the TF-ITF scheme following the formula.
N is the total number of documents. df is the number of documents in which the term appears. TF is the number of times a term appears in a document.

Training the classifier
Referring to the literature, support vector machines and naive bayes are the classifiers that bring the best performances in the movie domain. Our experimental study focuses on these two algorithms.

ISSN: 2252-8938
Assessing naive bayes and support vector machine performance in… (Redouane Karsi) 993 − Support vector machines: Support vector machines (SVM) is a supervised learning algorithm used to perform classification and regression tasks [25]. SVM is a classifier based on a statistical approach for either learning or prediction. It is developed by Vapnik [26]. Its operating principle consists of performing a set of computations to determine a hyperplane that separates the data into two different classes, so that the distance between the two classes is maximum. SVM seeks to perform a binary classification by defining a hyperplane that separates the two classes' data. This can be achieved by expressing the data in a multidimensional space that makes the data's linear separation quite possible. What makes SVM a complex algorithm is that it uses a kernel function that relies on the projection of data into a higher-dimensional space in which the problem becomes linear. The algorithm must go through several iterations to select the only hyperplane among all those who separate learning data according to their class. The particularity of this hyperplane is that it is located at a maximum distance from different learning instances. − Naive bayes: Naive bayes [27] is a classifier based on the bayes theorem. In this model, the random variables are statistically independent given a class c. This assumption of data independence will reduce the computation time. To predict the class ci of a random variable X by applying the bayes theorem, we calculate the conditional probability that the variable X belongs to the class ci by this formula: P(C = ci/X) is the probability of class ci conditioned on X. P(C = ci) is the probability of class ci. P(X/ C = ci) is the probability of X conditioned on ci. P(X) is the probability of X. The random variable X will be assigned the class ci which maximizes the conditional probability P(C = ci/X).

−
Apache spark: The first Big Data platforms like Hadoop based on the MapReduce framework were mainly designed for batch data processing which requires frequent access to the storage space, but in the case of iterative computing, the performance of the MapReduce frameworks decreases considerably. With the widespread use of machine learning algorithms for data analysis, and in order to overcome the problem of intensive computations performed by machine learning algorithms, several techniques have been developed, especially for the fast processing of massive data. Among these techniques, apache spark is positioned as an efficient solution that provides a higher-level programming interface to develop distributed applications. In this platform, the data and the intermediate results are loaded and stored in the memory of cluster machines using a data abstraction system called Resilient Distributed Dataset providing data processing in parallel.

EXPERIMENTS AND RESULTS
Several experiments were conducted to highlight the performance of two classifiers: svm and naive bayes in a distributed environment such as spark. Our evaluation was done from several angles by observing indicators such as (classification F-measure and time needed to complete the learning job) while varying the dataset size and the number of cluster slave nodes.

Setup spark cluster
To set up the environment to train the classification algorithms, we have implemented a multi-node cluster architecture. Our system is composed of a master node, and three slave nodes, each node of the cluster has a configuration with a 3.4 GHz processor, 8 GB memory, and 500 GB hard disk. These different nodes are interconnected with a local network with a speed of 100 Mbps. We opted for this configuration to provide the same conditions in which traditional algorithms have been experimented on a single machine.

Training and classification algorithm
After extracting five datasets of respective sizes of 10 k, 50 k, 100 k, 150 k and 200 k from Amazon Movie reviews dataset, we have written a java program which exploits the spark's machine learning library MLlib, our program receives as input a dataset of movie reviews. Next, it carries out various preprocessing operations, including selecting tokens, suppressing stop words and feature selection (Unigrams weighted as StopWordsRemover().loadDefaultStopWords("english") 7 remover.transform(movieData).show(false); 8 // Select unigrams 9 NGram ngramT = new NGram().setN (1)

Results and discussion
The designed program provides several statistics measuring the classification F-measure and the processing time according to several parameters such as the dataset size and the number of slave nodes constituting the spark cluster. To evaluate the performance of our algorithms, we use the classification recall, precision and F-measures defined as (3) In Table 1, we find that the classification F-measure of SVM and naive bayes under spark framework is greater than 84% and consistently exceeds baseline results obtained on a single machine regardless of the dataset size. On the other hand, if the classification process is performed in a single machine configuration, performance is poor from 10k dataset size. In larger sizes, the system fails to complete the learning task and generates an out-of-memory error. Unlike the results achieved by traditional machine learning techniques, naive bayes is more accurate than SVM when using spark components. Otherwise, we observe that the classification F-measure increases until it stabilizes from dataset sizes greater than 150 k. This is because the model gains enough knowledge from many training examples.

995
To test our methods' scalability, the time required for both SVM and naive bayes algorithms executed on three slave nodes to complete preprocessing and learning operations was calculated as illustrated in Figure 2. We deduce that these two algorithms' running time rises proportionally to the dataset size while maintaining better classification performance, confirming that SVM and naive bayes are scalable machine learning algorithms on spark platform. This is due to spark's capabilities in reducing latency by caching dataset in memory for fast processing and sharing data during iterative computations. Furthermore, if we add nodes to the cluster, we note that the running time decreases considerably. Indeed, the master node distributes data processing between the different slave nodes as illustrated in Figure 3.

CONCLUSION
Experiments have shown that machine learning algorithms are very effective in dealing with different issues of sentiment analysis. However they have some weaknesses, among which their inability to scale up when the volume of data increases as in big data context. Through this paper, we conducted a sentiment analysis approach that exploits machine learning components of spark as a big data framework. In our experimental study, we wrote a program based on apache spark's machine learning library (MLlib) to observe the behavior of two machine learning algorithms: SVM and naive bayes for sentiment classification using large training datasets whose size varies between 10 k and 200 k. From the results of our experiments, it appears that the classification performance under spark is much better compared to traditional approaches. Moreover, in terms of scalability, the running time is proportional to the training dataset size. Besides, it has been found that adding slave nodes to the cluster significantly reduces latency. In our future work, we will investigate ways to train classifiers from various heterogeneous data sources.