Machine learning-based technique for big data sentiments extraction

ABSTRACT


INTRODUCTION
Twitter, the largest used social media site, has now become a very popular trend over the world for people who want to share an opinion about their social, political and economic interest. User opinion can be related to various aspects like gadgets, politics, products, services etc. that can directly convey the viewpoint of the user and helps in making predictions of a consumer market. Such kind of opinions or sentiments of huge people around the world is capable of performing analysis and future predictions. Usually, tweets contain incomplete, poorly structured, noisy, irregular expressions, ill-formed words and non-dictionary terms [1]. Also, messages or tweets are short and have 140 lengths of limitations. So it requires preprocessing done on our collected datasets to reduce noise in tweets by removing stop-words, removing URLs, replacing negations etc [2]. Sentiment dictionary contains all forms of a word with each word's polarity strength that can save more time.
Sentiment analysis (SA) is a process of detecting the contextual polarity of text in terms of positive, negative or neutral [3][4]. Organizations across the world widely adopted the ability to extract insights from  ISSN: 2252-8938 Int J Artif Intell, Vol. 9, No. 3, September 2020: 473 -479 474 these sentiments of various social media sites. It helps organizations to make predictions of a certain product, reviews, and other decision-making processes that will ultimately increase the profit. So ultimately SA is beneficial for organizations and individuals to improve their profit as per user or market demand. SA also known as opinion mining, is a most popular trend in today's world which is the process of identifying and categorizing opinions on the web, determines the writer attitude towards a particular topic or product [5].
It tells about what author wants to communicate and defines his state of mind in terms of emotions, feelings, and subjectivities about an event or topic. It involved with Natural Language Processing (NLP) process which is the interaction between the computers and the human/natural language [6][7][8]. NLP technique facilitates easy pre-processing of text i.e. NLP cleans and normalizes text for sentiment analysis [8]. Analysis of sentiments can be based on single phrase or sentence, where the sentiment of the whole sentence is calculated. It contains following steps [9][10]:  Tweets posted on twitter are freely available through a set of APIs of twitter. At first, we collected a corpus of positive, negative, neutral and irrelevant tweets from twitter API.  Then pre-processing done by removing stop words, negations, URL, full stop, commas etc. to reduce noise from tweets and to prepare our data for sentiment classification.  Then, we apply machine learning algorithms to our dataset and compare their results.  Results help us to identify which machine learning algorithm is best suited for classification of SA. Applications of SA are broad and powerful that provide us easier and quicker social media monitoring like in: Consumer market for product reviews; Marketing to know consumer trends and attitude; Social media to find general user opinion about current topics; Movie to know whether released movie is liked or not, etc [11]. As users on social media sites are rapidly growing and producing a large amount of data every day, so there is a need to classify and analyze these messages to find out its polarity about some topic or event [12][13]. Emotions and opinions can be expressed in many ways. Classifying sentiments that have few relative classes such as "positive", "negative", or "neutral", is the most complicated task. SA is a popular topic and lots of research has been going on from a long time. Many researchers used supervised learning algorithms also with various automatic classifiers for classification of the polarity of sentiments [14]. The problem is in assigning the strongest polarity of sentiments and in finding the best algorithm which provides most accurate results.
In this paper we use three machine learning algorithms Support Vector Machine (SVM), Decision Tree (DT) [15] and Naïve Bayes Classifier (NB) sentiment classifier for classifying our data also helps in evaluating the performance of our training dataset. We focused on comparing outcomes of these algorithms to identify best machine learning method which gives most accurate and efficient results for classifying twitter data.

RESEARCH METHOD
This paper presents a model presented in Figure 1, which consists of three layers for analyzing sentiments. First Data Collection layer, used to collect tweets from twitter APIs; Second Data preprocessing layer with a selection of attributes which is used to reduce noise level from tweets, and last SA or Data Mining layer used to apply machine learning algorithm [2].

Data collection
At first, we obtain training data of twitter sentiments from 2-different twitter API. First, dataset taken from "Twitter Sentiment System for SemEval 2016", (denoted by "SE-T") contains approx 13541 tweets with 2-attributes namely: class and content [16]. Second dataset is taken from "Sanders Analytics twitter sentiment corpus" (denoted by TS), which contains 479 instances with class and text two attributes [17] as presented in Table 1. However, we also collect our own twitter data in Malay language, which is spoken in Malaysia, Singapore, Indonesia, and a few other countries and denoted as "OC". This language is actually the fourth-most popular language on Twitter, accounting for 8 percent of all Tweets are about airlines [18].  [16] 13541 5232 6242 2067 Twitter Sanders (TS) [17] 479

Featurization
Features in machine learning is basically numerical attributes from which anyone can perform some mathematical operation such as matrix factorization, dot product etc. But there are various scenario when dataset does not contain numerical attribute for example-sentimental analysis of Twitter/Facebook user, Amazon customer review, IMDB/Netflix movie recommendation. In all the above cases dataset contain numerical value, string value, character value, categorical value, connection (one user connected to another user). Conversion of these types of feature into numerical feature is called futurization.

Text processing setup
For the purpose of getting accurate results by classifiers we have to make sure that these datasets processed efficiently by removing unrelated contents and thus related contents are accurately extracted. As most researchers consider that URL doesn't have any information regarding sentiments, so by removing short URLs from tweet contents can be refined. People often use emotional words that contain repeated letters to express their sentiments which are very common trends like "coooool". Also, numbers are not used for analyzing sentiments so tweet contents can be refined by removing them [1]. The polarity of the word will be changed when they are preceded by a negation or negation can change/reverse the meaning of words. By checking negations, Removing of URLs, emotions, numbers and Repeated Word; noise in tweets can be reduced. This filter provides us options to do configuration with our dataset which includes following steps [19][20]:  Stemming: It is used to remove suffix from the word according to some grammatical rules.
Here we apply most popular Snowball Stemming library.  Stop Word Extractor: Some words that don't have polarity so they don't need to be further analyzed like: able, are, both, which, has, become, after etc. So after elimination of these words, our result will not be affected. We used Rainbow list for our experiment.  Tokenization: It is used to split a document into a word or terms and make a word vector. Here we used NGramTokenizer.  Feature Selection: This process decreases the number of attributes into a better subset which can increase accuracy also it brings a reduction in training time. It is done by using Filters and Wrappers.

Sentiment classifier
To classify sentiments machine learning (ML) algorithms are used i.e. a branch of Artificial Intelligence (AI) concerned with the study of classification and pattern analysis, allows the computer to learn behaviors of empirical data taken from sensors or database [21]. ML algorithm allows us to automatically recognize complex patterns and make intelligent decisions based on data. In this paper, we used various machine learning algorithms such as Naive Bayes (NB), Support Vector Machine (SVM) [22], and Decision Tree (DT) [15].

Naïve bayes classifier
It refers to counting the frequency of words that are related to the sentiments in the message. As Bayes theorem based on probabilistic classifier so it allows us to capture uncertainty about the model to determine the probability of the outcome. Explicit probabilities can be calculated by it for the tested dataset and it helps to reduce noise robustly. It is numerical based approach with easy, fast and high accuracy features.

Support vector machine (SVM)
It yields more accurate results when it is used for classifying text. The basic idea behind it is to find the hyperplane (or vector w), which is responsible for separating one class document vector from the vector in other class [7]. It is successfully employed in text classification and various other sequence processing applications as it is a type of linear classifier.

Decision tree (DT)
It is a flowchart used to output labels for certain features, act as input values. It categories a document as by, starting from the tree root (labeled as features), followed downward by branches (labeled as features weight) and last reached a leaf node (labeled by categories).

Experimental setup
We use Waikato Environment for Knowledge Analysis (WEKA) to implement data mining algorithms for preprocessing, classification, clustering, and analysis of results [23][24]. This environment includes java libraries that implement algorithms and provide the best environment to researchers for classifying datasets. We apply "StringToWordVector" filter and done lots of preprocessing with our datasets [25][26]. Using n-gram tokenizer option and attribute selection method different number of attributes are created. With attributes selection method 50 attributes are taken for testing out of 1613 words from first dataset SE-T [16] and 105 attributes out of 2065 words are taken from second dataset TS [17]. This method increases accuracy rate of our training dataset also it brings a reduction in execution time. Following Table 2 shows reduction in size of file after preprocessing: To evaluate performance we apply 10-fold cross validation technique which splits the original set into training sample to train the model and a test set to evaluate results. For computing sentiments quickly of tweets without compromising accuracy, an approach known as "Information Retrieval Metrics" can be used to evaluate experimental results in terms of precision, recall, f-measure, and accuracy with the use of following formulas [9,27]: Here (TP= True Positive; TN= True Negative; FP=False Positive; FN= False Negative)

RESULTS AND ANALYSIS
We observed that our classification results improved in terms of time and accuracy using processed and small features data than simple datasets. For example in first SE-T dataset, time taken to build a model for NB algorithm takes 10.56 seconds, accuracy 53.73% and after processing time taken to test model on training data is reduced at 0.35 seconds only, accuracy improved by 57.46%. Table 3 demonstrates the accuracy of classifiers on three datasets after applying various preprocess methods. Following performance measures are reported in Table 4 by our experimental result using three dataset, after conducting 10-fold cross validation technique. The number of correctly classified instances and accuracy rate is greater for three datasets with SVM algorithm. In our experiment obtained accuracy using SVM algorithm is 61.55%, 87.47% and 82.58% respectively (with 50 feature SE-T, 105 feature TS datasets and 100 feature of OC datasets) which is greater than other two algorithms. Our experimental result shows that same preprocessing methods on a different dataset affect similarly the classifiers performance. After analyzing results of Table 4 it is observed that SVM provides 64.96%, 71.26% and 91.25% overall precision which is better than other two algorithms. Also, overall Recall and F-measure rate of SVM is greater than NB and DT for three datasets. Furthermore, time taken to build a model is greatly reduced by applying feature selection method. Time taken to build model in first SE-T datasets is 0.45, 29.43, 4.47 seconds respectively with NB, SVM, and DT algorithm; in second TS dataset, it is 0.01, 0.06, 0.01 seconds with NB, SVM and DT algorithms respectively.

CONCLUSION
In this paper, we discuss sentiment analysis which can tell us the thought of writers about the particular entity. These days, it becomes a routine task to find people sentiments about a real world entity from social media sites like Twitter, face book or blogs etc. To efficiently analyze this large amount of datasets it is essential to accurately classify it. In this paper, we have presented a methodology of text mining using Weka tool for classifying sentiments of twitter. We use three machine learning algorithms SVM, DT, and NB for classifying sentiments of twitters data. We conduct an experiment on three twitter's datasets to verify the effectiveness of pre-processing. Our experimental results indicate that by removing unwanted words and selecting features in the preliminary phase of preprocessing, time to build model is reduced and also it provides more accurate results in applied algorithms. The result may be affected by the choice of features for training and choice of algorithm for sentiment classification. The performance of SVM, DT, and NB algorithms improve on datasets after removing unwanted words. Therefore, removing unwanted words is useful to improve the performance of sentiment classification. We discuss the comparative analysis of three algorithms and calculate overall performance measures in terms of precision, recall, and f-measure. Our experimental results indicate that SVM provides more accurate results than other algorithms. However, it is important to further study current available preprocessing techniques that help us to improve results of various classifiers. A method should be found to automatic incorporate feature selection at time of model building according to any language.