Extracting hidden patterns from dates' product data using a machine learning technique

ABSTRACT


INTRODUCTION
The dates' data contain hidden patterns as valuable knowledge, including the most produced types, the most consumed types, the undesirable types, etc.The Qassim region in KSA is one of the most producers places of dates in the Gulf region and in the world besides Iraq.Currently, the industry of dates becomes is an important food industry [1].Dates are considering the most popular kinds in KSA and in the Gulf region [2][3], many research works concentrated on the management, marketing, and traceability of date's product [4][5][6], during the processes of sell and export this product.And till right now, there is not any study employing data mining or Machine Learning techniques to benefit from the features and characteristics hidden in the dates' data.The research problem focuses on the weaknesses that are related to the dates product in regards to the random production and marketing and the un-ability to discover the most important characteristics of this product from the economically, healthy, and the type of consumers point of view (male, female, level of age, etc.) to achieve the highest profits by identifying and choosing the best types and the most consumed based on the analysis of the real datasets of this product by the machine learning tools.

LITERATURE REVIEW
Currently, world companies and organizations are drowning in data but starving for knowledge.Data can be found as numerical values, records, figures, text documents, structures that are more complex, and etc.The complex data may appear in various forms; multimedia data, spatial data, and hypertext.To take complete advantage of data, we can retrieve and analyse it by different methods.These methods are complex and not enough for that purpose.It requires strong tools to discover patterns from raw data.With the massive amount of data placed in files, databases, and data warehouses, it is progressively imperative to utilize effective and powerful tools for data analysis and extraction of interesting patterns to help the decision-makers.This can be accomplished using Data Mining.Data Mining contains effective tools with great mechanisms to help miners focus to find the most important patterns from data using the Machine Learning algorithms.The Machine Learning deals with algorithm development as software that can learn and extract hidden patterns or features or relations from datasets.The Machine Learning algorithms adjust to changes and enhance performance according to the learning and training process.The Data Mining role is the application of Machine Learning algorithms on data for various purposes, such as prediction, classification, clustering, and extraction of association rules.
The most common types of association rules algorithms are the frequent itemset mining and mining association rules.Three classes of these algorithms discussed and compared; Apriority algorithm, FP-growth algorithm, and Eclat algorithm [13]; the Eclat algorithm is suitable for Big data sets and the Apriori algorithm and the FP-growth are better for small data sets, that's why we use it in this research.A typical example of using the association rules is to discover which items in a supermarket are normally put together in the basket market for a specific customer.Various approaches are employed for the association rules extraction [9].In Data Mining, the datasets can be employed to compare and select the best methods such as classifiers and predictors for improving Data Mining techniques and algorithms [14].One of the common Data Mining algorithms is the Apriori algorithm that is used for frequent patterns analysis and extraction of association rules.This algorithm usually used to generate all significant association rules between items in a database.Currently, many organizations/companies are using Data Mining task and Machine Learning on a regular basis.Some of these companies include; retail stores, schools, banks, and insurance companies.Many of these organizations combine Data Mining with such things as pattern recognition, statistics, and software tools.Data Mining used to find interesting patterns and relations that would otherwise be difficult to find.It allows data owners to study and understand their customer's behaviour and make smart marketing decisions [15], for their products and services.
The Data Mining always aims at the analysis of historical datasets from different perspectives [16][17][18], to sum up, the data in new ways that are both clear and useful to increase revenue, cut costs, or both for the data owner [2].It becomes common in both the private and public sector [19,20] to satisfy various needs using various applications that are employed in a local and global society to enhance the services and procedures.Therefore, there is an increasing request for mining about interesting 207 patterns in datasets.The process of analyzing such data is a really computationally very complex process when using traditional methods [21].In addition to what previously discussed, there are many research works provided as contributions in this field of study, some are focusing on the data analysis [22][23][24][25] and others are concentrating on the development and refinement of the algorithm [12,26,25,16].This is because the Data Mining is a multidisciplinary field with a wide and diverse application developed for data analysis.In fact, there exist non-slight gaps between knowledge discovery fundamentals and domain applications.A few of the application domains include; the analysis of product data, educational data, retail industry, spatial-temporal data, and medical data [26].Furthermore, there are more related contributions are similar to this research, for instance, Cornelis studied and analyse the association rules problem relevant to positive and negative values for Big Data [27], likewise, Mahmood et al. concentrated on proposing an algorithm for discovering positive and negative association rules among frequent and infrequent item sets.The identified associations among medical test results using Data Mining algorithms [8].Association rule generally comprises of a set of antecedent parts that lead to a consequent part with a certain confidence.
Pazzani and Billsus see the list of subjects of books customers suggest for as transactions, which enable them to find groups of association rules for concerns that frequently appeared together as part of a customer's interests [28].Also, Osadchiy et al. proposed an algorithm that recognized a model of collective preferences independently of the customer's interests.This requires a simple system of ratings, the performance of that algorithm evaluated by a large dataset of various transactions of real dietary recalls.It has demonstrated that the execution based on pairwise association rules achieves better for the defined task [29].In fact, our research concentrates on a different idea, where it depends on the generation of association rules using a different kind of data consequently discover other types of knowledge.
Other research work provided a valuable community service, where Vasavi, used Data Mining algorithms for Hidden Patterns extraction from Road Accident dataset of highways that pass through Krishna district Indian for (2013), as a heterogeneous data collected from police stations.The objective was to find the shared features between accidents.The data analyzed using Machine Learning algorithms and the results generated are sets of association rules by Apriori algorithm [30], as well, Sene et al. worked on association rules but for analyzing a different database describing in-flight medical incidents to extract interesting knowledge from that data [7].Miholca et al. investigated the problem of incremental relational of association rule mining.They proposed a new method named "Incremental Relational Association Rule Mining (IRARM)" for incrementally uncovering interesting relational association rules within a dynamic dataset during updates.A number of experiments carried out in order to show that the proposed method generates the results more rapidly than the execution of the Data Mining algorithms, on the extended dataset [31].An additional approach presented for mining generalized association rules.An algorithm developed to scan the database one time only and use transaction dataset to compute the support of generalized item set faster than other similar algorithms [32].Vidhate and Kulkarni proposed an efficient algorithm to a set of data collected from different shops to find a set of frequent items [33], on the other hand, Fernandez-bassso et al. proposed a parallelization algorithm for association rule extraction using Big Data technologies, which uses an efficient algorithm to address the problems related to the massive amounts of data [9].
Sadh and Shukla proposed a mining-based optimization technique for rule generation based on the Apriori algorithm and ant colony optimization approach.They applied the Apriori algorithm [34], on the other hand, Prajapati et al. identified consistent and inconsistent association rules from sales using a distributed datasets [21].A modified form of the frequent itemset mining method presented using an improved formula for generating valid candidates by decreasing the number of invalid candidates.During the generation process of association rule sets, the confidence and support measures were applied [12].The produced frequent k-item set is specified to the association rule generator to create all possible rules [35].Rajeswari et al. proposed a modified fuzzy algorithm for Apriori rare Item sets mining to detect the outliers that represent weak student depend on the heap space usage [36].An additional approach was proposed to extract a set of association rules based on medical data, the objective is to select the best mining algorithm of association rules according to multiple-criteria decision analysis [37].In this paper, our approach is concentrating on the analysis of dates' data in order to find interesting patterns within the extracted association rules.These patterns are strongly relevant to the production and consummation of the date's product.

RESEARCH METHOD
The overall steps of the methodology are shown in Figure 1.It comprises dataset gathering, data preprocessing, mining process, knowledge generation & representation, and accuracy improvement.These steps are explained in the following sub-sections:

Information and Data Collection
Important information collected by interviews with a number of people and the collected data was by an online questioner designed, evaluated, and distributed to a sample of consumers, producers, marketers, and product manager.It distributed to a sample of 640 people.The collected dataset attributes presented in Table 1.Some values of the collected data are incorrect, and others are incomplete, it contains missing values, this reason leads to a cleaning.After the cleaning process, we got 499 records as a total number of instances employed in this research.

Dataset Samples
The dataset divided into four samples.The first consists of two attributes, the second includes four attributes, and the third & fourth contain five attributes.Some samples are overlapped.The output is a set of rules reflexing some features of the date's product relevant to production and consummation processes.

Data Preprocessing
The Pre-processing task is the basic step in knowledge discovery using machine learning [38], it includes various tasks [39]; remove inconsistent data, noisy data, attributes coding, transformation, and loading [40].This, in turn, will improve the data quality and the accuracy of the results.The Apriori algorithm was selected as a useful rule-based technique in order to discover strong hidden patterns as a set of rules.

Data Analysis and Rules Generation
Association rules method is considered one of the important functionality of Data Mining, it includes three types; multilevel association rules, multidimensional association rules, and quantitative association rules.This research is using the multilevel association rules, the results of the analysis of four samples are demonstrated as follows:

Measuring Support and Confidence
In this step, the Support and Confidence measures applied to validate the outputs.Appendix A contains all generated rules with the ranking values of these measures.The values show the importance of each rule amongst other rules.The formulas of Support and Confidence are given in Formula ( 1) and ( 2), respectively [41][42].The association rules format can be written as "IF" part = antecedent "THEN" part = consequent.The whole dataset applied once, but the final rules were limited and covering all Dates' types partially, that is the justification of divided the attributes into 4 samples and generate a big set of rules some of them were weak and the others were strong, then the filtration process.

RULES VALIDATION
To validate the generated rules, the frequent item generates strong association rules must satisfy minimum support and minimum confidence [42].The minimum confidence of a rule is a user-defined value and an association rule is strong if it has supported greater than the minimum support value and confidence greater than the minimum confidence value [43].All the generated rules are shown in Table 2 contains

RULES FILTRATION
The values of support measure normalized a small range by dividing each value over 499, (the dataset size), to be compatible with the values of Confidence as a primary step to finding the rank values.The Support and Confidence values used to calculate the rank of each rule, according to Formula (3), after that the rules filtered by removing the redundancy and removing the rules that have lower ranks (lower quality).The next step is the selection of the rules that have the highest quality/highest ranks.Figure 2 demonstrates the final results for all generated rules and their ranks.The rules shown above the value 1.6 in the Y-axis in the chart, the highest points in this figure represent the highest ranks.These rules are shown in Figure 2, it includes the following set of rules {1, 2, 3, 7, 8, 9, 10, 12, 16, 18, 24, 26, 27}.This set contains the best rules, where it found that there are 13 rules have the highest ranks, it covers all dates' types included in the research; it represents the final results as in Table 3. Rank = (Sup-of Consequent/Ds) + (Sup-of Antecedent/Ds) + Confidence (3) where Ds is the dataset size =499.
Extracting hidden patterns from dates' product data using a machine... (Mohammed Abdullah Al-Hagery)

Table 1 .
The dataset attributes

Table 2 .
All generated rules r Extracting hidden patterns from dates' product data using a machine... (Mohammed Abdullah Al-Hagery)

Table 3 .
The Final set of rules