CLG clustering for dropout prediction using log-data clustering method

Agung Triayudi1, Wahyu Oktri Widyarto2, Lia Kamelia3, Iksal4, Sumiati5 1Department of Informatic and Communication Technology, Universitas Nasional, Indonesia 2Department of Industrial Engineering, Universitas Serang Raya, Indonesia 3Department of Electrical Engineering, UIN Sunan Gunung Djati, Indonesia 4Department of Electrical Engineering, Universitas Faletehan, Indonesia 5Department of Informatic, Universitas Serang Raya, Indonesia


INTRODUCTION
Educational data mining is a data mining implemented technique in an effort to develop data explorations of various educational information systems recommend the single linkage (SLG) dissimilarity increment distribution method, global cumulative score standard (SLG), and average linkage (ALG) dissimilarity increment distribution, global cumulative score standard (ALG) which used to analyze student learning online interaction data. The end result is a grouping model of behavior patterns and interpersonality patterns of students [1], [2]. The initial process starts from collecting data, before it is continued with data transformation, and is terminated by data analysis [3]. Educational data mining is implemented in order to achieve the goal of fulfilling the useful information needs of large amounts of electronic data recorded in the educator system [4]- [6].
Referring to the main topic of the discussion this time, the actual education system. In most schools, most teachers will teach a number of students in a class at a time [7], [8]. Of course, this will complicate the teacher in displaying the material in detail on each of its students [9]. On the other hand, teachers need to know the skill level of students to guide them and provide a high quality education. Therefore, testing periodically needs to be done to see if the students have the skills he needs [10], [11]. However, it is quite ISSN: 2252-8938  CLG clustering for dropout prediction using log-data clustering method (Agung Triayudi) 765 difficult to understand the students ' ability from each subject, because the exams have depleted a lot of time as well as incriminated class members [12]. The other hand, it is common to track the activities and behaviors in programming class, as source code editing and UNIX commands save them as log-data [13], [14]. We found several studies related to log-data usages. Likes, there is a research that predicts student skills based on the log-data [15], [16]. Although the level of accuracy of the method in this prediction is not so high, and still a lack of consideration whether the evaluation is done based on the prescribed aspects. Moreover, in some research apparently found difficulties in evaluating the students' acquired skills based on log-data. Therefore, this research is aimed at obtaining data on students who cannot keeping up with the programming class, rather than to estimate the achievement of students based on data-logs. We specifically propose a method that can be applied to predict dropout by using outlier detection without any learning supervision.

RESEARCH METHOD 2.1. Problem setting for dropout prediction
Monitoring and supporting the students highly necessary according to the department of education. when the teacher is able to track a student with high risk being dropout from the beginning, they can take action immediately and make sure to help that student so he or she will not be expelled from the school. Hence, it is important to predict those risky students over the class, so that the teachers would give them special guidance. It is potential to control student action using log-data in the programming class. By developing a logging system that can record application traces of source code editing and UNIX command, a dataset which obtained from our programming lesson's students from 39 students. Manufacturing evaluators with learning supervision are commonly used to predict a person's dropout potential. However, the data is difficult to understand, as the size given from our narrow dataset. Also, the features in the log-data depends on the elements from every class, such as if there is a lot of training or a plenty of explanations. As a result, we use an unattended learning method with outlier detection, assuming that students as part of an outlier cluster can be compared based on students' achievements, either superior or inferior students [16], [17]. The application of this k-means clustering technique is adjusted with Euclidean distance, in order to do clustering by using the dynamic time warping and benchmarking against active time behavior. Therefore, it is possible for us to compare the flow of activities to the exception of time-series deviations [18].

Dynamic time wrapping
Dynamic time warping (DTW) is an algorithm that used to measure the similarities between the two sequences with different lengths or amounts of data. DTW matches two sequences by calculating temporal information so that both of them can be aligned. Alignment is the smallest measured cumulative distance between two synced samples. If it then assumed that there are two sequential data, Q and C, with the range of n and m severally as shown in (1) and (2) [19], [20].
Then, to align these two sequences using dynamic time warping, a matrix is formed m × n with matrix element (i,j) in the form of distance value d(q i ,c j ) between two q i points, and declared as d(q i ,c j ) = (q ic j ) 2 . Each of matrix element (i,j) relates to align between q i and c j points. Warping path W is a group of adjoining matrix elements that define mapping between Q and C. The k element of W is formulated as w k = (i,j) k , so we got (3).
with: , -1. While the path is defined as the cumulative distance D(i,j), that's distance d (qi,cj) for the elements added with the minimum cumulative distance from adjacent elements, as shown in (4).
Once obtained the optimal warping path, the distance or warping cost is calculated based on (5).

CLG clustering
In this study, a proposed modification method in the clustering algorithm is complete linkage dissimilarity increment distribution-global cumulative score standard (CLG), this algorithm is a combined algorithm between the complete linkage (CL) algorithm [20], the dissimilarity increment distribution (DID) algorithm [20], global cumulative score standard (GCSS) algorithm [21]. The CLG algorithm works by combining elements of free graph-based parameters and model-based approaches (which are defined by combining criteria by characterizing clusters in probabilistic terms) for grouping.

CL=
, , , DID= ; The CLG algorithm provides different treatment to small cluster candidate groups. Each candidate groups whose size is lower than YMIN is not required to explain the merging criteria. In fact, the merger between Ci and Cj always occurs in the case of the two groups of candidates less than the value of the YMIN object. Regarding the cluster size threshold, it is important to note the difference between the H and YMIN parameters; because both values refer to group size, parameter H is the real value used in the calculation of the dynamic merge threshold, while YMIN is the integer threshold value used when directing the comparison with the required cluster size.

k-means++
Algorithm in k-means often applied in clustering techniques that aim to minimizes the squared distance that has been leveled between points in the same cluster. But the algorithm of k-means algorithm has a disadvantage that cannot provide precise accuracy even using simple and fast calculations [22], [23]. If k-means added with randomized seeding technique will improve the accuracy from the algorithm of k-means. The accuracy from the algorithm of k-means heavily depends on a value of centroid (C) at the beginning of the calculation, then if using a different C value will give different result even if requires a lot of iterations to determine the member of a cluster if the value C inappropriate. By adding formulas randomized seeding technique, then it will determine the value of C at the beginning of the calculation. Each member has the opportunity become a centroid so the value of opportunities of each member is counted to found which one is the most appropriate. Here is a randomized seeding technique formula.

k-methods
The k-medoids algorithm is a classic partitioning technique of clustering that performs clustering dataset of n objects into k clusters, known as a priori. This algorithm operates on principle to minimize the amount of similarity between each object appropriate reference point. The k-medoids algorithm can be done as being as [24], [25]:  In the first step, initialise the center of the cluster by k (the amount of clusters).  In the second step, count each entity to a nearby cluster using Euclidian Distance size equations.  The third step, after calculating the Euclidian Distance, initialize the center of the new cluster each object as a non-medoids candidate.  The fourth step, measure the gap between each entity located on each cluster with non-applicant medoids.  The fifth step, measure the total deviation (S) by processing the new total distance -the old total distance. If S < 0, then exchange entity with non medoids cluster data to form a new set of k objects as medoids.

767
 The sixth step, repeat steps 3-5 until no more changes to the medoid, then we are already got cluster members and their respective cluster members.

Experimentation dataset
UNIX command input history are used during exam due to programming class consist of 39 students. Assumed that the log-data can rate many aspects, such as motivation, individual skills, and others. Then we need to break the teaching signals from this log-data to create a new one binary linear classifier that separates a large of student based on their level.
Then we consider a way to easily classify about the group of students with unsupervised learning without firstly prepared a quantitative evaluation machine. The outlier class considered from one subset of this group. Accumulated time-series data of five-minutes UNIX commands input with k-medoid methods will integrate k-means++ for initial value definition. Hereafter, we inspect the trend of the clusters belonged then set the outlier cluster to every lesson from the evaluation. Figure 1 shows the command input ratio of each student's classes, where the picture given presents the executable files as "ls", "cd", and "gcc". Looking at the graph, can be seen that the ratio of command input used depending on the subjects. Therefore, it is not appropriate to be used as an input guide for the performance of each class.

Clustering methods for outlier detection
For instance, the result by grouping each other class can be seen in Figures 2 and 3. In this case, the number of clusters are arranged on a scale of 0 to 4, because at the end of the lesson, the number of inputs from the class medoid cluster is low. It is commonly known that in this case, clusters tend to simply classify the students based on the quantity of input. Except, for those users who suddenly experience an increasing the number of inputs will create their own cluster. Figure 2 suit this phenomenon. When the data located outside from the cluster less than 10% of the amount clusters, then the tendency of this dislocated cluster can be described as: i) less command input than other clusters as shown in Figure 3; and ii) the input increases rapidly in a short period of time as shown in Figure 2.

Outlier cluster interpretation
This is the characteristic of the students belonged to the outlying clusters using the clustering features such as the student's five-grade prediction (A to E). Figure 1 belong to a student whose evaluation is E and Figure 4 belong to a student who got prediction D as his index of achievement. These typical items probably suspect as supporter for solving the issues. Figure 2 applied to a plenty of students, especially for students with evaluation grade A and C. While it is possible that the program may not work well. We detect a high number of command input entered during debugging work, or their task was finished earlier when they still proceeded the task during personal learnings. Figure 3 belongs to a cluster of each class on one test. The evaluation is about the similarity of application techniques in students that are balanced even with low motivation. Because, in a test there are

Resolution concerning the amount of clusters
As far as has been predicted, the number of final inputs is more dominant in these methods than the way number of command input increase during the analysis. This is expected to be the basis in monitoring and grouping these five groups in all analyses. It is just that, for the students in the Figure 3, there is a tendency that more students are in the group with less input. Clustering is unattended learning but looking at the resulting data there will be no right answer of the clusters is needed.

Transition regarding the number of clusters
Since discussed it earlier, the results of the grouping on Figure 3 are classified based on the number of command input during the classes, except when they experience with increases unexpectedly in a short period. In this study, we can said that Cluster 0 consists of the student who only have a few inputs, while Cluster 4 consists of the student who have a lot of inputs. Students can be predicted as a line of numbers. In this case, students on the left of the center point tend to have a few commands input, while those on the right will have many command inputs. The study focused on the detection of outliers applied to specify trends acknowledge the frequentative behaviors of many classes, then inspecting the trends of each lesson.

Attitude investigation by questionnaires
All of this time, we try to manage a questionnaire about to ensure how much the outlying cluster means. There are several considerations were made, two of them are likes: "How much did you understand the today's content?", and "How much did you paid attention for today's lesson?" The format of the solution we created is self-evaluation and being organized into four point systems, as the students will get it once during the classes and exam. Then the result there will be no significant difference compared to the variation and outlier clusters since the scores from the answer either that low. For this reason, we cannot argued that our hypothesis incredibly valid, because another method of evaluation is still required. Overall, our suggest plays a major part as a visualizer of student motivation when unsupervised learning started.

UNIX command log manual verification
Manually, we investigated different type of student's behavior during lessons using log-data from clusters outliers. The results of this investigation have been concluded: i) Students who increasingly pressed the keyboard in a period of time resulting "gcc" command then run the program. Some students may experience problems within compilation errors or program bugs based on our investigation. On the other hand, coding goes well without any problems. By looking at this phenomenon, we cannot classify the predicate of those students based on how much the number of input and duration while pressing the keyboard. ii) Students whose only pressed the keyboard a few times may not be able to complete the task and had a high risk being dropped out as we predicted earlier. iii) Students who suddenly experienced the increasing number of input while pressing keyboard and UNIX commands when pasted source code into the command line made this information powerless if only being investigated by the number of UNIX command issues. It is necessary to add information such as command values or implementation results.

CONCLUSION
This study proposes ways or methods to evaluate those students who are being risk of dropped out from school, by grouping them with unsupervised study using outlier detection. We use the data depend on the lesson's purpose, makes it difficult while created by evaluation engine. For this reason, we investigated the group of outliers by divide them into three trends with a predictable cause, so that students who have learning problems can be detected as soon as possible. However, as our proposed prediction methods are still need further development, this research need another proper method such as visualization of student behavior based in log-data.