Classification of dances using AlexNet, ResNet18 and SqueezeNet1_0

ABSTRACT


INTRODUCTION
Dancing is a type of performative art that is based on movement and actions of the performers.Professional dancers in the field of performative arts can identify and distinguish between several different types of dances based on their professional experience.Amateur dancers on the other hand, may have difficulty in distinguishing a type of dance due to the wide variety of dances that are present in the world of performance art.Ballet for example, can easily be assessed by a professional ballerina on how accurate one's performance is to be a representation of that dance type, while an amateur would find it difficult.On another note, deep learning models have been used to classify human actions and movements [1].Dancing however, unlike previous work that has been done to classify human actions; is slightly different because there exists a metric of how accurate the specific movement of the dancer is to what is generally considered to be a dance style of a specific dance type [2].Therefore, it is difficult for a non-expert to assess how accurate a dance is to its generally understood dance interpretation.This paper introduces using deep learning models to classify dances according to several dancers' general interpretation of a class of dance and how accurate the dances are to the deep learning models' understanding of that dance class.We proposed using AlexNet [3], ResNet18 [4] and SqueezeNet1_0 [5] deep learning models to classify and evaluate the accuracy of dances to its class.Several works have been done to classify human movements such as proposed by Yildirim and Çinar [6], Kumar and Harikiran [7] and Zamri et al. [8] that uses deep learning models.However, those works are similar in method whereby the authors utilized singular images of Int J Artif Intell ISSN: 2252-8938  Classification of dances using AlexNet, ResNet18 and SqueezeNet1_0 (Khalif Amir Zakry) 603 a human performing an action to train their deep learning models.This method generalises the action in a video into an image classification problem [9]- [11].Alternative methods for human action recognition by utilizing temporal gradients of action have also been proposed by Hutchison et al. [12].The authors demonstrated that by using temporal structures of motion segments in an activity for action recognition they can achieve an average precision of 72.1% for classification of a sports action dataset [12].We proposed that instead of using singular images to classify an action, we should use multiple frames of actions in an action sequence to classify dances.Such approach for identifying the quality of other human actions have been demonstrated [14], [15].Utilizing deep learning models with dance videos has also been performed by several authors.Wang et al. [16] proposed the used of a dataset of viral dance videos to predict the virality of a dance video.The authors introduced a relational temporal convolutional network (RTCN) for performing viral predictions of a dance video by incorporating the capture of temporal dynamics from the appearance of the video.Based on their study, factors such as facial, scenic, and holistic appearance of a dance video is considered an important aspect that in virality prediction of a dance video [16].In our method, we adapt a different approach from the mentioned works by utilizing a dance video dataset with deep learning models to incorporate multiple frames of a dance video to classify the types of dances and for measuring the accuracy of the dance video to how similar the dance video is to others of its type.

METHOD
In our proposed method, we used our own dance video dataset with FastAI, a machine learning framework that provides high-level and low-level components for classifying multi-modal datasets [17].The dataset consists of dance videos that have been obtained from TikTok which is an online public social media platform and from our research participants that were tasked to perform their interpretation of the dance type that we have labelled.The dataset consists of 240 videos and 12 different labelled dance classes, each class representing a single type of dance.Since each video may only attribute to a single class; this means that no video may be multi-classed.20 human participants were asked to perform each of the 12 types of dances.A reference video of each dance was provided, and the participants were asked to perform their own interpretation of the dance based on the reference video.The dance videos produced are of varying quality with slight variations in color grading, background, and image quality.Several of the videos also include the application of TikTok video filters which drastically changes the visual scene and the visual fidelity of the video by the introduction of noise, as shown in Figure 1 and Figure 2.For our deep learning classification experiment with FastAI, a set of frames of one-second intervals of each second in a video is extracted by using fast forward moving picture experts group (FFMPEG), an open-source multimedia framework [18].This generated a total of 17,389 images from the videos that can be categorized into the respective dance class of the original video that the image was extracted from.Using a combination of automatic tool and manual review, images that contain noise or without any subject context such as empty frames were filtered and discarded.Each dance class has different amounts of images due to the difference in video length and due to the automatic and manual removal of frames.See This extraction method when used provided us with numerous image frames that represents the various action frames of the participant from every dance video in their specific dance class.For example, there would be 1,138 frames of dancing actions that can be derived from the 20 videos in the "Say So" dance class.Sample frames are shown in Figure 3.We then experimented training with the deep learning models; AlexNet, Squeezenet1_0 and ResNet18 Before training is carried out for each pre-trained model, 20% of the total frames is separated as the validation set while the remaining 80% is used as the training set.To standardize, the training epoch for each pre-trained model is set to a uniform value of 10.The line plots shown in Figure 4 show a consistent increase in accuracy for every consecutive epoch of the trained model when predicting using the validation set for Squeezenet1_0 and ResNet18.This is distinct however from the results for AlexNet.There is a 0.0575% drop in predictive accuracy with the validation set between the 9 th and 10 th training epoch for AlexNet.This can be attributed overfitting of the trained model that occurs due the 9 th training epoch with AlexNet.An indicator of overfitting can be seen when the error rate goes up as the model is by-hearting the data [19].By analysing the total findings, we can consider that the trained model with ResNet18 provides the best final accuracy when validating with its validation set at 97.6416%.The average accuracy of the final epoch for all 3 training models is 94.9669% which is good for predicting the types of dances for image frames in their respective validation sets as experimented.However, as mentioned in [20], these 3 trained models might have difficulty in predicting a similar dataset if they were pre-processed differently before analyses.As mentioned in [21], even though accuracy is a widely used performance metric for evaluating classification models, it might not be the most suitable for some cases.As such we can also utilize confusion matrix another evaluation metric for this experiment.
In [22] a confusion matrix is performance matrix that can summarize the performance of a classifier with respect to some test data.From our experiment, the confusion matrix for all three models, as shown in Figures 5-7, also shows a generally good outcome whereby the ratio of correct predictions far outnumbers the incorrect predictions made by each model with its validation set.We can see this via the visible dark blue diagonal colour contrast from the top left to the bottom right of each confusion matrix.This indicates that the predictive capability of the model when used to predict its training validation set is high.
For AlexNet, the highest occurrence of misclassification is between Supalonely and Say So.For SqueezeNet 1_0, the highest occurrence of misclassification occurs among Big Up's, All TikTok Mashup, Say So, Lottery and The Dance Song.For ResNet18, the highest occurrence of misclassification is between Diam Diam Menyukaiku and Tak Mau Mau.We theorise the cause of these misclassifications is due to these dances sharing similar dance motions hence making the frames to be visually similar.Most of the participants had captured the videos for each dance class using the same clothing and background hence motions would be an important information to discriminate between classes.Based off the results of the confusion matrices of all 3 deep learning models, it shows that the models can be used to classify dances.They can classify with a high accuracy what class of dance the image in the validation layer is based on what it has learned from the input layer.This method can be used to create further programms and applications that enables users such as amateur dancers to assess the accuracy of their dance performance according to the general interpretation by other dances in that dance class.[23].The performance of their CNN for classifying Indian dances with different settings of convolutional filter sizes can be compared to our approach of using pre-trained models for classifying modern dances.Kishore et al. [23] managed to achieve a 93.33% average accuracy by using their own proposed deep learning CNN model to classify their dance dataset which is only a 1.6469% difference in accuracy than our 94.9669%average accuracy while using AlexNet, ResNet and SqueezeNet1_0 for our dataset.The difference in approach that is taken between ours and Kishore et al. does prove and support the evidence that deep learning models can produce a good average accuracy for classifying dances albeit Kishore et al. dataset consists of a different type of dance from ours.
We can also compare our results with the approach proposed by Li [24].They utilized a deep CNN model based on differential evolution for their approach to classify dances.The accuracy of their DE-CNN approach to classify their dances achieved an average accuracy of 92.75%.This is a 2.2169 difference in accuracy from our approach.This comparison indicates the viability of our approach in being able to achieve marginally better classification accuracy for classifying dances.However, it must be noted that the dataset of dances used in the other works are different.

Discussion on the automatic removal of extremely noisy frames inside the TikTok dataset
Using a tool provided with the FastAI framework, we can identify frames with a high loss based on the outcome of our training.This is how we locate extremely noisy frames for possible manual removal.See Figure 8 for some sample frames with high loss values.
We identified an occurring pattern whereby the top 9 frames with the highest loss for one model can also reappear for the others.For example, Frame A is the frame with the highest loss for the model trained with AlexNet, but it is also coincidentally the frame with the highest loss for the model trained with Squeezenet1_0 as well.We can consider that the high loss for Frame A can be attributed with the highly  ISSN: 2252-8938 Int J Artif Intell, Vol. 12, No. 2, June 2023: 602-609 608 noisy image quality of the frame.These frames must be removed since even when using human judgement, it is impossible to ascertain the dance class for Frame A, Frame B and Frame C whereby little to no useful characteristic that can be used to identify a dance is being shown.Several actions can be taken when dealing with noisy data, in [25] techniques such as ignoring the noise, and filtering noise by removing was used to handle noise.As demonstrated in [26], an increase in noise can reduce the accuracy of classification.As such, noise can and does impact the performance of a machine learning model.We observed that there are several frames that could be excluded for training purposes in this dataset as those frames provide very little relevance to the classification purposes of a dance video.These frames are duly removed from the dataset.

CONCLUSION
In conclusion, we proposed and experimented the use of deep learning approach for classifing dances using deep learning CNN models to classify dances according to how accurate the dance is with other similarly labelled dances.The results of our approach demonstrated that deep learning models can be used to classify dances and the average accuracy of our approach with 3 different deep learning models is 94.969%.

Figure 1 .Figure 2 .
Figure 1.Sample Image from a video with "Clone filter" applied Figure 2. Sample Image from a video with "Disco filter" applied

Figure 3 .
Figure 3.Each image represents a single frame that has been extracted from class of dance

Figure 8 .
Figure 8. Three sample frames with high loss

Table 1 .
The number of frames that has been extracted for each label category