Deep learning techniques for physical abuse detection

ABSTRACT


INTRODUCTION
Physical abuse is any non-accidental act of causing trauma, injury, or other physical suffering or harming body parts. Generally, children are more likely victim to physical abuse caused by parents or siblings or caretakers. Similarly, even old people might be physically abused by younger people. Elder abuseneglect or mistreatment is often perpetrated by a caretaker, who might be a paid professional or a family member. And those holding higher positions or in power might also physically abuse people in lower cader. The main form of physical abuse is punching, kicking, slapping, beating, caused by fast hand and leg movement. We can leverage emerging technologies like computer vision and deep learning to tackle this problem, specifically detection of child abuse by analysing live video feed from places that are susceptible to physical child abuse such as day-care centres, and nanny cameras. Note that the aspect of real time recognition in this context is of utmost importance and we need to perform both detection, classification and response in real time to possibly alert and intervene when a situation of physical abuse presents itself. Human action recognition and human pose estimation are two of the techniques that are utilized in semantically identifying the actions happening in a video, usually by extracting spatial-temporal features from video frames. The major challenge lies in the fact that an act of physical abuse to a child can be as simple as slapping or kicking them once, meaning the target action might occur only once in a relatively longer video feed. So the system must be robust enough to accurately identify the action and not generate false positives. The goal in this work is to introduce a new way of calculating the accuracy for these spatio-temporal based data with very few target labels and carry out a comparison study on two different approaches to carry out human action recognition when specifically dealing with physical abuse to another person and come up with a system to make this a viable method of dealing with physical abuse.

RESEARCH METHOD
Traditional solutions when it comes to tackling child abuse have generally been geared towards location-based tools which employ the use of mobile devices or tracking devices in order to keep a track of a child's location but require manual intervention from the victim's part to report the incident of child abuse. According to [1] presents a unified platform with an application available on both mobiles as well as desktops that connects directly to Zainab Alert Cell, an operating cell within ZARRA, a federal agency for reducing child kidnapping in Pakistan. This platform combined with GPS powered wristbands, hairbands or other accessories can allow parents to keep an eye on their children along with reporting functionality. The highlight of these proposed bands are fixed coordinates distance pre-set in the band to alert the parents if the child has gone beyond a certain distance and could possibly be in danger. However, the system is location based and requires human intervention to identify the cause of danger.
Location-based solution using geofencing where a parent can define boundaries or 'fences' to monitor the location of children and send emergency messages to local child protection services that are capable of monitoring children activity was proposed in [2]. The solution relies on knowledge of the social environment to determine location of interests and possible harmful locations. Again, this solution relies on devices capable of operating accurately and reliably as a location reading device located with the child. In 2018, Shruthi [3] also provides a solution that is dependent on wearable devices with location tracking combined with alert systems. This requires a possible victim sending a SOS request using their own mobile device. They also propose a higher functioning GPS using antennas located on clothes as logos so as to remain disguised from potential abusers. In 2016, Jatti et al. [4] propose a machine learning algorithm to evaluate signals and determine if an individual is relaxed or stressed. If the individual is determined as stressed, temporal based action recognition is proposed to alert nearby authorities. However, looking at the national statistics provided by Health and Human Services Department of the Children's Bureau of the US [5] on child abuse for the year 2018, it is observed that 92% of the children were victimized by a parent. Looking at the perpetrator-victim relationship numbers provided in [6], it was seen that 77.5% of the perpetrators in 2018 were a parent of the victim, 6.4% were a relative other than a parent and 4.2% had a multiple relationship (such as a nanny or caretaker) with the victim. This means 88.1% of the total reported cases were carried out by a person which the victim knew, meaning there is a much lower probability of the child being moved to a location further than their designated safe area.
Hence in recent times, approaches based on deep learning have gained maximum popularity in the research on vision based human action recognition. Complex human to human interaction can be analysed. They have the ability to learn from simple to complex features. By having multiple layers of processing and high-level representation of the given input video can be built. Deep learning approach uses weight sharing, local perception, down-pooling, and a multi-convolution kernel to learn local information from part of an image. The result will be the output from final recognition layer. The final recognition layer will be determined by the result of multiple layers of convolution. And action recognition is [7] considered to be a combination of gestures, for instance, "running", can be determined as a combination of arm and leg gestures. The various challenges and approaches involved in human action recognition are discussed in detail [8]. Spatiotemporal related specific features are considered in [9] for action. In 2010, Poppe [10] gives a detailed survey on various vision based action recognition systems discussing the challenges of classification algorithms and image representation. Human action analysis [11] techinques were reviewed and approaches on invariable view on pose detection and behaviour were discussed. In 2011, Weinland et al. [12] presented a survey on the approaches for action recognition, representation and segmentation in vision-based approaches. Some datasets on human action and activity recognition are reviewed in [13].

Data set
The most challenging part of this work is creating data for physical abuse. There are no publicly available datasets that come under the umbrella of physical abuse in a child-parent environment and although there are datasets like UCF101 and kinetics action recognition dataset which have examples for actions like boxing and punching, the inputs are of much lower resolution and does not match the description of having long instances of video feed with single action happening, which can make a system trained on data like this highly susceptible to false positives. Given all these factors into consideration, the testing data was generated, by recording video clips. In order to create action data, video clips were recorded of varying lengths for the following targets: stand, kick and slap. 4 clips for each target were recorded, with 2 clips having one participant and 2 clips having 2 participants each. The actions for our training data were performed in different orientations in each clip to account for as much variance in the positioning of the people as would be present in a real-life video feed. Figure 1 shows different frames from recorded videos with different orientations and number of people. 36 short clips were sampled from each video with each clip being 5 frames. This is a single view-based dataset. The datasets use a single camera for recording human actions from a certain fixed angle without camera being moved. This dataset has 12 recorded actions and each action was performed by either single or two individuals.

3D convolution neural network (CNN)
In 3D CNN architecture, various features are extracted using multiple convolution operations. They are applied at the same location on the input. Then, convolution and subsampling will be performed for the channel from adjacent frames. Information from all channels will be combined to form final feature representation. 3D CNNs have seen an increased growth in the recent years due to their ability to extract features that are both high level and low-level representations from images. A typical convolution operation can be defined as a mathematical function using the given representation: Kernel K is spatially smaller than Input I and the resulting convolution operation produces a resulting feature map that is able to capture details about our image with learnable weights associated with K.
The dimension of the feature space after applying the filter is governed by the formula While this operation allows us to map features in an image such as edges and lines on to the feature space and perform well on tasks like 2D image classification and object detection/localization, 2D CNNs perform poorly on data with underlying temporal features, such as video classification due to its inability to correlate temporal dependencies making them unsuitable for video action recognition which inherently relies on the system's ability to capture temporal dependencies from one frame to another in a video clip by utilizing a 3-dimensional feature that extracts features along 3 dimensions of width height and time. Instead, we can utilize 3D convolutions that are better at capturing a temporal based feature. Figure 2 shows a 3D convolution operation using 3D filter. Note that the dimensions of the output feature map follow the same rule as (1) for calculating the output height, width and depth. The formula for a 3D convolution operation is as: 3D-CNN is composed of three 3-dimensional conv layers and 2 fully connected layers. The input shape to our first layer is 5x360x480x1. Each output is followed a ReLu non-linearity except the last output layer which has 3 output nodes followed by a SoftMax non-linearity to generate output probabilities for each of our three classes. The last Conv3D layer is also followed by a 3D MaxPooling layer to reduce the receptive field of our feature map. We have also utilized valid padding i.e p=1 throughout the network. The shape of our kernels used is [1x3x3]. Losses are then calculated using the log loss function. We trained our model for 1 epoch and use Adam optimizer to converge the network. While training, our dataset was split into training and validation using 410 examples for training and 22 examples for validation. After training, we were able to achieve a training accuracy of 93% and a validation accuracy of 99%. We will compare our test results for both the techniques in a later section.

Human action recognition (HAR)
HAR works on classifying the activity being carried out by a human present inside a video. The challenge while tackling this problem lies in the aspect of carrying out a significant number of classifications each second in a video sequence and the spatial-temporal nature of the data itself i.e. sequence of video frames over a period of time. Vision-based HAR has applications in human-computer interaction, health care, video surveillance and many such areas. Typically, global representations have been utilized to represent a video or image and encode it as a feature. Then localization and region of interest is identified. However, this two-step process is not sufficient for action recognition which is in real-time. Generally, it is difficult to train large 3D CNN on 3-dimensional data that is typically needed to create models capable of carrying out action recognition [14]. Approaches are used [15] of training large 3D convolutional networks in order to classify actions [16] in a video by treating stacked frames as the third dimensional input to the 3D CNN and extract temporal features from one frame to another and perform human action recognition [17]. A 2-way convolutional approach, utilizing both optical flow as well as features extracted from RGB input and combining them to generate a representation of motion in temporal space is presented in [18]. Features are extracted from images using a CNN [19] and then using these feature vector as input to an RNN [20] to classify long term dependencies in a video sequence and coined the architecture as LRCC or longterm recurrent convolutional neural networks. According to [21] was developed at Facebook and currently holds SOTA results for many popular activity recognition datasets like THUMOS, UCF101, Kinetics and HMDB activity recognition datasets. However, a general problem with these approaches is the underlying datasets that are trained on. It doesn't generally translate well into a human-child interaction environment and do not scale well to the input sizes when dealing with input feed from webcams. This makes it terribly slow when trying to carry out activity recognition in real time. However, this is where this work utilizes human pose estimation. This approach is similar to [22] where it utilizes human skeletons generated from human pose estimation and use the skeletons as input to the model. The approach discussed in [23] also offloads most of the heavy computing to the task of generating human pose and [24] uses a simple network architecture to make predictions using human pose features instead of using a big architecture for carrying out human action recognition on the original input itself. This approach makes it very promising for real time systems with very limited fault tolerance. Human pose using CNN is handled in [25] especially for occlusion using detection and regression. Multicontext attention mechanism is used in CNN [26] to gain the ability to focus on different granularity. Learning feature pyramids are designed in deep CNN models to handle scale changes [27].

Human pose estimation
Human pose estimation is done with localization of the joints of a human in an image or a video and uses it to estimate the pose of the human. It consists of jointly detecting keypoints on the body, hand, face and foot. Typically, there are two approaches for carrying out human pose estimation -top-down and bottom-up approach. Top-Down approach involves localization of humans in the frame using a bounding box detector and then roughly guessing the pose of a single person in that box whereas the bottom-up approach starts by localizing the keypoints in the image and then grouping those keypoints into the instance of a person. Using the location of these joints and the subsequent movement of the joints from one frame to another allows to extract these joints as spatial-temporal features to feed into the neural network.
OpenPose [22], is a library developed by CMU that identifies keypoints in human face, hands, legs and body. It does multi-person keypoint detection for multiple persons on a single image or a video. It was trained on the CMU Panoptic Studio dataset and is capable of carrying out both 2D and 3D real-time multi person keypoint detection and draws a skeleton overlapping the human in the image. This generated skeleton data is used by our motion-estimation algorithm to classify different actions being carried out in the video feed by a human.
The skeleton generated has 18 joints that includes areas in the head, arms, neck and legs and is show in Figure 3. It uses Part Affinity Fields which are a set of 2D vector fields that encode the location and orientation of limbs over the image domain. It uses bottom up approach. These are able to encode unstructured pairwise relationships between body parts of a variable number of people present in the image.
First a feed-forward neural network predicts a set of 2D confidence maps S of body part locations and set of 2D vector fields L of Part Affinity Fields that encode a degree of associativity between different body parts of the skeletons. S has j confidence maps with one confidence map per part while L has c vector fields with one vector field per limb. The Figure 4 shows the design of OpenPose's multistage architecture for generating heatmaps of the skeleton poses. The image is analysed by a CNN (first 10 layers of VGG-19 as a base) generating feature maps F that acts as input to the first stage of the network architecture. Then in the first stage, one branch predicts a set of confidence maps S1 = p1(F) while the other branch predicts a set of Part Affinity Fields L1 = o1(F) where p1 and o1 refers to the CNN used for carrying out the inference at Stage 1. In the following stages, After experimentation, features 3, 4 and 5 were used and concatenated to form a feature vector of dimension 314. PCA was applied to reduce the dimension of this vector to 50 dimensions. This vector is then used as input to a neural network consisting of 2 hidden layers, each with 100 nodes and the output layer consisting of 3 nodes to serve as our predictions. This network is the default network available in scikit learn library for neural networks.

RESULTS AND DISCUSSION
There are 3 classes that a video clip can be classified into standing, kicking and slapping. Two categories are also defined that the actions can be split into: GOOD and BAD categories. The actions of kicking and slapping belonging to the category BAD and the action of standing belonging to the category GOOD. The main target for the system is to identify whenever a bad action occurs in a video and to also not incorrectly classify no action happening as a bad action. Based on this, the metrics are dined as in Table 1. Using these target metrics, a novel way of prioritizing the weight of the actual prediction are developed as explained in section 8.

Test data
For testing purposes, video clippings were recorded. Five video clips for each class, each clip being 5 seconds long were generated. Each video clip in test dataset starts with the person in a standing position. If the video belongs to the BAD category, then the corresponding action is carried out once in the video clip. For the video belonging to the GOOD category, the subject(s) remains standing throughout the test clip. Results for the experiments have been detailed in the next two subsections. We again followed the idea of recording these test clips in diferent orientations of the person present in the video clip to account for variability in the positioning of the people in a real life video feed as shown in Figure 5.
HARecognizer converts the 5 second video clip to a folder of images on which predictions are then made. It converts those images back into a playable video. Figure 6 shows some frames from the prediction video on the test dataset. The scores on the left indicate the confidence score for each class that the model has been trained upon. The Figure 6 showcases results (a) shows where the person was standing in a front facing orientation and (b) the model correctly identified the frames where kick action was being performed.
The Figure 7 shows from a prediction video where the person has a slightly different orientation as compared to the first set. The model starts off by correctly identifying the person performing a standing action. After that the person starts to perform a punch action which is again correctly identified as punching action. However, observing the third set in the image, we see the model incorrectly identifies it as a kicking action momentarily. So, although the model identifies some sort of action occuring, it gives it a wrong label. In the next section we describe in depth on how to deal with these predictions and come up with new metrics to compare different methods for the same. If it is marked a true negative, the accuracy for target label stand is calculated using: It is then multiplied with the True Negative to get a Weighted True Negative.

ℎ = *
If the target belongs to the other category (Label is punch or kick), then we mark the entire example as one count of true positive if c + d > 0 or a false negative if + = 0. However, this does not account for the actual prediction being made is correct or not depending on the target. In order to account for the actual prediction being made, we use the following formula for the accuracy of the particular example. If the video is marked as true positive, we use this accuracy and multiply it with true positive to get a weighted true positive. Weighted recall: Weighted precision: The recorded metrics mentioned above for this approach are listed in Table 2. Note that the results are rounded up to the nearest 2 decimal points.

CONCLUSION
The results clearly indicate that a human pose estimation based approach gives good results when it comes to developing a system that is robust towards false positives and false negatives. It also shows that the model will perform ideally in real time environments especially when dealing with situations where the movement of humans can be erratic or unpredictable and in a environment where giving even false positives as predictions can be seen as a harmful result to the parties present inside the scene at the given moment. The weighted metrics that were used shows more reliable measure, accuracy of about 89.42% with precision of about 85.82% and shows the effectiveness of the system. Deployment and future work: Actual deployment is another important aspect to the proposed deep learning model. Since this is a compute heavy model, it is not easy to run it on low end devices such as a raspberry-pi where the model runs independently on each device. Instead a better approach would be to stream the live video feed to an offboard server or using a cloud based system, availing the service of a cloud provider service such as Google Cloud Platform or AWS. This would transfer all of the compute needs to a separate location and allow us to install cameras with streaming capabilities to a remote server. The choice of cloud service will be based upon whichever is most cost beneficial since the models would need to be deployed on GPU-based machines to provide a relatively good performance. For some of our future work, we hope to develop a faster pose estimation model by carrying out a more extensive feature engineering process and determining what combination of features can further aid us in increasing the speed and accuracy of the system. We hope to use this and evaluate performance gains with respect to loss of accuracy in detection since the main goal here is in deploying such a system on lower end devices like raspberry pi which typically do not have a lot of compute to offer, especially for complex computations present in our deep learning models. This can ensure the real-time aspect that we are looking to achieve for our system while taking into account acceptable amounts of accuracy losses. We hope to develop more comprehensive accuracy metrics that can further serve as a benchmark when comparing different systems for real time activity recognition.