Adaptive Weight Assignment Scheme For Multi-task Learning

Deep learning based models are used regularly in every applications nowadays. Generally we train a single model on a single task. However, we can train multiple tasks on a single model under multi-task learning settings. This provides us many benefits like lesser training time, training a single model for multiple tasks, reducing overfitting, improving performances etc. To train a model in multi-task learning settings we need to sum the loss values from different tasks. In vanilla multi-task learning settings we assign equal weights but since not all tasks are of similar difficulty we need to allocate more weight to tasks which are more difficult. Also improper weight assignment reduces the performance of the model. We propose a simple weight assignment scheme in this paper which improves the performance of the model and puts more emphasis on difficult tasks. We tested our methods performance on both image and textual data and also compared performance against two popular weight assignment methods. Empirical results suggest that our proposed method achieves better results compared to other popular methods.

INTRODUCTION From the beginning of the last decade deep learning methods has been used vastly in various applications. The reach of it has exceeded tremendously not only in the field of computer science but also in electrical engineering, civil engineering, mechanical engineering and other fields as well. It is due to the fact that deep neural networks (DNN), have achieved human level competence in various applications like image classification [1], question answering [2], lip reading [3], video games [4] etc. DNNs have the capability to find out complex and hidden features of the input data without any assistance. Previously these models were depended on hand crafted features [5][6][7][8][9][10]. Human beings have the capability to perform multiple tasks simultaneously without harming performance of any tasks. Humans do this regularly and are able to decide which tasks can be done at the same time. That is why in recent years a lot of focus have been put into multi-task learning using DNN methods. Generally, a single model is devoted to performing a single task. However, performing multiple tasks increases the performance of the model, reduces training time and overfitting [11]. Often we find small insufficient datasets for individual tasks but if the tasks are related somehow then we can use this shared information and build a large enough dataset which will reduce this problem. Currently in the field of mult-task learning, several research work is going on to create new DNN architectures for multi-task learning setting [12,13], deciding which tasks should be learned together [14], how to assign weights to the loss values [15,16] etc. In this research work we focus on creating a dynamic weight assignment technique which will assign different weights to the loss values in each epoch during training. In our research work, we propose a new method for assigning weights to all loss values and test it against two datasets which are used in both image and text domain. The contributions of our research work are listed below.
• We propose an intuitive loss weighting scheme for multi-task learning.
• We tested our method against both image and text domain by using two different dataset. We did this to ensure that our method performs well across all domains.
• We compared our method against two popular weight assigning schemes for comparing the performance of our method.

RESEARCH METHOD
In this section we will provide a discussion about previous research work performed in this field. Next, we will provide our proposed method.

Literature Review
One of the earliest papers on multi-task learning is provided by R. Caruana [11]. In the manuscript, the author explored the idea of multi-task learning and showed it's effectiveness under different datasets. The author also explained how multi-task learning works and how it can be used in backpropagation. To train a DNN based on multi-task learning setting we need to consider which layers of network are shared among all the tasks and which layers are used for individual tasks. Previously, most of the research work has been focused on the concept of hard parameter sharing concept [17][18][19]. In this scenario, the user defines the shareable layers up to a particular point after which all layers are assigned per each task. There is also the concept of soft-parameter sharing where a single column exists for all the tasks in the network. A special mechanism is designed to share the parameter across all the network. Popular approaches for this method is Cross-stitch [13], Sluich [20] etc. A new approach named Ada-share has been proposed recently where the model learns dynamically which layers to share for all tasks and which layers to be used for single tasks [14]. The authors also proposed a new loss function which ensures the compactness of the model as well as the performance of it.
Weight assignment is a very crucial task in the field of multi-task learning. Previously weights either had equal values or some hand-tuned values which was assigned by the researchers [18,21,22]. However in scenarios where a large number of tasks existed for the multi-task learning model to perform, such approaches fall short. A method based on uncertainty was proposed by [15]. Later a revised method of this approach was proposed by [12]. In this paper, the authors improved the previous uncertainty based method by adding a positive regularization term. Dynamic weight average method was proposed by [12]. In this method the authors calculated the relative change in loss values in previous two epochs and used softmax function on these values to get the weights. [23] performed a comparative study of different weight assigning scheme. However, they didn't study these methods in any domain other than images. Also, the dataset they used had only 2 tasks. Int J Artif Intell ISSN: 2252-8938 Ì 3

Adaptive Weight Assignment
Our proposed method is simple and it takes into account of the loss value of each task in each epoch. Compared to other methods our method is easy to implement. Generally, in multi-task learning settings to train the model we need to sum up all the loss values with their weights and then perform backpropagation for updating the weights of the model. This summation of losses can be expressed as, Here, W corresponds to the weight of the loss and L represents the loss for each task. In vanilla multi-task learning setting all the weights are set to 1. However, we must keep in mind that all the tasks are not the same. Some are more difficult than others so we need to provide more weights on difficult tasks to improve performance of the overall multi-task learning system. That is why we propose Algorithm 1. Our algorithm is based on the simple T empLoss += L t 3: end for 4: for t = 1, 2, . . . n do 5: weights t = L t /T empLoss 6: T otalLoss += weights t x L t x n 7: end for concept that difficult tasks will have more loss values than the easier ones. So we should put more emphasis or weights on those loss values while assigning less weights to the smaller loss values. What we do is take the sum of the loss values for each tasks and use it to figure out the ratio of how much a single tasks loss value contributes to the total loss. We multiply this value with the total number of tasks. Generally, in vanilla multi-task learning setting all loss values have equal weights 1. So the total weight is then n for n number of tasks. That is why we multiply our ratios with n. Finally, we use these weights and using Eqn (1) compute the total loss for the multitask learning model. Figure 1 provides a visual representation of the method. One of the important things about designing loss weighting schemes is that we need to ensure that these weight calculating methods should not take a lot time because it will increase the training time. Table 1 provides a chart about the time required to execute these schemes including our method. From the table we can see that though our method is not the fastest method to compute weights but it certainly is not the slowest. Also, the time difference between the quickest method and our method is very small.

RESULTS AND DISCUSSION
We will discuss about the dataset, experimental setup and results of the experiments in this section.

Dataset Description
We used two different datasets in our experiment. They are CIFAR-100 [24] and AGNews [25]. The formal one is image based and the later one is text based. Since these datasets are designed for single task learning we created artificial tasks for multi-task learning settings. We created 5 different tasks from CIFAR-100 and 2 tasks from AGNews dataset. All the tasks were created based on the original tasks labels and we grouped different labels together to form multiple tasks. The tasks were created to ensure that no class imbalance exists for all tasks.

Experimental Setup
We used two different DNN models for our experiment. We used wide resnet-28-10 (WRN) [26] for CIFAR-100 and a custom DNN for AGNews dataset. We split the final layer of the WRN model into 5 output layers for CIFAR-100 and 2 output layers for AGNews dataset. we trained WRN model for 100 epochs using SGD optimizer and set the learning rate to 0.001. We also used one cycle learning rate scheduler [27]. In order to train the AGNews dataset we at first tokenize the dataset and create a vocabulary dictionary based on it. Then we perform embedding of the text which is going to be the input of the model. Our custom DNN consists of two fully connected layers. We trained this model using SGD optimizer. To ensure the effectiveness of our method, we compared our proposed method against two state-of-the-art methods namely dynamic weight average (DWA) and uncertainty method. We also compared against single task learning and vanilla multi-task setting.

Experimental Results
We will discuss about the performance of our method against two datasets in this section. Table 2 and 3 represents the results of our overall experiment. We have plotted the testing loss curves for both CIFAR-100 and AGNews dataset in Figure 2.
In Table 2, we have the results on running experiments on CIFAR-100 dataset which is an image dataset. At the beginning we have results for all the five tasks in a single task learning settings. That is five different models were trained to get the results of these five tasks. Next under multi-task learning setting we trained four methods for these tasks. In vanilla multi-task learning we have assigned equal weights to each task for each epoch. Other methods Uncertainty, DWA and our method updates weights in each epoch. From this table we can see our proposed method out performs other methods in three out of five tasks. Also our method achieved second best performance in the rest of the two tasks. We can see that multi-task learning models performed better than STL models and also we needed to train only one single model for all five of these tasks.  We evaluate our methods performance on AGNews dataset which contains textual data. We have two tasks and at the beginning we train two individual models for these two tasks. After that we train four multitask learning models with different weight assignment schemes. We can observe from the table that our proposed method performs well under one task and achieves second best score in the other one. Compared to other popular methods we can see that our proposed method is performing much better. If we look closely at the values we will see that other methods fail to achieve the best results. In some cases these approaches even fail to attain better performance than single task learning approach. We believe this is due to the fact the model architecture has a big impact on the performance of multi-task learning settings. In our experiment we focused on uniform DNN architecture for evaluation but some tasks might need a few extra convolutional or fully connected layers. If we put further emphasis on the DNN architecture then the performance of our proposed method would definitely be better in both tasks. We believe that a simpler approach should be taken while assigning weights. As this step is performed in each iteration, too much parameterized and complex approach mind hinder the performance of the model and increase time complexity.

CONCLUSION
Understanding and properly executing different hyper-parameters is extremely crucial in training a DNN model for the best results. Multi-task learning settings have the upper-hand on single task learning when it comes to amount of data needed, time to train the model, reducing overfitting and increasing model performance. In multitask learning settings since not all tasks are of equal difficulties assigning weight to the loss values is important to put more emphasis on difficult task. In this paper, we propose a new weight assignment scheme which aids in Ì ISSN: 2252-8938 improving the performance of the multi-task learning model. Our proposed method out-performs other state-ofthe-art weight assigning schemes in both image and textual domain and boosts the performance of the model.