Spike neuron optimization using deep reinforcement learning

Received Sep 22, 2020 Revised Jan 17, 2021 Accepted Feb 9, 2021 Deep reinforcement learning (DRL) which involved reinforcement learning and artificial neural network allows agents to take the best possible actions to achieve goals. Spiking neural network (SNN) faced difficulty in training due to the non-differentiable spike function of spike neuron. In order to overcome the difficulty, deep Q network (DQN) and deep Q learning with normalized advantage function (NAF) are proposed to interact with a custom environment. DQN is applied for discrete action space whereas NAF is implemented for continuous action space. The model is trained and tested to validate its performance in order to balance the firing rate of excitatory and inhibitory population of spike neuron by using both algorithms. Training results showed both agents able to explore in the custom environment with OpenAI Gym framework. The trained model for both algorithms capable to balance the firing rate of excitatory and inhibitory of the spike neuron. NAF achieved 0.80% of the average percentage error of rate of difference between target and actual neuron rate whereas DQN obtained 0.96%. NAF attained the goal faster than DQN with only 3 steps taken for actual output neuron rate to meet with or close to target neuron firing rate.


INTRODUCTION
Deep reinforcement learning (DRL) combines machine learning and artificial intelligence techniques [1]. Reinforcement learning algorithms with deep neural networks are implemented in DRL to select the best possible action to attain goals. A DRL agent interacts with a virtual environment as shown in Figure 1 and select actions to solve complex problem [2]. Deep neural network is used by agents to approximate a value or policy function in order to update and index the data instead of using a lookup table. The data consists of states, actions and rewards. The agent takes actions based on the current state and reward in a virtual environment [3]. The agent receives rewards or penalties based on the actions performed. The agent receives positive rewards when the outcome is closer to the target whereas when there is a faulty action taken, the agent obtains negative rewards. The agent learns from experience to decide the best suitable action to attain a goal [4].
Spike neuron is elementary unit in spiking neural network (SNN). Spike neuron has the characteristic of spiking behaviour. When spike neuron is fired,a spike is generated by using spike generation function. Spikes are sequences of action potential that is used in signal transmission in spike neurons [5].
Synapses of spike neuron which consists of excitatory and inhibitory population are required to be optimized before implement into a network to form a SNN. An optimization method is proposed to optimize spike neuron by using maximum likelihood [6]. The maximum likelihood optimization method is used to configure the single M-N neuron to predict the firing activity of the neuron. The average error between the actual and predicted of spike activity is 3%. A supervised multi-spike learning algorithm is proposed to train neurons in SNN [7]. A single neuron is trained to learn spike patterns in order to generate spike trains. The expression of membrane potential is simplified by the algorithm and enables the optimization of synaptic weights through the application of gradient descent. The results showed that the algorithm able to achieve classification accuracy. Based on [6] and [7], the gaps can be seen from the need of training data to train the model. Maximum likelihood optimization method and supervised multi-spike learning algorithm required training dataset to train the model. Furthermore, an unsupervised training algorithm is implemented to train SNN [8]. Spike neuron model is trained using synaptic weight association training (SWAT). The training and testing results showed that the algorithm exhibits the capability for classification and convergence accuracy. A limited precision (SNN/LP) supervised learning algorithm of spiking neural networks is implemented in SNN training [9]. Synaptic weights and synaptic delays are applied with limited precision for supervised learning. The algorithm achieved low mean squared error in non-linear XOR classification problem and capable to achieve up to 97% of classification accuracy.
In spiking neural network (SNN), information is emitted and processed by spike neuron through a sequence of action potentials which is also known as spikes [10]. Information is encoded in firing rate of spike neuron [11]. Spike neuron consists of a spike generation function for firing purpose. The spike function is non-differentiable which create a discontinuity at the instance of firing time. Non-differentiability of the function leads to difficulty to develop gradient descent to perform backpropagation in order to update the weight of spike neuron for minimizing loss [12]. This has caused training of SNN using backpropagation become difficult as compared to other artificial neural networks (ANN) [13]. SNN mimics biological nervous system more closely compared to conventional artificial neural networks [14]. Although SNN is biologically more realistic than artificial neural network (ANN) but receives less attention than ANN due to the difficulty to train SNN [15]. In order to overcome the non-differentiability of spike function that leads to difficulty in SNN training, deep reinforcement learning is applied to balance the firing rate of excitatory and inhibitory population of spike neuron. Spike neuron has different firing rate of spikes when different configuration on the firing rate of excitatory and inhibitory population of the neuron is applied [16]. The firing rate of inhibitory population of the spike neuron is initialized as input and adjusted during training to achieve the firing rate of excitatory population of the neuron has the same rate with the target neuron firing rate. In this research, two algorithms of reinforcement learning are proposed to act as agents which are deep Q network (DQN) and deep Q-learning with normalized advantage functions (NAF) to interact with a custom environment with OpenAI Gym framework to optimize spike neuron into balance state. Other than previous research works that using deep learning or reinforcement learning, this research work applied deep reinforcement learning to solve the difficulty in SNN training by using backpropagation algorithm. The algorithm consists of reinforcement learning algorithm with deep neural network for approximation of Q function. The motivation of this paper is to train single spike neuron using deep reinforcement learning with the absence of training dataset in order to attain goals. The algorithms learn from experience to perform an action to maximize rewards.

RESEARCH METHOD
A spike neuron is created by using neural simulation tool (NEST) simulator. Single spike neuron is used in training for optimization. A spike neuron is modeled by using PyNEST command in Python programming language after a custom environment is built. Simulation parameters are required to be initialized for NEST simulator to model a spike neuron as shown in Table 1 [17]. A custom environment is created using OpenAI gym toolkit. The spike neuron is converted into OpenAI Gym framework after the custom environment is built. The environment set the initial state for the problems to be solved. Action space and observation space are configured for both DRL algorithms. Action space represents how many possible actions for the DRL agents to interact with the environment and observation space represents all the data that generated by the environment and to be observed by the agents as shown in Figure 2. In DRL algorithm, no training dataset is required as input to provide raw data for training. The DRL agents select the action to be taken without training data. The agents generate their own data according to the given state, actions taken and reward by interacting with the custom environment with OpenAI Gym framework. The training data which also known as experience is stored in memory. The agents learn from experience to make decisions on the action to be taken to obtain the maximum rewards in order to achieve goal [18].
A deep neural network is constructed in DQN as DQN is value-based method. Action space for DQN in the custom environment is discrete type with 4 possible actions. The agent takes actions based on the 4 possible actions defined as shown in Table 2. The agent receives rewards according to the action taken and state. 4000 training steps are taken to balance the firing rate of excitatory and inhibitory population of the spike neuron. The trained model is used for testing to validate the performance for 5 episodes. The flowchart of this algorithm is showed in Figure 3  Current inhibitory rate -random number from range of 0.02 to 0.05 NAF agent is contructed with a state-value function and an advantage function [19]. In NAF, three networks are implemented in training to approximate Q function which are mu_model, V_model and L_model. The V_model is the network used to learn state value function and mu_model is the network applied to select action to be taken that can maximizes Q function. An advantage function is construct in L_model. Action space of NAF algorithm is continuous domain [20][21]. The action value is random selected by the NAF agent from the range of 0 to 50 which is the search interval range. Reward is feedback to the agent from the environment. The training steps is set to 26000 steps to balance the firing rate of exhibitory and inhibitory population of spike neuron. After trained the model, the model is implemented in testing in order to validate the performance for 5 episodes. The flowchart of this algorithm is showed in Figure 3

RESULTS AND DISCUSSION
The spike neuron model is trained in the custom environment built with OpenAI Gym framework. The model is trained until it able to meet the target neuron rate and achieve convergency. When the model is not given enough training, the model is not able to meet the convergence and the goal is unable to attain.

DQN algorithm
Spike neuron is optimized using DQN with 4000 steps. The DQN agent interacted with the custom environment and selected the actions to be taken. A plot of episode reward versus episodes is generated as shown in Figure 4. The learning curve indicated that the DQN agent capable to explore in the custom environment with OpenAI Gym framework. The trained model able to react towards the custom environment to attain the goal to train spike neuron into balance state. During the initial state, the agents obtained negative reward as the agent is unable to explore in the custom environment with OpenAI Gym framework to optimize spike neuron. During training, the agent learns to explore in the custom environment and receives more rewards. The agent receives positive and negative rewards based on the given state and action taken throughout the training. Each action is selected randomly from 4 discrete actions that defined in action space Int J Artif Intell ISSN: 2252-8938 Spike neuron optimization using deep reinforcement learning (Tan Szi Hui) 179 of the algorithm. With this capability to explore in the custom environment, the model became usable for testing. After training, the model is tested for 5 episodes to validate the performance of the model. The model received rewards for each episode during testing as shown in Figure 5. The inhibitory population rate is fine-tuned by the agent in order to attain the goal. The testing result is tabulated in Table 3. The actual output neuron rate of two episodes are closer to the target neuron firing rate with the difference of 0.04Hz in third and fifth episodes whereas the actual output neuron rate is attained the goal in the fourth episode. In first and second episodes of testing obtained the highest value of difference between actual and target neuron firing rate which is 0.08Hz. The percentage of error between the rate of difference of actual and target output neuron rate and goal is calculated in Table 4. The average percentage of error achieved 0.96%. The lowest steps taken for actual output neuron rate to meet with target neuron firing rate is 84 steps. The result showed that the capability of the trained model to interact with custom environment with OpenAI Gym framework to optimize the firing rate of excitatory and inhibitory population of the spike neuron into balance state.

NAF algorithm
Spike neuron is optimized using NAF with 26000 training steps. The NAF agent is learnt to interact with the custom environment with OpenAI Gym framework and to select action to be taken to get maximum rewards. A plot of episode reward versus episodes is generated as shown in Figure 6. The learning curve showed the exploration of NAF agent in the custom environment. The result proved that the NAF agent has the capability to explore in the custom environment with OpenAI Gym framework. The trained model capable to react towards the environment to optimize the firing rate of excitatory and inhibitory population of the spike neuron into balance state. The agent obtained positive and negative rewards fluctuately due to the continuous action space. The range of the action value is between 0 to 50. Different action values are being selected randomly for actions taken in training. The model able to perform exploration in the custom environment using NAF algorithm and the model can be applied for testing. 5 episodes of testing is applied to test the performance of the trained model for validation purpose as shown in Figure 7. The testing result is recorded in Table 5. The firing rate of inhibitory population is fine-tuned by the agent to attain the goal. The percentage of error between the rate of difference of actual output neuron rate and goal is recorded in Table 6. The average percentage of error between rate of difference of output and target neuron firing rate achieved 0.80%. The lowest steps taken for actual output neuron rate to meet with or close to the target excitatory population rate is 3 steps. The result showed the trained model able to interact with the custom environment with OpenAI Gym framework to achieve the balance state of spike neuron.

Evaluation of DQN and NAF algorithm
Spike neuron is optimized to balance the firing rate of excitatory and inhibitory population by using DQN and NAF algorithms in the custom environment with OpenAI Gym framework. The evaluation of the performance of the DQN and NAF trained model is tabulated in Table 7.
DQN algorithm is applied to train the spike neuron in discrete action space whereas NAF algorithm is implemented to train the spike neuron in continuous domain. The types of action space to use depends on the applications. The training steps of DQN is lower than NAF as 4000 training steps are executed on the model and able to meet the goal. The training time is longer for NAF as the model is trained for 26000 steps to attain the goal. The average percentage error of rate of difference between target and actual output neuron firing rate in NAF is lower than DQN. NAF able to achieve 0.80% of percentage error in the testing trained model. Furthermore, steps taken for actual output neuron rate to meet with or close to the target neuron firing rate in NAF is lower compared to DQN which only 3 steps taken to attain the goal. This indicates that NAF algorithm able to optimize spike neuron into balance state faster than DQN.
The performance of DQN and NAF is compared with a previous research work which using maximum likelihood optimization method to optimize spike neuron as shown in Table 8. Both proposed algorithms achieved lower average error between actual and target output compared to maximum likehood optimization method. The environment used in DQN and NAF are different with the maximum likehood optimization method as the custom environment of DQN and NAF is constructed with OpenAI Gym framework. The environment for DQN and NAF is customized in order to ensure both agents is capable to explore in the environment in order to optimize spike neuron.

CONCLUSION
Deep reinforment learning is proposed as a method to overcome the difficulty in SNN training due to non-differentiable of spike function of spike neuron. Deep Q network and Deep Q-learning with normalized advantage functions algorithms are proposed to balance the firing rate of excitatory and inhibitory population of a spike neuron. A spike neuron is trained in the custom environment with OpenAI Gym framework. Both algorithms able to interact with the custom environment with OpenAI Gym Framework to attain the goal. The average percentage error of rate of difference between target and actual output neuron firing rate for NAF and DQN algorithms obtained 0.80% and 0.96% respectively. In terms of steps taken for actual output neuron rate to meet with the target neuron firing rate, NAF achieved faster than DQN to meet the target neuron firing rate. The results proved that the algorithms able to explore in the custom environment to optimize the spike neuron. In future work, DQN and NAF algorithm can be used for further development to train a spiking neural network (SNN) since both algorithms are capable to explore in the custom environment with OpenAI Gym framework by using DRL to optimize a spike neuron. The developed SNN can be demonstrated in various types of applications such as playing game, classification, image recognition and so on.