Toward accurate Amazigh part-of-speech tagging

ABSTRACT


INTRODUCTION
Part-of-speech (POS) tagging is called the process of marking up each word (token) in a text (corpus) with a convenient POS depending on its definition or its context.Simple tagging consists of assigning nouns, verbs, adjectives to words, sophisticate tagging specifies more the tag with, for example, its gender, number POS tagging is an important pre-processing task for other natural language processing (NLP) tasks.It is primary for other applications used in the NLP such as information retrieval, questions-answering systems, information extraction, summarization, machine translation.There are two mean approaches for pos tagging: The rule-based approach, where rules are handwritten to be implemented in algorithms.Hence, it's time and resource consuming.On the other hand, there is a machine learning approach based on stochasticity and probabilities.But with the development of the machine's capacity and the explosion of data, deep neural networks have become the popular approach for resolving the most NLP tasks.Although the traditional machine learning methods have delivered good results for high resourced language, low resourced languages haven't yet reached these results specifically Amazigh language.
The Amazigh language, more common as Tamazight, is belonging to the Afro-Asiatic (Hamito-Semitic) family languages [1], [2].It spreads in the Northern region of Africa that expands from the Niger in the Sahara to the Mediterranean Sea to the Canary Isles.In the aim of providing an adequate standard writing system, the Tifinagh-Institut de Recherche et Coordination Acoustique/Musique (IRCAM) graphical system has been elaborated to represent the Moroccan Amazigh language in the best way.The Tifinaghe-IRCAM system includes, Toward accurate Amazigh part-of-speech tagging (Rkia Bani)

−
Two semi-consonants: ⵢ and ⵡ. − 4 vowels: ⴰ, ⵉ, ⵓ known as full vowels, and one neutral vowel ⴻ that is considered a genuineness of the Amazigh phonology.
The most syntactic classes of the Amazigh language are noun, noun is a grammatical class obtained from the combination of a root and a pattern.It might be in a simple form (ⴰⵖⵔⵓⵎ 'aghrum' the hand), compound form (ⴱⵓⵄⴰⵔⵉ 'buEari' the forest keeper) or derived one (ⴰⵙⵍⵎⴰⴷ 'aslmad' the teacher).The nouns could be masculine or feminine, they could be singular or plural and they could be in free case or construct case.In the other hand verb, in Amazigh could be basic or derived.In the case of the basic form, verb is composed of a root and a radical.In the case of the derived form, verb has a basic form attached to one of the prefixes morphemes: ⵜⵜ 'tt' that marks the passive form, ⵙ 's' / ⵙⵙ 'ss' which indicates the factitive form, and ⵎ 'm' / ⵎⵎ 'mm' that designates the reciprocal form.These two types of verbs are both conjugated in four aspects: perfect, negative perfect, aorist and imperfective.Finally, there is particle which is a function word that can't designate noun or verb.It includes conjunctions, pronouns, aspectual, prepositions, orientation, negative particles, subordinates, and adverbs.Usually, particles are uninflected words except for the demonstrative and possessive pronouns (ⵡⴰ 'wa' this (mas.)ⵡⵉⵏ 'win' these (mas.)).
Although high resourced or well-studied language have different POS tagging systems whether in classical machine learning or in deep learning, it is always a challenging task for Amazigh as it is a very low resource language.The reason for this is due to the lack of linguistic resources such as morphological analyzer and annotated corpus, the first corpus elaborated counts just 20k token, then the second [3] with 60k token.All these challenges motivate researchers to bring in more solutions for Amazigh text analyzing.Recently, deep learning offered more solutions for different NLP especially bidirectional recurrent network.In this context, we propose for the first time in Amazigh processing text a POS tagger based on bidirectional longue short term memory.
Since the importance of POS tagging in NLP, NLP researchers have been interested in it for the past decades.Early approaches were based on probabilities such as hidden Markov models (HMM) [4] and conditional random fields (CRF) [5].Or based on statistics such as support vector machine (SVM) [6].Recently, with the expansion of data in the net, the deep learning models recognize an important development starting from recurrent networks to bidirectional ones that we will present thoroughly in the next section.
The first research on Amazigh POS tagging was done by [7] using the first annotated corpus of just 20k tokens, it was based on traditional machine learning, such as CRF and SVM with an accuracy of 88.66% and 88.26 % respectively.Later, with the elaboration of another corpus of 60k tokens [3], the CRF and SVM algorithms are tested again in [8] giving an accuracy of 89% and 88% respectively.In addition, tree tagger as an independent language tagger has been tested in the Amazigh language in [8] with an accuracy of 89%.As we see, the accuracy of tagging the Amazigh language is still low compared to the well-known language.In the Germain language tree tagger reaches 96% [9], as for French reaching 97% [10].Certainly, the availability of the annotated corpus is the reason behind this large difference in accuracy between the Amazigh language as a low resource and the other rich language.As we are talking about 300k tokens range compared to 60k tokens.As traditional machine learning algorithms have already been tested in the Amazigh language, we are wondering if deep learning could enhance the tagging accuracy for this low resource language specifically the bidirectional long-short-term-memory (Bi-LSTM).
Recently, Bi-LSTM [11], [12] networks have been the center of interest in multiple NLP tasks, starting from sentiment analysis [13] and semantic role labeling [14] to dependency parsing [15], [16].In syntactic chunking, authors in [17] are the first to use a BI-LSTM in addition to the CRF layer to sequence tagging with state of art results.In the matter of POS tagging, in [18], they used BI-LSTM to test POS tagging on multiple languages and obtained the existing state-of-the art results.On the other hand [19] developed a unified tagging Bi-LSTM model for chunking, POS tagging and named-entity-recognition.However, in [20] they used the Bi-LSTM network not only in rich languages but also in some languages with a corpus of just 80k tokens, and the results were promising.Bidirectional LSTMs read a sequence of inputs on both senses forward and backward before passing on to the next layer.For more information see [21], [22].The main contribution of this work includes, − Present the challenges of POS tagging low resource language such as Amazigh and discuss the state-ofart approach already used to solve this task.

−
Propose a new model for POS tagging Amazigh which is Bi-LSTM and verify its performance on the Amazigh dataset.Compare the performance of our tagger with the existing taggers and demonstrate that our tagger offers state-of-the-art results.

METHOD
In this section, we present the architecture of the model that we used in this paper.It's based on Bidirectional LSTM with an embedding layer.And to choose the best parameters, we realized several experiments on the size of the cell as well as on the embedding size.

BI-LSTM architecture
A recurrent neural network (RNN) is a type of neural network that enables the use of previous predictions as input, thanks to the hidden state, hence modeling the contextual information dynamically.Beginning with an input sequence  1 ,  2 ,   , a standard RNN estimates the output vector   of each word   by calculating at itch step the (1) and (2).
Where W represents the weight matrix connecting two layers (like  ℎ is the weights between the output and the hidden layer), ℎ  is the vector of hidden states, b represents the bias vector (e.g.:   is the bias vector of the output layer) and A is the activation function of the hidden layer.We point out that ℎ  contains information generated from the output of the hidden state ℎ −1 , so that RNN can exploit all input history.Nevertheless, the number of the input history that can be used is practically limited, since the impact of certain inputs may decline or explode exponentially throw the hidden states, this phenomenon is called the vanishing gradient problem [23].To remediate this problem the long short-term memory (LSTM) architecture has been established [12].
An LSTM network, presented in Figure 1, enables the standard RNN to keep a memory of inputs for a long time.LSTM network has three gates as shown in Figure 1: an input gate (  ) , a forget gate (  ) which decides to keep or not a memory of precedent cell and an output gate (  ), all gathered in a memory block as shown in Figure 1.To compute the output of the LSTM hidden layer ℎ  starting from an input   , we solve the (3)-( 7) [24].
ℎ  =   tanh(  ) Note that σ represents the logistic sigmoid function, and i, o, f and c are correspondingly the input gate, the output gate, the forget gate and the cell of activation for vectors.The arrows in Figure 1 represent the weights matrices.Thanks to these multiple gates, the LSTM network can remember periods of time, with the size depending on the chosen architecture, and so it offers a solution of the problem known as vanishing gradient.To extend the knowledge about the more the architecture of LSTM, see [25].
Not only Bi-LSTM networks [11] solved the problem of vanishing gradient but also provided both the preceding and succeeding context.Which was not possible in conventional RNN.This advantage of Bi-LSTM is very helpful on a task like POS tagging where the whole sentence is given.As illustrated in Figure 2, Bi-LSTM computes first the forward hidden sequence ℎ ⃗  and the backward hidden sequence ℎ ⃖⃗  , then combines ℎ ⃗  and ℎ ⃖⃗  to generate the output   .We can describe this operation with these equations, ℎ ⃖⃗  = ( ℎ ⃖⃗ ⃗   +  ℎ ⃖⃗ ⃗ ℎ ⃖⃗ ⃗ ℎ ⃖⃗ −1 +  ℎ ⃖⃗ ⃗ ) =  ℎ ⃗ ⃗  ℎ  ⃗⃗⃗ +  ℎ ⃖⃗ ⃗  ℎ ⃖⃗  (10) Figure 2. Schematic representation of Bi-LSTM architecture

Our proposed model
The proposed model shown in Figure 3, includes three layers, a word embedding [26] layer, a bidirectional layer with LSTM cells, and finally a SoftMax layer.Starting with a sentence of n words having tags:  1 ,  2 , …   , the Bi-LSTM tagger is first trained to predict the tags.The sentence is transformed to [ 1 ,  2 , …   ] where   is the code number given to the word at index i according to the vocabulary.This sequence is first entered into the word embedding layer and then fed to the bidirectional layer to finally SoftMax the results to fulfill the predictions.The input of the word embedding layer is designed in a way that if even the dataset changes, the vocabulary received by this layer will be taken into consideration.As for the word embedding output, the number of LSTM cells, the parameters are chosen in such a way that the model performs better.

Dataset and tag set
The dataset used in this experiment is the one elaborated by [3], it is a collection of Amazigh texts from various resources.It is a CSV file of order 60k tokens including two columns of word and a tag, Table 1 shows the statistics of this used dataset.To use this corpus well, a preprocessing step was taken to allow us to use it in machine learning models.The first step is to delimit the sentence and to assign each word an ID, the second step is to transform the corpus into a list of sentences in this format: [(word1, tag1), (word2, tag2), (wordN, tagN)] to be ready for the next processing steps such as encoding words into indices and tags to one hot encoding.The tag set is a collection of POS tags (labels) to indicate the pos tag or some other grammatical categories of each token in text.The Amazigh language has its specific tag-set represented in Table 2; it is the same used in previous Amazigh NLP works like [3].This tag set is like other languages with the variance of having multiple particles.

Experimental environment
The proposed model is developed using a layer for word embedding, and a bidirectional LSTM layer then a SoftMax layer.For the implementation purpose, we used the TensorFlow library with Keras application programming interface (API).Model optimization is a primary task for the implementation of a deep learning model, it's done by choosing some hidden parameters that influence the model's behavior.Those hypermeters are embedding size, direction, number of LSTM cells, number of epochs, and batch size.

Parameters configuration
In this experiment, the size of the input layer is fixed to 100 and the output layer is 28 as we have 28 tags.To choose the embedding size, we run different experiments changing its size, the results are shown in Figure 4(a).The accuracy changes are very close, we have chosen the size 110 as it's the best performing.On the other hand, for the number of hidden layers, we evaluate the performance of the Bi-LSTM model on

RESULTS AND DISCUSSION
In this section, we present the result that we get using the detailed specifications in Table 3.We must emphasize that the Amazigh language hasn't yet a trained word embedding.So, in this experiment, under the Keras embedding layer, we will test the Bi-LSTM model without training embedding by randomly initialized word embedding, then we will train the word embedding layer by enabling it in the model.
Figure 5 represents the accuracy and loss function of our Bi-LSTM model with training embedding.As we can see in Figure 5(a) the accuracy reaches 97.10% and in Figure 5(b) the loss is less than 11%.For the nontrained word embedding model, it requires setting up the number of epochs to have significant results, so we test it with 200 epochs and the results were promising too, as in Figure 6(a) the accuracy reaches 94.8% and in Figure 6  For this experiment, we have chosen to compare our Bi-LSTM tagger with those Amazigh taggers cited in the section of related works as the baseline.The first SVM and CRF approaches use a 20k tokens corpus.The second CRF, SVM approaches using a 60k tokens corpus with another tree tagger approach.Table 5 shows the different approaches used and their results compared to our model.Table 5.Comparison of our Bi-LSTM Tagger to the baseline Corpus Tagger Accuracy (%) 20k tokens CRF [7] 88.66 SVM [7] 88.26 60k tokens CRF [8] 89 SVM [8] 88 Treetagger [8] 89 Our approach Our Bi-LSTM without training embedding 94.8 Our Bi-LSTM with training embedding 97.10 This comparison proves that our proposed model for tagging Amazigh language, specifically, Bi-LSTM outperforms the existing tagger for this language.Even without training the embedding, our model outperforms the traditional machine learning used in Amazigh language tagging.Figure 7 is a graphic representation of those results, and as we can see the outperformance of our model even with the size of the dataset that can be considered small regarding other languages.

CONCLUSION
While POS tagging in rich language reaches 97% with traditional machine learning models, Amazigh as a low resource language remains searching for a good pos tagger.In this paper, we presented a new tagger based on Bi-LSTM model that performs the state-of-art for the Amazigh language even without pre-trained word embedding.In future work, we will keep enriching the Amazigh language by creating new datasets for named entity recognition as well as training different word embedding models for the Amazigh language to be available for NLP projects.

Figure 1 .
Figure 1.Schematic representation of a typical LSTM network

Figure 3 .
Figure 3.The Bi-LSTM model proposed for Amazigh POS tagging

Figure 5 .Figure 6 .
Figure 5. performances of Bi-LSTM model with training word embedding for; (a) Accuracy for and (b) Loss

Figure 7 .
Figure 7. Graphic representation of the comparison of the performance of baseline and our proposed model 579

Table 2 .
The amazigh tag-set

Table 3 .
List of the hyperparameters values used in the proposed model

Table 4 .
Summary of the experiment specifications and results