Bidirectional long-short term memory and conditional random field for tourism named entity recognition

ABSTRACT


INTRODUCTION
Tourism growth happens almost every year. In 2019, worldwide international tourist arrivals reached 1.5 billion, increasing by 4% from the previous year [1]. With the advent of the web 2.0 age, internet users frequently share their travel experiences via websites [2]. A study conducted by Google Travel found that 74% of travelers plan their trips via the internet [3]. The search for tourist destinations is one of the steps that is generally carried out when planning a trip. Most people commonly search for information regarding tourism destinations through reviews, websites, and articles on the internet [4], [5]. However, searching, selecting, and reading details of each piece of information through travel guidebooks or portal sites is timeconsuming [6], [7]. The time-consuming issue of getting travel information from texts can be solved by applying information extraction.
The named entity recognition (NER) task can be applied to extract information from texts in the natural language processing area. NER can be defined as a task of extracting entities from text documents, such as a person's name, location, or organization [8]- [10]. Various NER approaches are Rule-Based, Machine Learning which includes hidden markov model (HMM), maximum entropy, decision tree, support vector machines, conditional random fields (CRF), and Hybrid Approaches [11]. There are also other methods, such as recurrent neural network (RNN) and its variant, long short-term memory (LSTM) which has been successfully used in various prediction problems sequences, such as NER, language modeling, and speech recognition [12]. Many researchers have also studied NER in multiple fields, such as the geological domain [13]. They developed a NER system for geological text using the CRF method and IITKGP-GEOCORP dataset developed from article collections and scientific reports containing geology-Int J Artif Intell ISSN: 2252-8938  Bidirectional long-short term memory and conditional random field for tourism named… (Annisa Zahra) 1271 related information in India. In the biomedical domain, the BiLSTM-CRF (bidirectional long-short term memory-conditional random fields) model was used to recognize drug names in biomedical literature [14]. They used the concatenation of word embedding and character level embedding for the input vector. Their experiments obtained an F1-Score of 92.04% and achieved comparable performances with the state-of-the-art results on two datasets. Word embedding was used as a feature by [15] in named entity recognition research and produced a better result than the baseline without word embedding. A study by [16] proposed BiLSTM-CRF for their NER system using three datasets, Penn Treebank POS Tagging, conference on computational natural language learning (CoNLL) 2000 chunking, and CoNLL 2003 named entity tagging. The research revealed that the BiLSTM-CRF model could efficiently use input features from the past and future because of the two-way LSTM component. Moreover, the CRF layer can provide the label information at the sentence level. From several scenarios used in that research, the BiLSTM-CRF model yielded the best results in almost all datasets. The merging of BiLSTM and CRF layers was also carried out in [17], and the results showed that such merging could solve the problem of the inability to handle strong dependencies of tags in a sequence. A Chinese dataset and BiLSTM-CRF model were used in [18]. It was found that using a dictionary produced better performance than using only the BiLSTM-CRF model. The work in [19], which used the BiLSTM-CRF model along with pre-trained word embeddings, character embeddings, and dictionary information, succeeded in improving the performance of the Disease-NER system. They used pre-trained word embeddings using skip-gram that combined domain-specific text (PMC and PubMed texts) with generic text (english wikipedia dump). The BiLSTM-CRF model can also be combined with bidirectional encoder representations from transformers (BERT) [20]. The combination of the three methods is called BBLC. They also compared the performance of the BBLC model with the BiLSTM-CRF model using the same dataset. As a result, the BBLC obtained a higher F1-Score than the BiLSTM-CRF on some entities, such as location, organization, and thing. On the other hand, the BiLSTM-CRF model achieved a higher F1-Score than BBLC on the time entity.
Furthermore, NER can extract meaningful information from tourism websites by identifying the named entities. In the tourism domain, identified entities can be the names of tourist attractions, places of lodging, facilities, and locations. Identifying related entities is expected to make it easier for potential tourists to find tourist destinations via the internet. However, many NER studies in the tourism domain did not focus on categorizing the characteristics of tourist attractions. We argue that classifying the characteristics of the tourist attraction, such as natural, heritage, or purposefully built, is essential to help users make decisions for future utilization of our NER system. Thus, in this study, we present a NER system to aid tourists in finding tourist destinations from articles by extracting tourist attractions into four categories such as "natural", "heritage", "purposefully built" (artificial), and "outside". This study proposes a combination of Word2vec and BiLSTM-CRF approaches to building the NER system for tourism. The implementation of Word2vec in our research is inspired by [21]. Additionally, we explore the performance of Word2vec using two different Word2vec algorithms called skip-gram and continuous bag-of-words (CBOW).
There is a previous study with different tags/labels for tourism NER. Saputro et al. [22] used five labels, namely "nature", "place", "city", "region", and "negative" as the named entity. The proposed system has scored 70.43% of accuracy, with an F-Score of 69%. However, some of the labels used in this study are not specific enough to identify the characteristics of the tourist attractions. For example, when our goal is to extract the name of a tourist attraction and its characteristics (whether they are natural, heritage, or purposefully built), the labels "city" and "region" are too broad. Another study proposed a corpus for tourism NER in the Mongolian language [23]. Thus, although the studies are similar, direct comparison is impossible because the labels and the language are different.
The rest of this paper is structured as follows. Section 2 describes the methodology of our study. In section 3, the result and discussion of this study are presented. Finally, section 4 provides the conclusion and future work.

METHOD
This section provides the methodology conducted in our study. Our work consists of six steps such as data retrieval, pre-processing, data labeling, feature extraction, classification, and evaluation. A more detailed explanation of each step is described later.

Data retrieval
The data for this study was gathered via web scraping techniques from English tourism articles. We searched the articles using two distinct methods: keyword searches and scraping articles directly from predetermined websites. The articles were indexed using the following keywords: top tourist cities, best places to visit, top world heritage sites, world heritage list, best destinations for nature lovers, and best natural tourist attractions. Scraping was accomplished using Python's Newspaper module, which is capable of extracting and parsing text from website articles. Each article was processed independently and then saved in a file with the *.txt extension. We gathered our dataset from 24 websites and obtained 92 articles containing 8,500 sentences, 17,137 unique words, and 183,507 tokens.

Pre-processing
This pre-processing task aims to eliminate all meaningless characters and preserve the remaining valuable words [24]. The pre-processing techniques carried out in this research were removing URLs, emoticons, and tokenization. URLs and emoticons were removed because they do not significantly affect recognizing a named entity. Moreover, tokenization was done by dividing the sentences into smaller units called tokens.

Data labelling
We manually labeled the tourist attractions into four categories: natural, heritage, purposefully built (artificial), and outside. We adopted the BIO tagging format in the labeling process, where a token is tagged as B-label if it is the beginning of a named entity, I-label if it is within a named entity but not the first, and Olabel represents otherwise [25]. In this study, the B-prefix represents the first word of a tourist attraction's name, while the I-prefix represents the second through the last word of a tourist attraction's name. We considered natural products to be labeled as a natural category in this study, which is open to the public and provides natural views such as waterfalls, mountains, caves, rivers, and glaciers. Tourist attractions that have been around for a long time are ancient, historic, and often cultural, or tend to represent culture and heritage, and places of worship fall into the heritage category. Ruins, monuments, temples, forts, castles, mosques, and cathedrals are categorized as heritage structures. Tourist attractions that are purposefully built to attract visitors, such as museums, markets, and amusement parks, will be classified as purposefully built. The final category, outside, is for a word that is not considered a tourist attraction.
The number of tokens for each label is shown in Table 1. Our dataset was imbalanced since the label O dominated our dataset with 171,728 tokens. The I-PURPOSE and I-HERITAGE labels consist of 2,874 and 2,051 tokens, respectively. The number of B-NATURAL, I-NATURAL, and B-PURPOSE were almost the same. The number B-HERITAGE label was the lowest, with 1,401 tokens.

Feature extraction using word embeddings
This study applied word embeddings called Word2vec as the feature extraction method. Word2vec has two different approaches, continuous bag-of-word (CBOW) and skip-gram. In our experiments, we compared the application of CBOW and skip-gram to obtain the best result. The CBOW algorithm predicts a target word based on its context, whereas the Skip-gram algorithm predicts the target context based on a word. The CBOW model aims to predict the middle word by combining the representation of the surrounding words. The Skip-gram model generates a word vector representation capable of predicting the context of the word.
Additionally, both models require little training time and can be applied to a large corpus [26]. The textual input is converted to vectors using Word2Vec and then trained to generate a dictionary. The dictionary contains the same number of words as the dataset's unique words, and each word has its vector. In the following step, the dictionary containing the pre-trained vectors will be used as weights for the embedding layer.

Classification
We used StratifiedKFold from Scikit-Learn to divide our data into ten subsets. One out of ten subsets were used as the test data, and the other subsets acted as the train data. The total sample in this study was 8,500, and the total sample is the total sentences in the dataset. After splitting the data, we have 7,650 train data and 850 test data. During training, we used 20% of the train data to be used as validation data, so 1273 that the train data ended up being 6,120 samples and the validation data was 1,530 samples. The average number of tokens used for each label in each fold for training data is shown in Table 2, and for test data is presented in Table 3. A large number of O labels result from the post padding sequences performed on each sample with O as the padding value.  The input data were classified into predefined categories by combining two methods, BiLSTM and CRF. BiLSTM is made up of two LSTMs, forward LSTM and backward LSTM. As a modification of RNN, LSTM has a memory cell that can store information for a long period. When dealing with long sequential data, the vanishing gradient problem encountered in RNN can be addressed with LSTM by utilizing gates that control the information entering the memory [27].
LSTM is useful to be applied in sequential labeling cases since its capability to gain the information from both front and back sides of the texts. The hidden state in the LSTM, on the other hand, only retrieves information from the previous part, leaving the next part unknown. To overcome these issues, BiLSTM can be applied [28]. In BiLSTM, the combination of forward LSTM and backward LSTM will capture information from both directions. The output of forward and backward LSTM then be combined using the sigmoid function (σ) as shown in (1). It can be a concatenation function, an addition function, an average function, or a multiplication function. The represents the output at time t, while ℎ ⃗ represents the hidden state from the forward layer and ℎ⃖ represents the hidden state from the backward layer.
The proposed BiLSTM-CRF architecture in this research is depicted in Figure 1. The input layer receives the words. These words are represented by a vector of integer values. Each word's value was generated using the pre-trained word2vec. Additionally, the embedding layer's output serves as the input for the following BiLSTM layer. Moreover, a decision function based on the CRF layer was used to generate the label sequence. CRF is a method to obtain global optimum predictions using a conditional probability distribution model [29]. The CRF layer labels the sequence using the surrounding labels. The labels preceding and following the current word can aid in predicting labels for the current word. There are two types of scoring calculations in the CRF method: emission and transition scores. The emission scores in this model are derived from the output score matrix for the preceding layer. While transition scores are initially assigned randomly, they will be updated throughout the training process. The two scores will be used to predict the final output sequence of labels.

Evaluation
To measure the performance of NER, [15] argues that the measurement using the F1-Score is more suitable than the accuracy. This is because most of the NER data labels are labeled as O, which refers to tokens that are not an entity named (named entity), and thus high accuracy can be obtained. Therefore, this study will use the F1-Score as a parameter for measuring model performance. F1-Score is the harmonic mean of precision and recall as shown in (2). The best score that the F1-Score can achieve is 1, while the worst is 0. This value can also be represented in a percentage ranging from 0-100%, which will also be used in this study.

RESULTS AND DISCUSSION
In this study, we set the hyperparameter with various values to obtain the best scenario. The initial scenario in this study used the Skip-gram algorithm for Word2Vec, 128 LSTM units, the dropout in the LSTM layer is 0.5, TanH as the activation function in the dense layer, Adam optimization function, 32 batch size, and 30 epochs. To get the best model performance for NER in the tourism domain, we made seven different scenarios, as shown in Table 4. We conducted scenarios 1 to 7 sequentially. The configuration that produces the best performance in each scenario will be used in the subsequent scenarios. Based on all scenarios that have been done, the model with the best scenario obtained an average F1-Score of 75.25% and used configurations shown in Table 5. In addition, the accuracy and average F1-Score generated for each scenario are shown in Table 6. Table 7 presents the best F1-Score for each type of Attraction, and Natural Attraction. Meanwhile, words that are not tourist attractions will be included in the outside category. Figure 2 illustrates the examples of the named entity detection results from our dataset. The result shows that our best model was able to detect some named entities correctly. However, our NER model still makes some mistakes while predicting the label. For example, several types of tourist attractions were still wrongly detected as tourist attractions. This mistake may happen due to the lack of tourist attractions in our dataset. Therefore, our model fails in predicting the entity.

CONCLUSION
In this study, a named entity recognition system for tourism has been presented. We focused on extracting tourism entities based on their categories: natural attraction, heritage attraction, and purposefully built (artificial) attraction. The experiments have shown that the proposed BiLSTM-CRF algorithm has demonstrated promising results in identifying named entities from the tourism dataset. We also found that the application of word2vec with skip-gram can improve the performance of the named entity system. This research has produced a model that could predict new data quite well, but there were still some mistakes in the detection. We experimented with various scenarios, and the best model produced an average F1-Score of 75.25%. For future work, applying any other word representation models can be considered to improve the performance of named entity detection. In addition, we suggest adding more entity labels, such as country, city location, and tourism name.