A spark-based parallel distributed posterior decoding algorithm for big data hidden markov models decoding problem

Received Jun 21, 2020 Revised Jun 7, 2021 Accepted Jun 20, 2021 Hidden Markov models (HMMs) are one of machine learning algorithms which have been widely used and demonstrated their efficiency in many conventional applications. This paper proposes a modified posterior decoding algorithm to solve hidden Markov models decoding problem based on MapReduce paradigm and spark’s resilient distributed dataset (RDDs) concept, for large-scale data processing. The objective of this work is to improve the performances of HMM to deal with big data challenges. The proposed algorithm shows a great improvement in reducing time complexity and provides good results in terms of running time, speedup, and parallelization efficiency for a large amount of data, i.e., large states number and large sequences number.


INTRODUCTION
Big data refers to a large amount of data created and diffused daily. Big data has a great influence, both commercial and economic, on the development of the global economy [1]. This huge amount of data constitutes a great source of power. It is an inexhaustible mine of knowledge that must be processed to extract valuable information. Thus, big data analytics have attracted many specialized companies and researchers who tried to improve and adapt the classical algorithms to handle voluminous data [2].
Hidden Markov models (HMMs) [3] are classical statistical models, widely used in many fields such as speech recognition [4], finance [5] or bioinformatics [6], but in a big data context, these models have not yet reached their maturity. In the previous approaches, the volume of the data did not present a real problem since, in general, we did not handle very large size of data. In addition, the applications used structured data, so their formats were not very varied which is not the case for big data. Also, algorithms were not fast enough to give solutions in real time or to manage the high speed of data generation and diffusion [7].
Big data in the context of HMMs applications can be tackled using different approaches, either by studying HMMs with multiple sequences, HMMs with long observation sequences or HMMs with a large amount of hidden or observed states. In recent years, various works have focused on how to introduce HMMs in big data applications in order to make full use of their potential. Thus, many researchers are working on improving algorithms which take into account the complex characteristics of big data [8]. In fact, HMMs algorithms must be adapted to meet the growing demand for data processing. One of the most promising solution is to implement these algorithms under big data framework to take advantage of the powerful tools of these facilitating data distribution and parallelism of calculation. The decoding problem is one of the fundamental problems of HMMs. In this problem, in a given model , we search the most probable state sequence that produced a given observations sequence = { 1 2 . . . }. To solve this problem, Viterbi [9] and posterior decoding algorithms [10], [11] are two of the most used algorithms. Even if these two algorithms solve a similar problem, the Viterbi algorithm finds the global solution while the posterior decoding algorithm locally finds the most likely hidden states. Although the posterior decoding algorithm has shown its processing speed, efficiency and accuracy, it generally has some drawbacks when handling big data specifically the suboptimal complexity and high execution time [12]. With the exponential development of big data technologies, it is necessary to focus on new approaches to use these new technologies to improve classical algorithms in terms of analysis and processing power, mainly parallel distributed computing [13]. Spark is certainly one of the most powerful big data technologies which have demonstrated their effectiveness in several applications, and which is attracting more and more researchers.
In this paper, we present a new parallel distributed version of posterior decoding algorithm under Spark [14] for HMMs decoding problem. We used the main concepts of the spark framework to achieve this implementation; To distribute the data over many blocks, we used the concept of resilient distributed datasets (RDD) [15], then for the parallel computation, the MapReduce paradigm [16] is used, and finally to reduce the communication cost, we used broadcast variables. One of the major advantages of the proposed solution based on spark is to benefit from the richness of its modules offering a variety of tools for data collection and preprocessing, a set of optimized algorithms for parallel calculation, and algorithms for analyzing data in real time, as well as the possibility of execution of graph algorithms. Through this implementation under spark in a cloud environment, we think we contribute to bring hidden Markov models into the new era of big data, which opens the doors to the use of hidden Markov models in various fields of applications requiring a huge amount of data and parallel processing. The main contributions of this paper are summarized as:  Reviewing the foundations of HMMs, mainly the decoding problem.  Proposing an improved posterior decoding algorithm, based on parallel distributed computing approach using Spark.  Evaluating the proposed approach in a cloud environment using several metrics.
The remainder of this paper is organized as. Section II deals with the hidden Markov models fundamentals and presents the HMMs decoding problem followed by a detailed discussion of the posterior decoding algorithm. In section III, we explore some related works. In section IV, we describe the proposed parallel distributed posterior decoding algorithm under Spark. The experimental results of the proposed algorithm evaluation are presented and discussed in section V. Finally, we conclude the paper with a summary of our key contributions and discuss possible future work.

RESEARCH BACKGROUND 2.1. Hidden Markov models fundamentals
Hidden Markov models are based on a 1st order Markov model simulating the evolution of the state of the system. It produces a sequence using two sequences of random variables; hidden and observable sequences. The hidden sequence corresponds to the sequence of states and the observable sequence corresponds to the sequence of observations [3]. They are statistical Markov models used in various fields. Especially in speech recognition and in signal processing and communications. Hidden Markov models are also used in computational biology and bioinformatics [6], in natural language modelling [17] as well as in finance analysis [5] and many other areas.
The characteristics of an HMM are defined as [3], [18]: we denote [B] = {b j (v k )} as the observation probabilities matrix. π i : the vector of initial probabilities, where: we denote [Π] = {π i } as the initial emission probabilities vector. P(λ | O): the probability that the HMM λ has produced the sequence O.
The HMMs are used to solve three main problems:  Evaluation: Given the sequence of observations and an HMM , how to assess the probability of observation ( | )? For this problem, a forward-backward dynamic programming procedure [19] is used to calculate the probability of the observation sequence efficiently.  Finding the most likely path: Given the sequence of observations and an HMM , how to find a sequence of states that maximizes the probability of observation of the sequence? Viterbi algorithm is a dynamic programming technique for finding this single best state sequence = { 1, 2 , 3 , . . . } for the given observation sequence = { 1 , 2 , 3 , . . . , }. Another algorithm used to solve the decoding problem is the posterior decoding algorithm used when several paths have similar probabilities.  Learning: How to adjust the parameters (A, B, Π) of an HMM to maximize ( | ), by using the Expectation-Maximization (EM) algorithm [20].

Posterior decoding algorithm
In hidden Markov models decoding problem, given the sequence of observations = { 1 , 2 , 3 , . . . , } and an HMM , we seek for the most probable sequence of states Q that maximizes the probability of observation of the sequence? In this problem we try to guess the correct hidden sequence of states. There are two algorithms that are most used to solve this problem: Viterbi and posterior decoding algorithms. The definition of the sequence of probable states differs depending on the domain and may influence the final solution of the problem. One first approach looks to search for the most probable state and to concatenate all such " ". It means that we have to choose states that are individually most likely at the time when a symbol is emitted. This approach is called posterior decoding. Another approach proposes to find the best path through the hidden state space, i.e., Viterbi algorithm.
While the Viterbi algorithm remains the most used and efficient algorithm for the problem of decoding HMMs, in some applications it is not the most appropriate. One of the alternatives of this algorithm is the posterior decoding algorithm which is also widely used when there are many paths that have almost the same probability as the most likely. Posterior decoding algorithm provides the most likely state at any time. It focuses on the individual positions in the sequence and seeks to maximize the probability that they are well explained.
Posterior decoding algorithm involves dynamic programming using the forward and backward algorithms and using sums instead of the maximization procedures to calculate the total probability for all possible paths. In forward algorithm, we define the forward variable, the probability of producing the partial observation sequence 1 , 2 , 3 , . . . , , (until time ) given the model and that the current state is at time , as: The forward algorithm to calculate ( | ), the probability of the observation sequence 1 , 2 , 3 , . . . , , given the model is as: Initialization Termination (t=T) The backward variable ( ) is calculated similarly to ( ) using a backward recursion given that we are starting from at the instant . Hence, we define the backward variable as: The backward algorithm is as: Initialization Termination (t=1) In posterior decoding, for each , 1 ≤ ≤ , we would find that maximizes ( | , ). Let ( ) be the probability of the being in state at time for the given observation sequence and the model (posterior probability). Thus, at each time, we can choose the optimal state that maximizes ( ).
with the following constrain being satisfied: The individually most likely state ̂ (the sequence of states obtained by posterior decoding) is defined thus: In other words, at every position we choose the most probable state for that position. The pseudo code of posterior decoding algorithm is given by Algorithm 1.

RELATED WORK
In [21], to improve the prediction of the topology of fully beta membrane proteins, Fariselli et al. propose a new algorithm for the HMMs decoding problem. This new algorithm, called Posterior-Viterbi, is a combination of the posterior and Viterbi algorithms. First, they compute the posterior probabilities of each state, then they use the Viterbi algorithm to look for the best posterior possible path through the model. It performs better than the others especially when several concurring paths are present. This algorithm is certainly effective, but in terms of time complexity it is slower than other algorithms of decoding (e.g., Viterbi, posterior decoding). While in terms of space complexity, it needs the same memory requirements as Viterbi and posterior.
In [22], Sand et al. used new generations of multi-core processors that support the SSE instruction set to develop a library for HMMs using C++. It exploits an optimized implementation of forward and backward algorithms by reformulating matrix multiplications, and for each iteration for the division operation, it uses SSE instruction instead of the instruction for chunks multiplication to speed-up the calculation. Lunter et al. [23] propose a variant of the posterior decoding, marginalized posterior decoding, which differs from the classical algorithm in the way the intervals are treated. It takes into account the columns which contribute to an alignment to calculate this alignment that maximizes the posterior probability of the cumulative log of these columns.
Do et al.
[24] present the probcons algorithm using pair HMMs to estimate posterior probabilities for amino acid residues. It uses an alignment partition function to generate suboptimal alignments. It differs from other approaches in its use of maximum expected accuracy to align pairs of sequence profiles. To predict the sequence features that combine probabilities for homologs sequence features, Käll et al. [25] propose a posterior HMM decoder. This algorithm considers the mean posterior label probability of each position in a global sequence alignment. Bourlard et al. [26] improve the posterior probabilities using all possible acoustic information and prior knowledge to enhance the functioning of automatic speech recognition systems. The objective in this work is to improve the estimation of local posteriors by calculating posterior probability recursively to generate local posteriors considering all available acoustic information adding other preliminary information. Brown et al. [27] outline a new HMMs decoding approach based on the labelling of sequences in such a way that the correct labelling of a sequence is close to the prediction.
We propose an improvement of posterior decoding algorithm. It is a parallel distributed posterior decoding algorithm under Spark which makes it possible to speed up the algorithm for a high number of states or a high number of sequences. Thus, the improved algorithm allows the optimization of the complexity and reduction of computation time. So, this solution is well adapted to big data applications (high scalability, effective management of heterogeneous data and easy integration in big data frameworks). In addition, the proposed solution based on Spark allows to benefit from the richness of its modules offering a variety of tools for collection, preprocessing and data cleaning, and a set of optimized algorithms for parallel calculation, analyzing and managing data in real time. It gives possibility of graph algorithms execution.

PARALLEL DISTRIBUTED POSTERIOR DECODING ALGORITHM USING SPARK
Many recent researches have focused on the parallel distributed implementation of classical algorithms using big data platforms including the classical algorithms of HMMs [28]- [33]. We implemented our algorithm under Spark using the Python language. We used the main concepts of the Spark framework to achieve this implementation; the MapReduce paradigm to perform parallel computations, the resilient distributed datasets (RDD) concept to distribute the data over many blocks and to reduce the communication cost, we used the broadcast variables.
Spark is one of the platforms often used for big data processing to handle a huge amount of data in batch and real time processing modes. Spark uses RDDs to enable efficient reuse of data in a broad family of applications. RDDs are characterized by their fault tolerance property and allow the storage of intermediate data in memory using parallel data structures, the control of partitioning, and the manipulation of data using a set of operators. RDDs support two types of operations: transformations and actions. The transformations (e.g., map, filter, sample) return a new RDD while the actions, like reduce, collect, and count, evaluate and return a new value.
Spark, like Hadoop [34], is based on a distributed storage system (e.g., HDFS [35]) to allow the storage of input and output data of Spark's jobs [36]. Spark is based on following elements: Spark core, which is the framework execution engine, Spark cluster manager, which manages the cluster resources (Kubernetes, Mesos [37], Yarn [38]), Spark SQL [39], Spark streaming [40], MLlib, the distributed machine learning library [41] and GraphX [42] as shown in Figure 1. To optimize the calculations on Spark, we used vectors. Each column of the matrices is stored in a vector. Vectors are less consuming in terms of computation time [43]. It is better to work on vectors rather than on matrices. Indeed, even if the filling of the vectors represents more operation than the filling of a matrix, the program will be faster. The explanation is as: When filling a vector, starting from the first component, the processor automatically allocates cache memory for the following n components. Whereas when filling a matrix, only the components of the first line are allocated a place in the cache memory. Going to the next line resets the operation. So, for each matrix, we use a vector to store the elements of each column. For example, when calculating the posterior probability ( ), the values of 1 (1), 1 (2), 1 (3), ..., 1 ( ) are stored in the vector 1 as shown in Figure 2. The steps of the proposed algorithm implemented under Spark is as. We first calculate the values of the forward and backward variables as we explained in Section 2 by parallelizing the loops on and . Under Spark, this task is performed using multiple executors in parallel. Then, from these stored values in ℎ and vectors, we compute the ( ) in parallel for each t as shown in Figure 3. Thus, we use different RDD operators (map, reduce, ...) to efficiently perform calculations in parallel. For example, using the reduce function to calculate ( | ), this allows to aggregate the elements of an RDD by applying a commutative and associative function passed as an argument instead of making a sum on the elements ( ) with which varies between 1 and . According to Figure 3, for a value of , the calculation of ( ) times is performed only once. It is therefore a gain of * operations. This is not negligible for large programs. The ( ) for each will be stored in the vector . Then, we apply the function on the ( ) on all states to find the individually most likely state.  for

Experimentation setup
We evaluated the new algorithm using Spark in a cloud environment. We used the t2.large cloud computing platform under Amazon EC2. It is characterized by a resizable computing capacity and a very high level of security. We carried out the experiments with a configuration consisting of 8 GB of memory and 2 CPU with 2.0.1 as version of Spark with 5 GB of storage for Amazon S3. T2 instances are expandable capacity instances that provide a high frequency Intel Xeon processor with expandable CPUs and present a high level of balance between computing, memory, and network resources.
In this study, we evaluate the proposed algorithm by performing different experiments using dataset which consist of sequences of integers drawn from a multinomial distribution. In the first experiment, we fixed the number of sequences and measured the running time in terms of states number, then we fixed the number of states and measured the running time in terms of sequences number. To evaluate the efficiency of this parallel distributed implementation, we also measured the acceleration and parallelization efficiency of the proposed algorithm. For these last two measurements, we created four subsets of data with different numbers of sequences, and we measured the speedup then the efficiency by varying the number of used nodes.

Computational complexity
We compared the proposed parallel distributed posterior decoding algorithm to the classical one in terms of time and space complexities. As shown in Table 1, the results indicate a great improvement in time complexity compared to the classic version while the space complexity remains almost the same. Thanks to this implementation, the complexity has been reduced from ( 2 ( − 1)) and has become ( − 1) with the states number and the length of the observation sequence.
The results in Table 2 matches the results in Table 1. This table presents a step by step time complexity comparison between classical posterior decoding algorithm and new parallel distributed algorithm under Spark. For most stages of the algorithm (i.e., forward, and backward variables, posterior probability) there is a remarkable improvement. In sum, from this table, the results of the proposed algorithm are much better than those of the conventional one. From the two tables, the results have shown that the proposed algorithm improved since the time complexity is considerably ameliorated.

797
In Figure 4, the results of the proposed parallel distributed version of posterior decoding algorithm performances, in terms of running time according to states number, are very significant. It is noticed from the curve in this figure that, with the increasing of the states number, the ratio between running time and states number remains a little close stable. Figure 5 shows the performances of the parallel distributed posterior decoding algorithm under Spark in terms of running time according to sequences number. The curve shows that, with the increase of the sequences number, the proposed algorithm presents good results in terms of running time. This explains that this algorithm is well suited to applications with very large sequences number.

Speedup analysis
To measure the performance of parallel implementations, one of the frequently used metrics is speedup. It measures the evolution of execution time as a function of the number of nodes. The acceleration is the benefit obtained by a parallel implementation of an algorithm (under p nodes) compared to the same algorithm on a single node. According to Amdahl's Law, the speedup is calculated as: where and are respectively the processing times on 1 and p resources. These can be cores in a processor, processors in a shared memory machine, nodes (PCs) in a cluster and disks in a mass storage system. In our case, Ts presents the execution time of the sequential algorithm and Tp the parallel algorithm execution time on p nodes. As it can be noticed in Figure 6, the proposed algorithm presents a significant improvement in execution time and this according to the good results obtained from speedup which increases with the number of nodes used while being relative to the volume of data processed.

Parallelization efficiency analysis
Efficiency is a profitability metric that allows to quantify the rate of good use of the resources used in a parallelization. This mesure is defined as: where is the speedup and is the parallel algorithm execution time on p nodes. According to Figure 7, the efficiency depends on the number of used nodes and on the volume of the data. So, for different subsets of data, a satisfactory efficiency rate has been obtained. According to these measurements of yield of parallel computation, the proposed algorithm presents a high level of scalability since the level of parallelism increases with the number of nodes. In addition to these performances, the implementation under a big data platform (i.e., Spark) allows to fully benefit from the advantages of using many data preprocessing tools especially for large scale multidimensional data, features selection and model's evaluation. Indeed, Spark provides a variety of tools for collection, features extraction, selection and transformation and data cleaning, a set of efficient algorithms for analyzing and managing data in real time and powerful techniques for model evaluation and selection thanks to Spark's MLlib the machine learning library, to Spark SQL for querying large and structured data, to Spark Streaming to process streaming data and Spark GraphX to handle graphs and graph-parallel computation.It is also important to note that this improved algorithm can be easily transposed and integrated into any other big data framework.The results of the proposed algorithm compared to the results of other implementations and to the classical version show a big improvement in terms of execution time.

CONCLUSION AND FUTURE WORK
In this paper, we proposed a parallel distributed algorithm based on Spark. It is a new implementation of posterior decoding algorithm under Spark for hidden Markov models decoding problem. The proposed algorithm presents an improvement of the classical algorithm using the benefits of a big data framework (e.g., Spark). We evaluated the new algorithm and the findings verified that this one solves the decoding problem significantly faster than the old algorithm. The obtained speedup is due to the implementation of the new algorithm under Spark, so to the data distribution over several blocks and parallel computation. It is worth mentioning that we only investigated the time complexity, while for space complexity, the algorithm in this paper will yield same complexity as the classic algorithm. Hence, in future, we will extend the idea of improvement of space complexity of the studied algorithm in this paper to the research. We plan to study the impact of the impoved algorithm in the context of big data problems in the most promising areas. Thus, this implementation of the classical posterior decoding algorithm under Spark optimized the complexity and reduced the computation time. We can say that we have succeeded to improve one of the most important algorithms of hidden Markov models to resolve the decoding problem leveraging one of the most promising big data technologies, the Spark framework, in a cloud environment. Finally, this parallel distributed posterior decoding algorithm allows, effectively, to meet the needs of this great digital revolution by proposing a welladapted algorithm to the big data context.