The Baum-Welch algorithm for hidden Markov Models: speed comparison between octave / python / R / scilab / matlab / C / C++

Presentation

In this comparison, I have programmed in a comparable way the Baum-Welch algorithm for discrete Hidden Markov Models (HMM) in several programming languages often used by probabilists and statisticians.

"A comparable way" means that the code follows the same lines in all languages, but tries to take advantage of each language's strengths (e.g., vectorialization in matlab). Each code is probably improvable (I will appreciate any kind of advice: please send me an email!), but the aim of this comparison is precisely to compare "average" implementations under different programming languages.

Two scenarios are considered:

the first scenario is "small" and has a pedagogical interest: inspired by genomics, it contains a hidden chain with two states, and a four-letter observation alphabet;
the second, "medium" scenario involves 7 hidden states and 7 possible observations;
the third, "large" scenario involves 50 hidden states and 50 possible observations.

The experiment run for each language consists in 100 Monte-Carlo repetitions, and has been run on my desktop machine. In each repetition, a sequence of 10000 successive observations is created, filtered and smoothed. Then, an approximation of the maximum likelihood estimate is computed using the Expectation-Maximization (EM) algorithm (starting from 10 random points drawn independently).

The EM iterations are stopped either when the change in parameters becomes smaller than 1e-4 in 1-norm, or after 100 iterations. With this stopping criterion, the number of iterations is almost always 100 (which gives more reliable results for the speed comparison).

The code, provided below in section "Download", allows everyone to replicate this experiment very easily, and to try other choices of parameters.

Motivations

I decided to do this comparison because a colleague of mine, who was coding a variant of the Baum-Welch algorithm in R for some involved and demanding application, was facing speed problems: he wanted to know if it was because of the programming language he was using, and what option to choose to solve his problem. He wanted a avoid a too costly solution, both in terms of price and in terms of speed of development.
I realized that though I had a rough idea of the comparative efficiency of all popular languages, I could not tell him in what proportions the use of these options would solve his problem.
More generally, my main motivations in this comparison are the following:

Many researchers spend a lot of time prototyping new algorithms and testing them on more or less intensive experiments; on the other hand, they often need to teach students how to code and how to make numerical tests. Which language should they use? In that choice, many elements come into account: free availability, speed of execution and development, diffusion in their community... I hope that this contribution may help them comparing between the possible choices.
The great success of high-level languages over the last decades has diminished the interest in C/C++ programming. Do these languages remain interesting / necessary / unavoidable? For which uses?
Everyone knows that C is faster than R; it is really always true? And above all: how much faster? Can a one week experiment in R take on a few minutes if C is used?
Most people prefer to use high-level languages even if they are slower, because development is much faster and easier. But in a second phase, they may want to spend some time optimizing it: then, they have the possibility to re-code the bottlenecks in a compiled language. Is this really worth it? How difficult is it? What is the good compromise?
On all those aspects, the code available here provides an example which is, in my opinion, quite illustrative. Everyone can use it to repeat and tweak the experiments. Moreover, these codes are useful templates that can be adapted to other situations (especially concerning the C API of the high-level languages).

Description of the code

In each language, a Makefile should make the replication of this experiment very easy. Besides, the code might be re-used in applications or teaching; the following functions are provided:

HMMsample: sample trajectory of a Hidden Markov chain
HMMfilter: computes filtering distribution
HMMsmoother: compute correcting factors to obtain the smoothing distribution
HMMbaumwelch: estimate the parameters of the HMM (ie, the transition probabilities of the hidden chain and the emission probabilities in each state) using the Expectation-Maximization algorithm
benchmark: run the experiment described above, printing execution times

The C++ code was written by Pierre Pudlo, who adapted the scilab code to use Armadillo, a C++ linear algebra library possibly relying on LAPACK, BLAS and ATLAS that aims at providing "good balance between speed and ease of use".

For R, matlab, octave and python, the C extension providing a much faster computation of the filtering/smoothing phase requires to be compiled. This is straightforward by using "make".

The python code makes use of the numpy library; its style is a bit "matlabic" as it makes use of matrix multiplications when loops could also be considered.

The C code is compiled with -O2 optimization option.

The matlab code is used without modification in octave.

The high-level languages allow to include some functions programmed in C, providing a more or less handy interface. I have tried this possibility, operating only filtering and smoothing in C. For python, I did a "hand-made" extension; for R, I used the standard (but easy to use) ".C" interface; for matlab, I have used "mex"; Judd Storrs wrote me an email explaining how to use "mkoctfile --mex" for octave.

Download

Here is a bundle with all the code.

If you are interested in one particular language (with the C extension when available), here are the R, matlab / octave, scilab, python, C++ with armadillo, and pure C codes. In all cases, the benchmark with 10 repetitions of scenario 1 can be run using "make benchmark". In most cases, editing the Makefile is sufficient to run other scenarios, or to use the C filtering/smoothing functions.

Results

"Small" scenario 1:

	python	python + C	R	R + C	scilab	matlab	matlab + C	octave	octave + C	C++ (armadillo)	pure C
creation of the sample	0.076	0.076	0.25	0.25	0.32	0.031	0.031	0.59	0.59	0.00082	0.00053
filtering/smoothing	0.29	0.00072	0.33	0.0028	0.14	0.076	0.00072	0.68	0.00072	0.0017	0.00054
10 EM	296	2.14	333	9.20	145	76.5	1.60	691	3.38	2.5	0.94

"Medium" scenario 2:

	python	python + C	R	R + C	scilab	matlab	matlab + C	octave	octave + C	C++ (armadillo)	pure C
creation of the sample	0.14	0.14	0.27	0.27	0.33	0.032	0.032	0.62	0.62	0.0012	0.00058
filtering/smoothing	0.32	0.0035	0.38	0.0067	0.18	0.081	0.0030	0.70	0.0027	0.0038	0.0022
10 EM	331	15.2	395	19.4	229	87.1	10.9	777	13.0	9.55	2.81

"Large" scenario 3:

	python	python + C	R	R + C	scilab	matlab	matlab + C	octave	octave + C	C++ (armadillo)	pure C
creation of the sample	0.28	0.28	0.30	0.30	0.38	0.042	0.042	0.63	0.63	0.0058	0.0015
filtering/smoothing	0.57	0.14	0.68	0.14	0.92	0.15	0.094	0.82	0.083	0.064	0.073
10 EM	688	355	858	384	1473	282	315	980	331	246	43

Times are given in seconds. These measures are not meant to be extremely precise: what matters is the order of magnitude of the ratios.

Conclusions and comments

The differences are not just important, they are HUGE, especially in the small and medium scenarios.
Concerning the small and medium scenarios: in all high-level languages, most of time in the EM iterations is spent in filtering and smoothing. In a typical run of 10 EM runs, the number of filtering / smoothing phases is 10 times 100 iterations: this is why the EM runs (last line of the table) are approximately 1000 times more time than a filtering / smoothing phase (shown in line 3).
Matlab is a lot faster than R and python; obviously, an important reason is that in this experiment only matlab makes uses of the presence of several processors. It is maybe is possible to force python, R, etc. to use mulithreading, but it requires additional work, and such optimizations are out of the scope of this comparison.
The small scenario, which involves tiny matrices, shows how much time can be lost using high-level languages: it typically scales in hundreeds.
The medium and large scenarios show that high-level languages take good advantage of vectorialization, as can be seen for example by comparing the execution times between the first and second scenarios (not much difference, while the matrices in play are significantly larger). In this game, matlab is obviously the best.
Octave is a great free software that can run matlab code without modification (in this example at least). It appears however significantly slower than matlab, probably because it lacks JIT (just-in-time) compiling, cf http://wiki.octave.org/FAQ: "Octave doesn't have a JIT and so to some might seem slower than Matlab. For this reason you must vectorize your code as much as possible". This defect is partially compensated by the fact, thanks to "mkoctfile --mex", octave can use mex files (again, without modification).
Scilab appears as a credible, efficient free alternative to matlab, though it lacks multithreading support. Yet, it was surprisingly unefficient in the "large" scenario, sometimes even causing "out of memory" problems.
Obviously, pure C is a lot faster than all high-level languages in the small scenario. It is much faster for standard operations (loops, floating point arithmetic...), where the inefficiency of high-level languages is still impressive. However, it would be a terrible effort to optimize "by hand" memory management and to take advantage of the availability of several processors, as optimized linear algebra libraries do. This is why matlab is faster on large examples.
The performance of the C++ code using armadillo seems to depend a lot on the available libaries.
When working with high-level languages, it is really worth programming the key operations (filtering/smoothing) in C; this can be somewhat messy, depending on the language; I found it very easy in R, easy in matlab and more complicated in python - probably in part because I did it by hand (not using nice extensions like cython). Reference counting in python requires real care. A particularly tricky detail: bidimensional arrays are not stored in the same order in python: matlab and R store them by colums, python by rows...
Yet, one has to be careful: when linear algebraic operations (even matrix multiplication) are crucial, naive C is clearly not optimal. This is shown in the "big" scenario, where the interest of the C code is much reduced.
Of course, it does not matter at all for this comparison, but it should be mentionned that the EM estimates of the thoudands of free parameters in the last scenario are not really meaningfull from a statistical point of view...
I have run almost the same experiment with the functions for HMM provided by matlab's Statistical Toolbox: the results are: (creation of the sample / computation of the posterior distribution / 10 EM runs) scenario 1: 0.004 / 0.098 / 83; scenario 2: 0.006 / 0.32 / 544; scenario 3: 0.015 / 2.3 / 13684; The sample generator is very well written, but obviously the filtering/smoothing functions are not designed to handle large chains.
Similarly, I have run the same experiment in R using package HMM. In the first scenario, the results are (creation of the sample / computation of the posterior distribution / 10 EM runs): 0.52 / 1.36 / 2334. In the larger scenarios, I obtained an error in the Baum-Welch iterations.
Darren Wilkinson has written a somewhat similar comparison on Gibbs sampler in various languages.
A small bug in the EM steps was discovered by Julius Su, Caltech. Thanks to him, it is now corrected.
This page was referred to by Christian Robert's blog, which brought much traffic and quite a few interesting comments/pieces of advice. Thanks to all of them!

Search form

Main menu