MyHDL is fast!
The message from this page is loud and clear: MyHDL is fast!
MyHDL is implemented as a pure Python application. Python is a very high-level, dynamically typed language. Traditionally, such languages have a much worse performance than statically typed languages. Moreover, it is usually taken for granted that there is a trade-off between expressive power and performance.
However, thanks to technological advances it is time to challenge the conventional wisdom, for Python and for MyHDL in particular. In contrast to what you might expect, using MyHDL does not necessarily imply a simulation performance penalty when compared to statically typed HDLs like Verilog or VHDL. If you think this is impossible, read on.
To make MyHDL simulations fast, all you have to do is to use the right Python interpreter, which may be different than the one you are using today. The interpreter of choice is developed by the the PyPy project and comes with a Just-In-Time (JIT) compiler. The JIT compiler does the trick: it can aggressively optimize those parts of the code that matter, by inspecting the characteristics of the running program.
If the prospect of using a different, unfamiliar interpreter sounds scary, don't worry. Compatiblity with Python is a major goal of the PyPy project. PyPy is highly compliant with Python 2.7 and can basically be used as a drop-in replacement. The process is similar to upgrading to a new version of Python. However, it is an upgrade with spectacular consequences.
Using PyPy to speed up MyHDL simulations
To demonstrate what PyPy can do for MyHDL simulations, I have compiled a
number of representative hardware design benchmarks. I run them both with
the reference Python interpreter (
cPython) and with the JIT-enabled PyPy
pypy) and time them. The results are as follows (times in
As you can see, the results are spectacular. By simply using a different interpreter, our simulations run 8 to 20 times faster. When it comes to instant improvements without compromises, it doesn't get better than this!
I have initially engineered the benchmarks so that they run for about 100s for PyPy 1.5. This leaves some margin to easily compare them in the future with runs on better machines and with future PyPy versions. Also, to get a realistic idea of the speedup factor, the simulations should run long enough. More details can be found here.
Perhaps we should temper our initial enthusiasm a little. After all, we are still comparing MyHDL to MyHDL. Perhaps we have merely gone from extremely bad to very bad. To get more insight in were we are, we should compare with equivalent benchmarks in Verilog and VHDL.
Comparison to VHDL and Verilog
The benchmarks are implemented as self-contained test benches that can be
converted automatically to Verilog and VHDL (by MyHDL). I have run them on
two Verilog simulators and two VHDL simulators. All simulators are
available at no cost, but two of them have to be kept anonymous. I have
called those simulators
A first thing to note is that the results go in all directions, sometimes in an astonishing way. There is no clear "winner". Some simulators perform bad in some benchmarks, and good in others. Therefore, the data suggests that the benchmarks form a good set that covers various aspects of hardware simulation.
Moreover, there seems no correlation between the VHDL simulators, nor between the Verilog simulators. This indicates that there is nothing wrong with the quality of the Verilog and VHDL code generated by MyHDL.
The most important conclusion: MyHDL is doing just fine. In two benchmarks, it is actually the fastest simulator, and it is never the slowest one. The comparatively weakest performance is in the
findmax benchmark, which is also the benchmark with the smallest speedup factor from using PyPy.
The results for other simulators are remarkable also. For example, GHDL
shows a very good performance in the
randgen benchmark, but a very bad
one for the
findmax benchmark. In the latter case it is 100 times slower
than the fastest simulator. Clearly, it should be interesting for other
developers to look into the details of the benchmarks.
More info on the benchmarks can be found here.
A note on paid-license commercial simulators
The benchmark data are all from zero cost HDL simulators. I think that is fair and the reasons are understandable. Note that even if I would have the numbers for paid-license commercial simulators, I would typically not be legally allowed to publish them.
As performance is such an important competitive driver, you can reasonably
expect significantly better performance from paid-license Verilog and VHDL
simulators. One simulation vendor has granted permission to publish
numbers: Tachyon with their cvc Verilog
simulator. Performance is the name of the game for cvc, and it really is
blazingly fast: in compiled mode, all benchmarks but one run in times
between 2.2s and 2.5s. (
longdiv takes just 1s.) If you need
high-performance Verilog simulation, cvc is definitely worth a look.
Tachyon convincingly claims that it outperforms any other commercial
simulator with a factor 2-20. This would mean that the benchmarks would run
in times between 4.4s and 50s for other paid-license simulators. (2s-20s
By simply changing the Python interpreter, MyHDL is playing in the same league as Verilog and VHDL simulators. This is a remarkable achievement, given that Python's power stays completely available. There is no reason anymore to avoid MyHDL because of performance concerns.
The results are a great validation the concept behind MyHDL. MyHDL is based on the idea that digital hardware design is not special enough to warrant the design of a dedicated hardware description language, at least not at the RTL level and higher. Instead, everything needed can be done with a dedicated library and dedicated usage of a general purpose language. The implication is that one benefits from all advances in the underlying language, even if they seem unrelated to the specific purpose. For example, the PyPy team is not trying to make hardware simulation fast. What they do is making Python fast in general.
The numbers are much better than the average speedup of the standard PyPy benchmark set, which stands at 4.1 for PyPy 1.6. This suggests that MyHDL simulation is well suited for JIT optimization. I believe this can be explained by considering the characteristics of a typical MyHDL simulation run. It consists of two phases: first an elaboration phase that creates a simulatable data structure, followed by the actual simulation phase. The elaboration phase is typically very fast and may use a lot of Python's dynamic features. The simulation phase can run for a long time. However, the simulatable data structure is typically rather "static" and probably a good target for the JIT compiler. It consists of generators, connected by signals and controlled by a simulation engine. The code within the generators is typically a loop containing integer arithmetic and bit operations.
A personal note
--- Jan Decaluwe 2011/06/06 04:57
Occasionally, I make bold predictions. In many cases they just stay wishful thinking. This is one of the rare cases when they turn out to be right. Therefore, please allow me a brief moment of glory. Here is a quote from a post in 2006:
> From: Jan Decaluwe email@example.com > Subject: Re: MyHDL performance > Newsgroups: gmane.comp.python.myhdl > Date: 2006-11-30 10:19:47 GMT > > ... > Finally, there is PyPy. I once saw a demo by > Armin Rigo showing a massive speedup by using > psyco. Unfortunately, psyco cannot handle > generators. Instead, Armin and others started > the PyPy project that at one time may bring > psyco-like advantages to general Python code. > It would seem to me that MyHDL is a good > candidate, because a lot of code is run over > and over again during simulation. > So perhaps one day I'll be able to report a > massive speedup without having to do anything > myself :-) That will be the day! > ...
The day when I can report a massive speedup has arrived now. As it turns out, the results for MyHDL are indeed particularly good. And it really was Armin Rigo and his fellow PyPy team members who have pulled it off.
I would like to express my respect and gratitude to the PyPy team and their achievements. They have a great vision, as well as the technical excellence and perseverance to make it happen.
When I briefly described JIT optimization earlier, I tried to make it sound intuitive and easy. Intuitive it may be, but easy certainly not. There have been several attempts to speed up Python in the past, including one sponsored by Google (Unladen Swallow.) So far, the PyPy project is the first and only one that can truly be called a great success.
Interestingly, the PyPy team did not really write a JIT compiler for Python. I guess that would have been too easy :-). Instead they wrote a JIT compiler generator, that can generate a JIT enabled interpreter for any language written in a Python subset called RPython (for "Restricted Python"). In other words, the goodies they provide are not restricted to the Python world.
I see glimpses of greatness in the PyPy project, and I believe we are going to hear a lot more about them in the future.
All benchmarks are implemented as self-contained test benches. The design
under test in each test bench is a synthesizable RTL description of a
typical hardware module. The test benches are self-checking using assert
randgen which creates an output file. They have been
written in MyHDL in such a way that they can be converted automatically to
Verilog and VHDL.
The functionality corresponding to the design under test in the benchmarks is the following:
timer A circuit that continuously emits a pulse after a fixed number of clock cycles. Implemented using an incrementer.
lfsr24 A classical bit-serial linear feedback shift register, based on a polynomial of length 24.
randgen A pseudo-random generator in hardware with a 32-bit output word. Implemented by putting a loop around an lfsr of length 64.
longdiv A long division algorithm. Implemented as an FSM that calculates one bit per clock cycle.
findmax A combinatorial network that finds a maximum value out of a set of 32 16-bit inputs.
The benchmarks are developed in the MyHDL repository. They can be found
I have run all simulations on my laptop with the following characteristics:
|Processor||Intel(R) Core(TM) i3 CPU M 350 @ 2.27GHz|
|Operating System||Ubuntu 10.04 (lucid)|
The JIT warming-up phase
A JIT compiler has a slow warm-up phase, during which it examines and
optimizes the running program. Very fast runs can therefore even be slower
with a JIT than without. On the other hand, the longer a program runs, the
faster it gets. From what I read, the warm-up phase can take a few seconds
up to a minute, depending on the size of the program. To get a feeling for
what it may mean for MyHDL, I have run one of the benchmarks (
several times, each time doubling the number of test vectors. The results
are as follows:
As you can see, the simulation gets up to speed after a few seconds. To get the initial offset out of the speedup factor, it has to run long enough. For this reason, I have engineered the benchmarks so that the PyPy version runs for around 100s.
Installation, hints, and known issues
To get started with PyPy, you may want to install it alongside your current
Python interpreter. This is fairly easy to do by following these
installation instructions. In
this way you can have a
pypy interpreter next to your
interpreter and switch between them as needed.
If you use a test framework like
py.test you can select the desired
interpreter by using the command
python -m py.test or
pypy -m py.test.
The PyPy project is under intensive development. New revisions can result in significant improvements, but this depends heavily on the application.
Between pypy-1.6 and pypy-1.9, the results for the MyHDL benchmarks have stagnated or become slightly worse. The following table keeps track of the historical benchmark data.
The benchmark data has not yet been updated for the latest pypy versions. At the time this update, pypy is at version number 2.2.