High performance communications libraries for Microsoft Windows 2000
This page describes the port of high performance communications libraries (BIP,
MPI-BIP) to Microsoft Windows
2000. This work is funded by
Microsoft Research through a
project with INRIA
Rhône-Alpes . This research is conducted by people from RESO
Action INRIA inside the RESAM Laboratory in Lyon, France.
People
Microsoft contact
- Pierre-Yves
Saintoyant (Responsible for the University Relations in
"Europe Middle East and Africa" area at Microsoft)
Acknowledgments
We would like to thank Loïc Prylli (LIP) for the
help he provided with the GM driver.
Port of BIP low level communication layer on Microsoft Windows 2000
The BIP low level communication layer is composed of three
components. We describe then briefly here and introduce the strategy
we used to have them working on windows.
- A kernel module. It has several roles:
- First it must act as a driver: it discovers the Myrinet board,
registers itself with the operating system as the driver for this
peripheral and properly initializes the hardware.
- As any classical network driver, it may interface with the TCP/IP
stack to handle communications for the Myrinet board.
- It provides some basic services to the BIP library. At the
initialization time, it gives direct access to the Myrinet board by
the BIP library. It is also used by the BIP library to
register/unregister memory (pin down memory pages in physical memory,
provide the address translations).
We are not interested in the second role since the main advantage of
using BIP is to have zero-copy communications with a light weight
protocol. Our idea to provide the first and third services to the
library was to rely on Myricom's
GM driver. Indeed this driver already provide functionalities
close to what we need. And it is available for a wide range of
platforms including windows 2000. So we modified the GM driver so that
it provides a new set of services for the BIP library. It doesn't mean
that this was a piece of cake but it was probably a lot easier that
re-writing a new driver from scratch.
- The BIP library. When it was written, it was targeted only to
linux. Thus, even though there is no fundamental limits that prevent a
native port to the win32 system, we decided to use the cygwin porting layer which
is freely available. Using this library has several
advantages. Maintenance of the code is easy. There is only a set of
source files with no ugly #ifdef/#endif. It comes with a full
environment which includes a set of handy tools: make to manage the
project, gcc to compile the code, perl to use the script provided with
BIP, ssh to access the remote nodes. It is in very active development
and is getting better and better at a quick pace. We see very few
objections to the use of the cygwin system. It is still possible to
use a third party compiler for the application to ensure top
performance. The BIP and MPI library in themselves don't use system
calls for any critical tasks and the application writer has the
freedom to use win32 calls directly to save the extra overhead
introduced by the cygwin layer. Note that even if the cygwin library
is a very powerful tool, we still had to rewrite some part of the BIP
library using native win32 calls.
- The firmware: nothing to do here, it is independent of the
operating system.
Port of MPI-BIP high level communication layer on Microsoft Windows 2000
Since MPI-BIP is a higher level layer, it is less dependent on the
underlying operating system and hardware. The port of MPI on top of
BIP was realatively easy and allows us to experiment and gather
applications results (NAS benchmarks).
Performance results
The following experiments were run on a cluster of 8 dual 933Mhz PIII
connected by Myrinet 2000 hardware (Lanai9 133Mhz, serial links).
Micro-benchmarks: point to point experiments
Point to point latency of BIP and MPI-BIP (click here for full size graph)
Point to point bandwidth of BIP and MPI-BIP (click here for full size graph)
IS
| 1 processor
| 4 processors
| 8 processors
| 8 x 2 processors
|
Time in seconds (class A)
| 9.47
| 2.66
| 1.53
| 1.31
|
Time in seconds (class B)
| 38.02
| 10.70
| 6.05
| 5.34
|
LU
| 1 processor
| 4 processors
| 8 processors
| 8 x 2 processors
|
Time in seconds (class A)
| 1596.73
| 397.62
| 201.39
| 195.75
|
Time in seconds (class B)
|
| 1647.42
| 862.58
| 536.09
|
These results are comparable to the one we get under Linux on the same
platform. The only difference is the performance of the fortran
compiler (g77): the one provided with cygwin generates code
significantly slower. We didn't investigate the problem much but, it
is probably possible to correct this strange behaviour.
Roland WESTRELIN
Last modified: Wed Mar 30 18:34:25 CEST 2005