field-theory.org
the blog of wolfram schroers

Parallel and distributed computing — Supercomputing

This article discusses massively parallel machines and supercomputing. One example highlights how optimization can save thousands of dollars. It is for people looking for specialized knowledge in this field of computer science.

Table of contents:

Introduction
Basic parallel operations
Potential pitfalls
Overview of existing systems
Optimization
Summary and conclusions

Introduction

This article summarizes a lecture I gave in summer 2008 on basic operations required for the field of lattice QCD and similar applications on parallel computers. Today there is even more relevance to this topic — with the advent of multi-core CPUs the application of parallel algorithms to regular desktop computers has become a necessity for developers. This may either be a blessing or a curse, depending on your point of view.

For developers interested in number crunching a new field has recently opened. Developers can use the power of graphic cards for numerical computations. This hardware has become very cheap due to its widespread use in computer games, making it a very economic alternative to conventional clusters and supercomputers.

This article discusses and reviews existing technology for deploying parallel algorithms on computers today. It focuses on the subset of functionality that is needed in the fields of physics and engineering that deal with data sets that can be represented on huge numerical arrays. This is basically a way of saying that you can find a way to discretize space and/or time. This is relevant, for example, in the simulation of electromagnetic fields, the simulation of crash tests, aircraft design, material sciences, climate simulations, or even in quantum field theory.

Basic parallel operations

Many parallel computing environment offer a huge variety of operations. For most practical applications in science and engineering, however, we only need a small subset of these operations. Key functionality is:

Setting up the machine configuration

To tell the machine that we are ready to start working and we want to make sure that we have the appropriate number of workers available. Therefore the system can tell us how many machines are available and what these machines can do.

Distribute work

In most practical cases we have a special, dedicated supervisor node that is responsible for communicating with the user or the file system. Usually, the supervisor node is called the “root node”. It collects the details of the assignments from input files and then distributes the work to the other nodes.

Collect/gather data

Once all workers have finished their individual assignments the root node is again responsible for collecting the data and return it to the file system or the user, respectively. This operation will usually be the inverse of the previous operation. However, there is a variation of global operations that is useful that frequently used, but different from the mere collection itself:

Reduction operations on data

Frequently used example of such operations are global sums or global products. In some cases the accuracy of the global operation may be an issue. In particular if the data on the nodes is only all single precision (and there are reason to do so, e.g. performance reasons if network and/or memory bandwidth is a bottleneck), the global sum certainly needs to be in double precision. They are other methods, like Kahan summation, to attack this problem.

Node to node communication

Individual nodes simultaneously send to and receive from neighbors. Or some of the nodes will send data and others will collect it, like the global reduction operations mentioned above.

Note that for huge amounts of data the root node may be very busy for quite some time, leaving the other nodes idle. Just like with human workers this is not a very desirable situation. Just imagine that you have a machine with 100 nodes that takes 50 minutes to do a computation. At the beginning the root nodes takes five minutes to read and distribute the date and in the end it takes another five minutes to collect and write out the data. That does not sound like much, but actually it means that 99 nodes are doing nothing for 10 minutes. This is equivalent to a single node being idle for 990 minutes – which is a lot! The solution to such a problems is quite complicated and usually involves a parallel file system in connection with parallel I/O.

There are a few other operations that are useful in many applications:

Blocking/non-blocking communications

Sometimes it is possible to tell the machine to send or receive data without waiting for its arrival. If you know that at some point during a calculation a thread will need data later but is busy doing something else at the moment it is possible to initiate a communication and then come back later to check if it has arrived. This is called non-blocking communication.

If the data is needed at a point in the program flow and processing cannot continue without it, one rather chooses a blocking communication, instead. You might wonder why not everybody does non-blocking communications all the time. The reason is that non-blocking communications may easier lead to conflicts – like deadlocks – which are much more difficult to find and debug.

Barriers

Barrier operations are useful to synchronize different threads. With barriers one can make sure that all nodes that call this operation will continue at the same point. This is particularly important when concurrent threads access the same resource at the same time; in such a situation locks and synchronization can keep the system consistent. However, one always needs to be careful that those threads do not wait for each other, a problem that would yield to the program “hanging” and is referred to as a “deadlock”.

Potential pitfalls

We already mentioned the challenge of avoiding deadlocks in connection with synchronization and barriers. This is a common pattern to watch out for and has therefore been discussed widely in the literature. A didactic example with possible solutions is the “Dining Philosophers Problem”, see this Wikipedia article for a discussion.

Another potential problem that one needs to be aware of are livelocks – where threads keep busy but are not progressing towards the goal. This occurs much less frequently, but if it happens it is more difficult to locate and debug since the threads always appear to be busy and appear to do actual work.

In some cases it happen that the work is not distributed symmetrically, i.e., some threads are busy while others are idle. This usually does not affect the correctness of the program, but it defies the purpose of parallelization: the desire to solve the problem faster than sequentially. It is the purpose of optimization to find and identify the bottlenecks of each implementation system and find the best way to treat it. A practical example is discussed below.

One subtlety which is most difficult to find and debug is the so-called “race condition”. Race condition refers to a situation where the result of a computation depends on the order in which threads are executed and synchronized. One case where race conditions are not troublesome are global sums, see above. With finite precision the summation of numbers may depend on the order in which they are summed. However, since the resulting sum will always have an uncertainty of that order one can accept that the result may not be bit-identical in repeated program runs. In other cases, however, race conditions may produce correct results in some cases and incorrect ones in other cases. This makes them so difficult and challenging to understand.

Overview of existing systems

PVM (Parallel Virtual Machine)

PVM is around already for many years. It supports all basic parallel operations and its strong points are the support of heterogeneous clusters, i.e., different machines and even operating systems, the ability to dynamically change the machine configuration, and the high portability.

Furthermore, it supports fault tolerance, i.e., the program can detect if a node fails. The program can then take appropriate countermeasures. On the down side it has only limited support for topology dependent communication. This system is not in wide use in the quantum field theory community since people may not need the advantages it offers over other systems, but often work on systems which use MPI as the default communication infrastructure.

MPI (Message Passing Interface)

MPI is the other senior and mature system in wide-spread use. The older MPI-1 specification only support static machine configurations which cannot be changed during the run of the program. Although the machine could in theory consist of different nodes, in reality most massively parallel machines and supercomputers are homogeneous.

Although the machine configuration cannot be changed, communication operations can be configured extremely flexibly and on the fly. A user can design arbitrary topologies at any time. This power is far more than what most people need when developing numerical code.

On the down side, MPI-1 offers absolutely no fault tolerance, i.e., the failure of a single node will result in the inevitable crash of the entire program. MPI offers support for all major program language and is available on most machines and operating systems.

MPI-2

MPI-2 is an improvement over the older MPI-1 standard. It specifies mechanisms for dynamic process management, parallel I/O, and even fault tolerance. In reality there are only a few programs that need all of this power. In fact, dynamically changing the configuration or fault tolerance are features that are not supported by most batch systems and do not really make sense on supercomputers dedicated to number crunching. Hence, fully compliant implementations are not available on all systems.

OpenMP

OpenMP supports only a subset of all the operations listed above. It works only on shared memory systems and does not support different topologies for communication among different nodes. It has the unbeatable advantage that it does not require major changes to the source code, merely adding a couple of compiler directives – or comment lines in FORTRAN – is enough. It is very simple to learn and deploy. It can also be combined with PVM or MPI.

Special-purpose systems

For vertical niche markets there are a couple of solutions tailored to very specific needs. There are software solutions like special-purpose communication libraries (e.g., the quantum field theory library QMP (QCD Message Passing) is one such system). But there is even special-purpose hardware, e.g. the Quadrics/APE and QCDOC machines for quantum field theory or Grape for astrophysical molecular dynamics simulations.

These systems support a subset of communication operations that is specific to the problem in question. They almost always employ a fixed topology which is already established at compiled time. Furthermore, they typically only support the communication of elementary data types.

In recent years graphics cards have become available as massively-parallel number crunching machines. One solution, the middleware CUDA from NVIDIA gained a lot of popularity as it was one of the first such systems in February 2007. CUDA is tied to the NVIDIA graphics cards, however. Recently, the vendor-independent middleware standard OpenCL became available which supports both NVIDIA and AMD/ATI graphics cards. At the moment of this writing (Summer 2010) CUDA is still the more mature solution; it also remains to be seen to which extent OpenCL will allow for optimization to hardware-specific properties.

Optimization

The need to optimize for a specific machine introduces nontrivial challenges. The most common one regards communication of different nodes. One needs to consider the latency, the bandwidth, and let these factors determine the distribution of data among the machine nodes. I have written two articles that exemplify how optimization is done in the real world.

Practical example: Latency vs bandwidth
Based on a real-world physics endeavor. The actual machines cost hundreds of thousands of dollars and even small optimizations of 10-20% are worth thousands of dollars!
Hypersystolic routing
Again based on a real-world problem. It describes a communication routing strategy for massively parallel systems. This improvement is also worth thousands of dollars.

Summary and conclusions

Parallel computing introduces new paradigms and new challenges to the programmer. Initially, it takes a little getting used to, but it can be very rewarding afterward. In my experience the best approach is learning by doing. The programming systems I have discussed above offer different approaches to problems, but are all based on similar basic concepts. As soon as one understands the basic ideas of parallel programming, a developer can fully benefit from modern hardware. With optimization being a key component in program design one can in addition save thousands of dollars when doing it right.