Introduction

This project was started with the following problem in mind:

  • Having to test MPI-parallel Python programs that link against C/C++/Fortran libraries

This poses a set of problems:

  • Differing code paths need to be taken into account for the asserts.

  • Deadlocks, i.e. all processes waiting on others, might happen and need to be accounted for.

  • Due to using background libraries that are not memory safe, each process might encounter a segfault at any time and any place. This leads to one process failing while the others (potentially) run on.

  • Any process is allowed to call MPI_Abort at any time and place, stopping the execution.

These can be grouped in two categories

  • Crashes of the compute environment due to MPI_Abort, segfaults, etc.

  • Differing control flows that lead to some parts of the code being executed on just one process

    • E.g. an if being triggered on one process, but not on the other and, in turn, an assert triggering

To counter these, this code was designed as follows:

  • The main process gathers the tests.

  • The main process uses mpirun to generate a parallel, forked environment

  • Only the forked environment runs in parallel

  • Communication with the processes happens via file IO

    • e.g. the results of the tests are written to file by the processes and read by the master process.

These decisions have the following benefits:

  • In case of the tests actually running through, the results of the multiple processes can, then, be gathered on the main process and joined, leading to a unified output of the test results.

  • The forked environment allows to tolerate MPI_Abort and segfaults happening, as the main process is not touched.

  • in case of the tests catastrophically failing (segfault, MPI_Abort), the usage of file IO leads to the stdout/stderr surviving the processes and output for the tests being captured by the main process.