Introduction
This project was started with the following problem in mind:
Having to test MPI-parallel Python programs that link against C/C++/Fortran libraries
This poses a set of problems:
Differing code paths need to be taken into account for the asserts.
Deadlocks, i.e. all processes waiting on others, might happen and need to be accounted for.
Due to using background libraries that are not memory safe, each process might encounter a segfault at any time and any place. This leads to one process failing while the others (potentially) run on.
Any process is allowed to call
MPI_Abortat any time and place, stopping the execution.
These can be grouped in two categories
Crashes of the compute environment due to
MPI_Abort, segfaults, etc.Differing control flows that lead to some parts of the code being executed on just one process
E.g. an if being triggered on one process, but not on the other and, in turn, an assert triggering
To counter these, this code was designed as follows:
The main process gathers the tests.
The main process uses
mpirunto generate a parallel, forked environmentOnly the forked environment runs in parallel
Communication with the processes happens via file IO
e.g. the results of the tests are written to file by the processes and read by the master process.
These decisions have the following benefits:
In case of the tests actually running through, the results of the multiple processes can, then, be gathered on the main process and joined, leading to a unified output of the test results.
The forked environment allows to tolerate MPI_Abort and segfaults happening, as the main process is not touched.
in case of the tests catastrophically failing (segfault,
MPI_Abort), the usage of file IO leads to thestdout/stderrsurviving the processes and output for the tests being captured by the main process.