libELC: a portable checkpoint/recovery library for C/MPI programs

Downloaded 2365 times
Alexey Lastovetsky
Peng Zhao

This project investigates how to provide portable fault tolerance facility to the MPI programs running in the heterogeneous network, which is motivated by the ever increasing deployment of heterogeneous networks of computers for solving computation-intensive problems. Most existing fault tolerance mechanisms for MPI programs are not system-independent. They are either built on some particular platform or more often, implemented as plug-ins to some specific MPI distributions.

This project proposes a new coordinated checkpoint algorithm, Event Logging, which addresses the application-level non-FIFO message passing problem in the Chandy-Lamport algorithm. Also libELC, the portable checkpoint/recovery library for C/MPI programs that uses Event Logging for the process coordination, is designed and implemented.


1.02008-Oct-0715.66 KBRecommended for libELCThis is currently the recommended release for libELC.