Home >> Projects >> Process Checkpointing & Restarting	Asim Shankar

Process checkpointing and restarting
(using dumped core)

This page describes a system for checkpointing and restarting UNIX processes. It differs from some existing implementations in that (a) It does not require the executables to be linked with library, so processes can be checkpointed without change and more interestingly, (b) the manner in which a checkpointed process is restarted. Other systems (such as ckpt and esky) have a complex mechanism of restoring the stack and register state of the checkpointed process as both are also used by the restoration code. This system seems to be simpler as the restarted process and the restoration code are in independent address spaces. The system runs only on user-level code and requires no modifications to the kernel.

Updates

March 1, 2005 - Well, I finally got to it and fixed the "Could not read name of note #1" error. The examples now seem to work on my Linux 2.6 kernel (and I suspect they should work in 2.4 as well). Let me know if there are any problems. Thanks.
March 3, 2004 - An error (Could not read name of note #1) seems to be appear in various distributions/kernels. This is because of a slight difference between the core format in the kernel I used to develop and test (the one that ships with Mandrake 9.1) and these other kernels. I am aware of the problem and the fix should not be complicated. Unfortunately, I haven't found the time to do this myself yet. If you do, please let me know. Thanks

Objective
Final Result
Methodology
Implementation
- Preamble
- Checkpointing
- The "restart" utility
- download
References
Some other checkpointing systems

Objective

The core file contains a complete memory dump of the process, thus in theory it should be possible to restore the process to the same state it was in when the core was dumped.

However, there are many unanswered questions when it comes to restarting from this state. What happens to the open file descriptors? Files may have changed, how do you handle sockets, pipes, seeks etc. Then there are issues with process ids - does the process id have to be the same as before? What about the parent-child relationship? Signal handling state - what signals are blocked? How does the process see the time that has elapsed since the checkpoint?

Answers to these questions would affect what exactly does one mean by "restarting" a process from the checkpointed state and how to go about it. However, one would notice that for jobs that are essentially compute intensive, where inter-process communication and signal handling aren't the major point of concern - the process address space has all the information necessary. The point is that restarting from the address-space dump in the core can serve a worthwhile purpose.

(Back to top)

Result

The result so far is a system that can checkpoint and then restart any process along with file descriptors, with the following caveats:

File descriptors of only regular files, directories and symbolic links can be checkpointed. No character/block devices, sockets or pipes
Signal handlers are not restored (default ones are used)
Processes that have used dlopen() to open a dynamic library are not restarted successfully
Programs must be single threaded
Only a single process will be checkpointed, thus programs that use fork(), exec() (or other things like system() and popen()) are in trouble
Programs that use the mmap() call to map files to the process' address space cannot be restarted

However, of the above, the mmap() and dlopen() limitations are likely to be fairly easily overcome. Given these limitations, which some other checkpointing systems share, I'm of the opinion that things are done much more simply here than in other systems. Details follow.

(Back to top)

Methodology

Here's an overview of the steps the restart utility takes in order to restart a process given the executable file and the core dump file:

Open the executable and core files and read their ELF headers
From the NOTES program header of the core file, get the PR_STATUS structure (which has register values) of the checkpointed process
fork(), we now have a CHILD process (which will be the restarted image) and the PARENT process (that sets up the child)
CHILD: ptrace(PTRACE_TRACEME,...) and then exec() the executable file
PARENT: Setup a breakpoint in the child.
This is done as follows: Store the instruction in the child at the entry point of the executable and replace it with the INT3 instruction (opcode 0xCC). Then do a ptrace(PTRACE_CONT,...). This allows the child process to run till it reaches the entry point (normally the address of the _start function). Once here, it will execute the INT3 instruction which causes a SIGTRAP to be generated and returns control to the parent process. (Allowing the child to run till the entry point allows the address space to be initialized and the code to be loaded). In the case of statically compiled binaries (e.g.: gcc with -static), instead of the entry point, we would want to break at the address of main()).
PARENT: With the help of the LOAD sections in the core file, restore the address space of the child.
(The program headers with type LOAD specify the virtual address and the location in the file where the contents of that address can be found)
PARENT: Restore the registers of the child, read in from the NOTES section of the dumped core
PARENT: Detach the child (ptrace(PTRACE_DETACH,...))
THE CHILD PROCESS IS NOW READY TO EXECUTE FROM THE POINT IT WAS CHECKPOINTED

The use of the exec() call and breaking at the entry point of the program handles the initialization of the process' address space and loading the executable code of the program and the used dynamic libraries (except those explicitly mapped by dlopen()). ckpt and esky (see Other Systems) handle the restart by making the restart process overwrite its own address space. This can be quite complicated as one must make sure that the code of the restart process remains intact and there are a host of related issues that must be carefully dealt with. The methodology above is much simpler as the address space of the restart process and the restarted process are completely independent.

File Descriptors - File descriptors are handled with the help of a dynamic library that must be put into the LD_PRELOAD environment variable. This library installs a special signal handler for the SIGQUIT signal which dumps information on the open file descriptors to a text file. This text file is then read in during the restart process mentioned above after the fork() and before the exec() and file descriptors are restored with their offsets.

(Back to top)

Implementation

Preamble

Based on the methodology described above, a system was implemented. Some things regarding the implementation:

The system works on Linux and requires kernel 2.4 or above
(The mmap2() system call is used to allocate pages to the process after the program was exec()ed. Kernel 2.2 doesn't seem to have this call implemented)
The "checkpoint" file used is an ELF core file with type ET_CORE. This implementation works on the IA32 architecture (The architecture affects, among other things, the registers available etc.)
In such a system, the stack starts at 0xbfffffff and "grows" to lower addresses
The .text, .data and .bss segments of the executable are loaded at 0x0804000. Dynamic libraries are loaded by ld at 0x4000000 onwards

Checkpointing

Checkpointing in this system simply means generating a core dump. Here we describe ways to do that and the slightly different methodology used to checkpoint file descriptors (which are not checkpointed in the core dump).

Using a signal - There are some signals (SIGSEGV, SIGQUIT among others) whose default disposition is to cause the process to dump core and quit. Thus, one way of creating a checkpoint for a running process is to send it the SIGQUIT signal. There is a limit to the allowable size of this core dump and many times the default setting is to not allow the core file to be created. To remedy this, before running the process type the following in the bash shell:

ulimit -c unlimited

Using the debugger (gdb) - NOTE: For checkpointing with gdb, you require gdb version 5.2 or greater (which implements the gcore command)

A debugger can be attached to a running process and then used to manipulate it. gdb has a command "gcore" that creates a core dump of the process. In fact, with the debugger you can bring a process to a "safe" state before dumping core. For example, if the process opens sockets, does some processing and then closes the sockets then you can use gdb to set a break point where all sockets are closed and then create a core dump. Thus, when the process is resumed from the core file, there were no open socket fds to worry about. To attach gdb to a running process use:

gdb <executable filename> <process id>

Checkpointing file descriptors - The file descriptor table is maintained by the kernel and thus doesn't lie in the process' address space. Therefore, information on open file descriptors doesn't seem to be present in the core file. Furthermore, various issues arise when trying to restore them, for example, what do you do with sockets and pipes? What happens if the file is moved? etc. This system however, provides rudimentary support for regular files (regular meaning files/directories as opposed to sockets or pipes). On receipt of a SIGQUIT signal, we store for each open file descriptor - its descriptor, filename, offset and flags and write all this information into another file. The default signal hander is then restored and the process is sent another SIGQUIT signal which forces a core dump. Information on open file descriptors is taken from /proc/self/fd.

Use of this special signal handler does not require any relinking, we use the environment variable LD_PRELOAD to load our library (libsavefds.so) which installs the special signal handler.

In summary, to checkpoint a process with file descriptors, ensure that libsavefds.so is present in LD_PRELOAD before starting the process and then when you need to checkpoint it, send the process a SIGQUIT signal.

The restart utility

The core component of this system is the restart program. Not much had to be done for checkpointing as we basically ask the kernel for a core dump to create the checkpoint (checkpointing file descriptors uses a special library, libsavefds.so as mentioned earlier). This program essentially implements the methodology explained above. The usage of this utility is shown below:

Usage: restart [options] <executable filename> <core filename>

Options:
  -b, --breakpoint=ADDRESS   When execing the program to be restarted then run till given
                             instruction ADDRESS before restoring address space and registers
                             (Default is the entry point of the executable, which is generally
                             the address of the _start function, thus all dynamic libraries are
                             loaded by this time. Specifying this is useful for statically linked
                             executables (Compiled with the --static flag in gcc)).
  -f, --filedes[=FILENAME]   Restore file descriptors from FILENAME created by
                             libsavefds.so (Default FILENAME is "filedescriptors")
  -n, --nostop               Do not pause the restarted process
                             (By default the process must be sent a SIGCONT to continue)
  -s, --select               Make detailed selections while the address space is restored
  -V, --verbose              Be a bit verbose about what is being done while restarting
  -w, --wait                 Wait for restarted process to finish execution
  -h, --help                 Display this help and exit
  -v, --version              Display version information and exit

Special mention must go to the -b option which is useful when it comes to statically linked executables. The -b option takes an address as argument, which is the address at which the exec()ed process is paused and the state of the checkpointed process is restored. The system requires that by executing all instructions in the program till this breakpoint, the program code and code of required dynamic libraries are loaded into the address space of the process (ld does it's job). Most executables are dynamically linked to libc and ld and the entry point of these executables (_start function) has the characteristics required of the breakpoint address. However, in the case of statically linked executables, the entry point is often 0 and at this address even the program code has not been loaded. Hence, for such executables an acceptable breakpoint would be the address of the main() function. One could look up the symbol table and determine this address, however in case symbols have been stripped, the -b option can be used to specify it.

Download

I'd appreciated if you'd share any comments/suggestions/queries you may have with me by emailing me.

DOWNLOAD my implementation of what I have described above
README
REPORT - This system was first created as part of a course project for a graduate level Operating Systems course (at the Computer Science & Engineering Department, Indian Institute of Technology, Kanpur). This course report is pretty much just a PDF version of this web page

(Back to top)

References

Some other interesting things that I came across while figuring out the logistics of this system:

Sandeep's articles on ptrace - Here is a series of 3 articles on the "ptrace" system and some interesting hacks. There articles appear in issues of Linux Gazette in issues 81, 83 and 85.
core_restart.c - A system which reconstructs an executable file from its core dump.
ELF Format Specifications - Search Google for "elf format specification"
To play around with ELF files, you can use "readelf" and "objdump". These utilities (and a hex editor!) were what helped me figure out the nitty-gritty. They are part of the binutils package and should be installed on most systems.
The Linux Kernel source code, specifically fs/binfmt_elf.c and fs/exec.c
Information on system calls and how they take parameters : http://www.lxhp.in-berlin.de/lhpsyscal.html

(Back to top)

Other Checkpointing & Restarting Systems

Some other checkpointing systems:

ckpt - A checkpointing system developed at the University of Wisconsin
esky - Doesn't suffer from the mmap() and dlopen() limitations that this thing does
libckpt - Transparent checkpointing under UNIX (1995)
EPCKPT - Checkpoint/restart utility built into the Linux kernel
checkpointing.org - The home of checkpointing packages

(Back to top)

Last modified: Tue Mar 01 15:05:27 Central Standard Time 2005

Process checkpointing and restarting (using dumped core)

Updates

Contents

Process checkpointing and restarting
(using dumped core)