=============================================================================== Process Checkpointing and Restarting using the core dump http://www.geocities.com/asimshankar/chekpointing/ README Last Updated: March 1, 2005 VERSION: 1.1 =============================================================================== This README contains information on installing and running this process checkpointing and restarting system. For details on how it works, refer to the doc/ directory the source and the website. If you have any questions, contact the author (see AUTHORS file). Also, make sure you read the COPYING file for some licensing issues. CONTENTS: --------- 0. CHANGELOG I. INSTALLATION A. Requirements B. Building II. EXAMPLE RUNS A. Complete, with file descriptors B. Checkpointing using gdb III. LIMITATIONS =============================================================================== 0. CHANGELOG =============================================================================== [Version 1.1 - March 01, 2005] - Fixed "Could not read name of note #1 error". The problem arose because it seems that the kernel rounds up the "name" and "description" field of the Elf32_Note structure to a multiple of 4-bytes (Specifically: the functions notesize() and writenote() called by elf_core_dump() in fs/binfmt_elf.c of the kernel sources), and I wasn't taking this into account. - Modified the fprintf example a bit, if filename supplied is "-", then no file is created, numbers are just dumped to stderr. [Version 1.0 - April 19, 2003] - Initial release =============================================================================== I. INSTALLATION =============================================================================== --------------- A. Requirements --------------- 1. Linux, kernel 2.4 or above (It seems kernel 2.2 and below don't really support the mmap2 system call the way I use it) (Seems to work with 2.6 as well) 2. gcc 3. gdb 5.2 [Optional] (We need the gcore command to be implemented. I know gdb 5.0 doesn't recognize the gcore command) ----------- B. Building ----------- 1. Explode the tarball (which you probably have) 2. Execute "make" In case of any trouble, contact the AUTHORS 3. Installation is complete with the restart utility and some example programs in bin/ and a library used to checkpoint filedescriptors in lib/ =============================================================================== II. EXAMPLE RUNS =============================================================================== In this section we demonstrate how a process can be checkpointed and then restarted using some example programs provided. The source of these example programs is available in examples/ NOTE: In the following, "$" stands for the shell (bash) prompt. NOTE: Some other examples might be present in the examples/ directory but are not documented here. Play around with them. ------------------------------------------------------- A. fprintf - Checkpointing along with file descriptors ------------------------------------------------------- bin/fprintf is a program that takes as input a filename and a number and then prints all numbers from 1..given number into the given file, sleep()ing for 2 seconds between each print. The program uses the standard C library's fprintf() function which may not immediately write to a file but does some buffering. For checkpointing file descriptors, you need to add the libsavefds.so library to the LD_PRELOAD environment variable. $ cd bin $ ulimit -c unlimited [This is bash shell specific. What this does is increase the amount of space that CAN be taken by a core dump. Often, this space is set to zero and cores are not dumped. The csh equivalent is limit, I think] $ export LD_PRELOAD=../lib/libsavefds.so $ fprintf [Now, send the process a SIGQUIT signal using Ctrl+'\' or the kill command] Checkpoint information is in core. and information on the open file descriptors is in a file called 'filedescriptors'. This is a simple text file with lines of the format: : : You can edit this file too. So in case you wish to restart the process on another computer where the file is in a different location, just edit this file appropriately. Now, to restart $ restart -f -n -w fprintf And it was as if the process never stopped! NOTE: You would want to $unset LD_PRELOAD when you're done, as above we have given a relative path to the library. If you give a fully qualified path, then you won't have to worry about this. ------------------------------------ B. linklist - Checkpointing with gdb ------------------------------------ bin/linklist is an example that takes as input a number from the user, then creates a link list of nodes containing integers from 0 to the supplied number. If you give a large number, a lof of memory is allocated from the heap. The program then prints out all the numbers onto stdout. Here we demonstrate how to checkpoint a process using gdb's gcore command. $ cd bin $ gdb linklist (gdb) break 29 (gdb) run [Enter a large number, say 4000] (gdb) cont 500 (gdb) gcore core.linklist (gdb) quit (say yes to the confirmation question) What just happened here is that the linklist program was run, and a link list of 4000 nodes was to be created. break 29 inserted a breakpoint at the source line 29 (use list to see the source within gdb). cont 500 told gdb to continue execution of the program till it passed the breakpoint 500 times. At source line 29 we were creating the linked list, and by cont 500 we created a list of 500 nodes, the remainder hadn't been created. gcore core.linklist created a core dump file for the process state as of now (creating a linklist, 500 of 4000 nodes created). THIS IS OUR CHECKPOINT. Now, we will restart the program from this point. As a result, the rest of the link list will be created and all numbers will be printed on stdout, which would have happened had we not checkpointed the process. To do this, do $ restart -n -w linklist core.linklist And voila, things were as if they never stopped! You can experiment with checkpointing the program at different states (for example, break at line 37 instead of 29 and allow a few numbers to be printed onto stdout. Then gcore to checkpoint and restart to see only the remaining numbers print). NOTE: You can also use gdb to checkpoint a RUNNING process. Use: gdb and then use gcore to checkpoint the running process. =============================================================================== III. LIMITATIONS =============================================================================== The way things work as of now, there are some restrictions on the processes that can be succesfully restarted from a checkpoint. Some things that come to mind: * Processes that use the dlopen() call to open dynamic libraries CANNOT be restarted as of now. * LD_PRELOAD must be the same when the checkpoint was made and when restart was called * Signal handlers are NOT restored * Processes that use mmap() to map files to address space CANNOT be restarted as of now. =============================================================================== ===============================================================================