On Sun, 24 Jul 2005 21:59:59 EDT, John Richard Moser said: > I'm thinking of application level fault tolerance using roll-back states > or something weird, to restore the system as affected by that > application to a point before the error. The obvious visual effect > would be that if an application were to crash, it and potentially > interrelated applications would suddenly reset to a state a few seconds > to a few minutes earlier. Google for "checkpoint-restart" - it's a big field in scientific computing, where you don't want to lose the results of a 3 week run on a supercomputer just because the system crashes 5 minutes before it's done. (Just think - if they'd had a proper checkpointing scheme, most of the Hitchhiker's trilogy wouldn't have happened... :) > Maintaining the state is also easy: > > - When a file is changed, track the changes and attach them to the last > state save > - When memory pages are written to, cache the old copies first > (unfortunately each page has to be made CoW after every state save) This is actually a lot harder than it looks - most of the real-life applications of checkpoint-restart have been to programs that were designed to play nice with checkpointing. It's *really* hard to do it with a program that wasn't designed to to be checkpointed, as you noticed yourself: > This of course raises many questions and concerns that make this > rediculous and probably not entirely possible: > > - What about huge modifications to files in a short time? Make a new > file, then write 10,000,000,000 bytes past the end and watch it crash. > - What about lost work in interrelated applications? > - Will the system state remain consistent? > - Will it crash over and over and over? > - Connecting to named pipes? (easily handled, not discussed here) > - Crashes are usually trappable, and then programs exit cleanly. They > won't care about this > - How does a process know to change course if it gets restored? Exactly the sort of things that make it hard...
Attachment:
pgp6p93HBjaUz.pgp
Description: PGP signature
- References:
- Fault tolerance. . .
- From: John Richard Moser <[email protected]>
- Fault tolerance. . .
- Prev by Date: Re: [PATCH] driver core: Add the ability to unbind drivers to devices from userspace
- Next by Date: Re: kernel page size explanation
- Previous by thread: Fault tolerance. . .
- Next by thread: Incorrect driver getting loaded for Qlogic FC-HBA
- Index(es):