Re: Fault tolerance. . .

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, 24 Jul 2005 21:59:59 EDT, John Richard Moser said:

> I'm thinking of application level fault tolerance using roll-back states
> or something weird, to restore the system as affected by that
> application to a point before the error.  The obvious visual effect
> would be that if an application were to crash, it and potentially
> interrelated applications would suddenly reset to a state a few seconds
> to a few minutes earlier.

Google for "checkpoint-restart" - it's a big field in scientific
computing, where you don't want to lose the results of a 3 week run on a
supercomputer just because the system crashes 5 minutes before it's done.

(Just think - if they'd had a proper checkpointing scheme, most of the
Hitchhiker's trilogy wouldn't have happened... :)

> Maintaining the state is also easy:
> 
>  - When a file is changed, track the changes and attach them to the last
> state save
>  - When memory pages are written to, cache the old copies first
> (unfortunately each page has to be made CoW after every state save)

This is actually a lot harder than it looks - most of the real-life applications
of checkpoint-restart have been to programs that were designed to play nice
with checkpointing.  It's *really* hard to do it with a program that wasn't
designed to to be checkpointed, as you noticed yourself:

> This of course raises many questions and concerns that make this
> rediculous and probably not entirely possible:
> 
>  - What about huge modifications to files in a short time?  Make a new
> file, then write 10,000,000,000 bytes past the end and watch it crash.
>  - What about lost work in interrelated applications?
>  - Will the system state remain consistent?
>  - Will it crash over and over and over?
>  - Connecting to named pipes? (easily handled, not discussed here)
>  - Crashes are usually trappable, and then programs exit cleanly.  They
> won't care about this
>  - How does a process know to change course if it gets restored?

Exactly the sort of things that make it hard...

Attachment: pgp6p93HBjaUz.pgp
Description: PGP signature


[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]
  Powered by Linux