Fedora Users — Re: random crashes

> >There are lots of cases where reallocated
> >sectors is not a problem
> 
> Please explain a situation in which a hard drive developing bad sectors is
> not a problem.

The drive has lots of spare sectors. Some drives will even indicate they
have reallocated sectors at purchase time. What matters is if the count
is increasing. It's also not unknown to get a couple over years and
nothing else.

The big problem are sectors that cannot be read, not sectors where the
drive has noted a spot problem and moved the data. That and trends. The
actual SMART health check the drive provides looks at these and should
give best answers as it uses drive internal knowledge.

Google's studies show none of these methods are that great so RAID and/or
backups are important [backups are anyway as you can have a PSU fail
badly and blow all the attached disks, been there seen that]. Also for
RAID pairs use different drives or drives from different sources
otherwise you may get two with the same systemic flaw as they came off
the production line together, run together exactly as long on your RAID
and duly fail close together.

> >On a bad sector Linux will continue as best it can and
> >you'll rarely see the machine go splat.
> 
> You've been lucky if your systems have simply burped during this process.
> Although, it doesn't sound to me as though you have actually witnessed this.

Chuckle. I was the Linux IDE/ATA disk maintainer for some years. I've
seen most of it, including some really bad periods for drive reliability
(IBM deathstars and other such fun)

> the kind you find in most desktops and budget servers. I think it could be
> handled a lot better by the OS than it is. I blame the drivers.

Ah good send patches. Unfortunately it's very rarely the drivers.

On a fault we run through a series of things including retrying the
command, lowering link speeds and then resetting the device. In the PATA
case a device can get stuck with IORDY asserted on the bus which hangs
the PC and there is nothing most cards will then do (SIL680 is almost the
only exception). Some controllers thoughtfully emulate this idiocy when
SATA devices failed, so its a good idea to get an AHCI capable controller
in AHCI mode.

The big failure cases we normally see are the drive dropping offline
entirely and refusing to come back until physically power cycled. As the
power between the PC and the drive is directly wired the OS can't fix
this one.

The biggest causes of apparently random failures seem to be people putting
too many disks on what PSU output and overheating.

> In any case, any tech worth his salt is going to find out what the problem
> actually is. Going thru a system in the way that I suggested is not only
> going to help in solving the problem, it should also be part of a anyone's
> maintenance plan.
> 
> If you're just guessing at the problem, you're paying too much for IT.

There is a school of thought that if your IT costs more than just
restoring a new box from backup you don't need IT 8)

Alan
-- 
users mailing list
users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines